Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

This commit is contained in:
Boris Glavic 2020-12-19 22:54:35 -06:00
commit 3e43bd053a
12 changed files with 90 additions and 65 deletions

View file

@ -7,7 +7,7 @@
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (s.t. $p \not \in \{0,1\}$).
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \not \in \{0,1\}$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs competitive with deterministic databases.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}

View file

@ -5,7 +5,7 @@
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
@ -132,10 +132,10 @@ For any expression tree $\etree$, the corresponding
{\em positive tree}, denoted $\abs{\etree}$ obtained from $\etree$ as follows. For each leaf node $\ell$ of $\etree$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$. %value $\coef$ of each coefficient leaf node in $\etree$ is set to %$\coef_i$ in $\etree$ is exchanged with its absolute value$|\coef|$.
\end{Definition}
Using the same factorization from ~\Cref{example:expr-tree-T}, $poly(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
Using the same factorization from ~\Cref{example:expr-tree-T}, $\polyf(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
\begin{Definition}[Evaluation]\label{def:exp-poly-eval}
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = \polyf(\etree)(\vct{v})$.
\end{Definition}
\subsection{Our main result}
@ -144,9 +144,9 @@ Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we defin
In the subsequent subsections we will prove the following theorem.
\begin{Theorem}\label{lem:approx-alg}
Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$
Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$.
%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
Then an estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
such that
@ -165,25 +165,30 @@ Given an expression tree $\etree$, define
\[\gamma(\etree)=\frac{\sum_{(\monom, \coef)\in \expandtree{\etree}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\etree}(1,\ldots, 1)}\]
\end{Definition}
%\AH{This....combined with \Cref{def:mod-set-polys} is \emph{really} nice notation!}
\AR{Need to make sure use of indicator variable $\onesymbol$ above is consistent with the rest of the paper.}
%\AR{Need to make sure use of indicator variable $\onesymbol$ above is consistent with the rest of the paper.}
%\OK{Done}
We next present couple of corollaries of~\Cref{lem:approx-alg}.
\begin{Corollary}
\label{cor:approx-algo-const-p}
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $p_i\ge p_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot p_0^{2k}}\right)\]
In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
\end{Corollary}
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
The restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for all queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
\AH{I am thinking that perhaps the terminology and presentation of~\Cref{sec:experiments} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
\AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{sec:experiments}}
\AR{{\bf Boris/Oliver:} Is there a way to claim that all probabilities in practice are actually constants: i.e. they do not increase with the number of tuples?}
\OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
Note that (i) tuple presence is independent across blocks, so the corresponding probabilities (and hence $p_0$) are independent of the number of blocks, and (ii) \bis model uncertain attributes, so block size (and hence $\gamma$) is a function of the ``messiness'' of a dataset, rather than its size.
Thus, we expect the corrolary to hold in general.
% \AH{I am thinking that perhaps the terminology and presentation of~\Cref{app:subsec:experiment} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
% \AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{app:subsec:experiment}}
% \AR{{\bf Boris/Oliver:} Is there a way to claim that all probabilities in practice are actually constants: i.e. they do not increase with the number of tuples?}
% \OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
\subsection{Approximating $\rpoly$}
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i
@ -330,7 +335,7 @@ The number of samples is computed by (see \Cref{app:subsec-th-mon-samp}):
\subsubsection{Correctness}
In order to prove~\Cref{lem:approx-alg}, we will need to argue the correctness of~\Cref{alg:mon-sam}. Before we formally do that,
we first state the lemmas that summarize the relevant properties of $\onepass$ and $\sampmon$, the auxiliary algorithms on which ~\Cref{alg:mon-sam} relies. Their proofs are given in~\Cref{sec:onepass} and~\Cref{sec:samplemonomial} respectively.
we first state the lemmas that summarize the relevant properties of $\onepass$ and $\sampmon$, the auxiliary algorithms on which ~\Cref{alg:mon-sam} relies. %Their proofs are given in~\Cref{sec:onepass} and~\Cref{sec:samplemonomial} respectively.
\begin{Lemma}\label{lem:one-pass}
@ -349,7 +354,7 @@ Armed with the above two lemmas, we are ready to argue the following result (pro
\begin{Theorem}\label{lem:mon-samp}
%If the contracts for $\onepass$ and $\sampmon$ hold, then
For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with
$\empmean$ has bounds
%$\empmean$ has bounds
\[P\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\]
in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time.
\end{Theorem}
@ -394,7 +399,7 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}
%\end{align*}
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions.
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions. Lemma~\ref{lem:one-pass} is also proved in~\Cref{sec:proofs-approx-alg}.
%\subsubsection{Psuedo Code}

View file

@ -22,7 +22,7 @@ In~\Cref{sec:results-circuits} we argue why results from earlier sections also h
\subsubsection{Extending our results to lineage circuits}
\label{sec:results-circuits}
We first note that since expression trees are a special case of them, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.
We first note that since expression trees are a special case of linear circuits, all of our hardness results in~\Cref{sec:hard} are still valid for the latter.
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
@ -76,7 +76,8 @@ For the specifics on how lineage circuits are translated to represent polynomial
\subsubsection{Circuit size vs. runtime}
\label{sec:circuit-runtime}
We now connect the size of a lineage circuit (where the size of a lineage circuit is the number of vertices in the corresponding DAG\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.}) for a given SPJU query $Q$ to its $\qruntime{Q}$. We do this formally by showing that the size of the lineage circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
We now connect the size of a lineage circuit (where the size of a lineage circuit is the number of vertices in the corresponding DAG %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
for a given SPJU query $Q$ to its $\qruntime{Q}$. We do this formally by showing that the size of the lineage circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
\begin{lemma}
\label{lem:circuits-model-runtime}
@ -89,7 +90,7 @@ We now have all the pieces to argue the following, which formally states that ou
Given an SPJU query $Q$ for a TIDB, we can present $(1\pm\eps)$ approximation to the expectation of each output tuple with probability at least $1-\delta$ in time $O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)$.
\end{Corollary}
\begin{proof}
This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit counterpart-- see~\Cref{sec:results-circuits} of)~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit counterpart-- see~\Cref{sec:results-circuits})~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}
\subsection{Higher moments}
@ -101,4 +102,4 @@ In addition, we could e.g. prove bounds of probability of the multiplicity being
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
However, we leave the question of coming up with a wider range of approximation algorithms for future work.
However, we leave the question of coming up with a more accurate approximation algorithms for future work.

View file

@ -1,13 +1,13 @@
%!TEX root=./main.tex
\section{Conclusions and Future Work}\label{sec:concl-future-work}
We have studied the problem of calculating the expectation of polynomials over random integer variables.
We have studied the problem of calculating the expectation of query polynomials over BIDBs. %random integer variables.
This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
We have proven this claim through a reduction from the problem of counting k-matchings.
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are few cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
Interesting directions for future work include development of a dichotomy for queries over bag PDBs and desgin approximation schemes for data models beyond what we consider in this paper.
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
\BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:

View file

@ -1,3 +1,5 @@
%!TEX root=./main.tex
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}
@ -74,7 +76,20 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
\subsection{Proof of~\Cref{prop:expection-of-polynom}}
\label{subsec:expectation-of-polynom-proof}
\BG{TODO}
We need to prove for $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$ that $\expct_{\db \sim \pd}[\query(\db)(t)] = \expct_{\vct{w} \sim \pd'}\pbox{\polyForTuple(\vct{w})}$
By expanding $\polyForTuple$ and the expectation we have:
\begin{align*}
\expct_{\vct{w} \sim \pd'}\pbox{\polyForTuple(\vct{w})}
& = \sum_{\vct{w} \in \{0,1\}^n}\pd'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
\intertext{From $\rmod(\pxdb) = \pdb$, we have that the range of $\assign_{\vct{w}(\pxdb)}$ is $\idb$, so}
& = \sum_{\db \in \idb}\;\;\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\pd'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
\intertext{In the inner sum, $\assign_{\vct{w}}(\pxdb) = \db$, so by distributivity of $+$ over $\times$}
& = \sum_{\db \in \idb}\query(\db)(t)\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\pd'(\vct{w})\\
\intertext{From the definition of $P$, given $\rmod(\pxdb) = \pdb$, we get}
& = \sum_{\db \in \idb}\query(\db)(t) \cdot \pd(D) \quad = \expct_{\db \sim \pd}[\query(\db)(t)]
\end{align*}
\subsection{Supplementary Material for~\Cref{subsec:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob(\tup)$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.

View file

@ -145,8 +145,8 @@ Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handl
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
% Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
\end{Example}
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
@ -172,9 +172,9 @@ The marginal probability (resp., expected count) of this query is computed over
% \AR{What is $\mu$ below?}
{\small
\begin{align*}
P[\poly_{set}] &= \hspace*{-1mm}
\sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}P[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot P[W_a = w_a,W_b = w_b,W_c = w_c]
\probOf[\poly_{set}] &= \hspace*{-1mm}
\sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
\end{align*}
}
\end{Example}
@ -203,13 +203,13 @@ In this particular lineage polynomial, all variables in each product clause are
\end{align*}
}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{multline}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}}
% = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
% + P[W_c = 1]P[W_a = 1]\\
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
\end{multline}
\begin{figure}[t]
@ -277,7 +277,7 @@ The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
+ \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{multline*}
Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$), but does suggest a related closed form formula.
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
@ -291,7 +291,7 @@ With $\poly^2$ as an example, we have:
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
%\SF{Should this be like $\tilde{\poly^2}$ to avoid ambiguous?}
Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).
Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.
The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
@ -318,10 +318,10 @@ Concretely, in this paper:
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(p,\dots,p)$ can be written as $\sum_{i=0}^{2k} c_i\cdot p^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(p_i,\dots,p_i)$ for distinct values of $p_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(p,\dots,p)$ for a {\em single specific} value of $p$ might be easy: indeed it is easy for $p=0$ or $p=1$. However, we are able to show that for any other value of $p$, computing $\rpoly_G^k(p,\dots,p)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $p^k$ approximation to the value $\rpoly(p,\dots,p)$ that we are after. If $p$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We also formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

View file

@ -21,7 +21,7 @@
\newcommand{\pxdb}{\mathbf{D}}
\newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db
\newcommand{\tset}{\mathcal{T}}%the set of tuples in a database
\newcommand{\pd}{P}%pd for probability distribution
\newcommand{\pd}{\vct{P}}%pd for probability distribution
\newcommand{\eval}[1]{\llbracket #1 \rrbracket}%evaluation double brackets
\newcommand{\evald}[2]{\eval{{#1}}_{#2}}
\newcommand{\query}{Q}
@ -116,6 +116,9 @@
%PDBs
\newcommand{\pdbx}{X_{DB}}
\newcommand{\prob}{p}
\newcommand{\probOf}{P}
\newcommand{\probDist}{\vct{\probOf}}
\newcommand{\probAllTup}{\vct{\prob}}
\newcommand{\wSet}{\Omega}
\newcommand{\ti}{TIDB\xspace}
\newcommand{\tis}{TIDBs\xspace}

View file

@ -4,7 +4,7 @@
\label{sec:hard}
\AH{The notation used here is different than in~\Cref{sec:background}, in particular~\Cref{eq:expect-q-nx}. Maybe we should decide on a notation and try to stick to it as much as possible?}
\BG{We sometimes use $\expct_{\vct{X} \sim P}$ sometimes $\expct_{\vct{X}}$}
In this section, we will prove that computing $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs. Furthermore, we demonstrate \Cref{sec:single-p} that the problem remains hard, even if $\pd(X_i) = p$ for all $X_i$ and some fixed valued $p$ as long as these conjectures hold. Finally, using popular hardness conjectures in fine-grained complexity we show that if these conjectures hold and except for the trivial choices of $p \in \{0,1\}$, the problem is hard for any given $p$.
In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs. Furthermore, we demonstrate \Cref{sec:single-p} that the problem remains hard, even if $\probOf(X_i) = \prob$ for all $X_i$ and some fixed valued $\prob$ as long as these conjectures hold. Finally, using popular hardness conjectures in fine-grained complexity we show that if these conjectures hold and except for the trivial choices of $\prob \in \{0,1\}$, the problem is hard for any given $\prob$.
% We would like to argue for a compressed version of $\poly(\vct{X})$, in general $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ even for tis, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).
@ -49,11 +49,11 @@ For any graph $G=([n],E)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned by to $X_i$ by $\vct{p}$) are the same value. Note that this polynomial can be encoded in an expression tree of size $\Theta(km)$.
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned by to $X_i$ by $\probAllTup$) are the same value. Note that this polynomial can be encoded in an expression tree of size $\Theta(km)$.
Following on Example~\ref{ex:intro}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the query polynomial corresponding to the query:
\[\poly^k_G:- R(A_1),E(A_1,B_1),R(B_1),\dots,R(A_\kElem),E(A_\kElem,B_\kElem),R(B_\kElem)\]
where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $p$ but to simplify notation we assign probability $1$ to edges.}
where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this imples that our hard query polynomial can be created from a project-join query -- by contrast our approximation algorithm in \Cref{sec:algo} can handle lineage polynomials generated by union of select-project-join (SPJU) queries. % (i.e. we do not need union or select operator to derive our hardness result).

View file

@ -52,8 +52,9 @@ We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (resp., \
there exists a $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$, or $\semNX$-PDB $\pxdb$), and tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are \bis (which subsume \tis as a special case).
As they are a special case of \bis, the following applies to \tis as well.
Recall that in a \bi $\pxdb$ with tuples $t_1, \ldots, t_n$, each input tuple $t_i$ is annotated with a unique variable $X_i$.
Tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_i$ is associated with a probability $\prob(\tup_i) = \pd[X_i = 1]$.\footnote{
Note the deviation from the more common approach of defining a single independent, $[\abs{\block_i}+1]$-valued variable per block; Here we define $\abs{\block_i}$ correlated variables per block.
Tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_i$ is associated with a probability $\prob_{\tup_i} = \pd[X_i = 1]$.
\footnote{
Although it is customary to define a single independent, $[\abs{\block_j}+1]$-valued variable per block, we decompose it into $\abs{\block_j}$ correlated $\{0,1\}$-valued variables per block that can be directly used in polynomials (without an indicator function). For $t_i \in b_j$, the event $(X_i = 1)$ is identical to the event $(X_j = i)$ in the customary annotation scheme.
}
Because blocks are independent and tuples from the same block are disjoint, $\prob$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
@ -77,8 +78,8 @@ For example when $S_0=\inset{X^2-X, Y^2-Y}$, taking the polynomial $2X^2 + 3XY -
%
\begin{Definition}\label{def:mod-set-polys}
Given the set of BIDB variables $\inset{X_{b,i}}$, define
\[\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{and } i\ne j}\]
\[\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{and } i}\]
\[\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{ and } i\ne j \in [~\abs{\block}~]}\]
\[\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{ and } i \in [~\abs{\block}~]}\]
\end{Definition}
%
\begin{Definition}[Reduced \bi Polynomials]\label{def:reduced-bi-poly}
@ -123,9 +124,9 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Valid Worlds]
For probability distribution $\vct{P}$ and its corresponding PMF $P$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
For probability distribution $\probDist$ and its corresponding PMF $\probOf$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
\[
\eta = \{\vct{w}\st P[\vct{W} = \vct{w}] > 0\}
\eta = \{\vct{w}\st \probOf[\vct{W} = \vct{w}] > 0\}
\]
\end{Definition}
@ -143,10 +144,10 @@ We state additional equivalences between $\poly(\vct{X})$ and $\rpoly(\vct{X})$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:exp-poly-rpoly}
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and query $\query$ we have:
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\probDist$ produced by the tuple probability vector $\probAllTup = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and query $\query$ we have:
% The expectation over possible worlds in $\poly(\vct{X})$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numvar)$.
\begin{equation*}
\expct_{\vct{w}\sim \vct{p}}\pbox{\poly(\vct{W})} = \rpoly(\vct{p}).
\expct_{\vct{W}\sim \probDist}\pbox{\poly(\vct{W})} = \rpoly(\probAllTup).
\end{equation*}
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View file

@ -11,7 +11,7 @@ Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\p
\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
@ -20,10 +20,10 @@ A $\semK$-database is a set of $\semK$-relations. It will be convenient to also
We review positive relational algebra semantics for $\semK$-relations below.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \pd}[\query(\db)(t)]$:
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%
\begin{align}\label{eq:bag-expectation}
\expct_{\idb \sim \pd}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
\end{align}
%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.
@ -53,13 +53,13 @@ Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$ and with the standard addition and multiplication of polynomials.
We will utilize $\semNX$-PDB $\pxdb$, defined as the tuple $(\db, \pd)$, where $\semNX$-database $\db$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ (i.e., $\polyForTuple = \query(\pxdb)(t)$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs, a function $\rmod$, which takes an $\semNX$-PDB input and outputs an equivalent $\semN$-PDB are formally defined in \Cref{subsec:supp-mat-background}.
$\semNX$-PDBs and a function $\rmod$ that takes an $\semNX$-PDB input and outputs an equivalent $\semN$-PDB are formally defined in \Cref{subsec:supp-mat-background}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db,\pd')$ where $\rmod(\pxdb) = \pdb$:
\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{w} \sim \pd'}\pbox{\polyForTuple(\vct{w})} \]
Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$:
\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.

View file

@ -1,7 +1,7 @@
%!TEX root=./main.tex
\section{Related Work}\label{sec:related-work}
In addition to probabilistic databases, our work has connections to work on compact representations of polynomials and on fine-grained complexity which we review in \Cref{sec:compr-repr-polyn,sec:param-compl}.
In addition to probabilistic databases, our work has connections to work on compact representations of polynomials and on fine-grained complexity, which we review in \Cref{sec:compr-repr-polyn,sec:param-compl}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subsection{Probabilistic Databases}\label{sec:prob-datab}
@ -19,7 +19,7 @@ This is similar to our $\semNX$-PDBs, but we use polynomials instead of Boolean
Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple % (a Boolean formula encoding the provenance of the tuple)
and then the probability of the lineage formula.
In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
In this paper we focus on intensional query evaluation using polynomials instead of Boolean formulas.
It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{valiant-79-cenrp} %provan-83-ccccptg
using the fact the tuple's marginal probability is the probability of a its lineage formula).
The second category, \emph{extensional} query evaluation, % avoids calculating the lineage.

View file

@ -6,24 +6,24 @@
\label{sec:single-p}
%In this discussion, let us fix $\kElem = 3$.
While \Cref{thm:mult-p-hard-result} shows that computing $\rpoly(\prob,\dots,\prob)$ in general is hard it does not rule out the possibility that one can compute this value exactly for a {\em fixed} value of $p$. Indeed, it is easy to check that one can compute $\rpoly(\prob,\dots,\prob)$ exactly in linear time for $p\in \inset{0,1}$. In this section, we show that these two are the only possibilities:
While \Cref{thm:mult-p-hard-result} shows that computing $\rpoly(\prob,\dots,\prob)$ in general is hard it does not rule out the possibility that one can compute this value exactly for a {\em fixed} value of $\prob$. Indeed, it is easy to check that one can compute $\rpoly(\prob,\dots,\prob)$ exactly in linear time for $\prob\in \inset{0,1}$. In this section, we show that these two are the only possibilities:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}\label{th:single-p-hard}
Fix $p\in (0,1)$. Then assuming \Cref{conj:graph} is true, then any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}.
Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, then any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}.
\end{Theorem}
%\begin{proof}[Proof of Corollary ~\ref{th:single-p-gen-k}]
%Consider $\poly^3_{G}$ and $\poly' = 1$ such that $\poly'' = \poly^3_{G} \cdot \poly'$. By \Cref{th:single-p}, query $\poly''$ with $\kElem = 4$ has $\Omega(\numvar^{\frac{4}{3}})$ complexity.
%\end{proof}
The above shows the hardness for a very specific query polynomial but it is easy to come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial query polynomials.
Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $p\in (0,1)$ is \sharpwonehard.
Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard.
However, in \Cref{sec:algo} we show that if we are willing to compute an approximation that this problem (and indeed solving our problem for a much more general setting) is in linear time.
%\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
We will prove the above result by the following reduction:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}\label{th:single-p}
Fix $p\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
Fix $\prob\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$, $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ %count the number of triangles, 3-paths, and 3-matchings in $G$
in $O\inparen{T(\numedge) + \numedge}$ time.
\end{Theorem}
@ -80,7 +80,7 @@ Note that $\rpoly_{G}^3(\prob,\ldots, \prob)$ as a polynomial in $\prob$ has deg
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:qE3-exp}
%When we expand $\poly_{G}^3(\vct{X})$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following result,
For any $p$, we have:
For any $\prob$, we have:
{\small
\begin{align}
&\rpoly_{G}^3(\prob,\ldots, \prob) = \numocc{G}{\ed}\prob^2 + 6\numocc{G}{\twopath}\prob^3 + 6\numocc{G}{\twodis} + 6\numocc{G}{\tri}\prob^3\nonumber\\
@ -94,7 +94,7 @@ For any $p$, we have:
\begin{proof}[Proof of \Cref{lem:qE3-exp}]
By definition we have that
\[\poly_{G}^3(\vct{X}) = \sum_{\substack{(i_1, j_1), (i_2, j_2), (i_3, j_3) \in E}}~\; \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Hence $\rpoly_{G}^3(\vct{X})$ has degree six. Note that the monomial $\prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}$ will contribute to the coefficient of $p^\nu$ in $\rpoly_{G}^3(\vct{X})$, where $\nu$ is the number of distinct variables in the monomial.
Hence $\rpoly_{G}^3(\vct{X})$ has degree six. Note that the monomial $\prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}$ will contribute to the coefficient of $\prob^\nu$ in $\rpoly_{G}^3(\vct{X})$, where $\nu$ is the number of distinct variables in the monomial.
%Rather than list all the expressions in full detail, let us make some observations regarding the sum.
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$.
We compute $\rpoly_{G}^3(\vct{X})$ by considering each of the three forms that the triple $(e_1, e_2, e_3)$ can take.
@ -103,14 +103,14 @@ We compute $\rpoly_{G}^3(\vct{X})$ by considering each of the three forms that t
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence in $\poly_{G}^3(\vct{X})$. Analogusly, there are three such occurrences in $\poly_{G}^3(\vct{X})$ when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. %Again, there are three combinations for this.
This implies that all $3 + 3 = 6$ combinations of two distinct edges $e$ and $e'$ contribute to the same monomial in $\rpoly_{G}^3$. % consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$.
Since $e\ne e'$, this case produces the following edge patterns: $\twopath, \twodis$, which contribute $p^3$ and $p^4$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
Since $e\ne e'$, this case produces the following edge patterns: $\twopath, \twodis$, which contribute $\prob^3$ and $\prob^4$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 3:} All $e_1,e_2$ and $e_3$ are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$, each of which contribute to a different monomial in the SOP expansion of $\poly_{G}^3(\vct{X})$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$, which contribute $p^3,p^4,p^4,p^5$ and $p^6$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 3:} All $e_1,e_2$ and $e_3$ are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$, each of which contribute to a different monomial in the SOP expansion of $\poly_{G}^3(\vct{X})$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$, which contribute $\prob^3, \prob^4, \prob^4, \prob^5$ and $\prob^6$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\end{proof}
\qed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Since $p$ is fixed, \Cref{lem:qE3-exp} gives us one linear equation in $\numocc{G}{\tri},$ $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ (we can handle the other counts due to \Cref{eq:1e}-\Cref{eq:2pd-3d}). However, we plan to generate two more independent linear equations in these three variables. Towards, this end we generate more graphs that are related to $G$:
Since $\prob$ is fixed, \Cref{lem:qE3-exp} gives us one linear equation in $\numocc{G}{\tri},$ $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ (we can handle the other counts due to \Cref{eq:1e}-\Cref{eq:2pd-3d}). However, we plan to generate two more independent linear equations in these three variables. Towards, this end we generate more graphs that are related to $G$:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}\label{def:Gk}
For $\ell > 1$, let graph $\graph{\ell}$ be a graph generated from an arbitrary graph $\graph{1}$, by replacing every edge $e$ of $\graph{1}$ with a $\ell$-path, such that all $\ell$-path replacement edges are disjoint. % in the sense that they only intersect at the original intersection endpoints as seen in $\graph{1}$.
@ -167,7 +167,7 @@ Using the results we have obtained so far, we will prove the following reduction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:lin-sys}
%Using the identities of lemmas [\ref{lem:3m-G2}, \ref{lem:3m-G3}, \ref{lem:3p-G2}, \ref{lem:3p-G3}, \ref{lem:tri}] to compute $\numocc{G}{\threedis}, \numocc{G}{\threepath}, \numocc{G}{\tri}$ for $G \in \{\graph{2}, \graph{3}\}$, there exists a linear system $\mtrix{\rpoly}\cdot (x~y~z~)^T = \vct{b}$ which can then be solved to determine the unknown quantities of $\numocc{\graph{1}}{\threedis}, \numocc{\graph{1}}{\threepath}$, and $\numocc{\graph{1}}{\tri}$.
Fix $p\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [3]$, we can compute in $O(m)$ time a vector $\vct{b}\in\rel^3$ such that
Fix $\prob\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [3]$, we can compute in $O(m)$ time a vector $\vct{b}\in\rel^3$ such that
\[ \begin{pmatrix}
1 & \prob & -(3\prob^2 - \prob^3)\\
-2(3\prob^2 - \prob^3) & -4(3\prob^2 - \prob^3) & 10(3\prob^2 - \prob^3)\\