Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

master
Boris Glavic 2020-12-19 15:00:34 -06:00
commit 00ef8e88c6
14 changed files with 206 additions and 164 deletions

View File

@ -3,8 +3,8 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability of being present (\Cref{th:single-p-hard}).
Given this, in this section we design an approximation algorithm for our problem that runs in {\em linear time}.
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
@ -87,8 +87,9 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\end{Example}
\begin{figure}[h!]
\begin{figure}[t]
\resizebox{0.65\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
@ -119,12 +120,13 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\draw[<-|, highlight_color] (TR) -- (tr-label);
\draw[<-|, highlight_color] (root) -- (t-label);
\end{tikzpicture}
}
\caption{Expression tree $\etree$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
\label{fig:expr-tree-T}
\trimfigurespacing
\end{figure}
\begin{Definition}[Positive T]\label{def:positive-tree}
For any expression tree $\etree$, the corresponding
{\em positive tree}, denoted $\abs{\etree}$ obtained from $\etree$ as follows. For each leaf node $\ell$ of $\etree$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$. %value $\coef$ of each coefficient leaf node in $\etree$ is set to %$\coef_i$ in $\etree$ is exchanged with its absolute value$|\coef|$.
@ -133,7 +135,7 @@ For any expression tree $\etree$, the corresponding
Using the same factorization from ~\Cref{example:expr-tree-T}, $poly(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
\begin{Definition}[Evaluation]\label{def:exp-poly-eval}
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
\end{Definition}
\subsection{Our main result}
@ -157,7 +159,7 @@ P\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error' \
The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
It turns out that to get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter (which roughly counts the (weighted) number of monomials in $\expandtree{\etree}$ that get `canceled' when modded with $\mathcal{B}$):
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expandtree{\etree}$ to be `canceled' when it is modded with $\mathcal{B}$:
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\etree$, define
\[\gamma(\etree)=\frac{\sum_{(\monom, \coef)\in \expandtree{\etree}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\etree}(1,\ldots, 1)}\]
@ -182,10 +184,10 @@ We note that the restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$)
\OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
\subsection{Approximating $\rpoly$}
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we note that we can exactly represent $\rpoly(\vct{X}$ as follows:
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\sum_{(v,c)\in \expandtree{\etree}} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\prod_{X_i\in \var\inparen{v}} X_i.
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i.
\end{equation}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
@ -233,7 +235,7 @@ to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot
%\bi Version of Approximation Algorithm
\begin{algorithm}[H]
\begin{algorithm}[t]
\caption{$\approxq(\etree, \vct{p}, \conf, \error)$}
\label{alg:mon-sam}
\begin{algorithmic}[1]
@ -328,9 +330,9 @@ we first state the lemmas that summarize the relevant properties of $\onepass$ a
\begin{Lemma}\label{lem:one-pass}
The $\onepass$ function completes in $O(size(\etree))$ time. After $\onepass$ returns the following post conditions hold. First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
The $\onepass$ function completes in $O(size(\etree))$ time. $\onepass$ guarantees two post conditions: First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
\end{Lemma}
In proving correctness of~\Cref{alg:mon-sam}, we will only use the following fact (which follows from the above lemma), $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
To prove correctness of~\Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
%\AH{I'm wondering if there is a better notation to use here. I myself got confused by my own notation of $\etree_{\vari{mod}}$. \emph{But}, we need to to be referencing the modified $\etree$ returned by $\onepass$ in the algorithm, so maybe this is the best we can do?}
%\AR{yeah, I think this is fine.}
%At the conclusion of $\onepass$, $\etree.\vari{partial}$ will hold the sum of all coefficients in $\expandtree{\abs{\etree}}$, i.e., $\sum\limits_{(\monom, \coef) \in \expandtree{\abs{\etree}}}\coef$. $\etree.\vari{weight}$ will hold the weighted probability that $\etree$ is sampled from from its parent $+$ node.
@ -357,7 +359,7 @@ $\empmean$ has bounds
The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as follows (where $\etree_\lchild$ and $\etree_\rchild$ are the `left' and `right' children of $\etree$ if they exist):
{\small
\begin{align}
\label{eq:T-all-ones}
\abs{\etree}(1,\ldots, 1) = \begin{cases}
@ -367,6 +369,7 @@ The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as fol
1 &\textbf{if }\etree.\type = \var.
\end{cases}
\end{align}
}
%\begin{align*}
%&\eval{\etree ~|~ \etree.\type = +}_{\abs{\etree}} =&& \eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}\\
@ -387,38 +390,36 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}
%\end{align*}
Algorithm ~\ref{alg:one-pass} essentially implements the above definitions.
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions.
%\subsubsection{Psuedo Code}
%See algorithm ~\ref{alg:one-pass} for details.
For an example of how $\onepass$ works, the pseudocode, and the proof of correctness (~\Cref{lem:one-pass}) of Algorithm ~\ref{alg:one-pass}see~\Cref{sec:proofs-approx-alg}.
\subsection{\sampmon\ Algorithm}
\label{sec:samplemonomial}
%Algorithm ~\ref{alg:sample} takes $\etree$ as input, samples an arbitrary $(\monom, \coef)$ from $\expandtree{\etree}$ with probabilities $\stree_\lchild.\wght$ and $\stree_\rchild.\wght$ for each subtree $\stree$ with $\stree.\type = +$, outputting the tuple $(\monom, \sign(\coef))$. While one cannot compute $\expandtree{\etree}$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\etree$ which produces a sample from $\expandtree{\etree}$ without ever materializing $\expandtree{\etree}$.
One way to implement \sampmon\ would be to compute $E(T)$ and then sample from it. However, this would be too time consuming.
A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
% However, this would be too time consuming.
%
Instead~\Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by the following top-down traversal. For a parent $+$ node, a subtree is chosen over the previously computed weighted sampling distribution. When a parent $\times$ node is visited, both children are visited. All variable leaf nodes of the subgraph traversal are added to a set. Additionally, the product of signs over all coefficient leaf nodes of the subgraph traversal is computed. The algorithm returns a set of the distinct variables of which the monomial is composed and the monomial's sign.
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
When a parent $\times$ node is visited, both children are visited.
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.
%\begin{Definition}[TreeSet]
%A TreeSet is a data structure whose elements form a set, each of which are stored in a binary tree.
%\end{Definition}
We will assume the TreeSet data structure to maintain sets with logarithmic time insertion and linear time traversal of its elements.
%
$\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \Cref{lem:sample}) is provided in \Cref{sec:proofs-approx-alg}.
\subsubsection{Pseudo Code}
See algorithm ~\ref{alg:sample} for the details of $\sampmon$ algorithm.
\begin{algorithm}
\begin{algorithm}[t]
\caption{\sampmon(\etree)}
\label{alg:sample}
\begin{algorithmic}[1]
@ -451,7 +452,6 @@ See algorithm ~\ref{alg:sample} for the details of $\sampmon$ algorithm.
\end{algorithmic}
\end{algorithm}
We argue the correctness of Algorithm ~\ref{alg:sample} by proving~\Cref{lem:sample} in~\Cref{sec:proofs-approx-alg}.
% \subsection{Experimental results}
% \label{sec:experiments}
% We conducted an experiment running modified TPCH queries over uncertain data generated by pdbench~\cite{pdbench}, both of which (data and queries) represent what is typically encountered in practice. Queries were run two times, once filtering $\bi$ cancellations, and then second not filtering the cancellations. The purpose of this was to determine an indication for how many $\bi$ cancellations occur in practice. Details and results can be found in~.

View File

@ -1,26 +1,35 @@
%!TEX root=./main.tex
\section{Generalizations}
\label{sec:gen}
In this section, we consider a couple of generalizations/corollaries of our results so far. In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of the earlier (expression) tree (\Cref{def:express-tree}) and we observe that all of our results carry over to the DAG representation. Then we formalize our claim in~\Cref{sec:intro} that a linear runtime algorithm for our problem would imply that we can process PDBs in the same time as deterministic query processing. Finally, in~\Cref{sec:momemts}, we make some simple observations on how our results can be used to estimate moments beyond the expectation of a lineage polynomial.
In this section, we consider a several generalizations/corollaries of our results.
In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
Then we formalize our claim in~\Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries.
Finally, in~\Cref{sec:momemts}, we observe how our results can be used to estimate moments other than the expectation.
\subsection{Lineage circuits}
\label{sec:circuits}
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and indeed pretty much of the rest of the paper has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, which are a standard way to represent polynomials over fields (and is standard in the field of algebraic complexity), though in our case we use them for polynomials over $\mathbb N$ in the obvious way. We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start. A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks correspond to the output. Every other node has at most two incoming edges (and is labeled as either an addition or a multiplication node), but there is no limit on the outdegree of such nodes. We note that if we restricted the outdegree to be one, then we get back expression trees.
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start.
A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to the output.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get back expression trees.
In~\Cref{sec:results-circuits} we argue why our results from earlier sections also hold for lineage circuits and then argue why lineage circuits do indeed capture the notion of runtime of some well-known query processing algorithms in~\Cref{sec:circuit-runtime} (We formally define the corresponding cost model in~\Cref{sec:cost-model}).
In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for lineage circuits and then argue why lineage circuits capture the notion of runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
\subsubsection{Extending our results to lineage circuits}
\label{sec:results-circuits}
We first note that since expression trees are a special case of them, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.
For the approximation algorithm in~\Cref{sec:algo} we note that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well. It turns out that both $\onepass$ and $\sampmon$ work for lineage circuits as well, simply because the only property these use for expression trees is that each node has two children. This is still valid of lineage circuits where for each non-source node the children correspond to the two nodes that have incoming edges to the given node. Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
For further discussion on why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
Analogously in a circuit, each node has a maximum in-degree of two.
Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%
For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.
\subsubsection{The cost model}
\label{sec:cost-model}
@ -28,13 +37,15 @@ Thus far, our analysis of the runtime of $\onepass$ has been in terms of the siz
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.
We adopt a minimalistic compute-bound model of query evaluation drawn from worst-case optimal joins~\cite{skew,ngo-survey}.
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
{\small
\begin{align*}
\qruntime{Q} & = |Q|\\
\qruntime{\sigma Q} & = \qruntime{Q}\\
\qruntime{\pi Q} & = \qruntime{Q} + \abs{Q}\\
\qruntime{Q \cup Q'} & = \qruntime{Q} + \qruntime{Q'} +\abs{Q}+\abs{Q'}\\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + |Q_1 \bowtie \ldots \bowtie Q_n|\\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + |Q_1 \bowtie \ldots \bowtie Q_n|
\end{align*}
}
Under this model the query plan $Q(D)$ has runtime $O(\qruntime{Q(D)})$.
Base relations assume that a full table scan is required; We model index scans by treating an index scan query $\sigma_\theta(R)$ as a single base relation.
@ -49,16 +60,16 @@ It can be verified that the worst-case join algorithms~\cite{skew,ngo-survey}, a
\subsubsection{Lineage circuit for query plans}
\label{sec:circuits-formal}
We now define a lineage circuit more formally and also show how to construct a lineage circuit given a SPJU query $Q$.
We now formalize lineage circuits and the construction of lineage circuits for SPJU queries.
As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.
A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices in the graph.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices.
We require that $\phi_Q$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_Q : V_Q \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
We require that vertices have an in-degree of at most two.
%
For the specifics on how lineage circuits are translated to represent polynomials see~\Cref{app:subsec-rep-poly-lin-circ}.
@ -71,7 +82,7 @@ We now connect the size of a lineage circuit (where the size of a lineage circui
\label{lem:circuits-model-runtime}
The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance. That is, for any query plan $Q$ we have $|V_Q| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$.
\end{lemma}
Proof is in~\Cref{app:subsec-lem-lin-vs-qplan}.
\noindent The proof appears in~\Cref{app:subsec-lem-lin-vs-qplan}.
We now have all the pieces to argue the following, which formally states that our approximation algorithm implies that approximating the expected multiplicities of SPJU query can be done in essentially the same runtime as deterministic query processing of the same query:
\begin{Corollary}
@ -84,4 +95,10 @@ This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit cou
\subsection{Higher moments}
\label{sec:momemts}
We make a simple observation to conclude the presentation of our results. So far we have presented algorithms that when given $\poly$, we approximate its expectation. In addition, we would e.g. prove bounds of probability of the multiplicity being at least $1$. While we do not have a good approximation algorithm for this problem, we can make some progress as follows. We first note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$). In other words, we can compute the $m$-th moment of the multiplicities as well. This allows us e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in. However, we leave the question of coming up with better approximation algorithms for proving probability bounds for future work.
We make a simple observation to conclude the presentation of our results.
So far we have presented algorithms that approximate the expectation of $\poly$.
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
However, we leave the question of coming up with a wider range of approximation algorithms for future work.

View File

@ -1,6 +1,14 @@
%!TEX root=./main.tex
\section{Conclusions and Future Work}\label{sec:concl-future-work}
We have studied the problem of calculating the expectation of polynomials over random integer variables. This problem has a practical application in probabilistic databases over multisets where it corresponds to calculating the expected multiplicity of a query result tuple using the tuple's provenance polynomial. This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in sum-of-products normal form, the problem is \sharpwonehard for factorized polynomials. We have proven this claim through a reduction from the problem of counting k-matchings. When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is possible to approximate the expectation of a polynomial in linear time. An interesting direction for future work would be development of a dichotomy for queries over bag PDBs. Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
We have studied the problem of calculating the expectation of polynomials over random integer variables.
This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
We have proven this claim through a reduction from the problem of counting k-matchings.
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
\BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
\textbullet{More queries: what happens with negation can circuits with monus be used?}

View File

@ -46,8 +46,7 @@ $\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is clos
\end{Proposition}
\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
\AH{I made small changes to the proof, noteably the summation, the variable definition and the world subscript, the latter of which I am not sure if it is the best notation or not.}
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\db, \pd')$ such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
%variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
In $\db$ we assign each tuple $\tup$ the polynomial:
@ -77,7 +76,7 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
\label{subsec:expectation-of-polynom-proof}
\BG{TODO}
\subsection{Supplementary Material for~\Cref{def:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
\subsection{Supplementary Material for~\Cref{subsec:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob(\tup)$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
@ -94,10 +93,33 @@ Note that the main difference to the standard definitions of \tis and \bis is th
A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as \tis, any finite PDB can be encoded using a \ti and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag \ti and a query (WHAT CLASS? ADD PROOF)
}
\subsection{~\Cref{lem:pre-poly-rpoly}}\label{app:subsec-pre-poly-rpoly}
\begin{Lemma}\label{lem:pre-poly-rpoly}
If
$\poly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}} \cdot \prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i^{d_i}$
then
$\rpoly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \eta} q_{\vct{d}}\cdot\prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i$% \;\;\; for some $\eta \subseteq \{0,\ldots, B\}^\numvar$
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}[Proof for~\Cref{lem:pre-poly-rpoly}]
Follows by the construction of $\rpoly$ in \cref{def:reduced-bi-poly}. \qed
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Proof for Proposition ~\ref{proposition:q-qtilde}}
\subsection{Proposition ~\ref{proposition:q-qtilde}}\label{app:subsec-prop-q-qtilde}
\noindent Note the following fact:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}\label{proposition:q-qtilde} For any \bi-lineage polynomial $\poly(X_1, \ldots, X_\numvar)$ and all $\vct{w} \in \eta$, it holds that
$% \[
\poly(\vct{w}) = \rpoly(\vct{w}).
$% \]
\end{Proposition}
\begin{proof}[Proof for~\Cref{proposition:q-qtilde}]
Note that any $\poly$ in factorized form is equivalent to its \abbrSMB expansion. For each term in the expanded form, further note that for all $b \in \{0, 1\}$ and all $e \geq 1$, $b^e = b$. \qed
\end{proof}
@ -108,10 +130,10 @@ Let $\poly$ be the generalized polynomial, i.e., the polynomial of $\numvar$ var
Then, assigning $\vct{w}$ to $\vct{X}$, for expectation we have
\begin{align}
\expct_{\vct{w}}\pbox{\poly(\vct{w})} &= \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}}\cdot \expct_{\vct{w}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar w_i^{d_i}}\label{p1-s1}\\
&= \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i^{d_i}}\label{p1-s2}\\
&= \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i}\label{p1-s3}\\
&= \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
\expct_{\vct{w}}\pbox{\poly(\vct{w})} &= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \expct_{\vct{w}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar w_i^{d_i}}\label{p1-s1}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i^{d_i}}\label{p1-s2}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i}\label{p1-s3}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
&= \rpoly(\prob_1,\ldots, \prob_\numvar)\label{p1-s5}
\end{align}
@ -221,7 +243,7 @@ When $\eset{1} \equiv \oneint$, the inner edges $(e_i, 1)$ of $\eset{2}$ are all
\item $3$-path ($\threepath$)
\end{itemize}
When $\eset{1} \equiv\threepath$ it is the case that all edges beginning with $e_1$ and ending with $e_3$ are successively connected. This means that the edges of $\eset{2}$ form a $6$-path in the edges of $f_2^{-1}(\eset{1})$, where all edges from $(e_1, 0),\ldots,(e_3, 1)$ are successively connected. For a $3$-matching to exist in $f_2^{-1}(\eset{1})$, we cannot pick both $(e_i,0)$ and $(e_i,1)$. % there must be at least one edge separating edges picked from a sequence.
There are four such possibilities: $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}, \pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}, \pbrace{(e_1, 0), (e_2, 1), (e_3, 1)},$\newline $\pbrace{(e_1, 1), (e_2, 1), (e_3, 1)}$ . Thus, there are four possible 3-matchings in $f_2^{-1}(\eset{1})$.
There are four such possibilities: $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}$, $\pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}$, $\pbrace{(e_1, 0), (e_2, 1), (e_3, 1)},$ $\pbrace{(e_1, 1), (e_2, 1), (e_3, 1)}$ . Thus, there are four possible 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item Triangle ($\tri$)
@ -362,8 +384,10 @@ Consider now the random variables $\randvar_1,\dots,\randvar_\numvar$, where eac
\[Y_i= \onesymbol\inparen{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i,\]
where the indicator variable handles the check in~\Cref{alg:check-duplicate-block}
Then for random variable $\randvar_i$, it is the case that
\[\expct\pbox{\randvar_i} = \sum\limits_{(\monom, \coef) \in \expandtree{\etree} }\frac{\onesymbol\inparen{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\prod_{X_i\in \var\inparen{v}} p_i }{\abs{\etree}(1,\dots,1)} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)}{\abs{\etree}(1,\ldots, 1)},\]
\begin{align*}
\expct\pbox{\randvar_i} &= \sum\limits_{(\monom, \coef) \in \expandtree{\etree} }\frac{\onesymbol\inparen{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\prod_{X_i\in \var\inparen{v}} p_i }{\abs{\etree}(1,\dots,1)} \\
&= \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)}{\abs{\etree}(1,\ldots, 1)},
\end{align*}
where in the first equality we use the fact that $\vari{sgn}_{\vari{i}}\cdot \abs{\coef}=\coef$ and the second equality follows from~\cref{eq:tilde-Q-bi} with $X_i$ substituted by $\prob_i$.
Let $\empmean = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$. It is also true that
@ -666,8 +690,6 @@ The circuit for $Q$ has at most $|V_{Q_1}|+|{Q_1}|$ vertices.
\intertext{(By definition of $\qruntime{Q}$)}
& \le (k-1)\qruntime{Q}.
\end{align*}
\AH{In the inductive step above, where does $\abs{\poly_1}$ come from? I understand that $b_i$ is part of the inductive hypothesis, but, is it \emph{legal/justifiable} to just throw in \emph{any} constant we so desire?}
\caseheading{Union}
Assume that $Q = Q_1 \cup Q_2$.
The circuit for $Q$ has $|V_{Q_1}|+|V_{Q_2}|+|{Q_1} \cap {Q_2}|$ vertices.

View File

@ -14,17 +14,17 @@ However, after discarding a long-held approach to representing lineage, we prove
This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
Under standard assumptions about how these are encoded, computing typical statistics like marginal probabilities or moments is easy (at worst linear in the size of the lineage) for bags and hence, perhaps not worthy of research attention, but hard (at worst exponential in the size of the lineage) for sets and hence, interesting from a research perspective.
However, conventional encodings of a result's lineage are typically large, and so even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing\footnote{
MCDB~\cite{jampani2008mcdb} is notable in that it is also a constant factor slower, but only guarantees additive rather than multiplicative bounds.
}.
Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
}
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expectations of counts.
Consider the dominant problem in Set-PDBs (Computing marginal probabilities) and the corresponding problem in Bag-PDBs (computing expectations of counts).
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
Thus, the expectation of the multiplicity is the expectation of this polynomial.
In the corresponding Bag-PDB problem~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
The expectation of the multiplicity is the expectation of this polynomial.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
@ -32,16 +32,15 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
As we will show, the size of the factorized
lineage formula for a query --- and by extension, our approximation algorithm --- is proportional to the complexity of evaluating the same query on a comparable deterministic database instance~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
We prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{factorized-db} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even Bag-PDBs can not enjoy the same time complexity as deterministic databases.
This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU query results over Bag-PDBs.
We also show that the size of the factorized lineage formula for a query (and by extension, our approximation algorithm) has complexity proportional to the same query on a deterministic database.
% In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
\subsection{Sets vs Bags}
@ -49,7 +48,7 @@ In other words, our approximation algorithm can estimate expected multiplicities
%Figures, etc
%Relations for example 1
\begin{figure}[ht]
\begin{figure}[t]
\begin{subfigure}{0.2\textwidth}
\centering
\begin{tabular}{ c | c c c}
@ -90,6 +89,7 @@ In other words, our approximation algorithm can estimate expected multiplicities
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\trimfigurespacing
\end{figure}
%Graph of query output for intro example
@ -115,28 +115,25 @@ For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
\end{Example}
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the Boolean annotations with natural numbers.
Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants ($\Phi_{bag}$ in the example).
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
We contrast bag and set query evaluation with the following example:
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the Boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the query result is a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
\begin{align*}
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
\end{align*}
It is left as an exercise for the reader to show that, given assignments to $W_a$, $W_b$, $W_c$, these expressions correspond to the existence (resp., count) of the single nullary result tuple for $\poly$ applied to the database instance in \Cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ (and the corresponding bag assignment),
The polynomials evaluate as:
Given $W_a$, $W_b$, $W_c$, these functions compute the existence (resp., count) of the nullary result tuple for $\poly$ applied to the database instance in \Cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ and the analogous bag assignment:
\begin{align*}
&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
The Set-PDB query is satisfied in this possible world and the Bag-PDB result tuple has a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
% \AR{What is $\mu$ below?}
{\small
@ -175,13 +172,14 @@ Computing such expectations is indeed linear in the size of the SOP as the numbe
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{multline}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}} = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
+ P[W_c = 1]P[W_a = 1]\\
\expct\pbox{\poly_{bag}}
% = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
% + P[W_c = 1]P[W_a = 1]\\
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
\end{multline}
\begin{figure}[h!]
\resizebox{\columnwidth}{!}{
\begin{figure}[t]
\resizebox{0.8\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
@ -216,6 +214,7 @@ As a further interesting feature of this example, note that $\expct\pbox{W_i} =
}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\trimfigurespacing
\end{figure}
\subsection{Superlinearity of Bag PDBs}
@ -225,7 +224,7 @@ Consider the Cartesian product of $\poly$ with itself:
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial is broken up into separate factors.
One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.
For example:
{\small
\begin{equation*}

View File

@ -363,11 +363,17 @@
\newcommand{\inparen}[1]{\left({#1}\right)}
\newcommand{\inset}[1]{\left\{{#1}\right\}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Forcing Layouts
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
%%%Adding stuff below so that long chain of display equatoons can be split across pages
\allowdisplaybreaks
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"

View File

@ -82,7 +82,6 @@ Christoph Koch},
Christoph Koch and
Dan Olteanu},
booktitle = {ICDE},
pages = {1479--1480},
title = {MayBMS: Managing Incomplete Information with Probabilistic World-Set
Decompositions},
year = {2007}
@ -104,17 +103,6 @@ Atri Rudra},
year = {2016}
}
@article{10.1145/3003665.3003667,
author = {Olteanu, Dan and Schleich, Maximilian},
journal = {SIGMOD Rec.},
number = {2},
numpages = {12},
pages = {5--16},
title = {Factorized Databases},
volume = {45},
year = {2016}
}
@article{DBLP:journals/sigmod/GuagliardoL17,
author = {Paolo Guagliardo and
Leonid Libkin},
@ -177,7 +165,6 @@ Val Tannen},
@inproceedings{ngo-survey,
author = {Hung Q. Ngo},
booktitle = {PODS},
pages = {111--124},
title = {Worst-Case Optimal Join Algorithms: Techniques, Results, and Open
Problems},
year = {2018}
@ -329,7 +316,6 @@ Virginia Vassilevska Williams},
year = {2009}
}
@article{FO16,
author = {Robert Fink and Dan Olteanu},
journal = {TODS},
@ -378,7 +364,6 @@ Virginia Vassilevska Williams},
@inproceedings{roy-11-f,
author = {Sudeepa Roy and Vittorio Perduca and Val Tannen},
booktitle = {ICDT},
pages = {232--243},
title = {Faster query answering in probabilistic databases using read-once functions},
year = {2011}
}
@ -476,7 +461,6 @@ Virginia Vassilevska Williams},
@conference{BD05,
author = {Boulos, J. and Dalvi, N. and Mandhani, B. and Mathur, S. and Re, C. and Suciu, D.},
booktitle = {SIGMOD},
pages = {891--893},
title = {MYSTIQ: a system for finding more answers by using probabilities},
year = {2005}
}
@ -511,8 +495,7 @@ Virginia Vassilevska Williams},
author = {R. Iris Bahar and Erica A. Frohm and Charles M. Gaona and Gary
D. Hachtel and Enrico Macii and Abelardo Pardo and Fabio
Somenzi},
booktitle = {ICCAD},
pages = {188--191},
booktitle = {IEEE CAD},
title = {Algebraic decision diagrams and their applications},
year = {1993}
}

View File

@ -21,7 +21,7 @@
\usepackage[normalem]{ulem}
\usepackage{subcaption}
\usepackage{booktabs}
\usepackage[disable]{todonotes}
\usepackage{todonotes}
\usepackage{graphicx}
\usepackage{listings}
%%%%%%%%%% SQL + proveannce listing settings

View File

@ -41,12 +41,12 @@ Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-har
Both of our hardness results rely on a simple query polynomial encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
Consider the polynomial
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.\]
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j\]
The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}
For any graph $G=([n],E)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem.\]
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned by to $X_i$ by $\vct{p}$) are the same value. Note that this polynomial can be encoded in an expression tree of size $\Theta(km)$.

View File

@ -122,27 +122,16 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
\noindent The usefulness of this will reduction become clear in \Cref{lem:exp-poly-rpoly}.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:pre-poly-rpoly}
If
$\poly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}} \cdot \prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i^{d_i}$
then
$\rpoly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \eta} q_{\vct{d}}\cdot\prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i$ \;\;\; for some $\eta \subseteq \{0,\ldots, B\}^\numvar$
\end{Lemma}
\begin{Definition}[Valid Worlds]
For probability distribution $\vct{P}$ and its corresponding PMF $P$, the set of valid worlds $\eta$ corresponds to all worlds that have a probability value greater than $0$, formally, for random variable $\vct{W}$
\[
\eta = \{\vct{w}\st P[\vct{W} = \vct{w}] > 0\}
\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
Follows by the construction of $\rpoly$ in \cref{def:reduced-bi-poly}. \qed
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We state additional equivalences between $\poly(\vct{X})$ and $\rpoly(\vct{X})$ in~\Cref{app:subsec-pre-poly-rpoly} and~\Cref{app:subsec-prop-q-qtilde}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent Note the following fact:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}\label{proposition:q-qtilde} For any \bi-lineage polynomial $\poly(X_1, \ldots, X_\numvar)$ and all $\vct{w} \in \{0,1\}^\numvar$, it holds that
$% \[
\poly(\vct{w}) = \rpoly(\vct{w}).
$% \]
\end{Proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -154,10 +143,10 @@ $% \]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:exp-poly-rpoly}
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and some query $\query$ we have
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and some query $\query$ we have
% The expectation over possible worlds in $\poly(\vct{X})$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numvar)$.
\begin{equation*}
\expct_{\vct{W}}\pbox{\poly(\vct{W})} = \rpoly(\vct{p}).
\expct_{\vct{w}\sim \vct{p}}\pbox{\poly(\vct{W})} = \rpoly(\vct{p}).
\end{equation*}
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -76,6 +76,7 @@ We focus on this problem exclusively from now on, assume an implicit result tupl
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
%
A \bi $\pxdb = (\db, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
@ -91,7 +92,7 @@ A \emph{\ti} is a \bi where each block contains exactly one tuple.
\subsection{Expression Trees}\label{sec:expression-trees}
In this section, we formally define expression trees, an encoding of polynomials that we use throughout much of the paper before generalizing to circuits in~\Cref{sec:gen}.
For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2X_1^2 + 3X_1X_2 - 2X_2^2$ over $\vct{X} = [X_1, X_2]$.
For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2X^2 + 3XY - 2Y^2$ over $\vct{X} = [X, Y]$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Expression Tree]\label{def:express-tree}
@ -101,29 +102,29 @@ tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes bei
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}. Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(X_1 + 2X_2)(2X_1 - X_2)$.
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}. Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[poly$(\cdot)$]\label{def:poly-func}
Denote $poly(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial in \abbrSMB. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
%
% \begin{align*}
% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
% &\etree.\type = \times\mapsto&& \polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild)\\
% &\etree.\type = \var \text{ OR } \tnum\mapsto&& \etree.\val
% \end{align*}
%
\begin{equation*}
\polyf(\etree) = \begin{cases}
\polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type } = +\\
\polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild) &\text{ if \etree.\type } = \times\\
\etree.\val &\text{ if \etree.\type } = \var \text{ OR } \tnum.
\end{cases}
\end{equation*}
\end{Definition}
%\begin{Definition}[poly$(\cdot)$]\label{def:poly-func}
%Denote $poly(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial in \abbrSMB. We define $poly(\cdot)$ recursively on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left/right child of $\etree$ respectively.
%%
%% \begin{align*}
%% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
%% &\etree.\type = \times\mapsto&& \polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild)\\
%% &\etree.\type = \var \text{ OR } \tnum\mapsto&& \etree.\val
%% \end{align*}
%%
%\begin{equation*}
% \polyf(\etree) = \begin{cases}
% \polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type } = +\\
% \polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild) &\text{ if \etree.\type } = \times\\
% \etree.\val &\text{ if \etree.\type } = \var \text{ OR } \tnum.
% \end{cases}
%\end{equation*}
%\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that addition and multiplication above follow the standard interpretation over polynomials.
%Note that addition and multiplication above follow the standard interpretation over polynomials.
%Specifically, when adding two monomials whose variables and respective exponents agree, the coefficients corresponding to the monomials are added and their sum is multiplied to the monomial. Multiplication here is denoted by concatenation of the monomial and coefficient. When two monomials are multiplied, the product of each corresponding coefficient is computed, and the variables in each monomial are multiplied, i.e., the exponents of like variables are added. Again we notate this by the direct product of coefficient product and all disitinct variables in the two monomials, with newly computed exponents.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -131,12 +132,11 @@ Note that addition and multiplication above follow the standard interpretation o
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For our running example, $\etreeset{\smb} = \{2X_1^2 + 3X_1X_2 - 2X_2^2, (X_1 + 2X_2)(2X_1 - X_2), X_1(2X_1 - X_2) + 2X_2(2X_1 - X_2), 2X_1(X_1 + 2X_2) - X_2(X_1 + 2X_2)\}$. Note that \cref{def:express-tree-set} implies that $\etree \in \etreeset{poly(\etree)}$.
For our running example, $\etreeset{\smb} = \{2X^2 + 3XY - 2Y^2, (X + 2Y)(2X - Y), X(2X - Y) + 2Y(2X - Y), 2X(X + 2Y) - Y(X + 2Y)\}$. Note that \cref{def:express-tree-set} implies that $\etree \in \etreeset{poly(\etree)}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Problem Definition}\label{sec:problem-definition}
We are now ready to formally state our main problem.
%\medskip
\noindent We are now ready to formally state our \textbf{main problem}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}

View File

@ -1,7 +1,8 @@
%!TEX root=./main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Compressed Representations of Polynomials and Boolean Formulas}\label{sec:compr-repr-polyn}
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{OS16,DBLP:conf/tapp/Zavodny11}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{DBLP:conf/tapp/Zavodny11,factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Parameterized Complexity}\label{sec:param-compl}

View File

@ -1,16 +1,30 @@
%!TEX root=./main.tex
\section{Related Work}\label{sec:related-work}
In addition to work on probabilistic databases, our work has connections to work on compact representations of polynomials and relies on past work in fine-grained complexity which we review in \Cref{sec:compr-repr-polyn} and \Cref{sec:param-compl}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Probabilistic Databases}\label{sec:prob-datab}
%\subsection{Probabilistic Databases}\label{sec:prob-datab}
Probabilistic databases have been studied predominantly under set semantics.
A multitude of probabilistic data models have been proposed for encoding a probabilistic database more compactly than as its set of possible worlds. Tuple-independent databases~ consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events. In spite of the inability to encode correlations, \tis have received much attention, because it was shown that any finite probabilistic database can be encode as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}. Block-independent databases (\bis) generalize \tis by partitioning the input into blocks where tuples within each block as disjoint events and blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution for each of its variables. This is similar to the $\semNX$-PDBs we use here, except that we do not allow for variables as attribute values and instead of local conditions which are propositional formulas which may contain comparisons, we associate tuples with polynomials $\semNX$.
Probabilistic Databases (PDBs) have been studied predominantly under set semantics.
A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds.
Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.
Approaches for probabilistic query processing, i.e., computing the marginal probability for each result tuple of a query over a probabilistic database, fall into two broad categories. \emph{Intensional} (or \emph{grounded}) query evaluation approaches compute the \emph{lineage} of a tuple which is a Boolean formula encoding the provenance of the tuple and then compute the probability of the lineage formula. In this paper we also focus on intensional query evaluation, but use polynomials instead of boolean formulas to deal with multisets. It is a well-known fact that computing the probability of a tuple in the result of a query over a probabilistic database (the \emph{marginal probability of a tuple}) is \sharpphard which can be proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the the probability of a tuple's lineage formula is equal to the marginal probability of the tuple. The second category, \emph{extensional} query evaluation, avoids calculating the lineage. This approach is in \ptime, but is limited to certain classes of queries. Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard or \ptime. Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries. Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}. Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple (a Boolean formula encoding the provenance of the tuple) and then the probability of the lineage formula.
In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the tuple's marginal probability is the probability of a its lineage formula).
The second category, \emph{extensional} query evaluation, avoids calculating the lineage.
This approach is in \ptime, but is limited to certain classes of queries.
Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (allows intensional).
Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}.
Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
Several techniques for approximating the probability of a query result tuple have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}. These approaches either rely on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}. The approximation algorithm for bag expectation we present in this work is based on sampling.
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}, relying on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}.
The approximation algorithm for bag expectation we present in this work is based on sampling.
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10} over the symbolic expressions that are used as tuple annotations and values in pvc-tables. \cite{FH12} identifies a tractable class of queries involving aggregation. In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while \cite{FH12} computes probabilities for individual output annotations).

View File

@ -57,7 +57,7 @@ We need all the possible edge patterns in an arbitrary $G$ with at most three di
%Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
For any graph $G$, the following formulas for $\numocc{G}{H}$ compute their respective patterns exactly in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$ (proofs are n \Cref{app:easy-counts}):
For any graph $G$, the following formulas for $\numocc{G}{H}$ compute their respective patterns exactly in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$ (proofs are in \Cref{app:easy-counts}):
\begin{align}
&\numocc{G}{\ed} = \numedge, \label{eq:1e}\\
&\numocc{G}{\twopath} = \sum_{i \in V} \binom{d_i}{2} \label{eq:2p}\\
@ -81,20 +81,23 @@ Note that $\rpoly_{G}^3(\prob,\ldots, \prob)$ as a polynomial in $\prob$ has deg
\begin{Lemma}\label{lem:qE3-exp}
%When we expand $\poly_{G}^3(\vct{X})$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following result,
For any $p$, we have:
{\small
\begin{align}
&\rpoly_{G}^3(\prob,\ldots, \prob) = \numocc{G}{\ed}\prob^2 + 6\numocc{G}{\twopath}\prob^3 + 6\numocc{G}{\twodis} + 6\numocc{G}{\tri}\prob^3\nonumber\\
&+ 6\numocc{G}{\oneint}\prob^4 + 6\numocc{G}{\threepath}\prob^4 + 6\numocc{G}{\twopathdis}\prob^5 + 6\numocc{G}{\threedis}\prob^6.\label{claim:four-one}
\end{align}
}
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}%[Proof of \Cref{lem:qE3-exp}]
\begin{proof}[Proof of \Cref{lem:qE3-exp}]
By definition we have that
\[\poly_{G}^3(\vct{X}) = \sum_{\substack{(i_1, j_1), (i_2, j_2), (i_3, j_3) \in E}}~\; \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Hence $\rpoly_{G}^3(\vct{X})$ has degree six. Note that the monomial $\prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}$ will contribute to the coefficient of $p^\nu$ in $\rpoly_{G}^3(\vct{X})$, where $\nu$ is the number of distinct variables in the monomial.
%Rather than list all the expressions in full detail, let us make some observations regarding the sum.
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum is a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take (and in each case, we account for their contribution to $\rpoly_{G}^3(\vct{X})$).
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$.
We compute $\rpoly_{G}^3(\vct{X})$ by considering each of the three forms that the triple $(e_1, e_2, e_3)$ can take.
\textsc{case 1:} $e_1 = e_2 = e_3$ (all edges are the same). There are exactly $\numedge=\numocc{G}{\ed}$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
@ -118,7 +121,7 @@ Next, we relate the various sub-graph counts in $\graph{2}$ and $\graph{3}$ to t
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3m-G2}
The number of $3$-matchings in graph $\graph{2}$ satisfies the following identity,
The $3$-matchings in graph $\graph{2}$ satisfy the identity:
\begin{align*}
\numocc{\graph{2}}{\threedis} &= 8 \cdot \numocc{\graph{1}}{\threedis} + 6 \cdot \numocc{\graph{1}}{\twopathdis}\\
&+ 4 \cdot \numocc{\graph{1}}{\oneint} + 4 \cdot \numocc{\graph{1}}{\threepath} + 2 \cdot \numocc{\graph{1}}{\tri}.
@ -128,7 +131,7 @@ The number of $3$-matchings in graph $\graph{2}$ satisfies the following identit
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3m-G3}
The number of 3-matchings in $\graph{3}$ satisfy the following identity,
The 3-matchings in $\graph{3}$ satisfy the identity:
\begin{align*}
\numocc{\graph{3}}{\threedis} &= 4\numocc{\graph{1}}{\twopath} + 6\numocc{\graph{1}}{\twodis} + 18\numocc{\graph{1}}{\tri}\\
&+ 21\numocc{\graph{1}}{\threepath}+ 24\numocc{\graph{1}}{\twopathdis} + 20\numocc{\graph{1}}{\oneint}\\
@ -140,14 +143,14 @@ The number of 3-matchings in $\graph{3}$ satisfy the following identity,
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3p-G2}
The number of $3$-paths in $\graph{2}$ satisfies the following identity,
The $3$-paths in $\graph{2}$ satisfy the identity:
\[\numocc{\graph{2}}{\threepath} = 2 \cdot \numocc{\graph{1}}{\twopath}.\]
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3p-G3}
The number of $3$-paths in $\graph{3}$ satisfies the following identity,
The $3$-paths in $\graph{3}$ satisfy the identity:
\[\numocc{\graph{3}}{\threepath} = \numocc{\graph{1}}{\ed} + 2 \cdot \numocc{\graph{1}}{\twopath}.\]
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -178,7 +181,7 @@ Fix $p\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\i
\end{pmatrix}
=\vct{b},
\]
from which we can compute $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
giving $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%