More tweaks to Introduction 120220.

This commit is contained in:
Aaron Huber 2020-12-02 16:30:42 -05:00
parent ea5f4ccd05
commit c204c9fc61
4 changed files with 119 additions and 134 deletions

View file

@ -1,3 +1,20 @@
@book{DBLP:series/synthesis/2011Suciu,
author = {Dan Suciu and
Dan Olteanu and
Christopher R{\'{e}} and
Christoph Koch},
title = {Probabilistic Databases},
series = {Synthesis Lectures on Data Management},
publisher = {Morgan {\&} Claypool Publishers},
year = {2011},
url = {https://doi.org/10.2200/S00362ED1V01Y201105DTM016},
doi = {10.2200/S00362ED1V01Y201105DTM016},
timestamp = {Tue, 16 May 2017 14:24:20 +0200},
biburl = {https://dblp.org/rec/series/synthesis/2011Suciu.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{10.1145/1265530.1265571,
author = {Dalvi, Nilesh and Suciu, Dan},
title = {The Dichotomy of Conjunctive Queries on Probabilistic Structures},

3
abstract.tex Normal file
View file

@ -0,0 +1,3 @@
%root: main.tex
Abstract here.

232
intro.tex
View file

@ -3,7 +3,9 @@
\section{Introduction}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of PDBs are built in the setting of set semantics, where the annotation polynomial is in disjunctive normal form (DNF), and computations over the output polynomial such as expectation (probability of the tuple) are \#-P hard in general. However for the equivalent sum of products (SOP) representation in the bags setting, computing the expectation (expected multiplicity) over the output polynomial is iinear. In this work we show that, if we use alternative representations of the output polynomial, such as factorized forms, the complexity landscape becomes much more nuanced.
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing expectations and other moments is analogous to counting the number of solutions to a boolean formula, a known \#-P problem
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, the boolean formula is called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, a formula generated by query processing. However computing expectation in the bag setting is linear with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula showing that the complexity landscape becomes much more nuanced, and is not linear in general.
@ -36,18 +38,18 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c | c}
$\rel$ & B & $\Phi$\\
\hline
& b & $W_b$\\
& c & $W_c$\\
& a & $W_a$\\
\end{tabular}
%\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom2}
\end{subfigure}
% \begin{subfigure}{0.15\textwidth}
% \centering
% \begin{tabular}{ c | c | c}
% $\rel$ & B & $\Phi$\\
% \hline
% & b & $W_b$\\
% & c & $W_c$\\
% & a & $W_a$\\
% \end{tabular}
% %\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
% \label{subfig:ex-atom2}
% \end{subfigure}
\caption{$\ti$ relations for $\poly$}
@ -69,31 +71,39 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\end{figure}
\begin{Example}\label{ex:intro}
Suppose we are given the following boolean query $\poly() := R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$), where the output polynomial will consist of all tuple annotations contributing to the output. The $\ti$ relations are given in ~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. Note that the attribute column $\Phi$ contains a variable/value, where in the former case the variable ranges in $[0, 1]$ denoting its marginal probability of appearing in the set of possible world, and the latter is the fixed (marginal) probability of the tuple across the set of possible worlds. Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
Suppose we are given the following boolean query $\poly() := R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$), where the annotation of the output will consist of all contributing tuple annotations. The $\ti$ relations are given in ~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. Note that the attribute column $\Phi$ contains a variable/value, where in the former case the variable ranges in $[0, 1]$ denoting its marginal probability of appearing in the set of possible worlds, and the latter is the fixed (marginal) probability of the tuple across the set of possible worlds. Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
This query is hard in set semantics because of correlations in the lineage formula, but under bag semantics it is easy since we enjoy linearity of expectation.
\end{Example}
While our work handles Block Independent Disjoint Databases ($\bi$), for now we consider the $\ti$ model. Define the probability distribution to be $P[W_i = 1] = \prob$ for $i$ in $\{a, b, c\}$.
Note that the query of ~\cref{ex:intro} in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics. To see this intuitively, for query $\poly$ over set semantics, we have that the output polynomial $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$ but exponential in the worst case.
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. For the purposes of this work, we define hard to be anything greater than linear time. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, we have an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation
\begin{equation*}
\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
\end{equation*}
of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$. In general, such a computation can be exponential.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
\AH{The value $\expct\pbox{\poly(W_a, W_b, W_c)}$ (in the set semantics case) needs to be computed, but I don't think I've arrived at a correct answer. In the interest of time, I am coming back to this. I appreciate any help on this. I have googled this, but search results point to instructional resources which demonstrate how to use shannon's expansion as a tool for multiplexers and not directly for computing the probability of propositional formulas...and I haven't seemed to make the connection yet.}
However, in the bag setting, the output polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. The expectation computation over the output polynomial is a computation of what the 'average' multiplicity of the number of tuples contributing to the output across possible worlds. In ~\cref{ex:intro}, the expectation is simply
However, in the bag setting, the lineage formula is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. To be precise, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the 'average' multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}\\
= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2,
\end{align*}
which is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations. The above equalities hold, since expectation is linear over addition of the natural numbers. Further, we exploited linearity of expectation over multiplication since in the $\ti$ model, all variables are independent. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, although this is coincidental and not true for the general case.
which is indeed linear in the size of the lineage as the number of operations in the computation is \textit{exactly} the number of lineage operations. The above equalities hold, since expectation is linear over addition of the natural numbers. Further, we exploited linearity of expectation over multiplication since in the $\ti$ model, all variables are independent. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, although this is coincidental and not true for the general case.
Now, consider the query
\begin{equation*}
\poly^2() := \rel(A), E(A, B), R(B)\rel(C), E(C, D), R(D),
\poly^2() := \rel(A), E(A, B), \rel(B), \rel(C), E(C, D), \rel(D),
\end{equation*}
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is known as the factorized polynomial, where the polynomial can be broken up into separate factors, and this is generally smaller than the expanded polynomial. Another equivalent form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is known as the factorized polynomial, where the polynomial can be broken up into separate factors, and this is generally smaller than the expanded polynomial. Another equivalent form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
A factorized polynomial of $\poly^2$ is
@ -152,136 +162,90 @@ The expectation then is
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
&\qquad \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
= &\prob^2 + \prob^2 + \prob^2 + 2\prob^3 + 2\prob^3 + 2\prob^3\\
= & 3\prob^2(1 + 2\prob) \neq \poly(\prob, \prob, \prob).
= & 3\prob^2(1 + 2\prob) \neq \poly^2(\prob, \prob, \prob).
\end{align*}
In this case, $\poly(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2} = \sum_{x \in Dom(X)}x^2 \cdot p(x)$. Note, that for our example, $Dom(W_i) = \{0, 1\}$.
In this case, even though we substituting probability or expecation values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2} = \sum_{x \in Dom(X)}x^2 \cdot p(x)$. Note, that for our example, $Dom(W_i) = \{0, 1\}$. Intuitively, bags are only hard with self-joins.\AH{Atri suggests a proof in the appendix regarding the last claim.}
Define $\rpoly(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly$. Note that this structure $\rpoly(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time \textit{in} the size of the output polynomial, compressed or SOP.
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
As seen in the example, a compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
A compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
This works seeks to explore the complexity landscape for compressed representations of polynomials. We use the term 'easy' to mean linear time, and the term 'hard' to mean superlinear time. Up to this point the message seems consistent that bags are always easy, but
This works seeks to explore the complexity landscape for compressed representations of polynomials. We use the term 'easy' to mean linear time, and the term 'hard' to mean superlinear time or greater. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
Up to this point the message seems consistent that bags are always easy, but
\begin{Question}
Is it always the case that bags are easy in the size of the polynomial?
Is it always the case that bags are easy in the size of the compressed polynomial?
\end{Question}
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is desirable.
When we consider the query
Consider the query
\begin{equation*}
\poly^3() := \rel(A), E(A, B), R(B), \rel(C), E(C, D), R(D), \rel(F), E(F, G), R(G),
\poly^3() := \rel(A), E(A, B), R(B), \rel(C), E(C, D), R(D), \rel(F), E(F, G), R(G).
\end{equation*}
the answer is no in the general case. Upon inspection one can see that the factorized output polynomial consists of three terms, while the SOP version consists of $3^3$ terms. We show in this paper that this particular query is hard given a factorized polynomial as input. We show this via a reduction to known hardness results in graph theory. The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form. We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Further, our apporximation algorithm generalizes to the $\bi$ model as well.
Upon inspection one can see that the factorized output polynomial consists of three product terms, while the SOP version consists of $3^3$ terms. We show in this paper that, given a $\ti$ and any conjunctive query with input $\prob$ for all variables of $\poly^3$, this particular query is hard given a factorized polynomial as input. We show this via a reduction to computing the number of $3$-matchings over an arbitrary graph. The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form. We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Our apporximation algorithm generalizes to the $\bi$ model as well. This shows that for all RA+ queries, the processing time in approximation is essentially the same deterministic processing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Interesting contributions, problem definition, known results, our results, etc
%\[\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)\times \left(\rel(A), E(A, B), R(B)\right),\]
%it is then the case that output polynomial in a compressed version is
%\[\left(W_aW_b + W_bW_c + W_cW_a\right)^3, \]
%while the expanded equivalent version has too many (27) terms to list.
%
%In $\poly^2()$ and $\poly^3()$, we would like to when computing the expectation over the compressed output polynomial is linear in the size of the compressed polynomial, and when it is not. Additionally, given a class of queries such that the expectation is hard, is there a linear approximation algorithm with confidence guarantees?
%\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
%This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
%expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
%been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
%in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
%claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
%expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
%
%\AH{{\bf 1)} Be sure to speak of our results.\par{\bf 2)} Explicitly mention the substitution of $\prob_i$ in for $W_i$ vars.}
%Now, assume the following restrictions. First, all variables $X \in \vct{X}$ are set to $\prob$. Second, all exponents $e > 1$ in the expanded polynomial are set to $1$. Call this modified polynomial $\rpoly(\prob,\ldots, \prob)$. We show that $\expct\pbox{\poly(\prob,\ldots, \prob)} = \rpoly(\prob,\ldots, \prob)$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
%\paragraph{Interesting Mathematical Contributions}
%This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
%exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
%This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
%This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
%$\poly(\vct{X})$.
%
%\[\sum\limits_{(i, j) \in E}X_iX_j + \sum\limits_{\substack{(i, j), (i \ell) \in E,\\ i \neq \ell}}X_iX_jX_\ell + \sum\limits_{\substack{(i, j), (k, \ell) \in E,\\ i\neq j\neq k \neq \ell}}X_iX_jX_kX_\ell.\]
%Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
%3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
%hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
%an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
%leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
%forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
%of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
%in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
%on the structure of the query.
%
%Notice that the first term is the sum of edges, and for $\rpoly(\prob,\ldots, \prob)$, this summation is computable in $O(\numedge)$ time. Similarly, the second summation is the sum over all two paths, which can also be evaluated in $O(\numedge)$ time. Finally, the third term is indeed computable in $O(\numedge)$ time by the closed form expression $\sum\limits_{(i, j) \in E}\binom{\numedge - d_i - d_j + 1}{2}$, and for all summations, we only need to multiply by the correct exponentiation of $\prob$.
%
%It is not until we compute a query such as $\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$ that we find hardness results for a compressed polynomial, specifically that the computation is greater than linear time, i.e., superlinear.
\AH{{\bf \Large New Material Stops Here.}}
\AR{The para below has some text that is too coloquial and should not be in a paper, e.g. ``giant pile of monomials" or ``folks".}
Most implementations of modern probabilistic databases (PDBs) view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in the bag semantic setting, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.
\AR{The para above does not quite seem to follow the outline we discussed for the first para. Here is what I thought we had discussed (but it is possible I'm mis-remembering). Here is how you can pharse this, line by line:
(Line 1): State modern DBs are based on bag semantics.
(Line 2): But most implementations of PDBs are for set semantics, where the annotation polynomial is represented in DNF form and balh problems are \#P-hard.
(Line 3): However, for the equivalent to DNF representation for bags, blah problems can be computed in linear time.
(Line 4): In this work we show that if we use better representation like factorized DBs for annotation polynomial then the complexity landscape becomes much more nuanced.
}
and this contributes to slow computation time \AR{Why did you put this comment on slow computation time? What is it buying you? It seems like you are jumping ahead over here.}. In both settings it is the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. While in set semantics, the output polynomial is predominantly viewed as the probability\AR{The annotation polynomial does {\bf NOT} give the probability of a tuple-- it just says whether a tuple is present in a world or not. Only when you take the expecation of the polynomial do you get a probability value.} that its associated tuple exists or not, in the bags setting the polynomial is an encoding of the multiplicity of the tuple in the output. Note that in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings.\AR{How does the last statement help? In this paper we are interested in set and bag semantics: why are you bring up other possibilities that have nothing to do with the main message of the paper?}
\AR{I'll pause here now (except for two comments on next page) since it would be quicker to give comments during the meeting but here is something to ponder about, which might make things simpler to present. I would recommend that you introduce the theoretical problem {\bf right after} the first para. Once you define the problem, what you are trying to describe in words in this para can be done much more succinctly with relevant notation. If you want to be bit more gentle then perhaps start off with an example annotation polynomial and talk about that example-- this latter one might be better for PODS. But {\em ideally}, you would like an example query that is hard in set semantics but easy in bag in SoP representation but also hard with more succinct representation. Side Q-- is our hard query that we use for triangle counting etc. \#P-hard in the set semantics? If so that would be a great example to use throughout the intro.}
In bag semantics, the polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. Should we attempt to make computations, e.g. expectation, over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation (probability) of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
There is limited work and results in the area of bag semantic PDBs. This work seeks to leverage prior work in factorized databases (e.g. Olteanu et. al.)~\cite{DBLP:conf/tapp/Zavodny11} with PDB implementations to improve efficient computation over output polynomials, with theoretical guarantees. \AR{I know what you are trying to say in the rest of the para but it can be easily interpreted to be saying something that is {\bf false}. It is always the case that the compressed form of the polynomial always evaluates to the same value as the extended SoP form for any value. So the expcted value of compressed poly is {\em always the same} as expected value of the SoP forms. What you are trying to get here if when you "push" in the expectations. Again the latter is very hard to describe in words. But this would be much easier to state once you have the notation in place. Or if you have a runnign example.} When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting, i.e., in set semantics expectation doesn't have linearity over disjunction, and a consequence of this is that it is not true in the general case that a compressed polynomial has an equivalent expectation to its DNF form. In the bag PDB setting, however, expectation does enjoy linearity over addition, and the expectation of a compressed polynomial and its equivalent SOP are indeed the same.
For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form.
\BG{I don't think this statement holds up to scrutiny. For instance, ProvSQL uses circuits.}
Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the size of the input in the general case.
The landscape of bags changes when we think of annotation polynomials in a compressed form rather than as traditionally implemented in DNF. The implementation of PDBs has followed a course of producing output tuple annotation polynomials as a
giant pile of monomials, aka DNF. This is seen in many implementations, including MayBMS, MystiQ, GProM, Orion, etc.), ~\cite{DBLP:conf/icde/AntovaKO07a}, ~\cite{DBLP:conf/sigmod/BoulosDMMRS05}, ~\cite{AF18}, ~\cite{DBLP:conf/sigmod/SinghMMPHS08}, all of which use an
encoding that is essentially an enumeration through all the monomials in the DNF. \AR{I don't think a PODS reader would care much about the next few sentences. Compress the argument into a sentence. If you cannot, perhaps it is not worth putting in?} The reason for this is because of the customary fixed data
size rule for attributes in the classical approach to building DBs. Such an approach allows practitioners to know how big a tuple will be,
and thus pave the way for several optimizations. The goal is to avoid the situation where a field might get too big. However, annotations
break this convention, since, e.g., a projection can produce an annotation that is arbitrarily greater in the size of the data. Other RA operators, such as join
grow the annotations unboundedly as well, albeit at a lesser rate. As a result, the aforementioned PDBs want to avoid creating arbitrarily sized
fields, and the strategy has been to take the provenance polynomial, flatten it into individual monomials, storing each individual monomial in
a table. This restriction carries with it $O(n^2)$ run time in the size of the input tables to materialize the monomials. Obviously, such an approach
disallows doing anything clever. For those PDBs that do allow factorized polynomials, e.g., Sprout ~\cite{DBLP:conf/icde/OlteanuHK10}, they assume such an encoding in the set semantics. With compressed encodings, the problem in bag semantics is actually hard in a non-obvious way.
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded SOP form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial with respective probability values for all variables in $\vct{X}$ is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms (with their corresponding probability values substituted in for their respective variables) that make up the compressed representation. For clarity, the probability we are considering is whether or not the tuple exists in the input DB, or in other words, the input to arbitrary query $Q$ is a set PDB. Note that our scheme takes the \textit{output polynomial} generated by the query over the input DB as its input.
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even the query class of $CQ$, which allow for only projections and joins, over a $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. %Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, specifically the range [compressed, SOP]. In approximating the expectation, an expression tree is used to model the query output polynomial, which indeed allows polyomials in compressed form.
\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
\paragraph{Interesting Mathematical Contributions}
This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
$\poly(\vct{X})$.
Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
on the structure of the query.
\section{Outline of the rest of the paper}
\begin{enumerate}
\item Background Knowledge and Notation
\begin{enumerate}
\item Review notation for PDBs
\item Review the use of semirings as generating output polynomials
\item Review the translation of semiring operators to RA operators
\item Polynomial formulation and notation
\end{enumerate}
\item Reduction to hardness results in graph theory
\begin{enumerate}
\item $\rpoly$ and its equivalence to $\expct\pbox{\poly}$ when $\vct{X} \in \{0, 1\}^\numvar$
\item Results for SOP polynomial
\item Results for compressed version of polynomial
\item ~\cref{lem:const-p} proof
\end{enumerate}
\item Approximation Algorithm
\begin{enumerate}
\item Description of the Algorithm
\item Theoretical guarantees
\item Will we have time to tackle BIDB?
\begin{enumerate}
\item If so, experiments on BIDBs?
\end{enumerate}
\end{enumerate}
\item Future Work
\item Conclusion
\end{enumerate}
%\section{Outline of the rest of the paper}
%\begin{enumerate}
% \item Background Knowledge and Notation
% \begin{enumerate}
% \item Review notation for PDBs
% \item Review the use of semirings as generating output polynomials
% \item Review the translation of semiring operators to RA operators
% \item Polynomial formulation and notation
% \end{enumerate}
% \item Reduction to hardness results in graph theory
% \begin{enumerate}
% \item $\rpoly$ and its equivalence to $\expct\pbox{\poly}$ when $\vct{X} \in \{0, 1\}^\numvar$
% \item Results for SOP polynomial
% \item Results for compressed version of polynomial
% \item ~\cref{lem:const-p} proof
% \end{enumerate}
% \item Approximation Algorithm
% \begin{enumerate}
% \item Description of the Algorithm
% \item Theoretical guarantees
% \item Will we have time to tackle BIDB?
% \begin{enumerate}
% \item If so, experiments on BIDBs?
% \end{enumerate}
% \end{enumerate}
% \item Future Work
% \item Conclusion
%\end{enumerate}

View file

@ -160,6 +160,7 @@ sensitive=true
%\input{pos}
%\input{sop}
%\input{davidscheme}
\input{abstract}
\input{intro}
\input{ra-to-poly}
\input{poly-form}