Small tweaks to the Intro

This commit is contained in:
Aaron Huber 2020-11-19 10:30:17 -05:00
parent e9e03df7b1
commit d86b4747f2

View file

@ -1,19 +1,18 @@
%root: main.tex
\AH{I need help not being redundant...}
\section{Introduction}
Most implementations of PDBs view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in bag semantics, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.
Most implementations of modern probabilistic databases (PDBs) view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in the bag semantic setting, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. When we consider PDBs in the bag setting, it is then the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. The polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. The polynomial is an encoding of the multiplicity of the tuple in the output, while in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings. Should we attempt to make computations over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, as noted above, most implementations of PDBs are built in the setting of set semantics, and this contributes to slow computation time. In both settings it is the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. While in set semantics, the output polynomial is predominantly viewed as the probability that its associated tuple exists or not, in the bags setting the polynomial is an encoding of the multiplicity of the tuple in the output. Note that in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings. In bag semantics, the polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. Should we attempt to make computations, e.g. expectation, over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation (probability) of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
There is limited work and results in the area of bag semantic PDBs. This work seeks to leverage prior work in factorized databases (e.g. Olteanu et. al.)~\cite{DBLP:conf/tapp/Zavodny11} with PDB implementations to improve efficient computation over output polynomials. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting, i.e., in set semantics expectation doesn't have linearity over disjunction, and a consequence of this it is not true in the general case that a compressed polynomial has an equivalent expectation to its DNF form. In the bag PDB setting, however, expectation does enjoy linearity over addition, and the expectation of a compressed polynomial and its equivalent SOP are indeed the same.
There is limited work and results in the area of bag semantic PDBs. This work seeks to leverage prior work in factorized databases (e.g. Olteanu et. al.)~\cite{DBLP:conf/tapp/Zavodny11} with PDB implementations to improve efficient computation over output polynomials, with theoretical guarantees. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting, i.e., in set semantics expectation doesn't have linearity over disjunction, and a consequence of this is that it is not true in the general case that a compressed polynomial has an equivalent expectation to its DNF form. In the bag PDB setting, however, expectation does enjoy linearity over addition, and the expectation of a compressed polynomial and its equivalent SOP are indeed the same.
For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form.
\BG{I don't think this statement holds up to scrutiny. For instance, ProvSQL uses circuits.}
Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the size of the input in the general case.
The landscape of bags changes when we think of annotation polynomials in a compressed form rather than as traditionally implemented in
disjunctive normal form (DNF). The implementation of PDBs has followed a course of producing output tuple annotation polynomials as a
The landscape of bags changes when we think of annotation polynomials in a compressed form rather than as traditionally implemented in DNF. The implementation of PDBs has followed a course of producing output tuple annotation polynomials as a
giant pile of monomials, aka DNF. This is seen in many implementations, including MayBMS, MystiQ, GProM, Orion, etc.), ~\cite{DBLP:conf/icde/AntovaKO07a}, ~\cite{DBLP:conf/sigmod/BoulosDMMRS05}, ~\cite{AF18}, ~\cite{DBLP:conf/sigmod/SinghMMPHS08}, all of which use an
encoding that is essentially an enumeration through all the monomials in the DNF. The reason for this is because of the customary fixed data
size rule for attributes in the classical approach to building DBs. Such an approach allows practitioners to know how big a tuple will be,
@ -21,21 +20,21 @@ and thus pave the way for several optimizations. The goal is to avoid the situa
break this convention, since, e.g., a projection can produce an annotation that is arbitrarily greater in the size of the data. Other RA operators, such as join
grow the annotations unboundedly as well, albeit at a lesser rate. As a result, the aforementioned PDBs want to avoid creating arbitrarily sized
fields, and the strategy has been to take the provenance polynomial, flatten it into individual monomials, storing each individual monomial in
a table. This restriction carries with it $O(n^2)$ in the size of the input tables to materialize the monomials. Obviously, such an approach
a table. This restriction carries with it $O(n^2)$ run time in the size of the input tables to materialize the monomials. Obviously, such an approach
disallows doing anything clever. For those PDBs that do allow factorized polynomials, e.g., Sprout ~\cite{DBLP:conf/icde/OlteanuHK10}, they assume such an encoding in the set semantics. With compressed encodings, the problem in bag semantics is actually hard in a non-obvious way.
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial with respective probability values for all variables in $\vct{X}$ is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms (with their corresponding probability values substituted in for their respective variables) that make up the compressed representation. For clarity, the probability we are considering is whether or not the tuple exists in the input DB. Note that our scheme takes the \textit{output polynomial} generated by the query over the input DB as input.
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded SOP form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial with respective probability values for all variables in $\vct{X}$ is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms (with their corresponding probability values substituted in for their respective variables) that make up the compressed representation. For clarity, the probability we are considering is whether or not the tuple exists in the input DB, or in other words, the input to arbitrary query $Q$ is a set PDB. Note that our scheme takes the \textit{output polynomial} generated by the query over the input DB as its input.
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even the query class of $CQ$, which allow for only projections and joins. over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. %Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even the query class of $CQ$, which allow for only projections and joins, over a $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. %Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, specifically the range [compressed, SOP]. In approximating the expectation, an expression tree is used to model the query output polynomial, which indeed allows polyomials in compressed form.
\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
expectation over the polynomial that is the result of a query over a bag PDB. This is a problem where, to the best of our knowledge, there has not
expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
@ -56,7 +55,7 @@ leverages the efficiency of compressed polynomial input by taking in an expressi
forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
on the characterization of the query.
on the structure of the query.