Added more to the Intro; implemented almost all of Boris' suggestions

master
Aaron Huber 2020-11-17 15:38:04 -05:00
parent 5af0132778
commit 2ee0e3c50e
2 changed files with 62 additions and 17 deletions

View File

@ -24,6 +24,7 @@ BEGIN: Introduction Outline
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
-in practice, Production Databases (Postgres, Oracle, etc.) use bags; modern pdbs are slow since they are dealing with sets rather than bags
-Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
--COMPUTATIONS (efficiently) over the output polynomial, in our case, expectation
-focus on intuition
-what are the key points?
-one cannot generate a result better than linear time in the size of the polynomial; sets, are even harder than that--in
@ -34,13 +35,13 @@ BEGIN: Introduction Outline
Naive alg: generate all the monomials, and compute each of their probabilities
-why do people think bags are easy?
-???
-how does this tie into how do people approach implementing pdbs?
THIS COULD BE BROUGHT OUT MORE -how does this tie into how do people approach implementing pdbs?
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
-how can we do better than the standard bar that most pdb use
-by accepting factorized polynomials as input
-don't worry about specifics
-Result--here is what we show (in gentle English and not too technical terms)
-Computation over bags is hard, i.e. superlinear time if we are given a compressed/factorized db; on the other hand, when we use an approximation
MAYBE VERIFY THIS IS EFFECTIVELY BROUGHT OUT -Computation over bags is hard, i.e. superlinear time if we are given a compressed/factorized db; on the other hand, when we use an approximation
algorithm, we get linear time
Can have another segway paragraph
@ -49,15 +50,19 @@ After motivation up front, forget it all, and get into the nitty gritty
Typical Theory Paper Structure (Look at PODS papers you've read and see their structures):
-------------------------------
#Define the (mathematical) problem
-computing expectation over bag PDB query output polynomial
#Here are known results (you want to articulate why this problem (that you are addressing) is non-trivial)
-people have not really studied this
#Here are our results
-hard in the general case via a reduction to computing the number of 3-paths, 3-matchings, and triangles in an arbitrary graph
#Here are the techniques for our results
-the algorithm uniformly samples monomials from expression tree of the polynomial, approximating $\rpoly{Q}$.
-we perform an analysis of the approximation algorithm that proves linear time with confidence guarantees
2nd Paragraph:
-------------
-Somewhere we need to mention...
THIS WAS NOT INCLUDED IN THE ORIGINAL PASS OF INTRO-------v
-Interesting mathematically
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
-what does this buy us, aside from being an interesting fact?
@ -86,14 +91,14 @@ Typical Theory Paper Structure (Look at PODS papers you've read and see their st
and/or forget about listing subtelties and just talk about them in some decent order
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
-#2 clarify how you are counting the input size (the size of the db instance vs. the size of the query polynomial--which might be significantly larger or smaller
than the former); this may be a good place to have an example
BE CERTAIN THAT THIS IS EXPLICITLY STATED -#2 clarify how you are counting the input size (the size of the db instance vs. the size of the query polynomial--which might be significantly larger or smaller
PLACEHOLDER IN THE TEXT FOR THE EXAMPLE than the former); this may be a good place to have an example
-better than linear time in the output
-since we take as input the polynomial encoding the query output
-by the fact that this output polynomial can be in factorized form
Q: -the above is the only subtelty that comes to mind currently
<-------------
DONE------------->ADD THE HISTORY
-Historical Overview
-why is this the way it is
-the customary fixed attribute data size rule
@ -102,7 +107,7 @@ Q: -the above is the only subtelty that comes to mind currently
an enumerating through the monomials
-this classical approach disallows doing anything clever
-those that use a factorized encoding assume sets, as is the case for Sprout
with new encodings, the bag problem is actually hard in a non-obvious way
SKIPPED--->with new encodings, the bag problem is actually hard in a non-obvious way
- a common convention in DBs is a fixed size on the size of a column, the size of the data
-if you know how big the tuple is, there are a bunch of optimizations that you can do
-you want to avoid the situation where the field gets too big
@ -118,7 +123,11 @@ Q: -the above is the only subtelty that comes to mind currently
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
-think about MayBMS
-
END HISTORY
CAN PERHAPS INCORPORATE THIS PART OF THE OUTLINE MORE STRONGLY IN THE BEGINNING PARAs
-Describe the subtelty of our scheme performing "better than linear in the output size"
-explicitly define our definition of 'hard' query in this setting
-hard is anything worse than linear time in the size of the SOP polynomial

View File

@ -2,26 +2,62 @@
\section{Introduction}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. In the set semantics setting, one cannot get better than \#P runtime\BG{I think when talking complexity classes, it is preferable to state the complexity, e.g., saying this problem is \#P-hard} in computing the expectation\BG{of what?} of the output polynomial\BG{I think this is dropped on the reader without any warning. What is the polynomial here. I think before getting at this point you need an introductory sentence to explain how polynomials are used in (bag) PDB. Also mention briefly what the output is: the expected multiplicity of the tuple?} of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. When we consider PDBs in the bag setting, it is then the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. The polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. The polynomial is an encoding of the multiplicity of the tuple in the output, while in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings. Should we attempt to make computations over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
\BG{Introductions also serve to embed the work into the context of related work, it would be good to add citations and state explicitly who has done what}
There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting.
\BG{what is the oversimplification of PDBs in the set semantics setting? what subtelty arises because of this?}
There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting, i.e., in set semantics expectation doesn't have linearity over disjunction, and as a consequence of this it is not true in the general case that a compressed polynomial has an equivalent expectation to its DNF form. In the bag PDB setting, however, expectation does enjoy linearity over addition, and the expectation of a compressed polynomial and its equivalent SOP are indeed the same.
For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form.
\BG{I don't think this statement holds up to scrutiny. For instance, ProvSQL uses circuits.}
Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential\BG{in what?} in the general case.
Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the size of the input in the general case.
The landscape of bags changes when we think of annotation polynomials in a compressed form rather than as traditionally implemented in
disjunctive normal form (DNF). The implementation of PDBs has followed a course of producing output tuple annotation polynomials as a
giant pile of monomials, aka DNF. This is seen in many implementations, including MayBMS, MystiQ, GProM, Orion, etc.), all of which use an
encoding that is essentially an enumeration through all the monomials in the DNF. The reason for this is because of the customary fixed data
size rule for attributes in the classical approach to building DBs. Such an approach allows practitioners to know how big a tuple will be,
and thus pave the way for several optimizations. The goal is to avoid the situation where a field might get too big. However, annotations
break this convention, since, e.g., a projection can produce an annotation that is arbitrarily greater in the size of the data. Other RA operators, such as join
grow the annotations unboundedly as well, albeit at a lesser rate. As a result, the aforementioned PDBs want to avoid creating arbitrarily sized
fields, and the strategy has been to take the provenance polynomial, flatten it into individual monomials, storing each individual monomial in
a table. This restriction carries with it $O(n^2)$ in the size of the input tables to materialize the monomials. Obviously, such an approach
disallows doing anything clever. For those PDBs that do allow factorized polynomials, e.g., Sprout, they assume such an encoding in the set
semantics. With compressed encodings, the problem in bag semantics is actually hard in a non-obvious way.
\AH{New para maybe? Introduce the subtelty here between a giant pile of monomials and a compressed version of the polynomial}
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial\BG{what is the input to the problem? The polynomial + X. What is X?} is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability\BG{It is probably not clear to the reader what a probability means in the bag setting. An output tuple would have a multiplicity in the bag setting. What probability are we computing? The probability that the multiplicity is larger than 0 (the tuple exists)?} in the number of terms that make up the compressed representation.
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial with respective probability values for all variables in $\vct{X}$ is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms (with their corresponding probability values substituted in for their respective variables) that make up the compressed representation. For clarity, the probability we are considering is whether or not the tuple exists in the output.
\AH{Perhaps a new para here? Above para needs to gently but explicitly highlight the differences in traditional logical implemenation and our approach.}
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$\BG{should we use query classes that are more familiar to PODS people like CQ? Or at least mention that this is equivalent to CQ?} over a bag $\ti$ where all tuples have probability $\prob$\BG{I think the setting needs to be explained better. Are we restricting ourselves to the case where the input tuple's probability means it's probability to appear once? This needs to be pointed out?} is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even the query class of $CQ$, which allow for only projections and joins. over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
\AH{The para below I think should be incorporated in the para above.}
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP]. \BG{This sentence is incomplete?:} In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, specifically the range [compressed, SOP]. In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.
\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
expectation over the polynomial that is the result of a query over a bag PDB. This is a problem where, to the best of our knowledge, there has not
been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
\paragraph{Interesting Mathematical Contributions}
This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
$\poly(\vct{X})$.
Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
on the characterization of the query.
%%% Local Variables: