First draft of Introduction.

master
Aaron Huber 2020-11-16 12:10:35 -05:00
parent e2bd3111f0
commit 9f2a1cc70c
4 changed files with 59 additions and 22 deletions

View File

@ -2,22 +2,36 @@ You don't need to obsess over the structure or the wording; just focus on the co
Q: Where can I find the semantics as defined for RA+ set operations
!Thinking about query results as (a giant pile of) monomials vs. polynomials needs to come out as EARLY as possible; this isn't entirely out of the blue;
!Thinking about query results as (a giant pile of) monomials vs. (compressed representations of, e.g. factorized db is an example of) polynomials
(since our audience is PODS, we don't need to spend more that a line or two on this; needs to come out as EARLY as possible; this isn't entirely out of the blue;
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
You want to get into this twice.
Focus on the 1st paragrah,
once you're done, get it to them
spend a lot of time on our contributions
also a table for sets/bags, data models, etc
===================================================================================================
BEGIN: Introduction Outline
1st Paragraph
-------------
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
-In practice PDBs are bags
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
-in practice, Production Databases (Postgres, Oracle, etc.) use bags; modern pdbs are slow since they are dealing with sets rather than bags
-Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
-focus on intuition
-what are the key points?
-we don't require the polynomial to be given to us in DNF
-one cannot generate a result better than linear time in the size of the polynomial; sets, are even harder than that--in
#P in sets; as a result if you assume that the result is given to you in SOP, then the naive method is the optimal;
however, factorized form of the polynomial allows for better results; runtime in the number of monomials vs. runtime in the
size of the polynomial is the same when the polynomial is given to you in DNF; however, they are not the same when given
compressed version(s) of the polynomial; this work looks into when the polynomal to NOT be given to us in SOP;
Naive alg: generate all the monomials, and compute each of their probabilities
-why do people think bags are easy?
-???
-how does this tie into how do people approach implementing pdbs?
@ -25,6 +39,25 @@ BEGIN: Introduction Outline
-how can we do better than the standard bar that most pdb use
-by accepting factorized polynomials as input
-don't worry about specifics
-Result--here is what we show (in gentle English and not too technical terms)
-Computation over bags is hard, i.e. superlinear time if we are given a compressed/factorized db; on the other hand, when we use an approximation
algorithm, we get linear time
Can have another segway paragraph
After motivation up front, forget it all, and get into the nitty gritty
Typical Theory Paper Structure (Look at PODS papers you've read and see their structures):
-------------------------------
#Define the (mathematical) problem
#Here are known results (you want to articulate why this problem (that you are addressing) is non-trivial)
-people have not really studied this
#Here are our results
#Here are the techniques for our results
2nd Paragraph:
-------------
-Somewhere we need to mention...
-Interesting mathematically
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
-what does this buy us, aside from being an interesting fact?
@ -38,6 +71,8 @@ BEGIN: Introduction Outline
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
should be a given I would think?
-???
-Interesting results
-???
-Why bags are more interesting than previously thought
@ -51,6 +86,8 @@ BEGIN: Introduction Outline
and/or forget about listing subtelties and just talk about them in some decent order
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
-#2 clarify how you are counting the input size (the size of the db instance vs. the size of the query polynomial--which might be significantly larger or smaller
than the former); this may be a good place to have an example
-better than linear time in the output
-since we take as input the polynomial encoding the query output
-by the fact that this output polynomial can be in factorized form

19
intro.tex Normal file
View File

@ -0,0 +1,19 @@
%root: main.tex
\section{Introduction}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. In the set semantics setting, one cannot get better than \#P runtime in computing the expectation of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting. For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form. Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the general case.
\AH{New para maybe? Introduce the subtelty here between a giant pile of monomials and a compressed version of the polynomial}
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms that make up the compressed representation.
\AH{Perhaps a new para here? Above para needs to gently but explicitly highlight the differences in traditional logical implemenation and our approach.}
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$ over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
\AH{The para below I think should be incorporated in the para above.}
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP]. In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.

View File

@ -157,6 +157,7 @@ sensitive=true
%\input{pos}
%\input{sop}
%\input{davidscheme}
\input{intro}
\input{ra-to-poly}
\input{poly-form}
\input{approx_alg}

View File

@ -3,26 +3,6 @@
\onecolumn
\section{Query translation into polynomials}
\begin{tikzpicture}[every node/.style={circle, draw=black, fill=black, text=white}]
\node{1}
child { node{2}
child { node{4}}
child { node{5}}
}
child{ node{3} };
\node at (5, 5) [fill,circle,inner sep=0pt,minimum size=3pt] (top) {};
\node[fill,circle,inner sep=0pt,minimum size=3pt, below=0.5cm of top] (bottom){};
\draw (top)--(bottom);
% \draw[red, very thick] (0, 0) rectangle (4, 4);
% \draw[blue, very thin] (0, 0) circle (2cm);
% \draw (0, 0) ellipse (4cm and 2cm);
% \draw (4, 4) arc (0:180:4);
% \fill[olive] (0, 0) rectangle (3, 3);
% \fill[teal] (-2, 2) circle (1cm);
% \fill[blue] (1, 1) circle(2pt);
% \fill[blue!50!red](3, 3) circle(.2cm);
% \fill[blue!50] (4, 4) circle (3pt);
\end{tikzpicture}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation