paper-BagRelationalPDBsAreHard/IntroOutline.txt

171 lines
10 KiB
Plaintext

You don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
Q: Where can I find the semantics as defined for RA+ set operations
!Thinking about query results as (a giant pile of) monomials vs. (compressed representations of, e.g. factorized db is an example of) polynomials
(since our audience is PODS, we don't need to spend more that a line or two on this; needs to come out as EARLY as possible; this isn't entirely out of the blue;
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
You want to get into this twice.
Focus on the 1st paragrah,
once you're done, get it to them
spend a lot of time on our contributions
also a table for sets/bags, data models, etc
===================================================================================================
BEGIN: Introduction Outline
1st Paragraph
-------------
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
-In practice PDBs are bags
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
-in practice, Production Databases (Postgres, Oracle, etc.) use bags; modern pdbs are slow since they are dealing with sets rather than bags
-Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
--COMPUTATIONS (efficiently) over the output polynomial, in our case, expectation
-focus on intuition
-what are the key points?
-one cannot generate a result better than linear time in the size of the polynomial; sets, are even harder than that--in
#P in sets; as a result if you assume that the result is given to you in SOP, then the naive method is the optimal;
however, factorized form of the polynomial allows for better results; runtime in the number of monomials vs. runtime in the
size of the polynomial is the same when the polynomial is given to you in DNF; however, they are not the same when given
compressed version(s) of the polynomial; this work looks into when the polynomal to NOT be given to us in SOP;
Naive alg: generate all the monomials, and compute each of their probabilities
-why do people think bags are easy?
-???
THIS COULD BE BROUGHT OUT MORE -how does this tie into how do people approach implementing pdbs?
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
-how can we do better than the standard bar that most pdb use
-by accepting factorized polynomials as input
-don't worry about specifics
-Result--here is what we show (in gentle English and not too technical terms)
MAYBE VERIFY THIS IS EFFECTIVELY BROUGHT OUT -Computation over bags is hard, i.e. superlinear time if we are given a compressed/factorized db; on the other hand, when we use an approximation
algorithm, we get linear time
Can have another segway paragraph
After motivation up front, forget it all, and get into the nitty gritty
Typical Theory Paper Structure (Look at PODS papers you've read and see their structures):
-------------------------------
#Define the (mathematical) problem
-computing expectation over bag PDB query output polynomial
#Here are known results (you want to articulate why this problem (that you are addressing) is non-trivial)
-people have not really studied this
#Here are our results
-hard in the general case via a reduction to computing the number of 3-paths, 3-matchings, and triangles in an arbitrary graph
#Here are the techniques for our results
-the algorithm uniformly samples monomials from expression tree of the polynomial, approximating $\rpoly{Q}$.
-we perform an analysis of the approximation algorithm that proves linear time with confidence guarantees
2nd Paragraph:
-------------
-Somewhere we need to mention...
THIS WAS NOT INCLUDED IN THE ORIGINAL PASS OF INTRO-------v
-Interesting mathematically
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
-what does this buy us, aside from being an interesting fact?
-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
-I may need to think about this more.
-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
-what does this buy us practically speaking?
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
should be a given I would think?
-???
-Interesting results
-???
-Why bags are more interesting than previously thought
-the landscape for bags changes when we think of the annotation as a factorized polynomial rather than a giant pile of monomials
-Describe the problem
-Computing expectation over bag PDBs
-Discuss hardness results
-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
------------->I think here we either decide that there is only one subtelty we want to bring out to the forefront
and/or forget about listing subtelties and just talk about them in some decent order
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
BE CERTAIN THAT THIS IS EXPLICITLY STATED -#2 clarify how you are counting the input size (the size of the db instance vs. the size of the query polynomial--which might be significantly larger or smaller
PLACEHOLDER IN THE TEXT FOR THE EXAMPLE than the former); this may be a good place to have an example
-better than linear time in the output
-since we take as input the polynomial encoding the query output
-by the fact that this output polynomial can be in factorized form
Q: -the above is the only subtelty that comes to mind currently
<-------------
DONE------------->ADD THE HISTORY
-Historical Overview
-why is this the way it is
-the customary fixed attribute data size rule
-existing work
-most systems (MayBMS, MystiQ, Orion, GProM, etc) use an encoding of possible tuples that is essentially
an enumerating through the monomials
-this classical approach disallows doing anything clever
-those that use a factorized encoding assume sets, as is the case for Sprout
SKIPPED--->with new encodings, the bag problem is actually hard in a non-obvious way
- a common convention in DBs is a fixed size on the size of a column, the size of the data
-if you know how big the tuple is, there are a bunch of optimizations that you can do
-you want to avoid the situation where the field gets too big
-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give
you an annotation that is of arbitrary size greater (in the size of the data).
-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
-take the provenance polynomial,
-flatten it into individual monomials
-store the individual monomials in a table
*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)),
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
-think about MayBMS
-
END HISTORY
CAN PERHAPS INCORPORATE THIS PART OF THE OUTLINE MORE STRONGLY IN THE BEGINNING PARAs
-Describe the subtelty of our scheme performing "better than linear in the output size"
-explicitly define our definition of 'hard' query in this setting
-hard is anything worse than linear time in the size of the SOP polynomial
-explain how the traditional 'set' setting for PDBs is an oversimplification
-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
-limited results in complexity over PDBs in the bag setting
EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
bullet point
-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
-discuss the 'expression tree' representation of the polynomial, emphasizing the ability to model the factorized form of a polynomial
-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and
this gives us less than linear time.
EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
-again, I need to reread the Olteanu Factorized DB paper
-motivating example(s)
-don't have any clearly worked out running examples at this point in time
END: Introduction Outline
===============================================================================================
*What other subtelties would be good to explicity bring out here?
-this is a good question, perhaps either would be beneficial
-reread what we have written so far
-think about any other subtelties that need to be brought out
-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
*Somewhere we need to list all the ways the annotation polynomial can be compressed*
-my understanding was that the factorization is limited to Products of Sums
-Boris' comment on element
-pushing projections down
-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
EEEEEEEEEEEEEEEEEEBackground Work: What have people done with PDBs? What have people done with Bag-PDBs?
-need to do a Lit Survey before tackling this