171 lines
10 KiB
Plaintext
171 lines
10 KiB
Plaintext
You don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
|
|
|
|
Q: Where can I find the semantics as defined for RA+ set operations
|
|
|
|
!Thinking about query results as (a giant pile of) monomials vs. (compressed representations of, e.g. factorized db is an example of) polynomials
|
|
(since our audience is PODS, we don't need to spend more that a line or two on this; needs to come out as EARLY as possible; this isn't entirely out of the blue;
|
|
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
|
|
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
|
|
You want to get into this twice.
|
|
|
|
|
|
Focus on the 1st paragrah,
|
|
once you're done, get it to them
|
|
spend a lot of time on our contributions
|
|
|
|
also a table for sets/bags, data models, etc
|
|
===================================================================================================
|
|
BEGIN: Introduction Outline
|
|
|
|
1st Paragraph
|
|
-------------
|
|
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
|
|
-In practice PDBs are bags
|
|
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
|
|
-in practice, Production Databases (Postgres, Oracle, etc.) use bags; modern pdbs are slow since they are dealing with sets rather than bags
|
|
-Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
|
|
--COMPUTATIONS (efficiently) over the output polynomial, in our case, expectation
|
|
-focus on intuition
|
|
-what are the key points?
|
|
-one cannot generate a result better than linear time in the size of the polynomial; sets, are even harder than that--in
|
|
#P in sets; as a result if you assume that the result is given to you in SOP, then the naive method is the optimal;
|
|
however, factorized form of the polynomial allows for better results; runtime in the number of monomials vs. runtime in the
|
|
size of the polynomial is the same when the polynomial is given to you in DNF; however, they are not the same when given
|
|
compressed version(s) of the polynomial; this work looks into when the polynomal to NOT be given to us in SOP;
|
|
Naive alg: generate all the monomials, and compute each of their probabilities
|
|
-why do people think bags are easy?
|
|
-???
|
|
THIS COULD BE BROUGHT OUT MORE -how does this tie into how do people approach implementing pdbs?
|
|
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
|
|
-how can we do better than the standard bar that most pdb use
|
|
-by accepting factorized polynomials as input
|
|
-don't worry about specifics
|
|
-Result--here is what we show (in gentle English and not too technical terms)
|
|
MAYBE VERIFY THIS IS EFFECTIVELY BROUGHT OUT -Computation over bags is hard, i.e. superlinear time if we are given a compressed/factorized db; on the other hand, when we use an approximation
|
|
algorithm, we get linear time
|
|
|
|
Can have another segway paragraph
|
|
|
|
After motivation up front, forget it all, and get into the nitty gritty
|
|
Typical Theory Paper Structure (Look at PODS papers you've read and see their structures):
|
|
-------------------------------
|
|
#Define the (mathematical) problem
|
|
-computing expectation over bag PDB query output polynomial
|
|
#Here are known results (you want to articulate why this problem (that you are addressing) is non-trivial)
|
|
-people have not really studied this
|
|
#Here are our results
|
|
-hard in the general case via a reduction to computing the number of 3-paths, 3-matchings, and triangles in an arbitrary graph
|
|
#Here are the techniques for our results
|
|
-the algorithm uniformly samples monomials from expression tree of the polynomial, approximating $\rpoly{Q}$.
|
|
-we perform an analysis of the approximation algorithm that proves linear time with confidence guarantees
|
|
|
|
2nd Paragraph:
|
|
-------------
|
|
-Somewhere we need to mention...
|
|
THIS WAS NOT INCLUDED IN THE ORIGINAL PASS OF INTRO-------v
|
|
-Interesting mathematically
|
|
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
|
|
-what does this buy us, aside from being an interesting fact?
|
|
-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
|
|
-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
|
|
-I may need to think about this more.
|
|
|
|
-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
|
|
-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
|
|
-what does this buy us practically speaking?
|
|
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
|
|
should be a given I would think?
|
|
-???
|
|
|
|
|
|
-Interesting results
|
|
-???
|
|
-Why bags are more interesting than previously thought
|
|
-the landscape for bags changes when we think of the annotation as a factorized polynomial rather than a giant pile of monomials
|
|
-Describe the problem
|
|
-Computing expectation over bag PDBs
|
|
-Discuss hardness results
|
|
-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
|
|
|
|
------------->I think here we either decide that there is only one subtelty we want to bring out to the forefront
|
|
and/or forget about listing subtelties and just talk about them in some decent order
|
|
|
|
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
|
|
BE CERTAIN THAT THIS IS EXPLICITLY STATED -#2 clarify how you are counting the input size (the size of the db instance vs. the size of the query polynomial--which might be significantly larger or smaller
|
|
PLACEHOLDER IN THE TEXT FOR THE EXAMPLE than the former); this may be a good place to have an example
|
|
-better than linear time in the output
|
|
-since we take as input the polynomial encoding the query output
|
|
-by the fact that this output polynomial can be in factorized form
|
|
Q: -the above is the only subtelty that comes to mind currently
|
|
<-------------
|
|
DONE------------->ADD THE HISTORY
|
|
-Historical Overview
|
|
-why is this the way it is
|
|
-the customary fixed attribute data size rule
|
|
-existing work
|
|
-most systems (MayBMS, MystiQ, Orion, GProM, etc) use an encoding of possible tuples that is essentially
|
|
an enumerating through the monomials
|
|
-this classical approach disallows doing anything clever
|
|
-those that use a factorized encoding assume sets, as is the case for Sprout
|
|
SKIPPED--->with new encodings, the bag problem is actually hard in a non-obvious way
|
|
- a common convention in DBs is a fixed size on the size of a column, the size of the data
|
|
-if you know how big the tuple is, there are a bunch of optimizations that you can do
|
|
-you want to avoid the situation where the field gets too big
|
|
-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give
|
|
you an annotation that is of arbitrary size greater (in the size of the data).
|
|
-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
|
|
-take the provenance polynomial,
|
|
-flatten it into individual monomials
|
|
-store the individual monomials in a table
|
|
*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
|
|
-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
|
|
*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)),
|
|
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
|
|
-think about MayBMS
|
|
-
|
|
END HISTORY
|
|
|
|
|
|
|
|
CAN PERHAPS INCORPORATE THIS PART OF THE OUTLINE MORE STRONGLY IN THE BEGINNING PARAs
|
|
-Describe the subtelty of our scheme performing "better than linear in the output size"
|
|
-explicitly define our definition of 'hard' query in this setting
|
|
-hard is anything worse than linear time in the size of the SOP polynomial
|
|
-explain how the traditional 'set' setting for PDBs is an oversimplification
|
|
-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
|
|
-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
|
|
-limited results in complexity over PDBs in the bag setting
|
|
EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
|
|
bullet point
|
|
-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
|
|
-discuss the 'expression tree' representation of the polynomial, emphasizing the ability to model the factorized form of a polynomial
|
|
-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and
|
|
this gives us less than linear time.
|
|
EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
|
|
-again, I need to reread the Olteanu Factorized DB paper
|
|
-motivating example(s)
|
|
-don't have any clearly worked out running examples at this point in time
|
|
END: Introduction Outline
|
|
===============================================================================================
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*What other subtelties would be good to explicity bring out here?
|
|
-this is a good question, perhaps either would be beneficial
|
|
-reread what we have written so far
|
|
-think about any other subtelties that need to be brought out
|
|
-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
|
|
|
|
|
|
|
|
*Somewhere we need to list all the ways the annotation polynomial can be compressed*
|
|
-my understanding was that the factorization is limited to Products of Sums
|
|
-Boris' comment on element
|
|
-pushing projections down
|
|
-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
|
|
|
|
EEEEEEEEEEEEEEEEEEBackground Work: What have people done with PDBs? What have people done with Bag-PDBs?
|
|
-need to do a Lit Survey before tackling this |