paper-BagRelationalPDBsAreHard/IntroOutline.txt

120 lines
7 KiB
Plaintext
Raw Normal View History

2020-11-12 09:15:00 -05:00
You don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
2020-11-11 15:45:38 -05:00
Q: Where can I find the semantics as defined for RA+ set operations
!Thinking about query results as (a giant pile of) monomials vs. polynomials needs to come out as EARLY as possible; this isn't entirely out of the blue;
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
You want to get into this twice.
2020-11-12 09:15:00 -05:00
===================================================================================================
BEGIN: Introduction Outline
2020-11-11 15:45:38 -05:00
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
-In practice PDBs are bags
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
2020-11-12 09:15:00 -05:00
-Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
-focus on intuition
-what are the key points?
-we don't require the polynomial to be given to us in DNF
-why do people think bags are easy?
-???
-how does this tie into how do people approach implementing pdbs?
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
-how can we do better than the standard bar that most pdb use
-by accepting factorized polynomials as input
-don't worry about specifics
2020-11-11 15:45:38 -05:00
-Interesting mathematically
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
-what does this buy us, aside from being an interesting fact?
-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
-I may need to think about this more.
-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
-what does this buy us practically speaking?
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
should be a given I would think?
-???
-Interesting results
-???
-Why bags are more interesting than previously thought
2020-11-12 09:15:00 -05:00
-the landscaper for bags changes when we think of the annotation as a factorized polynomial rather than a giant pile of monomials
2020-11-11 15:45:38 -05:00
-Describe the problem
-Computing expectation over bag PDBs
-Discuss hardness results
-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
2020-11-12 09:15:00 -05:00
------------->I think here we either decide that there is only one subtelty we want to bring out to the forefront
and/or forget about listing subtelties and just talk about them in some decent order
2020-11-11 15:45:38 -05:00
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
-better than linear time in the output
-since we take as input the polynomial encoding the query output
-by the fact that this output polynomial can be in factorized form
Q: -the above is the only subtelty that comes to mind currently
<-------------
2020-11-12 09:15:00 -05:00
-Historical Overview
-why is this the way it is
-the customary fixed attribute data size rule
-existing work
- a common convention in DBs is a fixed size on the size of a column, the size of the data
-if you know how big the tuple is, there are a bunch of optimizations that you can do
-you want to avoid the situation where the field gets too big
-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give
you an annotation that is of arbitrary size greater (in the size of the data).
-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
-take the provenance polynomial,
-flatten it into individual monomials
-store the individual monomials in a table
*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)),
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
-think about MayBMS
2020-11-11 15:45:38 -05:00
-
2020-11-12 09:15:00 -05:00
-Describe the subtelty of our scheme performing "better than linear in the output size"
-explicitly define our definition of 'hard' query in this setting
-hard is anything worse than linear time in the size of the SOP polynomial
-explain how the traditional 'set' setting for PDBs is an oversimplification
-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
-limited results in complexity over PDBs in the bag setting
EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
bullet point
-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
-discuss the 'expression tree' representation of the polynomial, emphasizing the ability to model the factorized form of a polynomial
-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and
this gives us less than linear time.
EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
-again, I need to reread the Olteanu Factorized DB paper
-motivating example(s)
-don't have any clearly worked out running examples at this point in time
END: Introduction Outline
===============================================================================================
2020-11-11 15:45:38 -05:00
*What other subtelties would be good to explicity bring out here?
-this is a good question, perhaps either would be beneficial
-reread what we have written so far
-think about any other subtelties that need to be brought out
-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
*Somewhere we need to list all the ways the annotation polynomial can be compressed*
-my understanding was that the factorization is limited to Products of Sums
-Boris' comment on element
-pushing projections down
-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
EEEEEEEEEEEEEEEEEEBackground Work: What have people done with PDBs? What have people done with Bag-PDBs?
-need to do a Lit Survey before tackling this