Added the IntroOutline to the Repo
parent
d7c0677955
commit
15e06595e7
|
@ -0,0 +1,114 @@
|
|||
YOu don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
|
||||
|
||||
Q: Where can I find the semantics as defined for RA+ set operations
|
||||
|
||||
!Thinking about query results as (a giant pile of) monomials vs. polynomials needs to come out as EARLY as possible; this isn't entirely out of the blue;
|
||||
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
|
||||
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
|
||||
You want to get into this twice.
|
||||
|
||||
-------->Below needs to be incorporated into the main outline
|
||||
1)Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
|
||||
-focus on intuition
|
||||
-what are the key points?
|
||||
-we don't require the polynomial to be given to us in DNF
|
||||
-why do people think bags are easy?
|
||||
-???
|
||||
-how does this tie into how do people approach implementing pdbs?
|
||||
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
|
||||
-how can we do better than the standard bar that most pdb use
|
||||
-by accepting factorized polynomials as input
|
||||
-don't worry about specifics
|
||||
2)Historical overview later on
|
||||
-why is this the way it is
|
||||
-the customary fixed attribute data size rule
|
||||
<----------
|
||||
|
||||
Brief overview of the Introduction
|
||||
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
|
||||
-In practice PDBs are bags
|
||||
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
|
||||
-Interesting mathematically
|
||||
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
|
||||
-what does this buy us, aside from being an interesting fact?
|
||||
-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
|
||||
-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
|
||||
-I may need to think about this more.
|
||||
|
||||
-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
|
||||
-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
|
||||
-what does this buy us practically speaking?
|
||||
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
|
||||
should be a given I would think?
|
||||
-???
|
||||
-Interesting results
|
||||
-???
|
||||
-Why bags are more interesting than previously thought
|
||||
-???
|
||||
-Describe the problem
|
||||
-Computing expectation over bag PDBs
|
||||
-Discuss hardness results
|
||||
-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
|
||||
|
||||
------------->I think here we either decide that there is only one subtelty and/or forget about listing subtelties and just talk about them in some decent order
|
||||
|
||||
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
|
||||
-better than linear time in the output
|
||||
-since we take as input the polynomial encoding the query output
|
||||
-by the fact that this output polynomial can be in factorized form
|
||||
Q: -the above is the only subtelty that comes to mind currently
|
||||
<-------------
|
||||
-perhaps make comparisons to previous work such as Olteanu Anytime work, Factorized DB work, etc.
|
||||
|
||||
------------->We need to merge the existing work bullets
|
||||
-existing work
|
||||
- a common convention in DBs is a fixed size on the size of a column, the size of the data
|
||||
-if you know how big the tuple is, there are a bunch of optimizations that you can do
|
||||
-you want to avoid the situation where the field gets too big
|
||||
-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give
|
||||
you an annotation that is of arbitrary size greater (in the size of the data).
|
||||
-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
|
||||
-take the provenance polynomial,
|
||||
-flatten it into individual monomials
|
||||
-store the individual monomials in a table
|
||||
*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
|
||||
-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
|
||||
*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)),
|
||||
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
|
||||
-think about MayBMS
|
||||
-
|
||||
<--------------
|
||||
|
||||
Describe the subtelty of our scheme performing "better than linear in the output size"
|
||||
-explicitly define our definition of 'hard' query in this setting
|
||||
-hard is anything worse than linear time in the size of the SOP polynomial
|
||||
-explain how the traditional 'set' setting for PDBs is an oversimplification
|
||||
-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
|
||||
-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
|
||||
-limited results in complexity over PDBs in the bag setting
|
||||
EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
|
||||
bullet point
|
||||
-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
|
||||
-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and
|
||||
this gives us less than linear time.
|
||||
EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
|
||||
-again, I need to reread the Olteanu Factorized DB paper
|
||||
-motivating example(s)
|
||||
-don't have any clearly worked out running examples at this point in time
|
||||
|
||||
*What other subtelties would be good to explicity bring out here?
|
||||
-this is a good question, perhaps either would be beneficial
|
||||
-reread what we have written so far
|
||||
-think about any other subtelties that need to be brought out
|
||||
-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
|
||||
|
||||
|
||||
|
||||
*Somewhere we need to list all the ways the annotation polynomial can be compressed*
|
||||
-my understanding was that the factorization is limited to Products of Sums
|
||||
-Boris' comment on element
|
||||
-pushing projections down
|
||||
-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
|
||||
|
||||
EEEEEEEEEEEEEEEEEEBackground Work: What have people done with PDBs? What have people done with Bag-PDBs?
|
||||
-need to do a Lit Survey before tackling this
|
|
@ -95,7 +95,7 @@ Finally, observe \cref{p1-s5} by construction in \cref{lem:pre-poly-rpoly}, that
|
|||
\end{proof}
|
||||
|
||||
\begin{Corollary}\label{cor:expct-sop}
|
||||
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly}$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
|
||||
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly} = \rpoly{Q}\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
|
||||
\end{Corollary}
|
||||
|
||||
\begin{proof}[Proof For Corollary ~\ref{cor:expct-sop}]
|
||||
|
@ -150,7 +150,7 @@ By definition we have that
|
|||
\[\poly_{G}(\vct{X}) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
|
||||
Rather than list all the expressions in full detail, let us make some observations regarding the sum. Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take.
|
||||
|
||||
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge$ such triples, each with a $\prob^2$ factor.
|
||||
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}\left(\prob_1,\ldots, \prob_\numvar\right)$.
|
||||
|
||||
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence. It is the analogue for when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. Again, there are three combinations for this. All $3 + 3 = 6$ combinations of two distinct values consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$. This case produces the following edge patterns: $\twopath, \twodis$.
|
||||
|
||||
|
@ -158,6 +158,8 @@ By definition we have that
|
|||
\end{proof}
|
||||
\qed
|
||||
|
||||
Notice that ~\cref{lem:qE3-exp} is an example of a query that reduces to the hard problems in graph theory of counting triangles, three-matchings, three-paths, etc. Thus, in general, computing $\expct_{\wVec}\pbox{\poly(\wVec)} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ is a hard problem.
|
||||
|
||||
\begin{Claim}\label{claim:four-two}
|
||||
If one can compute $\rpoly_{G}(\prob,\ldots, \prob)$ in time T(\numedge), then we can compute the following in O(T(\numedge) + \numedge):
|
||||
\[\numocc{G}{\tri} + \numocc{G}{\threepath} \cdot \prob - \numocc{G}{\threedis}\cdot(3\prob^2 - \prob^3).\]
|
||||
|
@ -189,7 +191,7 @@ The implication in \cref{claim:four-two} follows by the above and \cref{lem:qE3-
|
|||
\qed
|
||||
|
||||
\begin{Lemma}\label{lem:gen-p}
|
||||
If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values of $\prob$ then we can count the number of triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
|
||||
If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values $\vct{\prob}$ such that all $\prob_i = \prob$ for all $i \in [\numvar], \prob_i \in \vct{\prob}$, then we can count the number of triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
|
||||
\end{Lemma}
|
||||
|
||||
\begin{proof}[Proof of \cref{lem:gen-p}]
|
||||
|
|
|
@ -53,7 +53,7 @@ Using $\mathbb B$-typed variables in an $\mathbb{N}[\vct{X}]$ relation would cor
|
|||
Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.
|
||||
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.
|
||||
|
||||
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.
|
||||
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to $k$-annotations.
|
||||
The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$. Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.
|
||||
|
||||
\begin{align*}
|
||||
|
@ -67,7 +67,7 @@ The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean t
|
|||
&\eval{R}(\tup) && = &&\rel(\tup)
|
||||
\end{align*}
|
||||
|
||||
The above semantics show us how to obtain the annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.
|
||||
The above semantics show us how to obtain the $k$-annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.
|
||||
|
||||
\subsection{Defining the Data}\label{subsec:def-data}
|
||||
For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.
|
||||
|
@ -90,6 +90,6 @@ One of the aggregates we desire to compute over the annotated polynomial is the
|
|||
|
||||
Above, $\poly(\vct{w})$ is used to mean the assignment of $\vct{w}$ to $\vct{X}$.
|
||||
|
||||
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
|
||||
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}[i]$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
|
||||
|
||||
\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
|
||||
|
|
Loading…
Reference in New Issue