Added the IntroOutline to the Repo

master
Aaron Huber 2020-11-11 15:45:38 -05:00
parent d7c0677955
commit 15e06595e7
3 changed files with 122 additions and 6 deletions

114
IntroOutline.txt Normal file
View File

@ -0,0 +1,114 @@
YOu don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
Q: Where can I find the semantics as defined for RA+ set operations
!Thinking about query results as (a giant pile of) monomials vs. polynomials needs to come out as EARLY as possible; this isn't entirely out of the blue;
the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
You want to get into this twice.
-------->Below needs to be incorporated into the main outline
1)Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
-focus on intuition
-what are the key points?
-we don't require the polynomial to be given to us in DNF
-why do people think bags are easy?
-???
-how does this tie into how do people approach implementing pdbs?
-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
-how can we do better than the standard bar that most pdb use
-by accepting factorized polynomials as input
-don't worry about specifics
2)Historical overview later on
-why is this the way it is
-the customary fixed attribute data size rule
<----------
Brief overview of the Introduction
-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
-In practice PDBs are bags
-Thus, it is relevant and interesting to explore PDBs from a bag perspective
-Interesting mathematically
-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n
-what does this buy us, aside from being an interesting fact?
-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
-I may need to think about this more.
-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
-what does this buy us practically speaking?
-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that
should be a given I would think?
-???
-Interesting results
-???
-Why bags are more interesting than previously thought
-???
-Describe the problem
-Computing expectation over bag PDBs
-Discuss hardness results
-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
------------->I think here we either decide that there is only one subtelty and/or forget about listing subtelties and just talk about them in some decent order
-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
-better than linear time in the output
-since we take as input the polynomial encoding the query output
-by the fact that this output polynomial can be in factorized form
Q: -the above is the only subtelty that comes to mind currently
<-------------
-perhaps make comparisons to previous work such as Olteanu Anytime work, Factorized DB work, etc.
------------->We need to merge the existing work bullets
-existing work
- a common convention in DBs is a fixed size on the size of a column, the size of the data
-if you know how big the tuple is, there are a bunch of optimizations that you can do
-you want to avoid the situation where the field gets too big
-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give
you an annotation that is of arbitrary size greater (in the size of the data).
-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
-take the provenance polynomial,
-flatten it into individual monomials
-store the individual monomials in a table
*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)),
Factorized Databases implement factorizations (SIGMOD 2012? Olteanu) cgpgrey (England/Northern Ireland)
-think about MayBMS
-
<--------------
Describe the subtelty of our scheme performing "better than linear in the output size"
-explicitly define our definition of 'hard' query in this setting
-hard is anything worse than linear time in the size of the SOP polynomial
-explain how the traditional 'set' setting for PDBs is an oversimplification
-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
-limited results in complexity over PDBs in the bag setting
EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
bullet point
-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and
this gives us less than linear time.
EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
-again, I need to reread the Olteanu Factorized DB paper
-motivating example(s)
-don't have any clearly worked out running examples at this point in time
*What other subtelties would be good to explicity bring out here?
-this is a good question, perhaps either would be beneficial
-reread what we have written so far
-think about any other subtelties that need to be brought out
-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
*Somewhere we need to list all the ways the annotation polynomial can be compressed*
-my understanding was that the factorization is limited to Products of Sums
-Boris' comment on element
-pushing projections down
-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
EEEEEEEEEEEEEEEEEEBackground Work: What have people done with PDBs? What have people done with Bag-PDBs?
-need to do a Lit Survey before tackling this

View File

@ -95,7 +95,7 @@ Finally, observe \cref{p1-s5} by construction in \cref{lem:pre-poly-rpoly}, that
\end{proof}
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly}$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly} = \rpoly{Q}\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
\end{Corollary}
\begin{proof}[Proof For Corollary ~\ref{cor:expct-sop}]
@ -150,7 +150,7 @@ By definition we have that
\[\poly_{G}(\vct{X}) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Rather than list all the expressions in full detail, let us make some observations regarding the sum. Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take.
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge$ such triples, each with a $\prob^2$ factor.
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}\left(\prob_1,\ldots, \prob_\numvar\right)$.
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence. It is the analogue for when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. Again, there are three combinations for this. All $3 + 3 = 6$ combinations of two distinct values consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$. This case produces the following edge patterns: $\twopath, \twodis$.
@ -158,6 +158,8 @@ By definition we have that
\end{proof}
\qed
Notice that ~\cref{lem:qE3-exp} is an example of a query that reduces to the hard problems in graph theory of counting triangles, three-matchings, three-paths, etc. Thus, in general, computing $\expct_{\wVec}\pbox{\poly(\wVec)} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ is a hard problem.
\begin{Claim}\label{claim:four-two}
If one can compute $\rpoly_{G}(\prob,\ldots, \prob)$ in time T(\numedge), then we can compute the following in O(T(\numedge) + \numedge):
\[\numocc{G}{\tri} + \numocc{G}{\threepath} \cdot \prob - \numocc{G}{\threedis}\cdot(3\prob^2 - \prob^3).\]
@ -189,7 +191,7 @@ The implication in \cref{claim:four-two} follows by the above and \cref{lem:qE3-
\qed
\begin{Lemma}\label{lem:gen-p}
If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values of $\prob$ then we can count the number of triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values $\vct{\prob}$ such that all $\prob_i = \prob$ for all $i \in [\numvar], \prob_i \in \vct{\prob}$, then we can count the number of triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
\end{Lemma}
\begin{proof}[Proof of \cref{lem:gen-p}]

View File

@ -53,7 +53,7 @@ Using $\mathbb B$-typed variables in an $\mathbb{N}[\vct{X}]$ relation would cor
Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to $k$-annotations.
The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$. Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.
\begin{align*}
@ -67,7 +67,7 @@ The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean t
&\eval{R}(\tup) && = &&\rel(\tup)
\end{align*}
The above semantics show us how to obtain the annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.
The above semantics show us how to obtain the $k$-annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.
\subsection{Defining the Data}\label{subsec:def-data}
For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.
@ -90,6 +90,6 @@ One of the aggregates we desire to compute over the annotated polynomial is the
Above, $\poly(\vct{w})$ is used to mean the assignment of $\vct{w}$ to $\vct{X}$.
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}[i]$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]