Added the IntroOutline to the Repo

2020-11-11 15:45:38 -05:00 · 2020-11-11 15:45:38 -05:00 · 15e06595e7
parent d7c0677955
commit 15e06595e7
3 changed files with 122 additions and 6 deletions
--- a/IntroOutline.txt
+++ b/IntroOutline.txt
@ -0,0 +1,114 @@
+YOu don't need to obsess over the structure or the wording; just focus on the concepts; think only about the outline rather than specific phrasing.
+
+Q: Where can I find the semantics as defined for RA+ set operations
+
+!Thinking about query results as (a giant pile of) monomials vs. polynomials needs to come out as EARLY as possible; this isn't entirely out of the blue;
+the landscape changes for bags IF you think of the annotation in terms of a polynomial rather than a giant pile of monomials-->better than linear in the number
+of monomials; don't tie this to specific structure but RATHER to the general flow of the text...
+You want to get into this twice. 
+
+-------->Below needs to be incorporated into the main outline 
+	1)Brief overview of the challenges in the very beginning--1st paragraph, be very economical with words
+		-focus on intuition
+			-what are the key points?
+				-we don't require the polynomial to be given to us in DNF
+			-why do people think bags are easy?
+				-???
+			-how does this tie into how do people approach implementing pdbs?
+				-for one, the customary rule of fixed data size on attributes has significantly influenced how folks implement PDBs, i.e., with polynomials in DNF
+			-how can we do better than the standard bar that most pdb use
+				-by accepting factorized polynomials as input
+		-don't worry about specifics
+	2)Historical overview later on
+		-why is this the way it is
+			-the customary fixed attribute data size rule
+<----------
+
+Brief overview of the Introduction
+	-Motivation (Reader must be convinced that this problem is interesting from a DB perspective)
+		-In practice PDBs are bags
+			-Thus, it is relevant and interesting to explore PDBs from a bag perspective
+		-Interesting mathematically
+			-\tilde{Q} equivalence with \poly{Q} under \vct{X} \in {0, 1}^n 
+				-what does this buy us, aside from being an interesting fact?
+					-\tilde{Q} is the expectation of Q under the above assumption and the additional assumption of independent variables in \vct{X}
+						-which allows us to build an approximation alg of \tilde{Q} for the purpose of estimating E[\poly{Q}]
+					-I may need to think about this more. 
+			
+			-the computation of 3-paths, 3-matchings, and triangles via a linear system and approximation of \tilde{Q}
+				-Thm 2.1 shows that such a computation is hard (superlinear) without approximation
+				-what does this buy us practically speaking?
+					-eeee, are we just saying that we can compute approximations of hard problems in linear time, but, that 
+					should be a given I would think?
+			-???
+		-Interesting results
+			-???
+		-Why bags are more interesting than previously thought
+			-???
+	-Describe the problem
+		-Computing expectation over bag PDBs
+		-Discuss hardness results
+			-using PJ queries over TIDB with all p_i = p is hard in general--link with thm 2.1
+
+------------->I think here we either decide that there is only one subtelty and/or forget about listing subtelties and just talk about them in some decent order
+
+	-list the subtleties that we want to make VERY CLEAR and will subsequently detail in the next paragraphs
+		-better than linear time in the output
+			-since we take as input the polynomial encoding the query output
+			-by the fact that this output polynomial can be in factorized form
+Q:		-the above is the only subtelty that comes to mind currently
+<-------------
+	-perhaps make comparisons to previous work such as Olteanu Anytime work, Factorized DB work, etc.
+
+------------->We need to merge the existing work bullets
+	-existing work
+		- a common convention in DBs is a fixed size on the size of a column, the size of the data
+			-if you know how big the tuple is, there are a bunch of optimizations that you can do
+			-you want to avoid the situation where the field gets too big
+			-BUT, annotations break this, since a projection (or join--less so, since you end up with an annotation that is linear in the number of joins) can give 
+			you an annotation that is of arbitrary size greater (in the size of the data).
+			-therefore, every implemented pdb system (mystique, sprout, etc) really want to avoid creating arbitrary sized annotation column
+				-take the provenance polynomial,
+					-flatten it into individual monomials
+						-store the individual monomials in a table
+			*PDB implementations have restricted themselves to a giant pile of monomials because of the fixed data requirement
+				-in the worst case, polynomial in the size of the input tables (the table sizes) to materialize all monomials
+			*Orchestra (Val Tannen, Zach Ives--take the earliest journal papers (any journal--SIGMOD Record paper perhaps?)), 
+			Factorized Databases implement factorizations (SIGMOD 2012?  Olteanu) cgpgrey (England/Northern Ireland)
+		-think about MayBMS
+			-
+<--------------
+
+Describe the subtelty of our scheme performing "better than linear in the output size"
+	-explicitly define our definition of 'hard' query in this setting 
+		-hard is anything worse than linear time in the size of the SOP polynomial
+	-explain how the traditional 'set' setting for PDBs is an oversimplification
+		-note that most PDB implementations use DNF to model tuple polynomials, essentially an enumeration though the number of monomials
+			-thus computing the expectation (though trivial by linearity of expectation) is linear in the number of monomials
+		-limited results in complexity over PDBs in the bag setting
+			EEEEEEEEEEEEEEEEE-a good spot to discuss work in bags setting, but as noted below, I need to do a Lit Survey to add any more to this
+			bullet point
+	-the richness of the problem: lower, upper bound [factorized polynomial, sop polynomial]
+		-we can factorize the polynomial, which produces an output polynomial which is smaller than the equivalent one in DNF, and 
+		this gives us less than linear time.
+		EEEEEEEEEEEEEEEEEEEEEEEE-link this to work in 'factorized dbs'
+			-again, I need to reread the Olteanu Factorized DB paper
+	-motivating example(s) 
+		-don't have any clearly worked out running examples at this point in time
+
+*What other subtelties would be good to explicity bring out here?
+	-this is a good question, perhaps either would be beneficial
+		-reread what we have written so far
+			-think about any other subtelties that need to be brought out
+			-think about why this problem is interesting from a mathematical perspective, and the mathematical results that we have
+
+
+
+*Somewhere we need to list all the ways the annotation polynomial can be compressed*
+	-my understanding was that the factorization is limited to Products of Sums
+	-Boris' comment on element
+		-pushing projections down
+			-this is seen on p. 3 paragraph 2 of Factorisation of Provenance Polynomials
+
+EEEEEEEEEEEEEEEEEEBackground Work:  What have people done with PDBs?  What have people done with Bag-PDBs?
+	-need to do a Lit Survey before tackling this
--- a/poly-form.tex
+++ b/poly-form.tex
@ -95,7 +95,7 @@ Finally, observe \cref{p1-s5} by construction in \cref{lem:pre-poly-rpoly}, that
 \end{proof}

 \begin{Corollary}\label{cor:expct-sop}
-If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly}$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
+If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly} = \rpoly{Q}\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
 \end{Corollary}

 \begin{proof}[Proof For Corollary ~\ref{cor:expct-sop}]
@ -150,7 +150,7 @@ By definition we have that
 		\[\poly_{G}(\vct{X}) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
 		Rather than list all the expressions in full detail, let us make some observations regarding the sum.  Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$.  Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$.  There are three forms the triple $(e_1, e_2, e_3)$ can take.

-\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same.  There are exactly $\numedge$ such triples, each with a $\prob^2$ factor.
+\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same.  There are exactly $\numedge$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}\left(\prob_1,\ldots, \prob_\numvar\right)$.

 \textsc{case 2:}  This case occurs when there are two distinct edges of the three, call them $e$ and $e'$.  When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$.  There are three combinations for this occurrence.  It is the analogue for when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$.  Again, there are three combinations for this.  All $3 + 3 = 6$ combinations of two distinct values consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$.  This case produces the following edge patterns: $\twopath, \twodis$.

@ -158,6 +158,8 @@ By definition we have that
 \end{proof}
 \qed

+Notice that ~\cref{lem:qE3-exp} is an example of a query that reduces to the hard problems in graph theory of counting triangles, three-matchings, three-paths, etc.  Thus, in general, computing $\expct_{\wVec}\pbox{\poly(\wVec)} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ is a hard problem.
+
 \begin{Claim}\label{claim:four-two}
 If one can compute $\rpoly_{G}(\prob,\ldots, \prob)$ in time T(\numedge), then we can compute the following in O(T(\numedge) + \numedge):
 \[\numocc{G}{\tri} + \numocc{G}{\threepath} \cdot \prob - \numocc{G}{\threedis}\cdot(3\prob^2 - \prob^3).\]
@ -189,7 +191,7 @@ The implication in \cref{claim:four-two} follows by the above and \cref{lem:qE3-
 \qed

 \begin{Lemma}\label{lem:gen-p}
-If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values of $\prob$ then we can count the number of  triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
+If we can compute $\rpoly_{G}(\vct{X})$ in $T(\numedge)$ time for $O(1)$ distinct values $\vct{\prob}$ such that all $\prob_i = \prob$ for all $i \in [\numvar], \prob_i \in \vct{\prob}$, then we can count the number of  triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
 \end{Lemma}

 \begin{proof}[Proof of \cref{lem:gen-p}]
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -53,7 +53,7 @@ Using $\mathbb B$-typed variables in an $\mathbb{N}[\vct{X}]$ relation would cor
 Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.  
 Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.

-It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.  
+It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to $k$-annotations.  
 The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$.  Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.

 \begin{align*}
@ -67,7 +67,7 @@ The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean t
 &\eval{R}(\tup) && = &&\rel(\tup)
 \end{align*}

-The above semantics show us how to obtain the annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.
+The above semantics show us how to obtain the $k$-annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.

 \subsection{Defining the Data}\label{subsec:def-data}
 For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.  
@ -90,6 +90,6 @@ One of the aggregates we desire to compute over the annotated polynomial is the

 Above, $\poly(\vct{w})$ is used to mean the assignment of $\vct{w}$ to $\vct{X}$.

-For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}$.  Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds.  Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
+For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}[i]$.  Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds.  Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.

 \[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]