Finished implementing @atri's 070221 changes.

2021-07-07 10:31:35 -04:00 · 2021-07-07 10:31:35 -04:00 · e3e5f6ee13
parent d8b3df2d89
commit e3e5f6ee13
3 changed files with 22 additions and 17 deletions
--- a/intro-rewrite2.tex
+++ b/intro-rewrite2.tex
@ -178,7 +178,7 @@ A bag probabilistic database (\abbrPDB) is a probability distribution over $\num
 In general the runtime of $\query$ over a deterministic database is \sharpwonehard, meaning the runtime is superlinear in the $\numvar$ sized input, based on a specific parameter $k$, as is the case for counting $k$-cliques and multiple joins also known as $k$-joins.  This hardness result is unsatisfying in the sense that it doesn't account for computing the expected count, $\expct\pbox{\poly(\vct{X})}$.  A natural question is if we can quantify the hardness of the probability computation beyond the complexity of deterministic (pre) processing.  The model illustrated in \cref{fig:two-step} is one way to do this. %Assuming $\query$ is linear or better, how does query computation of a $\abbrPDB$ compare to deterministic query processing?  

 This model views \abbrPDB query processing as two steps.  As depicted, computing $\query$ over a $\abbrPDB$ consists of the first step, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.%result tuple lineage polynomial(s) encoded in the respective representation.
-\footnote{Note that computing a lineage polynomial is of the same complexity as computing the query output.}
+\footnote{Note that, assuming standard query algorithms over $\raPlus$ operators, computing a lineage polynomial is of the same complexity as computing the query output.}
 % the runtime of the first step is the same in both the deterministic and \abbrPDB settings, since the computation of the linage is never greater than the query processing time.} 
 The second step consists of computing the expectation of $\poly({\vct{X}})$, $\expct\pbox{\poly(\vct{X})}$. This model of computation is nicely followed by set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu} (where e.g. computing the marginal probability in intensional evaluation is a separate step; further, computing the marginal probability in extensional evaluation occurs as a separate step of each operator, and therefore implies that both concerns can be separated) and also by that of semiring provenance \cite{DBLP:conf/pods/GreenKT07} (where the $\semNX$-DB first computes the annotation via the query, and then the polynomial is evaluated on a specific valuation), and further, it is useful in this work for the purpose of separating the deterministic computation from the probability computation.  

@ -189,7 +189,7 @@ Most work done in \abbrPDB\xplural has been done in the setting of set \abbrPDB\
 %Since set-\abbrPDB\xplural are essentially limited to computing the marginal probability of $\tup$, bag-\abbrPDB\xplural are a more natural fit for computing queries such as count queries.  
 Traditionally, bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only, or linear in the size of query.  This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural.  Such a representation used in the bag-\abbrPDB setting \emph{indeed} allows for step two to be linear in the \emph{size} of the \abbrSOP representation, a result due to linearity of expectation.  

-The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}.  Our work explores whether or not step two in the computation model is \emph{always} linear in the \emph{size} of the representation of the lineage polynomial when step one of $\query(\pdb)$ is easy.  %This works focuses on step two of the computation model specifically in regards to bag-\abbrPDB queries.  
+The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).  Our work explores whether or not step two in the computation model is \emph{always} linear in the \emph{size} of the representation of the lineage polynomial when step one of $\query(\pdb)$ is easy.  %This works focuses on step two of the computation model specifically in regards to bag-\abbrPDB queries.  
 Consider again the bag-\abbrTIDB 
 %\footnote{A \abbrTIDB is a \abbrPDB such that each tuple is considered to be an independent random event.} 
 $\pdb$.  When the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear in the size of the arithemetic circuit, and we have polytime complexity for computing $\query(\pdb)$.  This leads us to our problem statement: 
@ -321,22 +321,26 @@ we can approximate the expected output tuple multiplicities with only $O(\log{Z}

 Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
 %$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ 
-over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\query^2()\dlImp Q(), Q()$.
+over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.

 The lineage polynomial for $Q^2$ is given by $\Phi^2$:
-\begin{equation*}
-\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
-\end{equation*}
-The expectation $\expct\pbox{\Phi^2}$ then is:
 \begin{multline*}
-\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
-+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
+\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
+=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
 \end{multline*}
+The expectation $\expct\pbox{\Phi^2}$ then is:
+\begin{footnotesize}
+\begin{multline*}
+\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
+\end{multline*}
+\end{footnotesize}
 \noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
 \begin{footnotesize}
-\begin{equation*}
-\expct\pbox{L_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
-\end{equation*}
+\begin{multline*}
+\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
+\end{multline*}
 \end{footnotesize}
 \noindent This property leads us to consider a structure related to the lineage polynomial.
 \begin{Definition}\label{def:reduced-poly}
@ -344,12 +348,12 @@ For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpol
 \end{Definition}
 With $\Phi^2$ as an example, we have:
 \begin{align*}
-\widetilde{\Phi^2}(L_a, L_b, L_c, L_d)
-=&\; L_aL_b + L_bL_d + L_bL_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
+&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
+&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
 \end{align*}
 It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} $\raPlus$ queries over TIDB/BIDB.

-To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
+To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  The opposite holds true for determining a constant factor lower bound.  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
--- a/macros.tex
+++ b/macros.tex
@ -43,11 +43,11 @@
 \newcommand{\db}{D}
 \newcommand{\query}{Q}
 \newcommand{\tset}{\mathcal{T}}%the set of tuples in a database
-\newcommand{\join}{\Join}
+\newcommand{\join}{\mathlarger\Join}
 \newcommand{\select}{\sigma}
 \newcommand{\project}{\pi}
 \newcommand{\union}{\cup}
-\newcommand{\rename}{\rho}
+\newcommand{\rename}{\mathlarger\rho}
 \newcommand{\sch}{sch}
 \newcommand{\attr}[1]{attr\left(#1\right)}

--- a/main.tex
+++ b/main.tex
@ -7,6 +7,7 @@
 \usepackage{caption}%caption for table
 \usepackage{booktabs}
 \usepackage{bm}%for math mode bold font
+\usepackage{relsize}%\mathlarger

 \usepackage{algpseudocode}
 \usepackage{algorithm}