More changes based on @atri's 072021 suggestions.

2021-07-27 12:23:06 -04:00 · 2021-07-27 12:23:06 -04:00 · 598687320e
parent 6ebf335e90
commit 598687320e
2 changed files with 24 additions and 11 deletions
--- a/app_notation-background.tex
+++ b/app_notation-background.tex
@ -156,10 +156,10 @@ Note that any $\poly$ in factorized form is equivalent to its \abbrSMB expansion



-\subsection{Proof for Lemma~\ref{lem:exp-poly-rpoly}}
+\subsection{Proof for Lemma~\ref{lem:exp-poly-rpoly}}\label{subsec:proof-exp-poly-rpoly}
 \begin{proof}
 Let $\poly$ be the generalized polynomial, i.e., the polynomial of $\numvar$ variables with highest degree $= B$: %, in which every possible monomial permutation appears,
-\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}\].
+\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}.\]
 Then, denoting the corresponding exponent vector $\vct{d}$ for a world $\vct{\wElem}$ over the set of valid worlds $\valworlds$ as $\vct{d} \in \valworlds$, in expectation we have
 \begin{align}
 \expct_{\vct{W}}\pbox{\poly(\vct{W})} &= \sum_{\vct{d} \in \eta}c_{\vct{d}}\cdot \expct_{\vct{w}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar w_i^{d_i}}\label{p1-s1}\\
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -48,16 +48,17 @@ There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natura
 The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by a lineage polynomial with standard addition and multiplication polynomial operators.  In this setting, linearity of expectation holds over the standard addition operator of the lineage polynomial, and given a standard monomial basis (\abbrSMB) representation of the lineage polynomial, the complexity of computing step two is linear in the size of the lineage polynomial.  This is true since the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for linearity of expectation over addition, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations.  Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$.  This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique.  The complexity difference between the two representations is up to a constant factor.} representation, may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.

 The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).  
-%Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy.  
-We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.  Consider the bag-\abbrTIDB $\pdb$.  Denote the probability of a tuple $\tup_i$ as $\prob_i$. When $\prob_i = 1$ for all $i$ in $[\numvar]$, the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can rea, and we have polytime complexity for computing $\query(\pdb)$.  Is this the general case?  This leads us to our problem statement: 
+%Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy.  We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.  
+
+Let us consider $\poly(\vct{X})$ for an arbitrary $\tup$ in $\query(\pdb)$ such that $\pdb$ is a bag-\abbrTIDB.  Denote the probability of a tuple $\tup_i$ as $\prob_i$, that is, $\probOf\pbox{X_i} = \prob_i$ for all $i$ in $[\numvar]$.  Consider the special case when $\pdb$ is a deterministic database with one possible world $\db$.  In this case, $\prob_i = 1$ for all $i$ in $[\numvar]$, and it can be seen that the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i = p_i \cdot 1^{h-1}x_i$.}.  This means that \abbrStepTwo is upperbounded by \abbrStepOne and we always have deterministic query runtime for $\query\inparen{\pdb}$ up to a constant factor for this special case.  Is this the general case?  This leads us to our problem statement: 
 \begin{Problem}\label{prob:intro-stmt}
 Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$? 
 \end{Problem}

-We show, for the class of \abbrTIDB\xplural with $\prob_i < 1$, the problem of computing step two in general is no longer linear in the size of the lineage polynomial representation. 
+We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo in general is no longer linear in the size of the lineage polynomial representation. 
 Our work further introduces an approximation algorithm of the expected count of $\tup$ from the bag-\abbrPDB query $\query$ which runs in linear time.

-As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability ($\expct\pbox{\poly\inparen{\vct{X}}}$) paradigm of set-\abbrPDB\xplural.  We focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) of $\tup$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
+As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability ($\expct\pbox{\poly\inparen{\vct{X}}}$) paradigm of set-\abbrPDB\xplural.  To address the assumption of whether or not \abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) of $\tup$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.

 Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.

@ -66,7 +67,7 @@ Our work focuses on the following setting for query computation.  Inputs of $\qu
 %%%%%%%%%%%%%%%%%%%%%%%%%
 Concretely, we make the following contributions:
 (i) We show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is \sharpwonehard in the size of the lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific cubic graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
-(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) have complexity linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic); (iii) We generalize the approximation algorithm to a class of bag-\abbrBIDB\xplural, a more general model of probabilistic data; (iv) We further prove that for \raPlus queries 
+(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries 
 \AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}  
 we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

@ -81,7 +82,7 @@ The lineage polynomial for $Q^2$ is given by $\Phi^2$:
 \left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
 =L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
 \end{multline*}
-The expectation $\expct\pbox{\Phi^2}$ then is:
+By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\pbox{\Phi^2}$ then is:
 \begin{footnotesize}
 \begin{multline*}
 \expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
@ -97,16 +98,28 @@ The expectation $\expct\pbox{\Phi^2}$ then is:
 \end{footnotesize}
 \noindent This property leads us to consider a structure related to the lineage polynomial.
 \begin{Definition}\label{def:reduced-poly}
-For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
+For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
 \end{Definition}
 With $\Phi^2$ as an example, we have:
 \begin{align*}
 &\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
 &\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
 \end{align*}
-It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} $\raPlus$ queries over TIDB/BIDB.
+It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
+\begin{Lemma}
+Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$.  For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:

-To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  The opposite holds true for determining a constant factor lower bound.  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
+\begin{equation*}
+	\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}. 
+\end{equation*} 
+\end{Lemma}
+
+To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{Note that for $k > 2$, the set of tuples is larger than in the $Q^2$ above.}.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  Recall $\query^2$ from above, and note that 
+\begin{equation*}
+	\poly^2\inparen{\probAllTup} = \inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
+\end{equation*}
+%For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  Consider the first output tuple of \cref{fig:two-step}.  Here, we set $\prob_0 = 1$, and the approximation $\poly\inparen{\vct{1}} = 1 \cdot 1 = 1$.  The opposite holds true for determining a constant factor lower bound.  
+To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.