Try this one neat trick to save 2 pages :)

master
Oliver Kennedy 2020-12-19 16:46:26 -05:00
parent 183ca09eae
commit fd767a507b
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
7 changed files with 62 additions and 52 deletions

View File

@ -7,8 +7,8 @@
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (where $p \neq 0$ and $p \neq 1$).
We then proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs that can be competitive with deterministic databases.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (s.t. $p \not \in \{0,1\}$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs competitive with deterministic databases.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
% \AH{Low-level why-would-an-expert-read-this}

View File

@ -3,15 +3,14 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
We now design an approximation algorithm for our problem that does run in {\em linear time}.
The algorithm itself works for any \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
First, let us introduce some useful definitions and notation related to polynomials and their representations. For illustrative purposes in the definitions below, we use the following %{\em bivariate}
polynomial:
First, we introduce some useful definitions and notation related to polynomials. The following polynomial serves as an example:
\begin{equation}
\label{eq:poly-eg}
\poly(X, Y) = 2X^2 + 3XY - 2Y^2.
@ -32,7 +31,8 @@ For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
%Note that $\etree$ need not encode an expression in the standard monomial basis. For instance, $\etree$ could represent a compressed form of the polynomial in~\Cref{eq:poly-eg}, such as $(x + 2y)(2x - y)$.
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
Denote $\polyf(\etree)$ to be the function from expression tree $\etree$ to its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
With addition and multiplication following the standard interpretation:
%
% \begin{align*}
% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
@ -48,8 +48,6 @@ Denote $\polyf(\etree)$ to be the function that takes as input expression tree $
\end{cases}
\end{equation*}
\end{Definition}
%
Note that addition and multiplication above follow the standard interpretation over polynomials.
%Specifically, when adding two monomials whose variables and respective exponents agree, the coefficients corresponding to the monomials are added and their sum is multiplied to the monomial. Multiplication here is denoted by concatenation of the monomial and coefficient. When two monomials are multiplied, the product of each corresponding coefficient is computed, and the variables in each monomial are multiplied, i.e., the exponents of like variables are added. Again we notate this by the direct product of coefficient product and all disitinct variables in the two monomials, with newly computed exponents.
%\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree) = \poly(\vct{X})$.
@ -68,15 +66,17 @@ $\expandtree{\etree}$ is the (pure) sum of products expansion of $\etree$, which
% &\etree.\type = \tnum \mapsto&& \elist{(\emptyset, \etree.\val)}\\
% &\etree.\type = \var \mapsto&& \elist{(\etree.\val, 1)}
% \end{align*}
\begin{align*}
&\expandtree{\etree} = \\
&\begin{cases}
{\small
\begin{multline*}
\expandtree{\etree} =
\begin{cases}
\expandtree{\etree_\lchild} \circ \expandtree{\etree_\rchild} &\textbf{ if }\etree.\type = +\\
\left\{(\monom_\lchild \cup \monom_\rchild, \coef_\lchild \cdot \coef_\rchild) ~|~\right.&\\ \left.(\monom_\lchild, \coef_\lchild) \in \expandtree{\etree_\lchild}, (\monom_\rchild, \coef_\rchild) \in \expandtree{\etree_\rchild}\right\} &\textbf{ if }\etree.\type = \times\\
\left\{(\monom_\lchild \cup \monom_\rchild, \coef_\lchild \cdot \coef_\rchild) ~|~\right.&\\ \quad \left.(\monom_\lchild, \coef_\lchild) \in \expandtree{\etree_\lchild}, (\monom_\rchild, \coef_\rchild) \in \expandtree{\etree_\rchild}\right\} &\textbf{ if }\etree.\type = \times\\
\elist{(\emptyset, \etree.\val)} &\textbf{ if }\etree.\type = \tnum\\
\elist{(\{\etree.\val\}, 1)} &\textbf{ if }\etree.\type = \var.\\
\end{cases}
\end{align*}
\end{multline*}
}
\end{Definition}
%where that the multiplication of two tuples %is the standard multiplication over monomials and the standard multiplication over coefficients to produce the product tuple, as in
%is their direct product $(\monom_1, \coef_1) \cdot (\monom_2, \coef_2) = (\monom_1 \cdot \monom_2, \coef_1 \times \coef_2)$ such that monomials $\monom_1$ and $\monom_2$ are concatenated in a product operation, while the standard product operation over reals applies to $\coef_1 \times \coef_2$. The product of $\expandtree{\etree_\lchild} \cdot \expandtree{\etree'_\rchild}$ is then the cross product of the multiplication of all such tuples returned to both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$. %The operator $\otimes$ is defined as the cross-product tuple multiplication of all such tuples returned by both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$.
@ -121,7 +121,7 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\draw[<-|, highlight_color] (root) -- (t-label);
\end{tikzpicture}
}
\vspace*{-2mm}
\caption{Expression tree $\etree$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
\label{fig:expr-tree-T}
\trimfigurespacing
@ -177,7 +177,7 @@ In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above r
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
We note that the restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for the three queries of the popular PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
The restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for all queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
\AH{I am thinking that perhaps the terminology and presentation of~\Cref{sec:experiments} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
\AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{sec:experiments}}
\AR{{\bf Boris/Oliver:} Is there a way to claim that all probabilities in practice are actually constants: i.e. they do not increase with the number of tuples?}
@ -187,12 +187,13 @@ We note that the restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$)
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i.
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i
\end{equation}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
\approxq (\Cref{alg:mon-sam}) is a sampling-based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
%\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. Algorithm~\ref{alg:mon-sam} has the details for $\approxq$. The derivation of $\numsamp$ (\Cref{alg:mon-sam-global2}) can be found in~\Cref{app:subsec-th-mon-samp}, and from those results, one can further see that
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
The number of samples is computed by (see \Cref{app:subsec-th-mon-samp}):
\begin{equation*}
2\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \conf \implies\samplesize \geq \frac{2\log{\frac{2}{\conf}}}{\error^2}.
%\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \frac{\conf}{2}\\

View File

@ -15,8 +15,8 @@ This finding shows that even Bag-PDB query processing has a higher complexity th
The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
}
@ -28,18 +28,18 @@ The expectation of the multiplicity is the expectation of this polynomial.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities remains the probability computation itself, and not the lineage formula.
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities is the probability computation itself, not the lineage formula.
The Bag-PDB analog is a polynomial in sum of products (SOP) form --- a sum of `clauses', each the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
We prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{factorized-db} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even Bag-PDBs can not enjoy the same time complexity as deterministic databases.
This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU query results over Bag-PDBs.
We also show that the size of the factorized lineage formula for a query (and by extension, our approximation algorithm) has complexity proportional to the same query on a deterministic database.
We prove that computing expected counts is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
As we also show, this complexity is proportional to the same query on a deterministic database.
% In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
\subsection{Sets vs Bags}
@ -85,8 +85,7 @@ We also show that the size of the factorized lineage formula for a query (and by
% %\caption{Atom 2 of query $\poly$ in ~\Cref{intro:ex}}
% \label{subfig:ex-atom2}
% \end{subfigure}
\vspace*{-3mm}
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\trimfigurespacing
@ -107,7 +106,7 @@ We also show that the size of the factorized lineage formula for a query (and by
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu} and we return to this model later.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handle Block Independent Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu}.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
% Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
@ -236,25 +235,24 @@ This factorized expression can be easily modeled as an expression tree, as in \C
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
The expectation then is
\begin{align*}
&\expct\pbox{\poly^2(W_a, W_b, W_c)}\\
&= \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} +\\
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
&\qquad \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
\end{align*}
In our original example, the lineage polynomial for $\poly$ had the nice property that the expected count could be computed by replacing each variable with its probability (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$).
However, a similar closed form formula for the expected count might be possible.
Under the assumption that $Dom(W_i) = \{0, 1\}$, it is generally true that for any $k$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider another structure related to $\poly$.
The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
\begin{multline*}
\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} \\
+ \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} \\
+ \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{multline*}
Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$), but does suggest a related closed form formula.
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
% \par Atri suggests a proof in the appendix regarding this claim.}
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c) =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
&+ 2W_aW_bW_c\\
\rpoly^2(W_a, W_b, W_c)
% =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
% &+ 2W_aW_bW_c\\
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
%\SF{Should this be like $\tilde{\poly^2}$ to avoid ambiguous?}
@ -269,8 +267,8 @@ This leads us to the \textbf{central question of this paper}:
{\em
Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
\end{quote}
If the answer is yes, then it is possible for Bag-PDBs to achieve performance competitive with deterministic databases.
The answer, unfortunately, is no, and an approximation algorithm is required.
If so, then Bag-PDBs can compete with deterministic databases.
This is ufortunately not the case, and an approximation is required.
% Consider the :
% \begin{equation*}

View File

@ -549,3 +549,14 @@ Study},
howpublished = {\url{http://pdbench.sourceforge.net/}},
note = {Accessed: 2020-12-15}
}
@article{factorized-db,
author = {Dan Olteanu and
Maximilian Schleich},
journal = {SIGMOD Rec.},
number = {2},
pages = {5--16},
title = {Factorized Databases},
volume = {45},
year = {2016}
}

View File

@ -123,7 +123,7 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Valid Worlds]
For probability distribution $\vct{P}$ and its corresponding PMF $P$, the set of valid worlds $\eta$ corresponds to all worlds that have a probability value greater than $0$, formally, for random variable $\vct{W}$
For probability distribution $\vct{P}$ and its corresponding PMF $P$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
\[
\eta = \{\vct{w}\st P[\vct{W} = \vct{w}] > 0\}
\]
@ -143,7 +143,7 @@ We state additional equivalences between $\poly(\vct{X})$ and $\rpoly(\vct{X})$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:exp-poly-rpoly}
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and some query $\query$ we have
Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and query $\query$ we have:
% The expectation over possible worlds in $\poly(\vct{X})$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numvar)$.
\begin{equation*}
\expct_{\vct{w}\sim \vct{p}}\pbox{\poly(\vct{W})} = \rpoly(\vct{p}).

View File

@ -62,8 +62,8 @@ $\semNX$-PDBs, a function $\rmod$, which takes an $\semNX$-PDB input and outputs
\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{w} \sim \pd'}\pbox{\polyForTuple(\vct{w})} \]
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.
This proposition shows that computing the expected multiplicity of a query result tuple is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on this problem exclusively from now on, assume an implicit result tuple, and drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from this point on).
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from now on).

View File

@ -157,7 +157,7 @@ The $3$-paths in $\graph{3}$ satisfy the identity:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:tri}
For $\ell > 1$, any graph $\graph{\ell}$ has the property that $\numocc{\graph{\ell}}{\tri} = 0$.
For $\ell > 1$ and any graph $\graph{\ell}$, $\numocc{\graph{\ell}}{\tri} = 0$.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%