More changes Intro Restructuring Round 2

This commit is contained in:
Aaron Huber 2021-03-23 11:41:08 -04:00
parent 6393951e20
commit 7472bd83e1
2 changed files with 264 additions and 157 deletions

419
intro.tex
View file

@ -3,189 +3,296 @@
\section{Introduction}
\label{sec:intro}
Computing the expectation of a lineage formula in a probabilistic databases (PDBs) involves two steps. First, the lineage formula is computed in the query, while the second step computes the expectation. In the setting of set semantics, the lineage formula is a boolean formula over which we are computing the marginal probability. In bag semantics, we are computing the expected count of a tuple $\tup$ whose lineage formula is a polynomial.
%The computation of the marginal probability in is a known \sharpphard problem. Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
It is known that a dichotomy exists for queries over set PDBs, partitioning the set of queries into tractable and intractable sets. Queries that are safe can be evaluated using extensional query evaluation, which is linear in the query runtime. Extensional query evaluation essentially performs both steps in one pass, computing the marginal probability (expected count) of $\tup$ as part of the actual query evaluation. Unsafe queries must, however, be evaluated using intensional query evaluation semantics, which computes step two only \emph{after} step one is computed by the query.
In the case of set semantics, computing the marginal probability of a boolean formula in general is a problem in \sharpphard, even when the boolean formula is in DNF. In this setting, it is indeed the case that step two is the bottleneck. However, in the case of bag semantics most have assumed that computing the expected count over the polynomial lineage formula is linear in the size of the polynomial. This is because many modern PDB systems represent the polynomial as a sum of products, where independent models like Tuple Independent Databases (\ti) enjoy linearity of expecation in both the sum and product operators. However, what can be said about lineage polynomials in a compressed representation (e.g. factorized), i.e., not in SOP form?
In this paper we study computing the expected count of an output tuple of a PDB using intensional query evaluation, specifically when the polynomial is in a compressed representation.
%Most theoretical developments in probabilistic databases (PDBs) have been made in the setting of set semantics. This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage formula (a boolean formula encoding the contributing input tuples to the output tuple) in set semantics versus the linear runtime when computing the expectation over the lineage polynomial (a standard polynomial analogously encoding contributing input tuples) of a tuple from an output bag PDB. However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in its "expanded" sum of products (SOP) form (each term is a product, where all (product) terms are summed). What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics?
%As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
%Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
%PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
%Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
%However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
\subsection{Sets vs. Bags}
In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result. It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples.
Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF--the "expanded" form of the lineage formula in set semantics, corresponding to SOP of bag semantics).
%A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}). In this work we define a compressed representation as any one of the possible circuit representations of the lineage formula (please see Definitions~\ref{def:circuit},~\ref{def:poly-func}, and~\ref{def:circuit-set}).
In bag semantics this problem corresponds to computing the expected multiplicity of a query result tuple, which can be reduced to computing the expectation of the lineage polynomial.
\begin{Example}\label{ex:intro}
The tables $\rel$ and $E$ in \Cref{fig:intro-ex} are examples of an incomplete database. In the setting of set semantics (disregard $\Phi_{bag}$ for the moment), every tuple $\tup$ of these tables is annotated with a variable or the symbol $\top$. Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$. When each variable represents an \emph{independent} event, this encoding is called a Tuple Independent Database $(\ti)$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and its probability is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
Computing the expectation of a lineage formula in a probabilistic databases (PDBs) involves two steps. First, the lineage formula is computed (typically this is instrumented in the query), while the second step computes the expectation. The computation might be performed in either the setting of set or bag semantics. In set semantics, the lineage formula is an element of the $PosBool[\vct{X}]$ semiring, where lineage variables are from the set $\vct{X}$ ranging over the elements of $\mathbb{B}$, and computing the expectation of this formula is the same as computing the marginal probability of the output tuple's existence. In bag semantics, we are computing the expected count of a tuple $\tup$ whose lineage formula is an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$, and variables from the set $\vct{X}$, ranging over some given set such as the reals $\mathbb{R}$.
\begin{Example}
Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company. Table Loc lists all the locations to which air transport is provided. Table Route identifies all locations connected via air transit. Elements from both $PosBool[\vct{X}]$ and $\mathbb{N}[\vct{X}]$ semirings can be seen as annotations in attributes $\Phi_{set}$ and $\Phi_{bag}$ of \cref{subfig:ex-shipping-queries}.
\end{Example}
\begin{figure}[t]
\begin{subfigure}{0.33\linewidth}
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{10mm}{
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c}
$\rel$ & A & $\Phi_{set}$ & $\Phi_{bag}$\\
$Loc$ & City & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
& a & $W_a$ & $W_a$\\
& b & $W_b$ & $W_b$\\
& c & $W_c$ & $W_c$\\
& Buffalo & $L_a$ & $L_a$\\
& Chicago & $L_b$ & $L_b$\\
& Bremen & $L_c$ & $L_c$\\
%& Tel Aviv & $L_d$ & $L_d$\\
& Zurich & $L_d$ & $L_e$\\
\end{tabular}
} \caption{Relation $R$ in ~\Cref{ex:intro}}
\label{subfig:ex-atom1}
}
\caption{Relation $Loc$ in ~\Cref{ex:intro}}
\label{subfig:ex-shipping-loc}
\end{subfigure}%
\begin{subfigure}{0.33\linewidth}
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{10mm}{
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c c}
$E$ & A & B & $\Phi_{set}$ & $\Phi_{bag}$ \\
$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& a & b & $\top$ & $1$\\
& b & c & $\top$ & $1$\\
& c & a & $\top$ & $1$\\
& Buffalo & Chicago & $\top$ & $1$\\
& Chicago & Zurich & $\top$ & $1$\\
& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$\\
& Chicago & Bremen & $\top$ & $1$\\
\end{tabular}
}
\caption{Relation $E$ in ~\Cref{ex:intro}}
\label{subfig:ex-atom3}
\caption{Relation $Route$ in ~\Cref{ex:intro}}
\label{subfig:ex-shipping-route}
\end{subfigure}%
\begin{subfigure}{0.33\linewidth}
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{29mm}{
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0){$W_a$};
\node[tree_node] (b1) at (1, 0){$W_b$};
\node[tree_node] (c1) at (2, 0){$W_c$};
\node[tree_node] (d1) at (3, 0){$W_d$};
\node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
\node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
\node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
\draw[->] (a1) -- (a2);
\draw[->] (b1) -- (a2);
\draw[->] (b1) -- (b2);
\draw[->] (c1) -- (b2);
\draw[->] (c1) -- (c2);
\draw[->] (d1) -- (c2);
\draw[->] (c2) -- (a3);
\draw[->] (a2) -- (a4);
\draw[->] (b2) -- (a3);
\draw[->] (a3) -- (a4);
%sink
\draw[thick, ->] (a4.110) -- (a5.250);
\draw[thick, ->] (a4.70) -- (a5.290);
\draw[thick, ->] (a5) -- (0.75, 3.0);
\end{tikzpicture}
}
\caption{Circuit encoding for query $\poly^2$.}
\label{fig:circuit-q2-intro}
\end{subfigure}
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c}
$Q_{1}$ & $\text{City}_1$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Chicago & $\top \vee \top = \top$ & $1 + 1 = 2$\\
\multicolumn{1}{c}{\vspace{1mm}}\\
$Q_{2}$ & $\text{City}_1$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Buffalo & $L_a \wedge \top$ & $2L_a$\\
\end{tabular}
}
\caption{$Q_1$ and $Q_2$ in ~\Cref{ex:intro}}
\label{subfig:ex-shipping-queries}
\end{subfigure}%
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{29mm}{
% \begin{tikzpicture}[thick]
% \node[tree_node] (a1) at (0, 0){$W_a$};
% \node[tree_node] (b1) at (1, 0){$W_b$};
% \node[tree_node] (c1) at (2, 0){$W_c$};
% \node[tree_node] (d1) at (3, 0){$W_d$};
%
% \node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
%
% \node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
%
% \draw[->] (a1) -- (a2);
% \draw[->] (b1) -- (a2);
% \draw[->] (b1) -- (b2);
% \draw[->] (c1) -- (b2);
% \draw[->] (c1) -- (c2);
% \draw[->] (d1) -- (c2);
% \draw[->] (c2) -- (a3);
% \draw[->] (a2) -- (a4);
% \draw[->] (b2) -- (a3);
% \draw[->] (a3) -- (a4);
% %sink
% \draw[thick, ->] (a4.110) -- (a5.250);
% \draw[thick, ->] (a4.70) -- (a5.290);
% \draw[thick, ->] (a5) -- (0.75, 3.0);
% \end{tikzpicture}
% }
% \caption{Circuit encoding for query $\poly^2$.}
% \label{fig:circuit-q2-intro}
% \end{subfigure}
%\vspace*{3mm}
\vspace*{-3mm}
\caption{ }%{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\label{fig:ex-shipping}
\trimfigurespacing
\end{figure}
\begin{Example}
Now consider the case when a customer service representative needs to expedite a shipment en route to Western Europe. Further assume that $L_c$ and $L_d$ both are set to $\mathbbold{1}$, the multiplicative neutral element of their respective semirings. To find the cities providing air service to either Zurich or Bremen, the query $Q_1 := \pi_{\text{City}_1}\left(\sigma_{\text{City}_2 = "Bremen" ~OR~ \text{City}_2 = "Zurich"}(Route)\right)$ might be issued, where the output is a bag PDB of the cities that have air transit to either Zurich and Bremen. The output bag PDB of \cref{subfig:ex-shipping-queries} is a bag \ti, and we see an example of a bag PDBs modeled by a query $Q$ over a set input relation (annotations from $\domain(W_i) = \{0, 1\})$, where the query output is closed with respect to the data model. This can be done WLOG.
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to Bag-PDBs using $\semN$-valued random variables (i.e., $\domain(\randomvar_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while \emph{query evaluation follows bag semantics}.
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (Bag-PDB) is a Boolean formula (polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the query result is a nullary relation, in what follows we can write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
\setlength\parindent{0pt}
\vspace*{-3mm}
\begin{tabular}{@{}l l}
\begin{minipage}[b]{0.45\linewidth}
\begin{equation}
\poly_{set}(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\label{eq:poly-set}
\end{equation}
\end{minipage}\hspace*{5mm}
&
\begin{minipage}[b]{0.45\linewidth}
\begin{equation}
\poly_{bag}(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a\label{eq:poly-bag}
\end{equation}
\end{minipage}\\
\end{tabular}
\vspace*{1mm}
These functions compute the existence (count) of the nullary tuple resulting from applying $\poly$ on the PDB of \Cref{fig:intro-ex}.
For the same possible world identified in \Cref{ex:intro}:
$$
\begin{tabular}{c c}
\begin{minipage}[b]{0.45\linewidth}
$\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \bot\top = \top$
\end{minipage}
&
\begin{minipage}[b]{0.45\linewidth}
$\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1$
\end{minipage}\\
\end{tabular}
$$
The Set-PDB query is satisfied in this possible world and the output Bag-PDB tuple has a multiplicity of 1.
The marginal probability (expected count) of this query is computed over all possible worlds:
{\small
\begin{align*}
\probOf[\poly_{set}] &= \hspace*{-1mm}
\sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
\end{align*}
}
Suppose the customer service agent would like to see if there are locations in closer proximity of the shipment's origination that connect to Chicago by the query $Q_2 := \pi_{\text{City}_1}(Loc$ $\bowtie_{\text{City}_2 = "Chicago"} Q_{1})$. Assume that at this point in time, government regulations are allowing transpotation of goods from the only location in Buffalo, where random variable $L_a = \top$ in the set semantics setting, and the $Q_2$ output table tells us that the shipment can be made from Buffalo. In the bags case $L_a = 1$, and $Q_2$ output then tells us we have $1 \cdot 2 = 2$ possible options to ship from Buffalo to Western Europe.
\end{Example}
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c, W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive). Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition).
Conversely, in Bag-PDBs, correlations between monomials of the SOP polynomial (\ref{eq:poly-bag}) are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause.
Referring again to example~\ref{ex:intro}, the expectation is simply
\begin{equation*}
\expct\pbox{\poly_{bag}(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}
\end{equation*}
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
\begin{equation*}
= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\end{equation*}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{equation}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}}
= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
\end{equation}
\Cref{eqn:can-inline-probabilities-into-polynomial} is not true in general, as we shall see in \Cref{sec:suplin-bags}.
The workflow modeling this particular problem can be broken down into two steps. We start with converting the output boolean formula (polynomial) into a representation. This representation is then the interface for the second step, which is computing the marginal probability (count) of the encoded boolean formula (polynomial). A natural question arises as to which representation to use. Our choice to use circuits (\Cref{def:circuit}) to represent the lineage polynomials follows from the observation that the work in WCOJ/FAQ/Factorized DB's --\color{red}CITATION HERE\color{black}-- all contain algorithms that can be easily be modified to output circuits without changing their runtime. Further, circuits generally allow for greater compression than other respresentations, such as expression trees. By the former observation, step one is always linear in the size of the circuit representation of the boolean formula (polynomial), implying that if the second step of the workflow is computed in time greater, then reducing the complexity of the second step would indeed improve the overall efficiency of computing the marginal probability (count) of an output set (bag) PDB tuple. This, however, as noted earlier, cannot be done in the set semantics setting, due to known hardness results.
%The computation of the marginal probability in is a known \sharpphard problem. Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
It is known that a dichotomy exists for queries over set PDBs, partitioning the set of queries into classes of tractability and intractibility. Queries that are safe can be evaluated using extensional query evaluation, which is linear in the query runtime. Extensional query evaluation essentially performs both steps of the expectation computation in one pass, merging step two into the actual query evaluation. Unsafe queries must, however, be evaluated using intensional query evaluation semantics, which computes step two only \emph{after} step one is computed by the query.
Though computing the expected count of an output bag PDB tuple $\tup$ is linear (in the size of the polynomial) when the lineage polynomial of $\tup$ is in SOP form, %has received much less attention, perhaps due to the property of linearity of expectation noted above.
%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
is this computation also linear (in the size of an equivalent compressed representation) when the lineage polynomial of $\tup$ is in compressed form?
Historically, the first step in computing the expectation of the lineage formula is not the bottleneck, since computing the lineage of an output tuple can be done by instrumenting the query, where the additional operations are $O(1)$ time. In the case of set semantics, computing the marginal probability of a boolean formula in general is a problem in \sharpphard, even when the boolean formula is in DNF. In this setting, it is indeed the case that step two is the bottleneck. However, in the case of bag semantics the underlying assumption has been that computing the expected count over the polynomial lineage formula is linear in the size of the polynomial. This is perhaps due to the observation that many modern PDB systems represent the polynomial as a sum of products (SOP), where independent PDB data models like Tuple Independent Databases (\ti) enjoy linearity of expecation in both the sum and product operators. Unlike set semantics, when the lineage is in SOP (bag equivalent to DNF in set semantics), bags enjoy linear time in computing the expectation over the lineage polynomial. However, what can be said about lineage polynomials in a compressed representation (e.g. factorized), i.e., not in SOP form?
In this paper we study computing the expected count of an output bag PDB tuple whose lineage formula is in a compressed representation, using the more general intensional query evaluation semantics.
%
%%Most theoretical developments in probabilistic databases (PDBs) have been made in the setting of set semantics. This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage formula (a boolean formula encoding the contributing input tuples to the output tuple) in set semantics versus the linear runtime when computing the expectation over the lineage polynomial (a standard polynomial analogously encoding contributing input tuples) of a tuple from an output bag PDB. However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in its "expanded" sum of products (SOP) form (each term is a product, where all (product) terms are summed). What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics?
%
%%As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
%%Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
%%PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
%%Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
%%However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
%\subsection{Sets vs. Bags}
%In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result. It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples.
%Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF--the "expanded" form of the lineage formula in set semantics, corresponding to SOP of bag semantics).
%%A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}). In this work we define a compressed representation as any one of the possible circuit representations of the lineage formula (please see Definitions~\ref{def:circuit},~\ref{def:poly-func}, and~\ref{def:circuit-set}).
%
%In bag semantics this problem corresponds to computing the expected multiplicity of a query result tuple, which can be reduced to computing the expectation of the lineage polynomial.
%
%\begin{Example}\label{ex:intro}
%The tables $\rel$ and $E$ in \Cref{fig:intro-ex} are examples of an incomplete database. In the setting of set semantics (disregard $\Phi_{bag}$ for the moment), every tuple $\tup$ of these tables is annotated with a variable or the symbol $\top$. Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$. When each variable represents an \emph{independent} event, this encoding is called a Tuple Independent Database $(\ti)$.
%
%The probability of this world is the joint probability of the corresponding assignments.
%For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
%The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and its probability is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
%\end{Example}
%
%\begin{figure}[t]
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{10mm}{
% \begin{tabular}{ c | c c c}
% $\rel$ & A & $\Phi_{set}$ & $\Phi_{bag}$\\
% \hline
% & a & $W_a$ & $W_a$\\
% & b & $W_b$ & $W_b$\\
% & c & $W_c$ & $W_c$\\
% \end{tabular}
%} \caption{Relation $R$ in ~\Cref{ex:intro}}
% \label{subfig:ex-atom1}
% \end{subfigure}%
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{10mm}{
% \begin{tabular}{ c | c c c c}
% $E$ & A & B & $\Phi_{set}$ & $\Phi_{bag}$ \\
% \hline
% & a & b & $\top$ & $1$\\
% & b & c & $\top$ & $1$\\
% & c & a & $\top$ & $1$\\
% \end{tabular}
% }
% \caption{Relation $E$ in ~\Cref{ex:intro}}
% \label{subfig:ex-atom3}
% \end{subfigure}%
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{29mm}{
% \begin{tikzpicture}[thick]
% \node[tree_node] (a1) at (0, 0){$W_a$};
% \node[tree_node] (b1) at (1, 0){$W_b$};
% \node[tree_node] (c1) at (2, 0){$W_c$};
% \node[tree_node] (d1) at (3, 0){$W_d$};
%
% \node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
%
% \node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
%
% \draw[->] (a1) -- (a2);
% \draw[->] (b1) -- (a2);
% \draw[->] (b1) -- (b2);
% \draw[->] (c1) -- (b2);
% \draw[->] (c1) -- (c2);
% \draw[->] (d1) -- (c2);
% \draw[->] (c2) -- (a3);
% \draw[->] (a2) -- (a4);
% \draw[->] (b2) -- (a3);
% \draw[->] (a3) -- (a4);
% %sink
% \draw[thick, ->] (a4.110) -- (a5.250);
% \draw[thick, ->] (a4.70) -- (a5.290);
% \draw[thick, ->] (a5) -- (0.75, 3.0);
% \end{tikzpicture}
% }
% \caption{Circuit encoding for query $\poly^2$.}
% \label{fig:circuit-q2-intro}
% \end{subfigure}
% %\vspace*{3mm}
% \vspace*{-3mm}
% \caption{ }%{$\ti$ relations for $\poly$}
% \label{fig:intro-ex}
% \trimfigurespacing
%\end{figure}
%
%
%Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to Bag-PDBs using $\semN$-valued random variables (i.e., $\domain(\randomvar_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
%Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while \emph{query evaluation follows bag semantics}.
%
%\begin{Example}\label{ex:bag-vs-set}
%Continuing the prior example, we are given the following Boolean (resp,. count) query
%$$\poly() :- R(A), E(A, B), R(B)$$
%The lineage of the result in a Set-PDB (Bag-PDB) is a Boolean formula (polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
%Because the query result is a nullary relation, in what follows we can write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
%
%\setlength\parindent{0pt}
%\vspace*{-3mm}
%\begin{tabular}{@{}l l}
% \begin{minipage}[b]{0.45\linewidth}
% \begin{equation}
% \poly_{set}(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\label{eq:poly-set}
% \end{equation}
% \end{minipage}\hspace*{5mm}
% &
% \begin{minipage}[b]{0.45\linewidth}
% \begin{equation}
% \poly_{bag}(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a\label{eq:poly-bag}
% \end{equation}
% \end{minipage}\\
%\end{tabular}
%\vspace*{1mm}
%
%
%
%These functions compute the existence (count) of the nullary tuple resulting from applying $\poly$ on the PDB of \Cref{fig:intro-ex}.
%For the same possible world identified in \Cref{ex:intro}:
%$$
%\begin{tabular}{c c}
% \begin{minipage}[b]{0.45\linewidth}
% $\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \bot\top = \top$
% \end{minipage}
% &
% \begin{minipage}[b]{0.45\linewidth}
% $\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1$
% \end{minipage}\\
%\end{tabular}
%$$
%
%The Set-PDB query is satisfied in this possible world and the output Bag-PDB tuple has a multiplicity of 1.
%The marginal probability (expected count) of this query is computed over all possible worlds:
%{\small
%\begin{align*}
%\probOf[\poly_{set}] &= \hspace*{-1mm}
% \sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
%\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
%\end{align*}
%}
%\end{Example}
%
%Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
%To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c, W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive). Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition).
%Conversely, in Bag-PDBs, correlations between monomials of the SOP polynomial (\ref{eq:poly-bag}) are not problematic thanks to linearity of expectation.
%The expectation computation over the output lineage is simply the sum of expectations of each clause.
%Referring again to example~\ref{ex:intro}, the expectation is simply
%\begin{equation*}
%\expct\pbox{\poly_{bag}(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}
%\end{equation*}
%In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
%\begin{equation*}
%= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
%\end{equation*}
%Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
%As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
%\begin{equation}
%\label{eqn:can-inline-probabilities-into-polynomial}
%\expct\pbox{\poly_{bag}}
%= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
%\end{equation}
%\Cref{eqn:can-inline-probabilities-into-polynomial} is not true in general, as we shall see in \Cref{sec:suplin-bags}.
%
%The workflow modeling this particular problem can be broken down into two steps. We start with converting the output boolean formula (polynomial) into a representation. This representation is then the interface for the second step, which is computing the marginal probability (count) of the encoded boolean formula (polynomial). A natural question arises as to which representation to use. Our choice to use circuits (\Cref{def:circuit}) to represent the lineage polynomials follows from the observation that the work in WCOJ/FAQ/Factorized DB's --\color{red}CITATION HERE\color{black}-- all contain algorithms that can be easily be modified to output circuits without changing their runtime. Further, circuits generally allow for greater compression than other respresentations, such as expression trees. By the former observation, step one is always linear in the size of the circuit representation of the boolean formula (polynomial), implying that if the second step of the workflow is computed in time greater, then reducing the complexity of the second step would indeed improve the overall efficiency of computing the marginal probability (count) of an output set (bag) PDB tuple. This, however, as noted earlier, cannot be done in the set semantics setting, due to known hardness results.
%
%Though computing the expected count of an output bag PDB tuple $\tup$ is linear (in the size of the polynomial) when the lineage polynomial of $\tup$ is in SOP form, %has received much less attention, perhaps due to the property of linearity of expectation noted above.
%%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
%is this computation also linear (in the size of an equivalent compressed representation) when the lineage polynomial of $\tup$ is in compressed form?
%there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart.
Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of an SOP), hinting that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of an SOP).
\begin{Example}
Consider again the tables in \cref{subfig:ex-shipping-loc} and \cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables $R_a, R_b,$ and $R_c$ respectively. Note that the query $Q_3 := \pi_{\text{City}}(Loc \bowtie_{\text{City} = \text{City}_1}Route)$ is equivalent to $Q_4 := Loc \bowtie_{\text{City} = \text{City}_1} \pi_{\text{City}_1}(Route)$, where the annotation for tuple $<Chicago>$ in $Q_4$ would be ($\otimes, \oplus$ denoting semiring multiplication and addition operations respectively) $L_b\otimes (R_b \oplus R_c)$ as opposed to the SOP of $Q_3$ of $(L_b\otimes R_b) \oplus (L_b\otimes R_c)$.
\end{Example}
This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
In view of this hardness result (i.e., step 2 of the workflow is indeed the bottleneck in the bag setting for compressed representations), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.

View file

@ -1 +1 @@
\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:51}
\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:50}