paper-BagRelationalPDBsAreHard/Sketching Worlds/intro.tex

394 lines
29 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
\section{Introduction}
\label{sec:intro}
In their most general form, tuple-independent set-probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) answer existential queries (queries for the probability of a specific condition holding over the input database) in two steps: (i) lineage and (ii) probability.
The lineage is a boolean formula, an element of the $\text{PosBool}[\vct{X}]$ semiring, where lineage variables $\vct{X}\in \mathbb{B}^\numvar$ are random variables corresponding to the presence of each of the $\numvar$ input tuples in one possible world of the input database.
The lineage models the relationship between the presence of these input tuples and the query condition being satisfied, and thus the probability of this formula is exactly the query result.
The analogous query in the bag setting~\cite{DBLP:journals/sigmod/GuagliardoL17,feng:2019:sigmod:uncertainty} asks for the expectation of the number (multiplicity) of result tuples that satisfy the query condition.
The process for responding to such queries is also analogous, save that the lineage is a polynomial, an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$ and random variables from the set $\vct{X} \in \mathbb{N}^\numvar$.
The expectation of this polynomial is the query result.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c}
$Loc$ & City$_\ell$ & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
& Buffalo & $L_a$ & $L_a$\\
& Chicago & $L_b$ & $L_b$\\
& Bremen & $L_c$ & $L_c$\\
%& Tel Aviv & $L_d$ & $L_d$\\
& Zurich & $L_d$ & $L_e$\\
\end{tabular}
}
\caption{Relation $Loc$ in \Cref{ex:intro-tbls}}
\label{subfig:ex-shipping-loc}
\end{subfigure}%
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c c}
$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Buffalo & Chicago & $\top$ & $1$\\
& Chicago & Zurich & $\top$ & $1$\\
& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$\\
& Chicago & Bremen & $\top$ & $1$\\
\end{tabular}
}
\caption{Relation $Route$ in \Cref{ex:intro-tbls}}
\label{subfig:ex-shipping-route}
\end{subfigure}%
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c}
$Q_{1}$ & $\text{City}_1$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Chicago & $\top \vee \top = \top$ & $1 + 1 = 2$\\
\multicolumn{1}{c}{\vspace{1mm}}\\
$Q_{2}$ & $\text{City}_1$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Chicago & $L_a \wedge \top$ & $2L_a$\\
\end{tabular}
}
\caption{$Q_1$ and $Q_2$ in \Cref{ex:intro-tbls}}
\label{subfig:ex-shipping-queries}
\end{subfigure}%
\vspace*{-3mm}
\caption{ }%{$\ti$ relations for $\poly$}
\label{fig:ex-shipping}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
Consider the \ti tables (\Cref{fig:ex-shipping}) from an international shipping company.
Table Loc lists all the locations of airports.
Table Route identifies all flight routes.
The tuples of both tables are annotated with elements of the $PosBool[\vct{X}]$ ($\Phi_{set}$) and $\mathbb{N}[\vct{X}]$ ($\Phi_{bag}$) semirings that indicate the tuples presence or multiplicity, respectively.
Tuples of Routes are annotated with random variables $L_i$ that models the probability of no delays at the airport on a given day\footnote{We assume for simplicity that these variables are independent events.}.
Tuples of Routes are annotated with a constant ($\top$ or $1$ respectively), and are deterministic; Queries over this table follow classical query evaluation semantics.
Consider a customer service representative who needs to expedite a shipment to Western Europe.
The query $Q_1 := \pi_{\text{City}_1}\left(\sigma_{\text{City}_2 = \text{``Bremen"} ~OR~ \text{City}_2 = \text{``Zurich"}}\right.$$\left.(Route)\right)$ asks for all cities with routes to either Zurich or Bremen.
Both routes exist from Chicago, and so the result lineage~\cite{DBLP:conf/pods/GreenKT07} of the corresponding tuple (\Cref{subfig:ex-shipping-queries}) indicates that the tuple is deterministically present, either via Zurich or Bremen.
Analogously, under bag semantics Chicago appears in the result twice.
Observe that even when the input is a set (i.e., input tuple annotations are at most $1$), we can still evaluate bag queries over it.
Suppose the representative would like to consider delays from the originating city, as per the query $Q_2 := \pi_{\text{City}_1}(Loc$ $\bowtie_{\text{City}_\ell = \text{City}_1} Q_{1})$.
The resulting lineage formulas (\Cref{subfig:ex-shipping-queries}) concisely describe the event of delivering a shipment to Zurich or Bremen without departure delay, or the number of departure-delay-free routes to these cities given an assignment to $L_b$.
If Chicago is delay-free ($L_b = \top$, $L_b = 1$, respectively), there exists a route (set semantics) or there are two routes (bag semantics).
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%The computation of the marginal probability in is a known . Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
A well-known dichotomy~\cite{10.1145/1265530.1265571} separates the common-case \sharpphard problem of computing the probability of a boolean lineage formulas, from the case where the probability computation can be inlined into the polynomial-time lineage construction process.
Historically, the bottleneck for \emph{set-}probabilistic databases has been the second step; An instrumented query can compute a circuit encoding of the result lineage with at most a constant-factor overhead over the un-instrumented query ((TODO: Find citation)).
Because the probability computation is the bottleneck, it is typical to assume that the lineage formula is provided in disjunctive normal form (DNF), as even when this assumption holds the problem remains \sharpphard in general.
However, for bag semantics the analogous sum of products (SOP) lineage representation admits a trivial naive implementation due to linearity of expectation.
However, what can be said about lineage polynomials (i.e., bag-probabilistic database query results) that are in a compressed (e.g, circuit) representation instead?
In this paper we study computing the expected count of an output bag PDB tuple whose lineage formula is in a compressed representation, using the more general intensional query evaluation semantics.
%
%%Most theoretical developments in probabilistic databases (PDBs) have been made in the setting of set semantics. This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage formula (a boolean formula encoding the contributing input tuples to the output tuple) in set semantics versus the linear runtime when computing the expectation over the lineage polynomial (a standard polynomial analogously encoding contributing input tuples) of a tuple from an output bag PDB. However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in its "expanded" sum of products (SOP) form (each term is a product, where all (product) terms are summed). What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics?
%
%%As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
%%Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
%%PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
%%Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
%%However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
%\subsection{Sets vs. Bags}
%In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result. It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples.
%Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF--the "expanded" form of the lineage formula in set semantics, corresponding to SOP of bag semantics).
%%A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}). In this work we define a compressed representation as any one of the possible circuit representations of the lineage formula (please see Definitions~\ref{def:circuit},~\ref{def:poly-func}, and~\ref{def:circuit-set}).
%
%In bag semantics this problem corresponds to computing the expected multiplicity of a query result tuple, which can be reduced to computing the expectation of the lineage polynomial.
%
%\begin{Example}\label{ex:intro}
%The tables $\rel$ and $E$ in \Cref{fig:intro-ex} are examples of an incomplete database. In the setting of set semantics (disregard $\Phi_{bag}$ for the moment), every tuple $\tup$ of these tables is annotated with a variable or the symbol $\top$. Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$. When each variable represents an \emph{independent} event, this encoding is called a Tuple Independent Database $(\ti)$.
%
%The probability of this world is the joint probability of the corresponding assignments.
%For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
%The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and its probability is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
%\end{Example}
%
%\begin{figure}[t]
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{10mm}{
% \begin{tabular}{ c | c c c}
% $\rel$ & A & $\Phi_{set}$ & $\Phi_{bag}$\\
% \hline
% & a & $W_a$ & $W_a$\\
% & b & $W_b$ & $W_b$\\
% & c & $W_c$ & $W_c$\\
% \end{tabular}
%} \caption{Relation $R$ in \Cref{ex:intro}}
% \label{subfig:ex-atom1}
% \end{subfigure}%
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{10mm}{
% \begin{tabular}{ c | c c c c}
% $E$ & A & B & $\Phi_{set}$ & $\Phi_{bag}$ \\
% \hline
% & a & b & $\top$ & $1$\\
% & b & c & $\top$ & $1$\\
% & c & a & $\top$ & $1$\\
% \end{tabular}
% }
% \caption{Relation $E$ in \Cref{ex:intro}}
% \label{subfig:ex-atom3}
% \end{subfigure}%
% \begin{subfigure}{0.33\linewidth}
% \centering
% \resizebox{!}{29mm}{
% \begin{tikzpicture}[thick]
% \node[tree_node] (a1) at (0, 0){$W_a$};
% \node[tree_node] (b1) at (1, 0){$W_b$};
% \node[tree_node] (c1) at (2, 0){$W_c$};
% \node[tree_node] (d1) at (3, 0){$W_d$};
%
% \node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
% \node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
%
% \node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
% \node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
%
% \draw[->] (a1) -- (a2);
% \draw[->] (b1) -- (a2);
% \draw[->] (b1) -- (b2);
% \draw[->] (c1) -- (b2);
% \draw[->] (c1) -- (c2);
% \draw[->] (d1) -- (c2);
% \draw[->] (c2) -- (a3);
% \draw[->] (a2) -- (a4);
% \draw[->] (b2) -- (a3);
% \draw[->] (a3) -- (a4);
% %sink
% \draw[thick, ->] (a4.110) -- (a5.250);
% \draw[thick, ->] (a4.70) -- (a5.290);
% \draw[thick, ->] (a5) -- (0.75, 3.0);
% \end{tikzpicture}
% }
% \caption{Circuit encoding for query $\poly^2$.}
% \label{fig:circuit-q2-intro}
% \end{subfigure}
% %\vspace*{3mm}
% \vspace*{-3mm}
% \caption{ }%{$\ti$ relations for $\poly$}
% \label{fig:intro-ex}
% \trimfigurespacing
%\end{figure}
%
%
%Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to Bag-PDBs using $\semN$-valued random variables (i.e., $\domain(\randomvar_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
%Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while \emph{query evaluation follows bag semantics}.
%
%\begin{Example}\label{ex:bag-vs-set}
%Continuing the prior example, we are given the following Boolean (resp,. count) query
%$$\poly() \dlImp R(A), E(A, B), R(B)$$
%The lineage of the result in a Set-PDB (Bag-PDB) is a Boolean formula (polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
%Because the query result is a nullary relation, in what follows we can write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
%
%\setlength\parindent{0pt}
%\vspace*{-3mm}
%\begin{tabular}{@{}l l}
% \begin{minipage}[b]{0.45\linewidth}
% \begin{equation}
% \poly_{set}(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\label{eq:poly-set}
% \end{equation}
% \end{minipage}\hspace*{5mm}
% &
% \begin{minipage}[b]{0.45\linewidth}
% \begin{equation}
% \poly_{bag}(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a\label{eq:poly-bag}
% \end{equation}
% \end{minipage}\\
%\end{tabular}
%\vspace*{1mm}
%
%
%
%These functions compute the existence (count) of the nullary tuple resulting from applying $\poly$ on the PDB of \Cref{fig:intro-ex}.
%For the same possible world identified in \Cref{ex:intro}:
%$$
%\begin{tabular}{c c}
% \begin{minipage}[b]{0.45\linewidth}
% $\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \bot\top = \top$
% \end{minipage}
% &
% \begin{minipage}[b]{0.45\linewidth}
% $\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1$
% \end{minipage}\\
%\end{tabular}
%$$
%
%The Set-PDB query is satisfied in this possible world and the output Bag-PDB tuple has a multiplicity of 1.
%The marginal probability (expected count) of this query is computed over all possible worlds:
%{\small
%\begin{align*}
%\probOf[\poly_{set}] &= \hspace*{-1mm}
% \sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
%\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
%\end{align*}
%}
%\end{Example}
%
%Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
%To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c, W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive). Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition).
%Conversely, in Bag-PDBs, correlations between monomials of the SOP polynomial (\ref{eq:poly-bag}) are not problematic thanks to linearity of expectation.
%The expectation computation over the output lineage is simply the sum of expectations of each clause.
%Referring again to example~\ref{ex:intro}, the expectation is simply
%\begin{equation*}
%\expct\pbox{\poly_{bag}(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}
%\end{equation*}
%In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
%\begin{equation*}
%= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
%\end{equation*}
%Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
%As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
%\begin{equation}
%\label{eqn:can-inline-probabilities-into-polynomial}
%\expct\pbox{\poly_{bag}}
%= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
%\end{equation}
%\Cref{eqn:can-inline-probabilities-into-polynomial} is not true in general, as we shall see in \Cref{sec:suplin-bags}.
%
%The workflow modeling this particular problem can be broken down into two steps. We start with converting the output boolean formula (polynomial) into a representation. This representation is then the interface for the second step, which is computing the marginal probability (count) of the encoded boolean formula (polynomial). A natural question arises as to which representation to use. Our choice to use circuits (\Cref{def:circuit}) to represent the lineage polynomials follows from the observation that the work in WCOJ/FAQ/Factorized DB's --\color{red}CITATION HERE\color{black}-- all contain algorithms that can be easily be modified to output circuits without changing their runtime. Further, circuits generally allow for greater compression than other respresentations, such as expression trees. By the former observation, step one is always linear in the size of the circuit representation of the boolean formula (polynomial), implying that if the second step of the workflow is computed in time greater, then reducing the complexity of the second step would indeed improve the overall efficiency of computing the marginal probability (count) of an output set (bag) PDB tuple. This, however, as noted earlier, cannot be done in the set semantics setting, due to known hardness results.
%
%Though computing the expected count of an output bag PDB tuple $\tup$ is linear (in the size of the polynomial) when the lineage polynomial of $\tup$ is in SOP form, %has received much less attention, perhaps due to the property of linearity of expectation noted above.
%%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
%is this computation also linear (in the size of an equivalent compressed representation) when the lineage polynomial of $\tup$ is in compressed form?
%there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart.
Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of a SOP).
\begin{figure}[t]
\begin{subfigure}[b]{0.51\linewidth}
\centering
\resizebox{!}{20mm} {
\begin{tabular}{c | c c c}
$Route$ & $\text{City}_1$ & $\text{City}_2$ &$\Phi$\\
\hline
& Buffalo & Chicago & $R_a$\\
& Chicago & Zurich & $R_b$\\
& $\cdots$ & $\cdots$ & $\cdots$\\
& Chicago & Bremen & $R_c$\\
\multicolumn{1}{c}{\vspace{1mm}}\\
$Q_3$ & \text{City} & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
& Chicago & $L_b \wedge R_b \vee L_b \wedge R_c$ & $ L_b \cdot R_b + L_b \cdot R_c$\\
\multicolumn{1}{c}{\vspace{1mm}}\\
$Q_4$ & \text{City} & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
& Chicago & $L_b \wedge (R_b \vee R_c)$ & $L_b \cdot (R_b + R_c)$\\
\end{tabular}
}
\caption{$Route$, $Q_3$, $Q_4$}
\label{subfig:ex-proj-push-q4}
\end{subfigure}%
\begin{subfigure}[b]{0.24\linewidth}
\centering
\resizebox{!}{29mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a2) at (0, 0){$R_b$};
\node[tree_node] (b2) at (1, 0){$L_b$};
\node[tree_node] (c2) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
%level 0
\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
%edges
\draw[->] (a2) -- (a1);
\draw[->] (b2) -- (a1);
\draw[->] (b2) -- (b1);
\draw[->] (c2) -- (b1);
\draw[->] (a1) -- (a0);
\draw[->] (b1) -- (a0);
\end{tikzpicture}
}
\caption{Circuit encoding $Q_3$}
\label{subfig:ex-proj-push-circ-q3}
\end{subfigure}%
\begin{subfigure}[b]{0.24\linewidth}
\centering
\resizebox{!}{29mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (1, 0){$R_b$};
\node[tree_node] (b1) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
%level 0
\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
%edges
\draw[->] (a1) -- (b2);
\draw[->] (b1) -- (b2);
\draw[->] (a2) -- (a3);
\draw[->] (b2) -- (a3);
\end{tikzpicture}
}
\caption{Circuit encoding $Q_4$.}
\label{subfig:ex-proj-push-circ-q4}
\end{subfigure}%
\label{fig:ex-proj-push}
\end{figure}
\begin{Example}
Consider again the tables in \Cref{subfig:ex-shipping-loc} and \Cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables as shown in \Cref{subfig:ex-proj-push-q4}.
Consider the equivalent queries $Q_3 := \pi_{\text{City}_1}(Loc \bowtie_{\text{City}_\ell = \text{City}_1}Route)$ and $Q_4 := Loc \bowtie_{\text{City}_\ell = \text{City}_1}\pi_{\text{City}_1}(Route)$.
The latter's ``pushed down'' projection produces a compressed annotation, both in the polynomial, as well as its circuit encoding (\Cref{subfig:ex-proj-push-circ-q3,subfig:ex-proj-push-circ-q4}).
In general, compressed representations of the lineage polynomial can be exponentially smaller than the polynomial.
\end{Example}
This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
In view of this hardness result (i.e., step 2 of the workflow is the bottleneck in the bag setting as well), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
}
This is an important result, because it implies that computing approximate expectations for bag output PDBs of SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.
\subsection{Overview of our results and techniques}
Concretely, in this paper:
(i) We show that computing the expected count of conjunctive queries whose output is a bag-$\ti$ is hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in \Cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
%For the ease of exposition, we start off with expression trees (see \Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\paragraph{Paper Organization.} We present some relevant background and set up our notation in \Cref{sec:background}. We present our hardness results in \Cref{sec:hard} and our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen}. We do a quick overview of related work in \Cref{sec:related-work} and conclude with some open questions in \Cref{sec:concl-future-work}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: