paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

406 lines
42 KiB
TeX

%!TEX root=./main.tex
%root: main.tex
\section{Introduction}\label{sec:intro}
\secrev{
This work explores the problem of computing the expectation of a tuple's multiplicity in a specific construction of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each possible tuple encoded in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity and $\bpd$ is the probability distribution over $\worlds$. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The probability distribution $\bpd$ for any tuple $\tup$ can then be encoded as $\prob_{\tup, j} = \probOf\pbox{\worldvec_{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
is an independent random event. %for $\tup \in \tupset$.
}
%\mypar{For a later section}
%\sout{
%Since each tuple in $\pdb$ has a mutually exclusive probability distribution over its possible multiplicities, it is natural to reduce a \abbrCTIDB to traditional (set) block independent database (\abbrBIDB). We refer to the reduced \abbrBIDB as a $1$-\abbrBIDB, as it is the case that each tuple can appear in a possible world at most $c = 1$ time. \Cref{fig:ctidb-red} shows an example of this reduction.
%}
\secrev{
Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
We can formally state our problem of computing the expected multiplicity of a result tuple as:
\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$
\footnote{
A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more of the positive relational operators $\inset{\select, \project, \join, \union}$.
}
, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Example of product distribution of c-TIDB
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[h!]
% \centering
% \textcolor{red}{
% \begin{tabular}{>{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c}
% \multicolumn{5}{c}{$\mathbf{\rel}$}\\
% \toprule
% A & Mult. $\inparen{M}$ &$\probOf\pbox{M=1}$ &$\probOf\pbox{M=2}$ &$\probOf\pbox{M=3}$\\
% \midrule
% $a$ & $2$ & $0.4$ &$0.3$ &$0.0$\\
% $b$ & $3$ & $0.2$ &$0.35$ &$0.15$\\
% \end{tabular}
%% \hspace*{0.5cm}
%% {\LARGE $\Rightarrow$}
%% \hspace*{0.5cm}
% \begin{tabular}{>{\footnotesize}c | >{\footnotesize}c}
% \multicolumn{2}{c}{$\textbf{World Probabilities}$}\\
% \toprule
% World & Probability \\
% \midrule
% $\emptyset$ & $0.3\cdot0.3 = 0.09$\\
% $\inset{\intup{a, 1}}$ & $0.4\cdot0.3 = 0.12$\\
% $\inset{\intup{a, 2}}$ & $0.3\cdot0.3 = 0.09$\\
% $\inset{\intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
% $\inset{\intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
% $\inset{\intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
% $\inset{\intup{a, 1}, \intup{b, 1}}$ & $0.4\cdot0.2 = 0.08$\\
% $\inset{\intup{a, 1}, \intup{b, 2}}$ & $0.4\cdot0.35 = 0.14$\\
% $\inset{\intup{a, 1}, \intup{b, 3}}$ & $0.4\cdot0.15 = 0.06$\\
% $\inset{\intup{a, 2}, \intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
% $\inset{\intup{a, 2}, \intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
% $\inset{\intup{a, 2}, \intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
% \end{tabular}
% }
% \caption{\textcolor{red}{\abbrCTIDB relation$\rel$ and its possible worlds with their probabilities.% Reduction to $1$-\abbrBIDB ($\rel'$). Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$. This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$. Multiplicities that don't exist are automatically assigned a probability of $0$. Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
% \label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Figure may work for an example in a later section of the Intro
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[h!]
% \centering
% \textcolor{red}{
% \begin{tabular}{c | c | c}
% \multicolumn{3}{c}{$\mathbf{\rel}$}\\
% \toprule
% A & Mult. & id\\
% \midrule
% $a$ & $2$ & $\tup_1$\\
% $b$ & $3$ & $\tup_2$\\
% \end{tabular}
% \hspace*{1cm}
% {\LARGE $\equiv$}
% \hspace*{1cm}
% \begin{tabular}{c | c | c}
% \multicolumn{3}{c}{$\textbf{\rel'}$}\\
% \toprule
% A & $\poly$ & $\probOf\pbox{x_{i, j}}$\\
% \midrule
% a & $\pVar_{1, 1}$ & $0.4$\\
% a & $\pVar_{1, 2}$ & $0.3$\\
% a & $\pVar_{1, 3}$ & $0.0$\\
% \midrule
% b & $\pVar_{2, 1}$ & $0.2$\\
% b & $\pVar_{2, 2}$ & $0.35$\\
% b & $\pVar_{2, 3}$ & $0.15$\\
% \end{tabular}
% }
% \caption{\textcolor{red}{\abbrCTIDB ($\rel$) Reduction to $1$-\abbrBIDB ($\rel'$). Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$. This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$. Multiplicities that don't exist are automatically assigned a probability of $0$. Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
% \label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
In this work we will assume that $c =\bigO{1}$ since this is what typically seen in practice.
%because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations.
Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$. %We discuss polynomial representation and equivalence in the following subsection.
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
}
%\sout{
%\mypar{Example that can perhaps be used later on (using commented out figure above)}
%Given a \abbrCTIDB $\pdb$ with $\numvar$ tuples, we can encode a possible world by the vector $\vct{W} \in \inset{0,\ldots, c}^\numvar$, with the intuitive interpretation when bit $W_i = j$, then tuple $\tup_i$ with multiplicity $j$ is selected, with $\tup_i$ not existing for the special case of $j = 0$. For the example in ~\Cref{fig:ctidb-red}, we have that for \abbrCTIDB $\textbf{R}$, $\numvar = 2$. Then, e.g., arbitrary world vector $\vct{W} = [2, 3]$ encodes the possible world $\db = \inset{\intup{a, 2}, \intup{b, 3}}$ Computing ~\Cref{prob:expect-mult} for tuple $\tup_2$ in ~\Cref{fig:ctidb-red} when $\query = \mathbf{\rel}$ then becomes $\expct_{\randDB\sim\pd}\pbox{\mathbf{\rel}\inparen{\tup_2}} = 1\cdot\prob_{2,1} + 2\cdot\prob_{2,2} + 3\cdot\prob_{2,3} = 1\cdot 0.2 + 2\cdot 0.35 + 3\cdot 0.15 = 1.35$.
%}
\secrev{
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an equivalent deterministic query. If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.
%Let $\gentupset$ denote the set of tuples in $\pdb$, i.e.,
%\begin{Definition}[$\gentupset$]
%Define $\gentupset$ to be the set of tuples appearing across all the possible worlds of a $\abbrCTIDB$, formally $\gentupset = \inset{\tup_i ~|~ \forall \worldvec \in \worlds,~\forall i \in \abs{\tupset}:~\worldvec\pbox{i} > 0}$. When a specific $\pdb = \inparen{\worlds, \bpd}$ is being referred to, we will use $\tupset$ to denote the set of tuples.
%\end{Definition}
Let $\qruntime{\optquery{\query},\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\optquery{\query}$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Being we consider $\raPlus$ queries in which order of operators can impact runtime, we denote the optimal query as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$.
%let $\qruntim{\optquery{\query}, \gentupset, \bound} = \min_{\query'\in\raPlus,~\query'\equiv\query}T_{det}\inparen{\query, \gentupset, \bound}$ be the runtime for the optimally structured equivalent $\raPlus$ query $\query'$ (with some caveats; discussed in~\Cref{sec:gen}). % of query $\query$ on deterministic database $\tupset$.
%{\newline\noindent\centerline{\Huge \textcolor{black}{Or instead$\ldots$}}}
%\newline\noindent Let $T_{det}\inparen{\query, \gentupset, \bound}$ denote the runtime for $\raPlus$ query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Since this paper does not consider optimization schemes, we leave optimization to the reader and show that our results hold across all inputs.
%We make this runtime concrete later on.
%We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
%\AR{Again if we are defining \abbrCTIDB `from scratch' instead of in terms of general PDBs, then the above might not be needed. Also it should be \abbrCTIDB instead of \abbrPDB in the sentence below.}
\begin{table}[h!]
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\bpd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$.For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results}
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$. Unfortunately this is not the case.
~\Cref{tab:lbs} shows our results.%our lower bounds for computing~\Cref{prob:expect-mult} on \abbrCTIDB\xplural.
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
and bag \abbrPDB\xplural are deployable in practice.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
% result tuple, we also have the same runtime (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
%Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
(see \Cref{sec:karp-luby}).
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
%we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases
(see \Cref{subsec:tidbs-and-bidbs}). %(\abbrBIDB\xplural).
}
\secrev{
\subsection{Polynomial Equivalence}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
%Intuitively, a \abbrCTIDB lends itself to a useful reduction to a specific type of block independent database (\abbrBIDB) which we refer to as a $1$-\abbrBIDB. A $1$-\abbrBIDB is a \abbrBIDB in the traditional sense of allowing no duplicate tuples, \emph{but} where we use bag query semantics instead of the usual set query semantics.
%(see~\Cref{fig:nxDBSemantics} for a definition)
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\gentupset}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =& \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\
\polyqdt{\select_\theta(\query)}{\gentupset}{\tup} =& \begin{cases}
\polyqdt{\query}{\gentupset}{\tup} & \text{if }\theta(\tup) \\
0 & \text{otherwise}.
\end{cases} &
\begin{aligned}
\polyqdt{\query_1 \join \query_2}{\gentupset}{\tup} =\\ ~
\end{aligned}&
\begin{aligned}
&\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}} \\
&~~~\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}}
\end{aligned}\\
& & & \polyqdt{\rel}{\gentupset}{\tup} = X_\tup%\sum_{j \in [c]}j\cdot\pVar_{\tup, j}
\end{align*}\\[-10mm]
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over a arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.}
\label{fig:nxDBSemantics}
\end{figure}
We drop $\query$, $\tupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now specify the problem of computing the expectation of tuple multiplicity in the language of lineage polynomials:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$ and result tuple $\tup$, compute the expected
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt(\vct{W})}$, where $\vct{W} \in \worlds$).%\inset{0,\ldots, \bound}^{\abs{\tupset}}$).
%,
%where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
\end{Problem}
We note that computing \Cref{prob:expect-mult}
is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
}
\secrev{
All of our results rely on working with a {\em reduced} form $\inparen{\poly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
%\begin{Definition}
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner. For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$. Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
%\end{Definition}
%\begin{Example}
%Consider the $\boldsymbol{Route}$ relation of~\Cref{fig:two-step} and query $\query = \project_{\text{City}_1}\inparen{\boldsymbol{Route}}$. The output relation $\query$ is $\inset{\intup{Chicago, X}, \intup{Chicago, Y}}$ and can be represented as a \abbrCTIDB $\query' = \inset{\intup{Chicago, X', 2}}$, where the following probabilities are true: $\probOf\pbox{X' = 0} = \probOf\pbox{\neg X \wedge \neg Y}$, $\probOf\pbox{X' = 1} = \probOf\pbox{\inparen{X \vee Y}\wedge\inparen{\neg X \vee \neg Y}}$, and $\probOf\pbox{X' = 2} = \probOf\pbox{X\wedge Y}$. $\query'$ can then be reduced to a $1$-\abbrBIDB by creating a block of the following disjoint tuples: $\query'' = \inset{\intup{\text{Chicago}, X'_0}, \intup{\text{Chicago}, X'_1}, \intup{\text{Chicago}, X'_2}}$ such that $\probOf\pbox{X'_i = 1} = \probOf\pbox{X' = i}$.
%\end{Example}
Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
}
\begin{lstlisting}
SELECT 1 FROM T $t_1$, Route r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
\end{lstlisting}
\secrev{
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X, Y, Z}$
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $X\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_X, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Considering again our example,
\begin{multline*}
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\end{multline*}
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing expectation we get $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
% \begin{footnotesize}
% \begin{align*}
% &\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2} = \expct\pbox{\randWorld_A^2}\expct\pbox{\inparen{\randWorld_{X_1} + \randWorld_{X_2}}^2}\expct\pbox{\randWorld_B^2} = \expct\pbox{\randWorld_A}\expct\pbox{\randWorld_{X_1}^2 + 2\randWorld_{X_1}\randWorld_{X_2} + \randWorld_{X_2}^2}\expct\pbox{\randWorld_B} =\\
% &\expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}^2}+\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}+\expct\pbox{\randWorld_{X_2}^2}}\expct\pbox{\randWorld_B} = \expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}} + \expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}} + \expct\pbox{\randWorld_{X_2}}}\expct\pbox{\randWorld_B} = \\
% &\expct\pbox{\randWorld_A}\inparen{\sum\limits_{j \in \pbox{\bound}}\expct\pbox{j\cdot\randWorld_{X_j}}}\expct\pbox{\randWorld_B}.
% \end{align*}
% \end{footnotesize}
%We can drop the term $\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}$ since by definition a tuple can only have one multiplicity value in a possible world, thus always making $\randWorld_{X_1}\cdot \randWorld_{X_2} = 0$.
%Another subtlety to note is that for any $i\in \pbox{\bound}$, $\expct\pbox{\randWorld_{X_i}} = i\cdot\prob_{X, i}$.
This leads us to consider a structure related to the lineage polynomial.
%By exploiting linearity of expectation, further pushing expectation through independent variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation is
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (where $\randWorld_A$ is the random variable corresponding to $A$, distributed by $\pdassign$).
%Atri: Combined the the first step below with the next one to save space.
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_E^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ simplifies to:
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_E} \\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound}}
$ to be the polynomial $\refpoly{}$ = $\poly\inparen{\inparen{\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}}_{\tup\in\tupset}}%,\ldots,\sum_{j\in\pbox{\bound}}j\cdot X_{\numvar, j}}
$ and ii) define the \emph{reduced polynomial} $\rpoly\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%(X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound})
$ to be the polynomial resulting from converting $\refpoly{}$ into the standard monomial basis (\abbrSMB),
\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}
removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\in\tupset, j\neq j'\in\pbox{c}$, and setting all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Continuing with the example $\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$, to save clutter we i) do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$, and ii) for $\sum_{j\in\pbox{\bound}}j^2\cdot X_j$, we omit the summands encoding multiplicities $> 2$, since the greatest multiplicity of the tuple annotated with $X$ is $2$, likewise those summands will always evaluated to $0$ since the tuple will never have a multiplicity of $>2$.
\begin{multline*}
\rpoly_1^2(A, B, C, E, X_1, X_2, Y, Z) = \\
A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC =\\
ABX_1 + AB\inparen{2}^2X_2 + BYE + BZC + 2AX_1BYE + 2A\inparen{2}^2X_2BYE + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{multline*}
Note that we have argued that for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X_1=1}, \probOf\inparen{X_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
%\BG{Term has not been introduced yet.}
%Atri: fixed
$\poly\inparen{\vct{X}}=\poly\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\refpoly{}\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{c}}}.$%,\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
\end{Lemma}
}
\secrev{
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
\mypar{Upper Bound Techniques}
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation in~\Cref{sec:expression-trees}).
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective: % to formally define our objective:
\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{\optquery{\query}, \tupset, \bound})$?
\end{Problem}
We show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
\begin{Problem}\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
\poly_1^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly_1^2\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly_1^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly^2_1\inparen{\vct{\prob}}, \rpoly_1^2\inparen{\vct{\prob}}]$.
%
%To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
\rule{\textwidth}{1.5pt}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Recent work in heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10,DBLP:journals/vldb/SaRR0W0Z17} emits a \abbrPDB when insufficient data exists to select the `correct' data repair.
Probabilistic data cleaning is a crucial innovation, as the alternative is to arbitrarily select one repair and `hope' that queries receive meaningful results.
Although \abbrPDB queries instead convey the trustworthiness of results~\cite{kumari:2016:qdb:communicating}, they are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}, even in approximation (see \Cref{sec:karp-luby}).
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. %We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
%\AH{I don't think I understand what the sentence (about extensions) is saying.}
% (\Cref{def:the-expected-multipl}).
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
%No reviewer comments in arxiv submission.
%Our responses to ICDT first cycle reviewer comments are in \Cref{sec:rebuttal}. % the appendix.\AR{Would be good to have a specific app ref to rebuttal}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: