paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

500 lines
49 KiB
TeX

%!TEX root=./main.tex
%root: main.tex
\section{Introduction}\label{sec:intro}
\secrev{
This work explores the problem of computing the expectation of a tuple's multiplicity in an important special case of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. The set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec = \inset{0,\ldots, \bound}^{\abs{\tupset}}$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each %distribution
$\tup$ is an independent random event. %for $\tup \in \tupset$.
}
%\mypar{For a later section}
%\sout{
%Since each tuple in $\pdb$ has a mutually exclusive probability distribution over its possible multiplicities, it is natural to reduce a \abbrCTIDB to traditional (set) block independent database (\abbrBIDB). We refer to the reduced \abbrBIDB as a $1$-\abbrBIDB, as it is the case that each tuple can appear in a possible world at most $c = 1$ time. \Cref{fig:ctidb-red} shows an example of this reduction.
%}
\secrev{
Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\vct{W}\in\worlds$.
We can formally state this problem as:
\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}
\AH{I \emph{think} we use $\randDB$ to denote something different in one of the proofs. Have to keep an eye open for this to avoid overloading notation.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Example of product distribution of c-TIDB
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[h!]
% \centering
% \textcolor{red}{
% \begin{tabular}{>{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c}
% \multicolumn{5}{c}{$\mathbf{\rel}$}\\
% \toprule
% A & Mult. $\inparen{M}$ &$\probOf\pbox{M=1}$ &$\probOf\pbox{M=2}$ &$\probOf\pbox{M=3}$\\
% \midrule
% $a$ & $2$ & $0.4$ &$0.3$ &$0.0$\\
% $b$ & $3$ & $0.2$ &$0.35$ &$0.15$\\
% \end{tabular}
%% \hspace*{0.5cm}
%% {\LARGE $\Rightarrow$}
%% \hspace*{0.5cm}
% \begin{tabular}{>{\footnotesize}c | >{\footnotesize}c}
% \multicolumn{2}{c}{$\textbf{World Probabilities}$}\\
% \toprule
% World & Probability \\
% \midrule
% $\emptyset$ & $0.3\cdot0.3 = 0.09$\\
% $\inset{\intup{a, 1}}$ & $0.4\cdot0.3 = 0.12$\\
% $\inset{\intup{a, 2}}$ & $0.3\cdot0.3 = 0.09$\\
% $\inset{\intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
% $\inset{\intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
% $\inset{\intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
% $\inset{\intup{a, 1}, \intup{b, 1}}$ & $0.4\cdot0.2 = 0.08$\\
% $\inset{\intup{a, 1}, \intup{b, 2}}$ & $0.4\cdot0.35 = 0.14$\\
% $\inset{\intup{a, 1}, \intup{b, 3}}$ & $0.4\cdot0.15 = 0.06$\\
% $\inset{\intup{a, 2}, \intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
% $\inset{\intup{a, 2}, \intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
% $\inset{\intup{a, 2}, \intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
% \end{tabular}
% }
% \caption{\textcolor{red}{\abbrCTIDB relation$\rel$ and its possible worlds with their probabilities.% Reduction to $1$-\abbrBIDB ($\rel'$). Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$. This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$. Multiplicities that don't exist are automatically assigned a probability of $0$. Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
% \label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Figure may work for an example in a later section of the Intro
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[h!]
% \centering
% \textcolor{red}{
% \begin{tabular}{c | c | c}
% \multicolumn{3}{c}{$\mathbf{\rel}$}\\
% \toprule
% A & Mult. & id\\
% \midrule
% $a$ & $2$ & $\tup_1$\\
% $b$ & $3$ & $\tup_2$\\
% \end{tabular}
% \hspace*{1cm}
% {\LARGE $\equiv$}
% \hspace*{1cm}
% \begin{tabular}{c | c | c}
% \multicolumn{3}{c}{$\textbf{\rel'}$}\\
% \toprule
% A & $\poly$ & $\probOf\pbox{x_{i, j}}$\\
% \midrule
% a & $\pVar_{1, 1}$ & $0.4$\\
% a & $\pVar_{1, 2}$ & $0.3$\\
% a & $\pVar_{1, 3}$ & $0.0$\\
% \midrule
% b & $\pVar_{2, 1}$ & $0.2$\\
% b & $\pVar_{2, 2}$ & $0.35$\\
% b & $\pVar_{2, 3}$ & $0.15$\\
% \end{tabular}
% }
% \caption{\textcolor{red}{\abbrCTIDB ($\rel$) Reduction to $1$-\abbrBIDB ($\rel'$). Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$. This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$. Multiplicities that don't exist are automatically assigned a probability of $0$. Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
% \label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We upperbound the multiplicity of tuples in a \abbrCTIDB since this is what typically seen in practice.
%because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations.
Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial algorithm to compute~\cref{prob:expect-mult} for any query over a \abbrCTIDB due to linearity of expection. Simply perform the probability computations in a `sum-of-products' fashion. This is made more precise when we discuss polynomial equivalence in the following subsection. Since we can compute~\cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with hardness of computing expectation using fine-grained analysis and parameterized complexity.
}
%\sout{
%\mypar{Example that can perhaps be used later on (using commented out figure above)}
%Given a \abbrCTIDB $\pdb$ with $\numvar$ tuples, we can encode a possible world by the vector $\vct{W} \in \inset{0,\ldots, c}^\numvar$, with the intuitive interpretation when bit $W_i = j$, then tuple $\tup_i$ with multiplicity $j$ is selected, with $\tup_i$ not existing for the special case of $j = 0$. For the example in ~\cref{fig:ctidb-red}, we have that for \abbrCTIDB $\textbf{R}$, $\numvar = 2$. Then, e.g., arbitrary world vector $\vct{W} = [2, 3]$ encodes the possible world $\db = \inset{\intup{a, 2}, \intup{b, 3}}$ Computing ~\cref{prob:expect-mult} for tuple $\tup_2$ in ~\cref{fig:ctidb-red} when $\query = \mathbf{\rel}$ then becomes $\expct_{\randDB\sim\pd}\pbox{\mathbf{\rel}\inparen{\tup_2}} = 1\cdot\prob_{2,1} + 2\cdot\prob_{2,2} + 3\cdot\prob_{2,3} = 1\cdot 0.2 + 2\cdot 0.35 + 3\cdot 0.15 = 1.35$.
%}
\secrev{
One of the main theoretical points in this work is to discern whether or not bag \abbrCTIDB query semantics are indeed linear in the runtime of an equivalent deterministic query. If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice.
Unfortunately, we prove that this is not the case. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing ~\cref{prob:expect-mult} over \abbrCTIDB $\pdb$. Let $\gentupset$ denote the set of tuples in $\pdb$, i.e.,
\begin{Definition}[$\gentupset$]
Define $\gentupset$ to be the set of tuples appearing across all the possible worlds of a $\abbrCTIDB$, formally $\gentupset = \inset{\tup_i ~|~ \forall \worldvec \in \worlds,~\forall i \in \abs{\tupset}:~\worldvec\pbox{i} > 0}$. When a specific $\pdb = \inparen{\worlds, \bpd}$ is being referred to, we will use $\tupset$ to denote the set of tuples.
\end{Definition}
Let $\qruntime{\query, \gentupset}$ be the optimal runtime (with some caveats; discussed in~\cref{sec:gen}) of query $\query$ on a comparable deterministic database $\gentupset$ defined next.
%We make this runtime concrete later on.
%We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
%\AR{Again if we are defining \abbrCTIDB `from scratch' instead of in terms of general PDBs, then the above might not be needed. Also it should be \abbrCTIDB instead of \abbrPDB in the sentence below.}
~\Cref{tab:lbs} shows our lower bounds for computing~\cref{prob:expect-mult} on \abbrCTIDB\xplural.
\newline
\begin{table}[h!]
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\pd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\gentupset$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\gentupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \gentupset}}^k}$, where $k$ is the join width (our notion of join width follows from~\cref{def:degree-of-poly} and~\cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \gentupset}$ by just $\abs{\gentupset}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\cref{prob:expect-mult} in $O_\epsilon\inparen{\qruntime{\query, \tupset}}$.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
% result tuple, we also have the same runtime (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
%Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\abs{\circuit}^{2k})$ (see \Cref{sec:karp-luby}).
Further, we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural).
}
\secrev{
\subsection{Polynomial Equivalence}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
%Intuitively, a \abbrCTIDB lends itself to a useful reduction to a specific type of block independent database (\abbrBIDB) which we refer to as a $1$-\abbrBIDB. A $1$-\abbrBIDB is a \abbrBIDB in the traditional sense of allowing no duplicate tuples, \emph{but} where we use bag query semantics instead of the usual set query semantics.
%(see~\Cref{fig:nxDBSemantics} for a definition)
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\gentupset}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =& \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\
\polyqdt{\select_\theta(\query)}{\gentupset}{\tup} =& \begin{cases}
\polyqdt{\query}{\gentupset}{\tup} & \text{if }\theta(\tup) \\
0 & \text{otherwise}.
\end{cases} &
\begin{aligned}
\polyqdt{\query_1 \join \query_2}{\gentupset}{\tup} =\\ ~
\end{aligned}&
\begin{aligned}
&\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}} \\
&~~~\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}}
\end{aligned}\\
& & & \polyqdt{\rel}{\gentupset}{\tup} = X_\tup%\sum_{j \in [c]}j\cdot\pVar_{\tup, j}
\end{align*}\\[-10mm]
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query over a \abbrCTIDB, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduct the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.}
\label{fig:nxDBSemantics}
\end{figure}
We drop $\query$, $\gentupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now specify the problem of computing the expectation of tuple multiplicity in the language of lineage polynomials:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$ and result tuple $\tup$, compute the expected
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt(\vct{W})}$, where $\vct{W} \in \inset{0,\ldots, \bound}^{\abs{\tupset}}$).
%,
%where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
\end{Problem}
We note that computing \Cref{prob:expect-mult}
is equivalent to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
}
\secrev{
\subsection{Our Machinery}
\mypar{Lower Bound Proof Techniques}
All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
%\begin{Definition}
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner. For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$. Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
%\end{Definition}
%\begin{Example}
%Consider the $\boldsymbol{Route}$ relation of~\cref{fig:two-step} and query $\query = \project_{\text{City}_1}\inparen{\boldsymbol{Route}}$. The output relation $\query$ is $\inset{\intup{Chicago, X}, \intup{Chicago, Y}}$ and can be represented as a \abbrCTIDB $\query' = \inset{\intup{Chicago, X', 2}}$, where the following probabilities are true: $\probOf\pbox{X' = 0} = \probOf\pbox{\neg X \wedge \neg Y}$, $\probOf\pbox{X' = 1} = \probOf\pbox{\inparen{X \vee Y}\wedge\inparen{\neg X \vee \neg Y}}$, and $\probOf\pbox{X' = 2} = \probOf\pbox{X\wedge Y}$. $\query'$ can then be reduced to a $1$-\abbrBIDB by creating a block of the following disjoint tuples: $\query'' = \inset{\intup{\text{Chicago}, X'_0}, \intup{\text{Chicago}, X'_1}, \intup{\text{Chicago}, X'_2}}$ such that $\probOf\pbox{X'_i = 1} = \probOf\pbox{X' = i}$.
%\end{Example}
Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
}
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
\end{lstlisting}
\secrev{
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly^2\inparen{A, B, C, E, X, Y, Z}$
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the summand $A^2X^2B^2$ as the procedure is the same for all other summands of $\poly^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. However, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
An equivalent representation of $\poly^2$ can be derived by thinking of having a separate product $j\cdot X_j$ for each multiplicity value $j\in\pbox{\bound}$ such that the original `base' variable $X$ is equal to the sum of these products. For this example, the set of variables could be $\inset{A_1,\ldots,A_4, X_1,\ldots,X_4, B_1,\ldots,B_4}$, where e.g. $X$ now equals $\sum_{j\in\pbox{4}}j\cdot X_j$ and each variable takes values from the set $\inset{0, 1}$. Our reformulated polynomial $\poly_R^2 = \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2$ $\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2$ $\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2$. Since tuple multiplicities by nature are disjoint we can drop all cross terms and have $\poly_R^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. With the reframed polynomial, the expectation is $\expct\pbox{\poly^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{A_{j_1}}\expct\pbox{X_{j_2}}\expct\pbox{X_{j_3}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
% \begin{footnotesize}
% \begin{align*}
% &\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2} = \expct\pbox{\randWorld_A^2}\expct\pbox{\inparen{\randWorld_{X_1} + \randWorld_{X_2}}^2}\expct\pbox{\randWorld_B^2} = \expct\pbox{\randWorld_A}\expct\pbox{\randWorld_{X_1}^2 + 2\randWorld_{X_1}\randWorld_{X_2} + \randWorld_{X_2}^2}\expct\pbox{\randWorld_B} =\\
% &\expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}^2}+\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}+\expct\pbox{\randWorld_{X_2}^2}}\expct\pbox{\randWorld_B} = \expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}} + \expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}} + \expct\pbox{\randWorld_{X_2}}}\expct\pbox{\randWorld_B} = \\
% &\expct\pbox{\randWorld_A}\inparen{\sum\limits_{j \in \pbox{\bound}}\expct\pbox{j\cdot\randWorld_{X_j}}}\expct\pbox{\randWorld_B}.
% \end{align*}
% \end{footnotesize}
%We can drop the term $\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}$ since by definition a tuple can only have one multiplicity value in a possible world, thus always making $\randWorld_{X_1}\cdot \randWorld_{X_2} = 0$.
%Another subtlety to note is that for any $i\in \pbox{\bound}$, $\expct\pbox{\randWorld_{X_i}} = i\cdot\prob_{X, i}$.
This reformulation of the problem leads us to consider a structure related to the lineage polynomial.
%By exploiting linearity of expectation, further pushing expectation through independent variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation is
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (where $\randWorld_A$ is the random variable corresponding to $A$, distributed by $\pdassign$).
%Atri: Combined the the first step below with the next one to save space.
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_E^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ simplifies to:
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_E} \\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$ define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by i) replacing all $X_\tup \in \vct{X}$ for $\tup \in \tupset$ with $\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}$, i.e. $\rpoly\inparen{\vct{X}}$ has variables $X_{\tup, j}$ for $j \in \pbox{\bound}$ such that $X_{\tup, j} \in \inset{0, 1}$, ii) convert the reformulated polynomial formed into the standard monomial basis (\abbrSMB)
\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}
while setting all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Continuing with the example $\poly^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$, to save clutter we i) do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$, and ii) for $\sum_{j\in\pbox{\bound}}j^2\cdot X_j$, we omit the summands encoding multiplicities $> 2$, since the greatest multiplicity of the tuple annotated with $X$ is $2$, likewise those summands will always evaluated to $0$ since the tuple will never have a multiplicity of $>2$.
\begin{multline*}
\rpoly^2(A, B, C, E, X_1, X_2, Y, Z) = \\
A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC =\\
ABX_1 + AB\inparen{2}^2X_2 + BYE + BZC + 2AX_1BYE + 2A\inparen{2}^2X_2BYE + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{multline*}
Note that we have argued that for our specific example the expectation that we want is $\widetilde{\poly^2}(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X_1=1}, \probOf\inparen{X_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
%\BG{Term has not been introduced yet.}
%Atri: fixed
$\poly\inparen{\vct{X}}=\poly\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\prob_{1, 1},\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
\end{Lemma}
\AH{Here is what I stopped.}
To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
%
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
\mypar{Upper Bound Techniques}
Our negative results (\cref{tab:lbs}) indicate that \abbrCTIDB{}s can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrCTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
Denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:
\begin{Problem}[Bag-\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\vct{D}\sim\pd}\pbox{\query\inparen{\vct{\db}}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O_\epsilon(\qruntime{Q, \tupset})$?
\end{Problem}
We show in \Cref{sec:circuit-depth} an $O(\qruntime{Q, \tupset})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \tupset}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le O(\qruntime{\query, \tupset})$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{Q, \tupset}$ (i.e., $|\circuit^*| \le O(\qruntime{Q, \tupset})$).
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
can be reframed as:
\begin{Problem}[\Cref{prob:big-o-joint-steps} reframed]\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\vct{\db}\sim\pd}\pbox{\query\inparen{\vct{\db}}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
\rule[1cm]{\textwidth}{1.5pt}
}
\AHchange{
\LARGE Old Stuff
}
A probabilistic database (PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
\AHchange{
A tuple independent database (\abbrTIDB) (to which we will refer to later) is a \abbrPDB such that each tuple is an independent random event.
}
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise.
In this work, we are interested in bag semantics, where each tuple is associated with a multiplicity.
Following~\cite{DBLP:conf/pods/GreenKT07}, we model bag databases (resp., relations) as functions from each $\tup$ to the tuple's multiplicity $\db(\tup) \in \semN$ in a possible world $\db$.
\sout{
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
}
The natural generalization of the \AHchange{(set)} problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that is assigned value $\query(\db)(\tup) \in \semN$ in world $\db \in \idb$
\AHchange{
, formally $\expct_{\randDB\sim\pd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$.
}
For \lstinline{COUNT(*)} queries, expected multiplicities can model the expected count. \sout{
The equivalent set-\abbrPDB operation, simply computes the probability that this count is non-zero.
Further,
} We are interested in the parameterized complexity of
\AHchange{
computing the expectation,
}%\Cref{prob:bag-pdb-query-eval}
(i.e. we think of $\query$ as being parameterized by some parameter $k$ with the size of the database going to infinity relative to $k$). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$.
\AHchange{
While the parameterized and fine-grained results of this paper apply to general \abbrPDB\xplural, we start by focusing on a restricted form of \abbrTIDB which we refer to as \abbrCTIDB.
As alluded to, a \abbrTIDB is a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.\label{footnote:set-not-limit}
}
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
We then define a \abbrCTIDB be a bag \abbrTIDB with the further restriction that each tuple $\tup$ has a multiplicity of at most some constant $c$, formally: $\forall \db \in \pdb, ~\forall \tup \in \db, ~\db\inparen{\tup}\leq c$. That is, any tuple in a \abbrCTIDB has a multiplicity of at most $c$.
}
%\mypar{\abbrTIDB\xplural}
%We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
% This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.\label{footnote:set-not-limit}
% % Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
% % % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
% % When the multiplicity of input tuple is bound by some constant,
% % the increased input size is negligible.\label{footnote:set-not-limit}
%}
%% OK: I tidied things up a touch.
%%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
%We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
%iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
%Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$
%\AH{Where do we use this notation? If we use this somewhere, should we maybe use $\db_{\vct{\randWorld}}$ instead?}
% as the world represented by $\vct{W}$.
%Atri: I don't think we use it so removing it.
%Atri: Stuff below was confusing, so am re-writing it.
%A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database $\encodedDB$ with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$
%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
%being the distribution induced from the requirement that each tuple $\tup \in \encodedDB$ be treated as an independent Bernoulli distributed random variable with probability $\prob_\tup$.
%The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
%Given a vector $\vct{X}$ such that each $\tup \in \encodedDB$ has a unique formal variable annotation $X_\tup \in \vct{X}$, for a boolean domain $\{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
%\BG{REMOVED:
%When $\pdb$ is a \abbrTIDB, for every output tuple $\tup$, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$.
%Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \Cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators over Natural numbers (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{W}}}\inparen{\tup} = \poly_\tup\inparen{\vct{W}}$.
%This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.}
Thanks to linearity of expectation, simple polynomial-time algorithms (for fixed query $\query$) exist for computing the expectation of a lineage polynomial $\apolyqdt$ when $\pdb$ is a \abbrTIDB and $\query$ is an $\raPlus$ query.
% The algo is trivial so I think putting in a 2010 cite seems like bit too much
%\cite{kennedy:2010:icde:pip})
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time.
If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done.
However, in practice (and in theory), we care about the {\em fine-grained}/parameterized complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime).
Given \abbrCTIDB $\pdb$ and query $\query$, let $\timeOf{}^*(Q,\pdb)$ denote the (optimal) runtime complexity of computing expected multiplicity (over all result tuples $\tup$). %\AR{Am changing these runtime definitions to include the runtime for all result tuples $\tup$.}
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}). %\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$,
\sout{
We finally have all the pieces to state a formal specification of our problem:
}
\AHchange{
Given the above, the natural question to ask is whether or not it is always the case that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\query, \dbbase}$?
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Problem}\label{prob:informal}
%Given an $\raPlus$ query $\query$ and \abbrTIDB
%% OK: added motivation
%%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.}
%\abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$?
%\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.
%In this paper, we explore the \emph{fine-grained complexity} of bag-probabilistic database query evaluation.
%Atri: I'm not sure if this comment makes much sense here-- it sort of breaks the flow I think. I'll refer to this when talking about our results.
%The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
\AHchange{
This question
}
is a special case of computing the expected multiplicity of $\tup$ since we are asking whether the query evaluation over a \abbrCTIDB is {\em linear} in the runtime of deterministic query processing.
We stress that this question is very well motivated, even for \abbrTIDBs: An answer \AHchange{to the above question} in the affirmative \AH{not sure that this is the best way of putting it} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for deployment in practice.
%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Recent work in heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10,DBLP:journals/vldb/SaRR0W0Z17} emits a \abbrPDB when insufficient data exists to select the `correct' data repair.
Probabilistic data cleaning is a crucial innovation, as the alternative is to arbitrarily select one repair and `hope' that queries receive meaningful results.
Although \abbrPDB queries instead convey the trustworthiness of results~\cite{kumari:2016:qdb:communicating}, they are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}, even in approximation (see \Cref{sec:karp-luby}).
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. %We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
%\AH{I don't think I understand what the sentence (about extensions) is saying.}
% (\Cref{def:the-expected-multipl}).
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
%No reviewer comments in arxiv submission.
%Our responses to ICDT first cycle reviewer comments are in \Cref{sec:rebuttal}. % the appendix.\AR{Would be good to have a specific app ref to rebuttal}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: