paper-BagRelationalPDBsAreHard/introduction.tex

336 lines
37 KiB
TeX

%!TEX root=./main.tex
%root: main.tex
\section{Introduction}\label{sec:intro}
We explore the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ is defined over a \dbbaseName (i.e. a `base' set of tuples) $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a bag of tuples (\Cref{app:subsec:background-nxdbs}). %(\Cref{prop:expection-of-polynom}).
%in \Cref{subsec:expectation-of-polynom-proof}).
Given that tuple multiplicities are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,j(t)}$ for $j(t) = \worldvec_{\tup}$.
%
% Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
% In this work, since we are generally considering bag query input, we will only be considering bag query semantics.
In this work, we consider \emph{queries with bag semantics} over such bag probabilistic databases.
We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of a result tuple $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
%
We now formally state our problem of computing the expected multiplicity: % of a result tuple as:
\begin{Problem}\label{prob:expect-mult}
Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query\footnote{
An $\raPlus$ query is a query expressed in positive relational algebra, i.e., using only the operators selection ($\select$), projection ($\project$), natural join ($\join$) and union ($\union$).
}
$\query$, and result tuple $\tup$, compute the expectation $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}
\begin{figure}[t!]
\begin{align*}
&\begin{aligned}[t]
&\polyqdt{\project_A(\query)}{\gentupset}{\tup}\inparen{\vct{X}} =\\
&~\sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'}\inparen{\vct{X}}
\end{aligned}
&
&\begin{aligned}[t]
&\polyqdt{\query_1 \union \query_2}{\gentupset}{\tup}\inparen{\vct{X}} =\\
&~ \polyqdt{\query_1}{\gentupset}{\tup}\inparen{\vct{X}} + \polyqdt{\query_2}{\gentupset}{\tup}\inparen{\vct{X}}\\
\end{aligned}\\
&\begin{aligned}
&\polyqdt{\select_\theta(\query)}{\gentupset}{\tup}\inparen{\vct{X}} =\\
&~ \begin{cases}
\polyqdt{\query}{\gentupset}{\tup}\inparen{\vct{X}} & \text{if }\theta(\tup) \\
0 & \text{otherwise}.
\end{cases}
\end{aligned}
&
&\begin{aligned}
&\polyqdt{\query_1 \join \query_2}{\gentupset}{\tup}\inparen{\vct{X}} =\\
&~~\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}}\inparen{\vct{X}}\\
&~\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}}\inparen{\vct{X}}
\end{aligned}
\end{align*}%\\[-10mm]
%\setlength{\abovecaptionskip}{-0.25cm}
\savecaptionspace{
\caption{Lineage polynomial semantics for any $\raPlus$ query $\query$, arbitrary \dbbaseName $\gentupset$ (with variables $\vct{X}=\inparen{X_\tup}_{\tup \in\gentupset}$), where for $\rel\in\gentupset$, $\tup\in\rel$, the base case is $\polyqdt{\rel}{\gentupset}{\tup}\inparen{\vct{X}} = X_\tup$.}% for any $\rel\in\gentupset$ and $\tup\in\rel$.}% consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$, such that the base case $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.} %Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively. The base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$}
\label{fig:nxDBSemantics}
}{\abovecapshrink}{\belowcapshrink}
%\vspace{-0.53cm}
\end{figure}
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524,DBLP:journals/sigmod/GuagliardoL17}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
% We will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
% Allowing for unbounded $c$ is an interesting open problem.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and their data complexity has, in general been shown % by Dalvi and Suicu
to be \sharpphard\cite{10.1145/1265530.1265571}.
In an independent work, Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural with unbounded multiplicities, which requires %them to explicitly address the issue of
a succinct representation of probability distributions over infinitely many multiplicities and proved a dichotomy for
%the problem of
computing the probability of an output tuple's multiplicity being bounded by given $s$.
% investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem.
% While the authors observe that computing the expectation of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered.
% Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
\mypar{Our Setup} In contrast to~\cite{https://doi.org/10.48550/arxiv.2201.11524}, we consider \abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$.
Then, % (\abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$), however,
there exists a trivial \ptime algorithm for computing the expectation of a result tuple's multiplicity~(\Cref{prob:expect-mult}) for any fixed $\raPlus$ query due to linearity of expectation (see~\Cref{sec:intro-poly-equiv}).
Since the {\em data complexity} of~\Cref{prob:expect-mult} is in \ptime, %interesting
we then explore the question of %the hardness of
computing expectation using fine-grained and parameterized complexity, where we are interested in the exponent of polynomial runtime.\footnote{While %the authors of
\cite{https://doi.org/10.48550/arxiv.2201.11524} also observes that computing the expectation of an output tuple multiplicity is in \ptime, it does not investigate the fine-grained complexity of this problem.}
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query, which we make more precise shortly.
If true, this opens up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$.
That is, although production database systems use bag semantics for query evaluation, allowing duplicate intermediate or output tuples, input tuples in real world datasets are still frequently unique.
To analyze this question we denote by $\timeOf{}^*(\query,\pdb, \bound)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$ and query $\query$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table*}[t!]
\centering
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
\textbf{Lower bound on $\timeOf{}^*(\qhard^k,\pdb,1)$} & \textbf{Num.} $\bpd$s
& \textbf{Hardness Assumption}\\
\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{c_0\cdot k/\log{k}}}$ for {\em some} $c_0>0$ & Multiple & Exponential Time Hypothesis (ETH)\\%\Cref{conj:known-algo-kmatch}\\
\hline
\end{tabular}
\savecaptionspace{
\caption{Our lower bounds for $\qhard^k$ parameterized by $k$ over $1$-TIDB $\pdb$. % = \inset{\worlds, \bpd}$.
Those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$. See~\Cref{sec:hard} for further details.}%, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
}{0cm}{-0.73cm}
%\vspace{-0.73cm}
\end{table*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Our lower bound results}
%
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for a formal definition) denote the runtime for query $\query$ over any deterministic database
%\dbbaseName
that maps each tuple in $\gentupset$ to a multiplicity in $[0,\bound]$.
%where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for {\em all} $\query$. \emph{Specifically, if there is an equivalent and more efficient query $\query'$, we allow both deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the answer to the above question is no--
\Cref{tab:lbs} shows our results.
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(\query,\pdb, \bound) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width of $\query$ (our notion of join width
%follows from~\Cref{def:degree-of-poly}
is essentially the degree of the corresponding lineage polynomial we introduce in \Cref{sec:intro-poly-equiv}).
%% OK: Fig 1 hasn't been introduced yet
% defined in~\Cref{fig:nxDBSemantics}.)
%of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
%
What our lower bound in the third row says, is that for a specific family of hard queries, one cannot get more than a polynomial improvement (for fixed $k$) over essentially the trivial algorithm for~\Cref{prob:expect-mult}, assuming the Exponential Time Hypothesis (ETH)~\cite{eth}.
%However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions).
We also note that existing results\footnote{This claim follows when we set $\query$ to the query that counts the number of $k$-cliques over database $\tupset$ that encodes a graph. Precisely the same bounds as in the three rows of~ \Cref{tab:lbs} (with $n$ replacing $\qruntime{\optquery{\query}, \tupset, \bound}$) follow from the same complexity assumptions we make: triangle detection hypothesis (by definition), $\sharpwzero\ne\sharpwone$~\cite{10.5555/645413.652181} and Strong ETH~\cite{CHEN20061346}. For the last result we can replace $k/\log{k}$ by just $k$.
%This claim follows from known results for the problem of evaluating a query $\query$ that counts the number of $k$-cliques over database $\tupset$. Specifically, a lower bound of the form $\Omega\inparen{n^{1+\eps_0}}$ for {\em some} $\eps_0>0$ follows from the triangle detection hypothesis (this like our result is for $k=3$). Second, a lower bound of $\omega\inparen{n^{C_0}}$ for {\em all} $C_0>0$ under the assumption $\sharpwzero\ne\sharpwone$~\cite{10.5555/645413.652181}. Finally, a lower bound of $\Omega\inparen{n^{c_0\cdot k}}$ for {\em some} $c_0>0$ was shown by~\cite{CHEN20061346} (under the strong exponential time hypothesis).
}
imply the claimed lower bounds if we replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar = |\tupset|$.
%(indeed these results follow from known lower bounds for deterministic query processing).
Our contribution is to identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce a $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query\BG{What is the size of the deterministic query?}. % and bag \abbrPDB\xplural are deployable in practice.
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$
(see \Cref{sec:karp-luby}).
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
(see \Cref{subsec:tidbs-and-bidbs}).
\subsection{Polynomial Equivalence}\label{sec:intro-poly-equiv}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,4497507,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) annotates tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over variables $\vct{X}$ encoding input tuple multiplicities. The lineage polynomial for result tuple $t_{out}$ evaluates to $t_{out}$'s multiplicity in a given possible world when each $X_{t_{in}}$ is replaced by the multiplicity of $t_{in}$ in the possible world.
We now state the problem of computing the expectation of tuple multiplicity in terms of lineage polynomials (which is equivalent to \Cref{prob:bag-pdb-poly-expected}-- see \Cref{prop:expection-of-polynom}):
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$ and result tuple $\tup$,
%compute the expected multiplicity of the polynomial $\poly$ (i.e.,
%for $\worldvec\in\worlds$,
compute $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt\inparen{\worldvec}}$).
\end{Problem}
%We note that computing \Cref{prob:expect-mult} is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
We drop $\query$, $\tupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or not relevant to the discussion.
All of our results rely on working with a {\em reduced} form $\rpoly$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
Further, only light extensions (see \Cref{def:reduced-poly-one-bidb}) are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10}. % (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
Next, we motivate the reduced polynomial $\rpoly$.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
\begin{lstlisting}[frame=single,framerule=0pt]
SELECT DISTINCT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
\end{lstlisting}
It can be verified that $\poly\inparen{A, B, C, E, U, Y, Z}$ for the sole result tuple of $\query_1$ is $AUB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $\query_1^2$ is $\poly_1^2\inparen{A, B, C, E, U, Y, Z}$
$$
=A^2U^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AUB^2YE + 2AUB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\monomial{1}(A,U,B) = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_U\cdot\prob_B$ % = \rmonomial{1}\inparen{\prob_A, \prob_U, \prob_B}$
(denoting $\probOf\pbox{\randWorld_X = 1} = \prob_X$). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
The simple insight to get around this issue to note that the random variables $\randWorld_U$ and $\randWorld_{U_1}+2\randWorld_{U_2}$ have exactly the same distribution, where $\randWorld_{U_1},\randWorld_{U_2}\in\inset{0,1}$ and $\probOf\pbox{\randWorld_{U_j} = 1} = \probOf\pbox{\randWorld_{U} = j}$. Thus, the idea is to replace the variable $U$ by $U_1+2U_2$ (where $U_j$ corresponds to the event that $U$ has multiplicity $j$) yielding% to obtain the following polynomial:
%
%Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ . Given $X_\tup \in\vars{\poly}$ and integer valuation $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where the variables $\inparen{X_{\tup, j}}_{j\in\pbox{\bound}}$ are disjoint with integer assignments $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$ and corresponding reformulated world $\worldvec_{\vct{R}}\in\inset{0, 1}^{\tupset\bound}$, we set $\worldvec_{\vct{R}_{\tup, j}} = 1$ for $\worldvec_\tup = j$, while $\worldvec_{\vct{R}_{\tup, j'}} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any integer valuation $X_\tup\in\pbox{\bound}$, $X_{\tup, j}\in\inset{0, 1}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_{t, j}$.
%
%Considering again our example,
%{\small
%\begin{multline*}
%\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
%= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\[\monomial{1,R}\inparen{A, U_1, U_2, B} = \monomial{1}\inparen{A,(U_1+2U_2),B}.\]
%\end{multline*}
%}
%Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$, computing expectation yields $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2$ \allowbreak $\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$.
Given that $U$ can only have multiplicity of $1$ or $2$ but not both,
%we drop the monomials with the term $U_1U_2$ to get
%$\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2, B} = A^2U_1^2B^2+2^2\cdot A^2 U_2^2B^2.$
given world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^2$, we have $\expct\pbox{\randWorld_{U_1}\randWorld_{U_2}}=0$. Further, since the world vectors are Binary vectors, we have $\expct\pbox{\monomial{1,R}}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rmonomial{1}\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to arbitrary $\poly$ leads us to consider its following `reduced' version:
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
$ to be the polynomial $\refpoly{}$ = $\poly\inparen{\inparen{\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}}_{\tup\in\tupset}}
$ and ii) define the \emph{reduced polynomial} $\rpoly\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
$ to be the polynomial resulting from converting $\refpoly{}$ into the standard monomial basis\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
} (\abbrSMB),
removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for any $\tup\in\tupset, j\neq j'\in\pbox{c}$, and setting each \emph{variable}'s exponents $e > 1$ to $1$.
\end{Definition}
%Continuing with the example\footnote{To save clutter we do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$.}
% $\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$ we have $\rpoly_1^2(A, B, C, E, X_1, X_2, Y, Z)=$
%\begin{align*}
%&A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE \\
%&\qquad+ 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC \\
%&= ABX_1 + AB\inparen{2}^2X_2+ BYE + BZC + 2AX_1BYE+ 2A\inparen{2}^2X_2BYE\\
%&\qquad + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%\end{align*}
As we have essentially argued earlier, the expecation for our specific example is $\rpoly_1^2(\probOf\inparen{A=1}, \ldots,
%$\allowbreak$\probOf\inparen{B=1}, \probOf\inparen{C=1}$,\allowbreak $\probOf\inparen{E=1},$\allowbreak $\probOf\inparen{U_1=1}, \probOf\inparen{U_2=1}, \probOf\inparen{Y=1},
\probOf\inparen{Z=1})$.
This equivalence generalizes to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}):
\begin{Lemma}\label{lem:tidb-reduce-poly}
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
$\poly\inparen{\vct{X}}$%=\textcolor{red}{CHANGE}\poly\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$
, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\prob_{\tup,j}}_{\tup\in\tupset,j\in\pbox{\bound}}.$
\end{Lemma}
%\noindent
%The proof in \Cref{subsec:proof-exp-poly-rpoly} follows by~\Cref{prop:ctidb-reduct} and~\Cref{lem:bin-bidb-phi-eq-redphi}.
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
%Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB.
To prove the lower bounds in \Cref{tab:lbs} we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $\qhard^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query_1$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various subgraphs with $k$ edges. E.g., for the last two rows in \cref{tab:lbs}, we exploit the fact that the coefficient corresponding to $\prob^{2k}$ in $\rpoly\inparen{\prob,\ldots,\prob}$ of $\qhard^k$ is proportional to the number of $k$-matchings in $G$,
a known hard problem in parameterized/fine-grained complexity literature.
\mypar{Upper Bound Techniques}
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) cannot achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under well established hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial $\poly(\vct{X})%=\textcolor{red}{CHANGE}\apolyqdt\inparen{\vct{X}}$
$;
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute a $(1\pm \eps)$-approximation $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (a representation of $\poly$ as an arithmetic circuit --- more on this representation in~\Cref{sec:expression-trees}).
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo when $\poly$ is input as $\circuit$. Then to answer if we can compute a $(1\pm \eps)$-approximation to the expected multiplicity, it is enough to answer the following:
%which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective:
\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}$\allowbreak$\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ for all result tuples $\tup$ where there exists
$\circuit : \timeOf{\abbrStepOne}(\query,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le$\allowbreak$ O_\epsilon(\qruntime{\optquery{\query}, \tupset, \bound})$?
\end{Problem}
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $|\circuit|$ is the size of circuit $\circuit$.
Hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ is too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable/numeric source gates and multiplication/addition internal gates.
}
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more precisely, a circuit $\circuit$ with $\numvar$ sinks, one per output tuple).% representing the tuple's lineage).
%
Since a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ and
the size of $\circuit^*$ is bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$) (see~\Cref{sec:circuit-runtime}), we can focus on the complexity of \abbrStepTwo.
%Thus, the question of approximation can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
%Given such a $\circuit^*$,
Thus, to solve \Cref{prob:big-o-joint-steps}, it is \emph{sufficient} to solve: % the following problem:
\begin{Problem}\label{prob:intro-stmt}
Given any circuit $\circuit$ that encodes $\Phi\inparen{\vct{X}}$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ is a constant factor approximation of $\rpoly$ (where we assume $\tupset=[n]$).
This is illustrated in the following example using $\query_1^2$ from earlier. To aid in presentation we again limit our focus to $\monomial{1,R}(A,U,B)$, assume $\bound = 2$ for variable $U$ and $\bound = 1$ for all other variables. Let $\prob_X$ denote $\probOf\pbox{X = 1}$.
%In computing $\rpoly$, we have some cancellations to deal with:
Then we have:
%
%\begin{footnotesize}
%\begin{equation*}
$\monomial{1,R}\inparen{\vct{X}} = A^2\inparen{U_1^2 + 4U_1U_2 + 4U_2^2}B^2 =A^2U_1^2B^2 + 4A^2U_1U_2B^2+4A^2U_2^2B^2$, which in turn implies:
%&\qquad+ 2AX_2B^2YE + 2AX_1B^2ZC + 2AX_2B^2ZC + 2B^2YEZC\\
%\end{equation*}
%\end{footnotesize}
%Recall that
%\begin{footnotesize}
%\begin{equation*}
%\end{equation*}
%\end{footnotesize}
\[ \monomial{1,R}\inparen{\probAllTup} -4\prob_A^2\prob_{U_1}\prob_{U_2}\prob_B^2=\prob_A^2\prob_{U_1}^2\prob_B^2 + 4\prob_A^2\prob_{U_2}^2\prob_B^2.\]
%Substituting $\vct{\prob}$ for $\vct{X}$,
%\begin{footnotesize}
%\begin{align*}
%\hspace*{-3mm}
% \refpoly{1, }^{\inparen{ABX}^2}\inparen{\probAllTup} &=\prob_A^2\prob_{X_1}^2\prob_B^2 + 4\prob_A^2\prob_{X_1}\prob_{X_2}\prob_B^2 + 4\prob_A^2\prob_{X_2}^2\prob_B^2\\% + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_{X_1}\prob_B^2\prob_Y\prob_E\\
%&\qquad+ 2\prob_A\prob_{X_2}\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_{X_1}\prob_B^2\prob_Z\prob_C + 2\prob_A\prob_{X_2}\prob_B^2\prob_Z\prob_C+ 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
% &\leq\prob_A\prob_{X_1}\prob_B + 4\prob_A^2\prob_{X_1}\prob_{X_2}\prob_B^2 + 4\prob_A\prob_{X_2}\prob_b\\% + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C + 2\prob_A\prob_{X_1}\prob_B\prob_Y\prob_E \\
%&\qquad + 2\prob_A\prob_{X_2}\prob_B\prob_Y\prob_E+ 2\prob_A\prob_{X_1}\prob_B\prob_Z\prob_C + 2\prob_A\prob_{X_2}\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C\\
% &= \rpoly_1^{\inparen{ABX}^2}\inparen{\vct{p}} + 4\prob_A^2\prob_{X_1}\prob_{X_2}\prob_B^2.
%\end{align*}
%\end{footnotesize}
Noting that $\rmonomial{1}\inparen{\vct{X}} = AU_1B+4AU_2B$,
if we assume that all probability values are in $[p_0,1]$ for some $p_0>0$,
%then given access to $\refpoly{1, }^{\inparen{ABX}^2}\inparen{\vct{\prob}} - 4\prob_A^2\prob_{X_1}\prob_{X_2}\prob_B^2$
we get that $\monomial{1,R}\inparen{\vct{\prob}} - 4\prob_A^2\prob_{U_1}\prob_{U_2}\prob_B^2$ is in the range $\pbox{p_0^3\cdot\rmonomial{1}\inparen{\vct{\prob}}, \rmonomial{1}\inparen{\vct{\prob}}}$.
%We can simulate sampling from $\refpoly{1, }^2\inparen{\vct{X}}$ by sampling monomials from $\refpoly{1, }^2$ while ignoring any samples $A^2X_1X_2B^2$.
Note however, that this is \emph{not a tight approximation}.
In~\Cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\vct{p}\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Recent work in heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10} emits a \abbrPDB when insufficient data exists to select the `correct' data repair.
Probabilistic data cleaning is a crucial innovation, as the alternative is to arbitrarily select one repair and `hope' that queries receive meaningful results.
Although \abbrPDB queries instead convey the trustworthiness of results~\cite{kumari:2016:qdb:communicating}, they are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}, even in approximation (see \Cref{sec:karp-luby}).
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.
% \mypar{Concurrent Work}
% In work independent of ours, Grohe, et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} investigate bag-\abbrTIDB\xplural allowing for unbounded multiplicities (which requires them to explicitly address the issue of a succinct representation of the distribution over infinitely many multiplicities). While the authors observe that computing the expected value of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered. The work primarily investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem. Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present background and notation in \Cref{sec:background}. We prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}.
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: