paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

508 lines
52 KiB
TeX

%!TEX root=./main.tex
%root: main.tex
\section{Introduction}\label{sec:intro}
A probabilistic database (PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
\AHchange{
A tuple independent database (\abbrTIDB) (to which we will refer to later) is a \abbrPDB such that each tuple is an independent random event.
}
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise.
In this work, we are interested in bag semantics, where each tuple is associated with a multiplicity.
Following~\cite{DBLP:conf/pods/GreenKT07}, we model bag databases (resp., relations) as functions from each $\tup$ to the tuple's multiplicity $\db(\tup) \in \semN$ in a possible world $\db$.
\sout{
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
}
The natural generalization of the \AHchange{(set)} problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that is assigned value $\query(\db)(\tup) \in \semN$ in world $\db \in \idb$
\AHchange{
, formally $\expct_{\randDB\sim\pd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$.
}
%OK: done
%\AH{I think I understand what is being stated in this last sentence, but I wonder if phrasing the end something like, ``for world $\db \in \idb$ would be easier to digest for the average reviewer...maybe it was just me.}
% In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
%In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Problem}[Expected Multiplicity]\label{prob:bag-pdb-query-eval}
%Given an \raPlus query\footnote{The class of positive relational algebra (\raPlus) queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).\label{footnote:ra-def}} $\query$, \abbrBPDB $\pdb$, and result tuple $\tup$, compute the expected
%multiplicity ($\expct_{\randDB\sim\pd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$)
%of tuple $\tup$.
%\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For \lstinline{COUNT(*)} queries, expected multiplicities can model the expected count. \sout{
The equivalent set-\abbrPDB operation, simply computes the probability that this count is non-zero.
Further,
} We are interested in the parameterized complexity of
\AHchange{
computing the expectation,
}%\Cref{prob:bag-pdb-query-eval}
(i.e. we think of $\query$ as being parameterized by some parameter $k$ with the size of the database going to infinity relative to $k$). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$.
\AHchange{
While the parameterized and fine-grained results of this paper apply to general \abbrPDB\xplural, we start by focusing on a restricted form of \abbrTIDB which we refer to as \abbrCTIDB.
As alluded to, a \abbrTIDB is a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.\label{footnote:set-not-limit}
}
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
We then define a \abbrCTIDB be a bag \abbrTIDB with the further restriction that each tuple $\tup$ has a multiplicity of at most some constant $c$, formally: $\forall \db \in \pdb, ~\forall \tup \in \db, ~\db\inparen{\tup}\leq c$. That is, any tuple in a \abbrCTIDB has a multiplicity of at most $c$.
}
\noindent\AHchange{
For notational convenience we make use of the following definitions.
\begin{Definition}[$\dbbase$]
Let $\dbbase$ be the relation composed of all possible tuples in $\pdb$, i.e. $\dbbase = \bigcup_{\db \in \idb}\db$.
\end{Definition}
\begin{Definition}[$\pdassign$]
Given a \abbrCTIDB $\pdb = \inparen{\idb, \pd}$ and the set of all $2^\numvar$ worlds $W$, denote the probability distribution induced from $\pd$ over each world $\wElem \in W$ as $\pdassign$.
\end{Definition}
}
\sout{Further, define $\dbbase=\bigcup_{\db\in\idb} \db$.}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in.
%\AR{Removed couple of sentence on lineage formula since we explicitly define $\poly$ now.}
%
%Each valuation of the random variables appearing in this formula corresponds to one possible world.
%Given a joint probability distribution over such assignments, the marginal probability of a query result tuple $\tup$ is the probability that the lineage formula of $\tup$ evaluates to true. Given a \abbrBPDB $\pdb$, we refer to the above encoding of $\pdb$ as \dbbaseName and denote it as $\dbbase$.
%
The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\dbbase}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\dbbase}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\dbbase}{\tup} =& \polyqdt{\query_1}{\dbbase}{\tup} + \polyqdt{\query_2}{\dbbase}{\tup}\\
\polyqdt{\select_\theta(\query)}{\dbbase}{\tup} =& \begin{cases}
\polyqdt{\query}{\dbbase}{\tup} & \text{if }\theta(\tup) \\
0 & \text{otherwise}.
\end{cases} &
\begin{aligned}
\polyqdt{\query_1 \join \query_2}{\dbbase}{\tup} =\\ ~
\end{aligned}&
\begin{aligned}
&\polyqdt{\query_1}{\dbbase}{\project_{\attr{\query_1}}{\tup}} \\
&~~~\cdot\polyqdt{\query_2}{\dbbase}{\project_{\attr{\query_2}}{\tup}}
\end{aligned}\\
& & \polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases}
X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = 1 \\
0 &\text{otherwise.}\end{cases}
%\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
% \evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
% \evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
% \evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
% 0 & \text{otherwise}.
% \end{cases} &
% \begin{aligned}
% \evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
% \end{aligned}&
% \begin{aligned}
% &\evald{\rel_1}{\db}(\project_{\attr{\rel_1}}(\tup)) \\
% &~~~\cdot\evald{\rel_2}{\db}(\project_{\attr{\rel_2}}(\tup))
% \end{aligned}\\
% & & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query over a \abbrBPDB, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\dbbase$ and $\tup$ in $\rel$. Here $\dbbase.\rel$ denotes the instance of relation $\rel$ in $\dbbase$.} % Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
%Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial.
We drop $\query$, $\dbbase$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now
\sout{
re-state
} %~\Cref{prob:bag-pdb-query-eval}
\AHchange{
specify the problem of computing the expectation of tuple multiplicity
}
in the language of lineage polynomials:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$,
\AHchange{
\abbrCTIDB $\pdb$
}
and result tuple $\tup$, compute the expected
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt(\vct{W})}$).
\sout{,
where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
}
\end{Problem}
We note that %\Cref{prob:bag-pdb-query-eval}
\AHchange{
computing $\expct_{\randDB\sim\pd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$
}
is equivalent to \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
%\mypar{\abbrTIDB\xplural}
%We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
% This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.\label{footnote:set-not-limit}
% % Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
% % % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
% % When the multiplicity of input tuple is bound by some constant,
% % the increased input size is negligible.\label{footnote:set-not-limit}
%}
%% OK: I tidied things up a touch.
%%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
%We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
%iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
%Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$
%\AH{Where do we use this notation? If we use this somewhere, should we maybe use $\db_{\vct{\randWorld}}$ instead?}
% as the world represented by $\vct{W}$.
%Atri: I don't think we use it so removing it.
%Atri: Stuff below was confusing, so am re-writing it.
%A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database $\encodedDB$ with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$
%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
%being the distribution induced from the requirement that each tuple $\tup \in \encodedDB$ be treated as an independent Bernoulli distributed random variable with probability $\prob_\tup$.
%The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
%Given a vector $\vct{X}$ such that each $\tup \in \encodedDB$ has a unique formal variable annotation $X_\tup \in \vct{X}$, for a boolean domain $\{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
%\BG{REMOVED:
%When $\pdb$ is a \abbrTIDB, for every output tuple $\tup$, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$.
%Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \Cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators over Natural numbers (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{W}}}\inparen{\tup} = \poly_\tup\inparen{\vct{W}}$.
%This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.}
Thanks to linearity of expectation, simple polynomial-time algorithms (for fixed query $\query$) exist for computing the expectation of a lineage polynomial $\apolyqdt$ when $\pdb$ is a \abbrTIDB and $\query$ is an $\raPlus$ query.
% The algo is trivial so I think putting in a 2010 cite seems like bit too much
%\cite{kennedy:2010:icde:pip})
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time.
If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done.
However, in practice (and in theory), we care about the {\em fine-grained}/parameterized complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime).
Given \abbrCTIDB $\pdb$ and query $\query$, let $\timeOf{}^*(Q,\pdb)$ denote the (optimal) runtime complexity of computing expected multiplicity (over all result tuples $\tup$). %\AR{Am changing these runtime definitions to include the runtime for all result tuples $\tup$.}
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}). %\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$,
\sout{
We finally have all the pieces to state a formal specification of our problem:
}
\AHchange{
Given the above, the natural question to ask is whether or not it is always the case that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\query, \dbbase}$?
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Problem}\label{prob:informal}
%Given an $\raPlus$ query $\query$ and \abbrTIDB
%% OK: added motivation
%%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.}
%\abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$?
%\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.
%In this paper, we explore the \emph{fine-grained complexity} of bag-probabilistic database query evaluation.
%Atri: I'm not sure if this comment makes much sense here-- it sort of breaks the flow I think. I'll refer to this when talking about our results.
%The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
\AHchange{
This question
}
is a special case of computing the expected multiplicity of $\tup$ since we are asking whether the query evaluation over a \abbrCTIDB is {\em linear} in the runtime of deterministic query processing.
We stress that this question is very well motivated, even for \abbrTIDBs: An answer \AHchange{to the above question} in the affirmative \AH{not sure that this is the best way of putting it} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for deployment in practice.
\mypar{Our lower bound results} Unfortunately, we prove that this is not the case. In fact in Table~\ref{tab:lbs} %\AR{Cref was not formatting Table correct so added Table in explicitly.}
we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to \AHchange{the above question}.%\Cref{prob:informal}.
\begin{table}
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\pd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\dbbase$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\dbbase$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
Note that the lower bound in the first row by itself is enough to refute \AHchange{the above question}. %\Cref{prob:informal}.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\AH{Not sure what is meant by `the trivial algorithm for (what was originally called) Problem 1.4'}
%\footnote{
% We note similar hardness results for determinsitic query processing that apply lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\qruntime{Q,\dbbase}$, which in general can be super-linear in $\abs{\dbbase}$.
%}
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \dbbase}$ by just $\abs{\dbbase}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multiplicities is hard. To put these hardness results in context, we will next take a short detour
\AH{Is this detour still necessary?}
to review the existing hardness results for \abbrPDB\xplural under set semantics.
% Atri: Converting sub-section to para since it saves space
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\mypar{Relationship to Set-Probabilistic Query Evaluation}
%%
%\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural. % , where each output tuple appears at most once.
%As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a propositional formula
%%Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though
%%\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).}
%% whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$.
%and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the (data) complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard
%%OK: The former ---v
%%\AR{This is result for TIDBs for general set-PDBs?}
%%Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble
%%\footnote{\sharpp is the counting version for problems residing in the NP complexity class.}
%in general.
%%, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query.
%%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$).
%We note that the \sharpphard lower bound is much stronger than what one can hope for in \abbrBPDB since, as mentioned earlier, for a fixed query one can always solve \Cref{prob:bag-pdb-query-eval} in polynomial time (for \abbrTIDB).
%
%%Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
%%This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
%%Atri: Removed the para above since the above does not seem to add much to the current intro flow.
%
%%Such a guarantee is not possible
%\AH{Do we need the two step model?}
%For queries on set-PDBs, the best known algorithmic approach is the \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage formula and then its expectation --- we will come back this framework shortly. % as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
%The complexity of this approach is, in general, dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
%BEGIN Needs to be said
%Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
%END NEeds to be said
%Atri: Again changing subsection below to para
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \mypar{Intensional Bag-Probabilistic Query Evaluation}
% However, there exist some queries for which \abbrBPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{\randWorld}}}$) of the result. This work focuses on computing $\expct\pbox{\poly\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of \abbrBPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
%BEGIN Needs to be noted.
%As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
% paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy,
%END Needs to be noted.
% Atri: Removing stuff below as per conversation with Oliver on matrix on Aug 26
%A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation, effectively dividing \abbrPDB query evaluation into two steps: deterministic query evaluation\footnote{Given input $\pdb$, this step includes outputting every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$.\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}} and computing expectation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \Cref{fig:two-step}.
%The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \Cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \Cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
\mypar{Approximating the expected multiplicities}
%\AR{Have done my pass till here}
Our negative results indicate that \abbrBPDB{}s can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrTIDB. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
\input{two-step-model}\AR{In \Cref{fig:two-step}, perhaps in caption give f/w pointer to definition of $Q$?}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
Denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:\AH{What if there are more efficient representations than circuits?}
\AHchange{
Our next question is whether or not there exists a $\inparen{1\pm\epsilon}$-approximation algorithm that is linear to the deterministic query? If so, we have shown that approximation of bag \abbrPDB\xplural is comparable to deterministic query processing.
}
%\sout{
%\begin{Problem}\label{prob:big-o-joint-steps}
%Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
%is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
%$\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O_\epsilon(\qruntime{Q, \dbbase})$?
%\end{Problem}
%Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplicities).
%}
We show in \Cref{sec:circuit-depth} %{sec:gen}\AR{Refs needs to be updated}
%\OK{confirm this ref}
%Atri: fixed the ref
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
% , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}.
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$,
%\BG{should be $|\idb |$?},
%Atri: No, this is fine
and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.
% When $\poly(\vct{X})$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it is a sum of products of variables (a variable can occur more than once), where each product of variables is unique.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $O(|\poly_\tup(\vct{X})|)$ and thus also $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$.
% \AH{Is this obvious enough for the typical reviewer to realize?}
% Recall that $\prob_i$ denotes the probability of tuple $\tup_i$ (i.e. $\probOf\pbox{W_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.
% % Replaced the stuff below with something more auccint
% %For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in
% %$\abs{\poly_\tup}$
% %the size of the arithemetic circuit
% %, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}
% In this case, we have for any output tuple $\tup$, $\expct\pbox{\poly(\vct{W})}=\Phi(1,\dots,1)$.
% Thus, we have another case where $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). These observations introduce our first formalization of~\Cref{prob:informal}:
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\dbbase,\circuit^*)\le O(\qruntime{\query, \dbbase})$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
%\BG{This sentence didn't parse for me. What do we mean by representing a polynomial by a size?}
%Atri: fixed
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{Q, \dbbase}$ (i.e., $|\circuit^*| \le O(\qruntime{Q, \dbbase})$).
Thus, \AHchange{the question of approximation} %\Cref{prob:big-o-joint-steps}
can be reframed as:
%Atri: Replaced the text below by the above. I know I had talked about $|\circuit|^k$ but I think the stuff below breaks the flow a bit
%Re-stating our earlier observation, given a circuit \circuit, if \circuit is in \abbrSMB (i.e. every sink to source path has a prefix of addition nodes and the rest of the internal nodes are multiplication nodes), then we have that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is indeed $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. We note that \abbrSMB representations are produced by queries with a projection operation on top of a join operation.
% the form $\project, \project\inparen{\join},$ etc.
% Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form. Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{|\circuit|^k}$ time for the case that $k$ is the degree of the polynimial $\Phi_\tup(\vct{X})$. Since $|\circuit'|$ lies between $\bigO{|\circuit|}$ and $\bigO{|\circuit|^k}$, it behooves us to determine which of these extremes is true for the general \circuit. This leads us to the main problem statement of our paper:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
%\OK{This doesn't parse. What is $\bigO{\abbrStepOne}$? Should this be $\bigO{\poly}$?}
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Our upper bound results} We show that the answer to \Cref{prob:intro-stmt} (and hence the answer to \Cref{prob:big-o-joint-steps}) is yes. In particular, we show the following upper bound results.
%In this paper we tackle~\Cref{prob:bag-pdb-query-eval} to~\Cref{prob:intro-stmt}.
%Concretely, we make the following contributions:
%(i) %Under fine grained hardness assumption,
%We show that the answer to~\Cref{prob:bag-pdb-query-eval} is \textit{no} in general for exact computation. %\cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
% \sharpwonehard in the size of the lineage circuit
%In fact, via a
%reduction from counting the number of $k$-matchings over an arbitrary graph, we show that the problem of \abbrStepTwo (\termStepTwo) is \sharpwonehard. I.e., not only is the answer to~\Cref{prob:intro-stmt} no, but \abbrStepTwo cannot be solved in fully polynomial time, i.e. there is no algorithm for \abbrStepTwo with runtime that grows as $f(k)\cdot |\circuit|^d$, where $k$ is the degree of the corresponding lineage polynomial and $d$ is any fixed constant.\footnote{We would like to note that it is a well-known result in deterministic query computation that \abbrStepOne is also \sharpwonehard. What our result says is that \abbrStepTwo is \sharpwonehard\emph{even if} we exclude the complexity of \abbrStepOne .}
%This hardness result requires the algorithm to be able to solve the hard query $Q$ for {\em multiple} PDBs. We further show that the answer to~\Cref{prob:intro-stmt} is no even if we fix the $\pd$ (in particular, we insist on $\prob_i = \prob$ for some $\prob$ in $(0, 1)$).
%Atri: The footnote above is where I talk about \sharpwonehard of det query complexity.
%We further note that in our hardness proofs, we have $|\circuit|=\Theta\inparen{\timeOf{\abbrStepOne}(Q,\pdb)}$, which shows that the answer to~\Cref{prob:bag-pdb-query-eval} is also no.\AR{Need to make sure we have the correct statement for this claim (i) in the main paper.}
%we further show superlinear hardness in the size of \circuit for a specific %cubic
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
e.g. when the circuit is a tree
%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
and there is a single
result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes} (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB, the answer to \Cref{prob:big-o-joint-steps} is also yes.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\abs{\circuit}^{2k})$ (see \Cref{sec:karp-luby}).
%\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}.
%Atri: The footnote below does not add much
%\footnote{Note that this doesn't rule out queries for which approximation is linear});
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural).
%\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
%(iii) We finally observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the \abbrTIDB (and \abbrBIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the \abbrTIDB/\abbrBIDB. Next, we motivate this reduced polynomial.
Consider the query $\query$ defined as follows over the bag relations of \Cref{fig:two-step}:
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
\end{lstlisting}
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query^2 = \query \times \query$.
%\BG{$\query(\db)$ is a query result, so I changed it to this}
%Atri: Sounds good.
The lineage polynomial for $Q^2$ is given by $\poly^2\inparen{A, B, C, E, X, Y, Z}$:%\AR{Changed the variable $D$ to $E$ to avoid conflict with use of $D$ as a DB.}
$$%\begin{multline*}
%\inparen{AXB + BYE + BZC}^2\\
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$%\end{multline*}
By exploiting linearity of expectation, further pushing expectation through independent \abbrTIDB variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation is
%\AH{If we choose to use $\pd$ in \Cref{prob:bag-pdb-poly-expected}, then we either need to follow the same convention here OR introduce the notation $\pdassign$ before using it.}
$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (where $\randWorld_A$ is the random variable corresponding to $A$, distributed as $\pdassign$).
%Atri: Combined the the first step below with the next one to save space.
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_E^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_E} \\
+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
\end{multline*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$ corresponding to a \abbrTIDB (henceforth, \abbrTIDB-lineage polynomial),
%\BG{Better introduce the notion of TIDB lin poly before here, then it iis more clear?},
%Atri: Done
define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\poly^2\inparen{A, B, C, E, X, Y, Z}$ as an example, we have:
\begin{align*}
&\widetilde{\poly^2}(A, B, C, E, X, Y, Z) = AXB + BYE + BZC + 2AXBYE + 2AXBZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{align*}
Note that we have argued that for our specific example the expectation that we want is $\widetilde{\poly^2}(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
Let $\pdb$ be a \abbrTIDB over $n$ input tuples
such that the probability distribution $\pdassign$ over $\vct{W}\in\{0,1\}^\numvar$ (the set of possible worlds) is induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ where $\prob_i=\probOf\inparen{W_i=1}$.
For any \abbrTIDB-lineage polynomial
%\BG{Term has not been introduced yet.}
%Atri: fixed
$\poly\inparen{\vct{X}}=\apolyqdt(\vct{X})$, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
$
\end{Lemma}
To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
%
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Recent work in heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10,DBLP:journals/vldb/SaRR0W0Z17} emits a \abbrPDB when insufficient data exists to select the `correct' data repair.
Probabilistic data cleaning is a crucial innovation, as the alternative is to arbitrarily select one repair and `hope' that queries receive meaningful results.
Although \abbrPDB queries instead convey the trustworthiness of results~\cite{kumari:2016:qdb:communicating}, they are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}, even in approximation (see \Cref{sec:karp-luby}).
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. %We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
%\AH{I don't think I understand what the sentence (about extensions) is saying.}
% (\Cref{def:the-expected-multipl}).
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
%No reviewer comments in arxiv submission.
%Our responses to ICDT first cycle reviewer comments are in \Cref{sec:rebuttal}. % the appendix.\AR{Would be good to have a specific app ref to rebuttal}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: