This commit is contained in:
Boris Glavic 2021-09-17 12:39:58 -05:00
parent 8d8e369962
commit 51fcabfe5a
2 changed files with 44 additions and 37 deletions

View file

@ -2,17 +2,24 @@
%!TEX root=./main.tex
\begin{abstract}
% The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) is arguably the most fundamental problem in set-PDBs.
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) is a % arguably the most
fundamental problem in set-PDBs.
%can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
%The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
The analog for bag semantics is computing the expected multiplicity of a result tuple.
% The analog for bag semantics is computing the expected multiplicity of a result tuple.
%In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
In this work, we study the problem of a tuple's expected multiplicity exactly and approximately.
We are specifically interested in the fine-grained complexity of this problem relative to the complexity of deterministic query evaluation --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
Unfortunately, we show the reverse; our results imply that computing expected multiplicities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead.
In this work, we study the analog problem for bag semantics: computing a tuple's expected multiplicity exactly and approximately.
% Specifically, we are interested in the fine-grained complexity of computing this type of expectation based on a query result tuple's lineage polynomial which encodes how the tuple's multiplicity is computed based on the multiplicity of input tuples.
% Furthermore, we study how the complexity of this problem compares to
We are specifically
interested in the fine-grained complexity and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
Unfortunately, % we show the reverse;
our results imply that computing expected multiplicities for Bag-PDBs based on the results produced by such query evaluation algorithms introduces super-linear overhead.
% Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
The problem stays hard even if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
We proceed to study polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
% The problem stays hard even if
This is the case even if
all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).\BG{Replace with this because notion of hardness unclear: This is the case even if \ldots}
We proceed to study how approximate multiplicities using lineage polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expected multiplicity of an output tuple in linear time in the runtime of a comparable deterministic query.
% By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
\end{abstract}

View file

@ -2,7 +2,7 @@
%root: main.tex
\section{Introduction}\label{sec:intro}
A probabilistic database (PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise.
In this work, we are interested in bag semantics, where each tuple is associated with a multiplicity.
Following~\cite{DBLP:conf/pods/GreenKT07}, we model bag databases (resp., relations) as functions from each $\tup$ to the tuple's multiplicity $\db(\tup) \in \semN$ in a possible world $\db$.
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
@ -29,7 +29,7 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski198
%Each valuation of the random variables appearing in this formula corresponds to one possible world.
%Given a joint probability distribution over such assignments, the marginal probability of a query result tuple $\tup$ is the probability that the lineage formula of $\tup$ evaluates to true. Given a \abbrBPDB $\pdb$, we refer to the above encoding of $\pdb$ as \dbbaseName and denote it as $\dbbase$.
%
The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\dbbase}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\dbbase}{\tup'} &
@ -70,7 +70,7 @@ The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DB
\end{figure}
%Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial.
%Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial.
We drop $\query$, $\dbbase$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now re-state~\Cref
{prob:bag-pdb-query-eval} in the language of lineage polynomials:
@ -80,8 +80,8 @@ Given an $\raPlus$ query $\query$, \abbrBPDB $\pdb$, and result tuple $\tup$, co
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pd\pbox{\apolyqdt(\vct{W})}$),
where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
\end{Problem}
We note that \Cref{prob:bag-pdb-query-eval} is equivalent to \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
We note that \Cref{prob:bag-pdb-query-eval} is equivalent to \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
\mypar{\abbrTIDB\xplural}
We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; The bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics.
@ -92,8 +92,8 @@ We initially focus on tuple-independent probabilistic bag-databases\footnote{See
}
% OK: I tidied things up a touch.
%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$
iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
%Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$
%\AH{Where do we use this notation? If we use this somewhere, should we maybe use $\db_{\vct{\randWorld}}$ instead?}
@ -116,21 +116,21 @@ Thanks to linearity of expectation, simple polynomial-time algorithms (for fixed
% The algo is trivial so I think putting in a 2010 cite seems like bit too much
%\cite{kennedy:2010:icde:pip})
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time.
If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done.
However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime).
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time.
If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done.
However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime).
For \abbrBPDB $\pdb$ and query $Q$, let $\timeOf{}^*(Q,\pdb)$ denote the (optimal) runtime complexity of \Cref{prob:bag-pdb-query-eval} (over all result tuples $\tup$).\AR{Am changing these runtime definitions to include the runtime for all result tuples $\tup$.}
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$,
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$,
We finally have all the pieces to state a formal specification of our problem:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}\label{prob:informal}
Given an $\raPlus$ query $\query$ and \abbrTIDB
% OK: added motivation
%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.}
%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.}
\abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$?
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -141,7 +141,7 @@ Given an $\raPlus$ query $\query$ and \abbrTIDB
%The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time.
We stress that this question is very well motivated, even for one of the simplest models of probabilistic databases (i.e., \abbrTIDBs): An answer in the affirmative for~\Cref{prob:informal} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for deployment in practice.
\mypar{Our lower bound results} Unfortunately, we prove the negative. In fact in Table~\ref{tab:lbs}\AR{Cref was not formatting Table correct so added Table in explicitly.} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to \Cref{prob:informal}.
@ -151,9 +151,9 @@ We stress that this question is very well motivated, even for one of the simples
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\pd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
\hline
%\hline
$\omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
\hline
%\hline
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
@ -177,21 +177,21 @@ As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a prop
%Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though
%\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).}
% whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$.
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the (data) complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the (data) complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard
%OK: The former ---v
%\AR{This is result for TIDBs for general set-PDBs?}
%Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble
%\footnote{\sharpp is the counting version for problems residing in the NP complexity class.}
in general.
%, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query.
%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$).
%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$).
We note that the \sharpphard lower bound is much stronger than what one can hope for in \abbrBPDB since, as mentioned earlier, for a fixed query one can always solve \Cref{prob:bag-pdb-query-eval} in polynomial time.
%Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
%This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
%Atri: Removed the para above since the above does not seem to add much to the current intro flow.
%Such a guarantee is not possible
%Such a guarantee is not possible
For queries on the hard side of the dichotomy, the best known algorithmic approach is the \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then its expectation --- we will come back this framework shortly. % as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
%The complexity of this approach is, in general, dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
@ -223,14 +223,14 @@ In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplic
\input{two-step-model}
Like set-probabilistic databases, our approach adopts the two-step intensional model of query evaluation, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ --- more on this representation shortly).
Respectively denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:
\begin{Problem}\label{prob:big-o-joint-steps}
Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$?
\end{Problem}
@ -241,13 +241,13 @@ We show in \Cref{sec:gen}
%Atri: fixed the ref
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single $\circuit$ with one sink per tuple representing the lineage).
% , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}.
A key insight of this paper is that the representation of $\circuit$ matters.
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
}
@ -298,14 +298,14 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu
%We further note that in our hardness proofs, we have $|\circuit|=\Theta\inparen{\timeOf{\abbrStepOne}(Q,\pdb)}$, which shows that the answer to~\Cref{prob:bag-pdb-query-eval} is also no.\AR{Need to make sure we have the correct statement for this claim (i) in the main paper.}
%we further show superlinear hardness in the size of \circuit for a specific %cubic
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), where there is a single result tuple, the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}.\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding~\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}.
%Atri: The footnote below does not add much
%\footnote{Note that this doesn't rule out queries for which approximation is linear});
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural);
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural);
%\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
(iii) We finally observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
@ -324,7 +324,7 @@ $$%\begin{multline*}
%\inparen{AXB + BYE + BZC}^2\\
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$%\end{multline*}
By exploiting linearity of expectation, further pushing expectation through independent \abbrTIDB variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation
By exploiting linearity of expectation, further pushing expectation through independent \abbrTIDB variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation
\AH{If we choose to use $\pd$ in \Cref{prob:bag-pdb-poly-expected}, then we either need to follow the same convention here OR introduce the notation $\pdassign$ before using it.}
$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (Where $\randWorld_A$ is the random variable corresponding to $A$, distributed as $\pdassign$).
@ -364,8 +364,8 @@ For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}=\apolyqdt(\vct{X})$
$
\end{Lemma}
To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
\begin{footnotesize}
@ -373,7 +373,7 @@ For an upper bound on approximating the expected count, it is easy to check that
\hspace*{-3mm}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
@ -388,7 +388,7 @@ To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample mon
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
%\AH{I don't think I understand what the sentence (about extensions) is saying.}
% (\Cref{def:the-expected-multipl}).
% (\Cref{def:the-expected-multipl}).
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.