diff --git a/abstract.tex b/abstract.tex index f25d56b..0064395 100644 --- a/abstract.tex +++ b/abstract.tex @@ -2,17 +2,24 @@ %!TEX root=./main.tex \begin{abstract} % The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds. - The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) is arguably the most fundamental problem in set-PDBs. + The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) is a % arguably the most + fundamental problem in set-PDBs. %can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds. %The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world. - The analog for bag semantics is computing the expected multiplicity of a result tuple. + % The analog for bag semantics is computing the expected multiplicity of a result tuple. %In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately. - In this work, we study the problem of a tuple's expected multiplicity exactly and approximately. - We are specifically interested in the fine-grained complexity of this problem relative to the complexity of deterministic query evaluation --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases. - Unfortunately, we show the reverse; our results imply that computing expected multiplicities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead. + In this work, we study the analog problem for bag semantics: computing a tuple's expected multiplicity exactly and approximately. +% Specifically, we are interested in the fine-grained complexity of computing this type of expectation based on a query result tuple's lineage polynomial which encodes how the tuple's multiplicity is computed based on the multiplicity of input tuples. +% Furthermore, we study how the complexity of this problem compares to + We are specifically + interested in the fine-grained complexity and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases. + Unfortunately, % we show the reverse; + our results imply that computing expected multiplicities for Bag-PDBs based on the results produced by such query evaluation algorithms introduces super-linear overhead. % Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database. - The problem stays hard even if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$). - We proceed to study polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). + % The problem stays hard even if + This is the case even if +all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).\BG{Replace with this because notion of hardness unclear: This is the case even if \ldots} + We proceed to study how approximate multiplicities using lineage polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expected multiplicity of an output tuple in linear time in the runtime of a comparable deterministic query. % By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases. \end{abstract} diff --git a/intro-rewrite-070921.tex b/intro-rewrite-070921.tex index 37eed7b..6a5c085 100644 --- a/intro-rewrite-070921.tex +++ b/intro-rewrite-070921.tex @@ -2,7 +2,7 @@ %root: main.tex \section{Introduction}\label{sec:intro} A probabilistic database (PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$. -A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. +A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics, where each tuple is associated with a multiplicity. Following~\cite{DBLP:conf/pods/GreenKT07}, we model bag databases (resp., relations) as functions from each $\tup$ to the tuple's multiplicity $\db(\tup) \in \semN$ in a possible world $\db$. We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short. @@ -29,7 +29,7 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski198 %Each valuation of the random variables appearing in this formula corresponds to one possible world. %Given a joint probability distribution over such assignments, the marginal probability of a query result tuple $\tup$ is the probability that the lineage formula of $\tup$ evaluates to true. Given a \abbrBPDB $\pdb$, we refer to the above encoding of $\pdb$ as \dbbaseName and denote it as $\dbbase$. % -The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities. +The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities. \begin{figure} \begin{align*} \polyqdt{\project_A(\query)}{\dbbase}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\dbbase}{\tup'} & @@ -70,7 +70,7 @@ The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DB \end{figure} -%Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial. +%Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial. We drop $\query$, $\dbbase$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now re-state~\Cref {prob:bag-pdb-query-eval} in the language of lineage polynomials: @@ -80,8 +80,8 @@ Given an $\raPlus$ query $\query$, \abbrBPDB $\pdb$, and result tuple $\tup$, co multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pd\pbox{\apolyqdt(\vct{W})}$), where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$. \end{Problem} -We note that \Cref{prob:bag-pdb-query-eval} is equivalent to \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}). -In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials. +We note that \Cref{prob:bag-pdb-query-eval} is equivalent to \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}). +In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials. \mypar{\abbrTIDB\xplural} We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; The bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{ This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics. @@ -92,8 +92,8 @@ We initially focus on tuple-independent probabilistic bag-databases\footnote{See } % OK: I tidied things up a touch. %\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.} -We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$ -iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$. +We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$ +iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$. %Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$ %\AH{Where do we use this notation? If we use this somewhere, should we maybe use $\db_{\vct{\randWorld}}$ instead?} @@ -116,21 +116,21 @@ Thanks to linearity of expectation, simple polynomial-time algorithms (for fixed % The algo is trivial so I think putting in a 2010 cite seems like bit too much %\cite{kennedy:2010:icde:pip}) % for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s. -However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time. -If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done. -However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). +However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ (over the deterministic database instance $\dbbase$) can also be done in polynomial time. +If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done. +However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). For \abbrBPDB $\pdb$ and query $Q$, let $\timeOf{}^*(Q,\pdb)$ denote the (optimal) runtime complexity of \Cref{prob:bag-pdb-query-eval} (over all result tuples $\tup$).\AR{Am changing these runtime definitions to include the runtime for all result tuples $\tup$.} -Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}). -%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$, +Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:gen}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}). +%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$, We finally have all the pieces to state a formal specification of our problem: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{Problem}\label{prob:informal} Given an $\raPlus$ query $\query$ and \abbrTIDB % OK: added motivation -%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.} +%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.} \abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$? \end{Problem} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -141,7 +141,7 @@ Given an $\raPlus$ query $\query$ and \abbrTIDB %The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$. %In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class. -We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time. +We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time. We stress that this question is very well motivated, even for one of the simplest models of probabilistic databases (i.e., \abbrTIDBs): An answer in the affirmative for~\Cref{prob:informal} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for deployment in practice. \mypar{Our lower bound results} Unfortunately, we prove the negative. In fact in Table~\ref{tab:lbs}\AR{Cref was not formatting Table correct so added Table in explicitly.} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to \Cref{prob:informal}. @@ -151,9 +151,9 @@ We stress that this question is very well motivated, even for one of the simples Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\pd$s & Hardness Assumption\\ \hline $\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\ -\hline +%\hline $\omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\ -\hline +%\hline $\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Current $k$-matching algorithms\\ \hline \end{tabular} @@ -177,21 +177,21 @@ As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a prop %Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though %\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).} % whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$. -and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the (data) complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard +and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the (data) complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard %OK: The former ---v %\AR{This is result for TIDBs for general set-PDBs?} %Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble %\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general. %, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query. -%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$). +%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$). We note that the \sharpphard lower bound is much stronger than what one can hope for in \abbrBPDB since, as mentioned earlier, for a fixed query one can always solve \Cref{prob:bag-pdb-query-eval} in polynomial time. %Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing. %This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events. %Atri: Removed the para above since the above does not seem to add much to the current intro flow. -%Such a guarantee is not possible +%Such a guarantee is not possible For queries on the hard side of the dichotomy, the best known algorithmic approach is the \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then its expectation --- we will come back this framework shortly. % as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability. %The complexity of this approach is, in general, dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}. @@ -223,14 +223,14 @@ In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplic \input{two-step-model} Like set-probabilistic databases, our approach adopts the two-step intensional model of query evaluation, as illustrated in \Cref{fig:two-step}: -(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$); +(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$); (ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$. Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ --- more on this representation shortly). Respectively denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective: \begin{Problem}\label{prob:big-o-joint-steps} -Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$, +Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$, is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where $\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$? \end{Problem} @@ -241,13 +241,13 @@ We show in \Cref{sec:gen} %Atri: fixed the ref an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single $\circuit$ with one sink per tuple representing the lineage). % , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}. -A key insight of this paper is that the representation of $\circuit$ matters. +A key insight of this paper is that the representation of $\circuit$ matters. For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{ This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition. }, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large. However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}). -For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$. +For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$. Accordingly, this work uses (arithmetic) circuits\footnote{ An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator. } @@ -298,14 +298,14 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu %We further note that in our hardness proofs, we have $|\circuit|=\Theta\inparen{\timeOf{\abbrStepOne}(Q,\pdb)}$, which shows that the answer to~\Cref{prob:bag-pdb-query-eval} is also no.\AR{Need to make sure we have the correct statement for this claim (i) in the main paper.} %we further show superlinear hardness in the size of \circuit for a specific %cubic %graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$; -%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly. +%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly. (i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), where there is a single result tuple, the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}.\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).} % the approximation algorithm has runtime linear in the size of the compressed lineage encoding ( In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding~\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}. %Atri: The footnote below does not add much %\footnote{Note that this doesn't rule out queries for which approximation is linear}); -(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); +(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); %\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?} %\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''} (iii) We finally observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation). @@ -324,7 +324,7 @@ $$%\begin{multline*} %\inparen{AXB + BYE + BZC}^2\\ =A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC. $$%\end{multline*} -By exploiting linearity of expectation, further pushing expectation through independent \abbrTIDB variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation +By exploiting linearity of expectation, further pushing expectation through independent \abbrTIDB variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation \AH{If we choose to use $\pd$ in \Cref{prob:bag-pdb-poly-expected}, then we either need to follow the same convention here OR introduce the notation $\pdassign$ before using it.} $\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (Where $\randWorld_A$ is the random variable corresponding to $A$, distributed as $\pdassign$). @@ -364,8 +364,8 @@ For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}=\apolyqdt(\vct{X})$ $ \end{Lemma} -To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}). -We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature. +To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}). +We do so by considering an arbitrary graph $G$ (analogous to the $Route$ relation of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that \begin{footnotesize} @@ -373,7 +373,7 @@ For an upper bound on approximating the expected count, it is easy to check that \hspace*{-3mm} \poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\ &\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C + -2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C +2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C = \rpoly\inparen{\vct{p}} %\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup} \end{align*} @@ -388,7 +388,7 @@ To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample mon \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen}. %and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem %\AH{I don't think I understand what the sentence (about extensions) is saying.} -% (\Cref{def:the-expected-multipl}). +% (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.