Finished pass over S1.

This commit is contained in:
Aaron Huber 2021-09-08 10:43:54 -04:00
parent c724d60157
commit 513b345ccb
2 changed files with 54 additions and 41 deletions

View file

@ -3,9 +3,10 @@
\section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921}
\input{two-step-model}
A probabilistic database (PDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics where each tuple $\tup$ is associated with a multiplicity $\db(\tup)$ from $\semN$ in each possible world\footnote{We find it convenient to use the notation from~\cite{DBLP:conf/pods/GreenKT07} which models bag relations as function that map tuples to their multiplicity.}.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics where each tuple $\tup$ is associated with a multiplicity $\db(\tup)$ from $\semN$ in each possible world\footnote{We find it convenient to use the notation from~\cite{DBLP:conf/pods/GreenKT07} which models bag relations as functions that map tuples to their multiplicity.}.
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assign value $\query(\db)(\tup)$ in world $\db$:
The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assigns value $\query(\db)(\tup)$ in world $\db$:
\AH{I think I understand what is being stated in this last sentence, but I wonder if phrasing the end something like, ``for world $\db \in \idb$ would be easier to digest for the average reviewer...maybe it was just me.}
% In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
%In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
@ -23,8 +24,8 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski198
Each valuation of the random variables appearing in this formula corresponds to one possible world.
Given a joint probability distribution over such assignments, the marginal probability of a query result tuple $\tup$ is the probability that the lineage formula of $\tup$ evaluates to true.
The bag semantics analog of a lineage formula is a provenance polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with integer coefficients and exponents over integer random variables $\vct{X}$ encoding the multiplicity of input tuples.
Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial, and so we will drop $Q$, $\pdb$, and $\tup$ from $\apolyqdt$ where they are clear from the context or irrelevant to the discussion.
The bag semantics analog of a lineage formula is a provenance polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with integer coefficients and exponents over integer random variables $\vct{\randWorld}$ encoding the multiplicity of input tuples.
Analog to set-semantics, computing the expected multiplicity of a tuple reduces to computing the expectation of this polynomial. We drop $\query$, $\pdb$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -33,6 +34,7 @@ Given an $\raPlus$ query $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, co
multiplicity of $\apolyqdt$ ($\expct_\pd\pbox{\apolyqdt}$).
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\AH{I think that \Cref{prob:bag-pdb-poly-expected} needs to define the all worlds distribution $\pdassign$ over the set $\vct{W}\in\{0, 1\}^\numvar$, as well as the assumption or justification that $\pd \equiv \pdassign$.}
Note that, if $\apolyqdt$ is given, then \Cref{prob:bag-pdb-query-eval} reduces to \Cref{prob:bag-pdb-poly-expected} (see \Cref{subsec:expectation-of-polynom-proof} for the proof). Evaluating queries over probabilistic databases in this fashion (first computing a tuple's lineage and then calculating the expectation of the lineage) has been referred to as \textit{intensional query evaluation}~\cite{DBLP:series/synthesis/2011Suciu}. In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials, considering the size of the encoding as the input size. % specifically, the bag semantics version of tuple-independent probabilistic bag-databases (\abbrTIDB) and block-independent probabilistic databases (\abbrBIDB).
% Our main technical focus is on studying the complexity of this problem for various encoding of such polynomials.
@ -41,9 +43,9 @@ However, as we will show, these results have implications for the complexity of
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{\abbrTIDB\xplural}
%Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; The bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event\footnote{
This model is exactly the definition of \abbrTIDB{}s\cite{VS17} under classical set semantics.
Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), we model tuples with with possible multiplicities greater than one by replacing each input tuple with as many copies as its largest possible multiplicity.
We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; The bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics.
Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), we model tuples with possible multiplicities greater than one by replacing each input tuple with as many copies as its largest possible multiplicity.
% To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
This increases the size of the input but this overhead is negligible when each input tuple has constant multiplicity. %$\tup$ in $\pdb$.
%This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding.
@ -71,7 +73,7 @@ Thanks to linearity of expectation, simple polynomial-time algorithms exist for
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ can also be done in polynomial time.
If our notion of efficiency were simply achieving a polynomial time algorithm, then we would be done.
However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is a huge literature on fine grained complexity of deterministic query processing, a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval} is:
However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is huge literature on fine grained complexity of deterministic query processing, a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval} is:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Informal problem statement]\label{prob:informal}
@ -99,8 +101,8 @@ As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a prop
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query evaluation problem over set-\abbrPDB\xplural is \sharpphard
%Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble
%\footnote{\sharpp is the counting version for problems residing in the NP complexity class.}
in general, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard $Q$ in data complexity. %for any polynomial-time deterministic query.
Thus, for the hard queries, the answer to~\cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$.
in general, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query.
Thus, for the hard queries, the answer to~\cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$).
Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
@ -129,9 +131,13 @@ However, there exist some queries for which \abbrBPDB\xplural are a more natural
%The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
Analogous to set-probabilistic databases, we focus on the intensional model of query evaluation, as illustrated in \cref{fig:two-step}.
Given input $\pdb$ and $\query$, the first step, which we will refer to as \termStepOne (\abbrStepOne), outputs every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$), which is computed inline across the query operators of $\query$~\cite{Imielinski1989IncompleteII,DBLP:conf/pods/GreenKT07}.
Given input $\pdb$ and $\query$, the first step, which we will refer to as \termStepOne (\abbrStepOne), outputs every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly$), which is computed inline
\AH{While correct, I wonder if the average reviewer could confuse the language (based on the previous discussion of extensional evaluation) and think that the \emph{probability computation} is computed \emph{inline}.}
across the query operators of $\query$~\cite{Imielinski1989IncompleteII,DBLP:conf/pods/GreenKT07}.
We show in \cref{sec:circuit-runtime} that, assuming a standard $\raPlus$ query evaluation algorithm, the cost of constructing the lineage polynomial for tuples in a query result is upper-bounded by the runtime of generating those tuples through deterministic query evaluation.
In other words, the first step is in \sharpwonehard, allowing us to focus on the complexity of the second step, \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$.
In other words, the first step is in \sharpwonehard,
\AH{\sharpwonehard is not defined.}
allowing us to focus on the complexity of the second step, \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{\randWorld})}$.
There is significant precedent for the intensional model in \abbrBPDB{}s, as several systems, including UADBs~\cite{feng:2019:sigmod:uncertainty}, PIP~\cite{kennedy:2010:icde:pip}, and Trio \cite{DBLP:conf/vldb/AgrawalBSHNSW06} adopt similar strategies for computing expectations.
Notably, intensional query evaluation also mirrors the approach of semiring provenance~\cite{DBLP:conf/pods/GreenKT07}, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by multiplication and addition operators.} query processing first constructs a $\semNX$-polynomial for each result tuple, which can then be evaluated over a semantically appropriate semiring (e.g. $\semN$ for bag semantics multiplicities).
@ -140,46 +146,50 @@ Finally, the intensional model lends itself nicely to separating the concerns of
For bag-\abbrPDB $\pdb$ and query $Q$, let $\timeOf{\abbrStepOne}(Q,\pdb)$ denote the runtime of \abbrStepOne (Lineage Computation) and similarly for $\timeOf{\abbrStepTwo}(Q,\pdb)$ (Expectation Computation).
%Atri: Don't see what the sentence below is adding, so removing
%Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ (e.g. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$).
When $\poly_\tup(\vct{X})$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it is a sum of products of variables (a variable can occur more than once).}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $O(|\poly_\tup(\vct{X})|)$ and thus also $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. Recall that $\prob_i$ denotes the probability of tuple $\tup_i$ (i.e. $\probOf\pbox{W_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.
When $\poly(\vct{X})$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it is a sum of products of variables (a variable can occur more than once), where each product of variables is unique.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $O(|\poly_\tup(\vct{X})|)$ and thus also $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$.
\AH{Is this obvious enough for the typical reviewer to realize?}
Recall that $\prob_i$ denotes the probability of tuple $\tup_i$ (i.e. $\probOf\pbox{W_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.
% Replaced the stuff below with something more auccint
%For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in
%$\abs{\poly_\tup}$
%the size of the arithemetic circuit
%, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}
In this case, we have for any output tuple $\tup$, $\expct\pbox{\Phi_\tup(\vct{W})}=\Phi(1,\dots,1)$.
In this case, we have for any output tuple $\tup$, $\expct\pbox{\poly(\vct{W})}=\Phi(1,\dots,1)$.
Thus, we have another case where $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). These observations introduce our first formalization of~\Cref{prob:informal}:
\begin{Problem}\label{prob:big-o-step-one}
Given bag-\abbrPDB $\pdb$, $\raPlus$ query $\query$ and output tuple $\tup$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is always $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$?
Given bag-\abbrPDB $\pdb$, $\raPlus$ query $\query$ and output tuple $\tup$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$?
\end{Problem}
If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation, and probabilistic databases can offer performance competitive with deterministic databases.
The main insight of the paper is that to answer~\Cref{prob:big-o-step-one}, the representation of $\Phi_\tup(\vct{X})$ matters. One can have compact representations of $\poly_\tup(\vct{X})$ (e.g., resulting from optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which produce factorized representations
The main insight of the paper is that to answer~\Cref{prob:big-o-step-one}, the representation of $\poly(\vct{X})$ matters. One can have compact representations of $\poly(\vct{X})$ (e.g., resulting from optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which produce factorized representations
%Atri: footnote below was not informative: used an example instead
%\footnote{A factorized representation is a representation of a polynomial that is not in \abbrSMB form.}
of $\poly_\tup(\vct{X})$.
of $\poly(\vct{X})$.
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$. To capture such factorizations, this work uses (arithmetic) circuits\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes representing either an addition or multiplication operator.}
as the representation system of $\poly_\tup(\vct{X})$.
as the representation system of $\poly(\vct{X})$.
These are a natural fit to $\raPlus$ queries, as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}. The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} illustrate this.
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\pdb}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\pdb}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\pdb}{\tup} =& \polyqdt{\query_1}{\pdb}{\tup} + \polyqdt{\query_2}{\pdb}{\tup}\\
\polyqdt{\select_\theta(\query)}{\pdb}{\tup} =& \begin{cases}
\polyqdt{\query}{\pdb}{\tup} & \text{if }\theta(\tup) \\
\polyqdt{\project_A(\query)}{\dbbase}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\dbbase}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\dbbase}{\tup} =& \polyqdt{\query_1}{\dbbase}{\tup} + \polyqdt{\query_2}{\dbbase}{\tup}\\
\polyqdt{\select_\theta(\query)}{\dbbase}{\tup} =& \begin{cases}
\polyqdt{\query}{\dbbase}{\tup} & \text{if }\theta(\tup) \\
0 & \text{otherwise}.
\end{cases} &
\begin{aligned}
\polyqdt{\query_1 \join \query_2}{\db}{\tup} =\\ ~
\polyqdt{\query_1 \join \query_2}{\dbbase}{\tup} =\\ ~
\end{aligned}&
\begin{aligned}
&\polyqdt{\query_1}{\pdb}{\project_{\attr{\query_1}}{\tup}} \\
&~~~\cdot\polyqdt{\query_2}{\pdb}{\project_{\attr{\query_2}}{\tup}}
&\polyqdt{\query_1}{\dbbase}{\project_{\attr{\query_1}}{\tup}} \\
&~~~\cdot\polyqdt{\query_2}{\dbbase}{\project_{\attr{\query_2}}{\tup}}
\end{aligned}\\
& & \polyqdt{\rel}{\db}{\tup} =& \atupvar
& & \polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases}
X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = 1 \\
0 &\text{otherwise.}\end{cases}
%\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
@ -197,12 +207,12 @@ These are a natural fit to $\raPlus$ queries, as each operator maps to either a
% \end{aligned}\\
% & & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query over a \abbrBPDB} % Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query over a \abbrBPDB, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\dbbase$ and $\tup$ in $\rel$.} % Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
In other words, we can capture the size of a factorized lineage polynomial by the size of its correspoding arithmetic circuit $\circuit$ (which we denote by $|\circuit|$).
More importantly, our result in \cref{sec:circuit-runtime} shows that, assuming a standard $\raPlus$ query evaluation algorithm for \termStepOne, given the arithmetic circuit $\circuit$ corresponding to lineage polynomial output at the end of \termStepOne, we always have $|\circuit|\le \bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. Given this, we study the following stronger version of~\Cref{prob:big-o-step-one}:
More importantly, our result in \cref{sec:circuit-runtime} shows that, assuming a standard $\raPlus$ query evaluation algorithm for \abbrStepOne (\termStepOne), given the arithmetic circuit $\circuit$ corresponding to lineage polynomial output at the end of \abbrStepOne, we always have $|\circuit|\le \bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. Given this, we study the following stronger version of~\Cref{prob:big-o-step-one}:
%Atri: Replaced the text below by the above. I know I had talked about $|\circuit|^k$ but I think the stuff below breaks the flow a bit
%Re-stating our earlier observation, given a circuit \circuit, if \circuit is in \abbrSMB (i.e. every sink to source path has a prefix of addition nodes and the rest of the internal nodes are multiplication nodes), then we have that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is indeed $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. We note that \abbrSMB representations are produced by queries with a projection operation on top of a join operation.
@ -225,8 +235,8 @@ Concretely, we make the following contributions:
We show that the answer to~\Cref{prob:big-o-step-one} is \textit{no} in general for exact computation. %\cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
% \sharpwonehard in the size of the lineage circuit
In fact, via a
reduction from counting the number of $k$-matchings over an arbitrary graph, we show that for the problem of \termStepTwo is \sharpwonehard. I.e., not only is the answer to~\Cref{prob:intro-stmt} no, but \termStepTwo cannot be solved in fully polynomial time, i.e. there is no algorithm for \termStepTwo with runtime that grows as $f(k)\cdot |\circuit|^d$, where $k$ is the degree of the corresponding lineage polynomial and $d$ is any fixed constant.\footnote{We would like to note that it is a well-known result in deterministic query computation that \termStepOne is also \sharpwonehard. What our result says is that \termStepTwo is \sharpwonehard\emph{ even if} we exclude the complexity of \termStepOne .}
This hardness result requires the algorithm to be able to solve the hard query $Q$ for {\em multiple} PDBs. We further show that the answer to ~\Cref{prob:intro-stmt} is no even if we fix the $\pd$ (in particular, we insist on $\prob_i = \prob$ for some $\prob$ in $(0, 1)$).
reduction from counting the number of $k$-matchings over an arbitrary graph, we show that the problem of \abbrStepTwo (\termStepTwo) is \sharpwonehard. I.e., not only is the answer to~\Cref{prob:intro-stmt} no, but \abbrStepTwo cannot be solved in fully polynomial time, i.e. there is no algorithm for \abbrStepTwo with runtime that grows as $f(k)\cdot |\circuit|^d$, where $k$ is the degree of the corresponding lineage polynomial and $d$ is any fixed constant.\footnote{We would like to note that it is a well-known result in deterministic query computation that \abbrStepOne is also \sharpwonehard. What our result says is that \abbrStepTwo is \sharpwonehard\emph{even if} we exclude the complexity of \abbrStepOne .}
This hardness result requires the algorithm to be able to solve the hard query $Q$ for {\em multiple} PDBs. We further show that the answer to~\Cref{prob:intro-stmt} is no even if we fix the $\pd$ (in particular, we insist on $\prob_i = \prob$ for some $\prob$ in $(0, 1)$).
%Atri: The footnote above is where I talk about \sharpwonehard of det query complexity.
We further note that in our hardness proofs, we have $|\circuit|=\Theta\inparen{\timeOf{\abbrStepOne}(Q,\pdb)}$, which shows that the answer to~\Cref{prob:big-o-step-one} is also no.\AR{Need to make sure we have the correct statement for this claim (i) in the main paper.}
%we further show superlinear hardness in the size of \circuit for a specific %cubic
@ -239,23 +249,24 @@ In contrast, known approximation techniques in set-\abbrPDB\xplural are at most
(iii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); (iv) We further prove that for \raPlus queries
%\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
we can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
we can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}). We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly_\tup$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial.
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial.
Consider the query $\query(\pdb)$ defined as follows over the bag relations of \cref{fig:two-step}:
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
\end{lstlisting}
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
It can be verified that $\poly_\tup\inparen{A, B, C, D, X, Y, Z}$ for the sole output tuple ofis is $Q$ is $AXB + BYD + BZC$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
It can be verified that $\poly\inparen{A, B, C, D, X, Y, Z}$ for the sole output tuple of $\query$ is $AXB + BYD + BZC$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
The lineage polynomial for $Q^2$ is given by $\poly^2\inparen{A, B, C, D, X, Y, Z}$:
\begin{multline*}
\inparen{AXB + BYD + BZC}^2\\
=A^2X^2B^2 + B^2Y^2D^2 + B^2Z^2C^2 + 2AXB^2YD + 2AXB^2ZC + 2B^2YDZC.
\end{multline*}
By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ then is:\footnote{The random variable corresponding to a formal variable $A$ is denoted $\randWorld_A$, with probability drawn from $\pd$.}
By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\Phi^2\inparen{\vct{\randWorld}}}$ then is:\footnote{The random variable corresponding to a formal variable $A$ is denoted $\randWorld_A$, with probability drawn from $\pdassign$.}
\begin{footnotesize}
\begin{multline*}
\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_D^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D}\\
@ -264,7 +275,7 @@ By exploiting linearity of expectation of summand terms, and further pushing exp
\end{footnotesize}
\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
$\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ simplifies to:
$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\Phi^2\inparen{\vct{\randWorld}}}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_D} \\
@ -286,16 +297,16 @@ In fact, the following lemma shows that this equivalence holds for {\em all} $\r
\begin{Lemma}\label{lem:tidb-reduce-poly}
Let $\pdb$ be a \abbrTIDB over $n$ input tuples
%\OK{Should this be $\vct{W}$?} $\vct{X} = \{X_1,\ldots,X_\numvar\}$
such that the probability distribution $\pd$ over $\vct{W}\in\{0,1\}^\numvar$ (the set of possible worlds) is induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ where $\prob_i=\probOf\pbox{W_i=1}$.
such that the probability distribution $\pdassign$ over $\vct{W}\in\{0,1\}^\numvar$ (the set of possible worlds) is induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ where $\prob_i=\probOf\pbox{W_i=1}$.
% $\probAllTup$ consists of each individual tuple's marginal probability across $\idb$.
For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$, it holds that $
%\begin{equation*}
\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
\expct_{\vct{W} \sim \pdassign}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
%\end{equation*}
$
\end{Lemma}
To prove our hardness result we show that for the same $Q$ considered in the example above, for an arbitrary product width $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the $Route$ relation in $Q$).
To prove our hardness result we show that for the same $Q$ considered in the example above, for an arbitrary product width $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of subgraphs in an arbitrary graph $G$ (which is used to define the $Route$ relation in $\query$) isomorphic to various graphs with $k$ edges.
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other six variables), we can see that
\begin{align*}
@ -312,7 +323,9 @@ then we note that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
\AH{I don't think I understand what the sentence (about extensions) is saying.}
(\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. All proofs are in the appendix.
%%% Local Variables:

View file

@ -113,7 +113,7 @@
\newcommand{\pd}{{\mathcal{P}_{\idb}}}%pd for probability distribution
\newcommand{\pdassign}{\mathcal{P}}
\newcommand{\pdb}{\mathcal{D}}
\newcommand{\encodedDB}{\textnormal{\db}}
\newcommand{\dbbase}{\db_\idb}
\newcommand{\pxdb}{\pdb_{\semNX}}
\newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db--Are we currently using this?
@ -125,7 +125,7 @@
%PDB Abbreviations
\newcommand{\abbrPDB}{\textnormal{PDB}\xspace}
\newcommand{\abbrBPDB}{\textnormal{BPDB}\xspace}
\newcommand{\abbrBPDB}{\textnormal{bag-PDB}\xspace}
\newcommand{\abbrTIDB}{\textnormal{TIDB}\xspace}%replace \ti with this
\newcommand{\abbrTIDBs}{\textnormal{TIDBs}\xspace}%replace \ti with this
\newcommand{\abbrBIDB}{\textnormal{BIDB}\xspace}