diff --git a/app_set_to_bag_pdb.tex b/app_set_to_bag_pdb.tex index 623281a..e02fd74 100644 --- a/app_set_to_bag_pdb.tex +++ b/app_set_to_bag_pdb.tex @@ -1,8 +1,28 @@ -\section{Generalizing Results Beyond set-TIDBs} +\section{Generalizing Beyond Set Inputs} \label{sec:gener-results-beyond} -For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability of having multiplicity $1$. +\subsection{\abbrTIDB{}s} +\label{sec:abbrtidbs} + +For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability $p$ of having multiplicity $1$. That is, we assumed inputs to be sets, but interpret queries under bag semantics. Other sensible interpretations of what the generalization of \abbrTIDBs from sets to bags should be exist. + +One important such generalization is to assign each input tuple $\tup$ a multiplicity $m_\tup$ and probability $p$: the tuple has probability $p$ to exists with multiplicity $m_\tup$, and otherwise has multiplicity $0$. If the maximal multiplicity of all tuples in the \abbrTIDB is bound by some constant, then a generalization of our hardness results and approximation algorithm can be achieved by changing the construction of lineage polynomials as follows: + +\begin{align*} + \polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases} + m_\tup X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = m_\tup \\ + 0 &\text{otherwise.}\end{cases} +\end{align*} +That is the variable representing a tuple is multiplied by $m_\tup$ to encode the tuple's multiplicity $m_\tup$. + +Yet another option would be to assign each tuple a probability distribution over multiplicities. It seems clear that our results would not extend to a model that allows arbitrary probability distributions for this purpose. However, we would like to note that the special case of a normal distribution over multiplicities can be handled as follows: we add an additional identifier attribute to each relation in the database. For a tuple $\tup$ with maximal multiplicity $m_\tup$, we create $m_\tup$ copies of $\tup$ with different identifiers. To answer a query over this encoding, we first project away the identifier attribute. + +\subsection{\abbrBIDB{}s} +\label{sec:abbrbidbs} + +The approach described above works for \abbrBIBD{}s as well if we define the bag version of \abbrBIDB{}s to associate each tuple $\tup$ a multiplicity $m_\tup$. Recall that we associate each tuple in a block with a unique variable. Thus, the modified lineage polynomial construction shown above can be applied for \abbrBIDB{}s too. + %%% Local Variables: diff --git a/intro-rewrite-070921.tex b/intro-rewrite-070921.tex index dea4c7e..d92508e 100644 --- a/intro-rewrite-070921.tex +++ b/intro-rewrite-070921.tex @@ -85,11 +85,11 @@ In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for s \mypar{\abbrTIDB\xplural} We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{ - This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics. - Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity). - % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. - When the multiplicity of input tuple is bound by some constant, - the increased input size is negligible.\label{footnote:set-not-limit} + This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives. + % Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity). + % % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. + % When the multiplicity of input tuple is bound by some constant, + % the increased input size is negligible.\label{footnote:set-not-limit} } % OK: I tidied things up a touch. %\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.} @@ -246,7 +246,7 @@ A key insight of this paper is that the representation of $\circuit$ matters. For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{ This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition. }, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, -%\BG{should be $|\idb |$?}, +%\BG{should be $|\idb |$?}, %Atri: No, this is fine and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large. @@ -306,11 +306,11 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu %graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$; %(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly. -(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?} -e.g. when the circuit is a tree -%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), - and there is a single - result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?}, +(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?} +e.g. when the circuit is a tree +%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), + and there is a single + result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?}, the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes} (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}) Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB, the answer to \Cref{prob:big-o-joint-steps} is also yes. % the approximation algorithm has runtime linear in the size of the compressed lineage encoding (