set to bag

2021-09-20 08:54:17 -05:00 · 2021-09-20 08:54:17 -05:00 · b1ffd39f78
parent 626e86536f
commit b1ffd39f78
2 changed files with 33 additions and 13 deletions
--- a/app_set_to_bag_pdb.tex
+++ b/app_set_to_bag_pdb.tex
@ -1,8 +1,28 @@

-\section{Generalizing Results Beyond set-TIDBs}
+\section{Generalizing Beyond Set Inputs}
 \label{sec:gener-results-beyond}

-For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability of having multiplicity $1$.
+\subsection{\abbrTIDB{}s}
+\label{sec:abbrtidbs}
+
+For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability $p$ of having multiplicity $1$. That is, we assumed inputs to be sets, but interpret queries under bag semantics. Other sensible interpretations of what the generalization of \abbrTIDBs from sets to bags should be exist.
+
+One important such generalization is to assign each input tuple $\tup$ a multiplicity $m_\tup$ and probability $p$: the tuple has probability $p$ to exists with multiplicity $m_\tup$, and otherwise has multiplicity $0$. If the maximal multiplicity of all tuples in the \abbrTIDB is bound by some constant, then a generalization of our hardness results and approximation algorithm can be achieved by changing the construction of lineage polynomials as follows:
+
+\begin{align*}
+  \polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases}
+                                           		m_\tup X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = m_\tup \\
+                                           		0		 &\text{otherwise.}\end{cases}
+\end{align*}
+That is the variable representing a tuple is multiplied by $m_\tup$ to encode the tuple's multiplicity $m_\tup$.
+
+Yet another option would be to assign each tuple a probability distribution over multiplicities. It seems clear that our results would not extend to a model that allows arbitrary probability distributions for this purpose. However, we would like to note that the special case of a normal distribution over multiplicities can be handled as follows: we add an additional identifier attribute to each relation in the database. For a tuple $\tup$ with  maximal multiplicity  $m_\tup$, we create $m_\tup$ copies of $\tup$ with different identifiers. To answer a query over this encoding, we first project away the identifier attribute.
+
+\subsection{\abbrBIDB{}s}
+\label{sec:abbrbidbs}
+
+The approach described above works for \abbrBIBD{}s as well if we define the bag version of \abbrBIDB{}s to associate each tuple $\tup$  a multiplicity $m_\tup$. Recall that we associate each tuple in a block with a unique variable. Thus, the modified lineage polynomial construction shown above can be applied for \abbrBIDB{}s too.
+


 %%% Local Variables:
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -85,11 +85,11 @@ In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for s

 \mypar{\abbrTIDB\xplural}
 We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
-  This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics.
-  Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
-  % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
-  When the multiplicity of input tuple is bound by some constant,
-  the increased input size is negligible.\label{footnote:set-not-limit}
+  This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.
+  % Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
+  % % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
+  % When the multiplicity of input tuple is bound by some constant,
+  % the increased input size is negligible.\label{footnote:set-not-limit}
 }
 % OK: I tidied things up a touch.
 %\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
@ -246,7 +246,7 @@ A key insight of this paper is that the representation of $\circuit$ matters.
 For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
  This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
 }, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$,
-%\BG{should be $|\idb |$?}, 
+%\BG{should be $|\idb |$?},
 %Atri: No, this is fine
 and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.

@ -306,11 +306,11 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu
 %graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
 %(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for  an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.

-(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?} 
-e.g. when the circuit is a tree 
-%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), 
- and there is a single 
- result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?}, 
+(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
+e.g. when the circuit is a tree
+%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
+ and there is a single
+ result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
 the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes} (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).})
 Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB, the answer to \Cref{prob:big-o-joint-steps} is also yes.
 % the approximation algorithm has runtime linear in the size of the compressed lineage encoding (