set to bag

This commit is contained in:
Boris Glavic 2021-09-20 08:54:17 -05:00
parent 626e86536f
commit b1ffd39f78
2 changed files with 33 additions and 13 deletions

View file

@ -1,8 +1,28 @@
\section{Generalizing Results Beyond set-TIDBs}
\section{Generalizing Beyond Set Inputs}
\label{sec:gener-results-beyond}
For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability of having multiplicity $1$.
\subsection{\abbrTIDB{}s}
\label{sec:abbrtidbs}
For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability $p$ of having multiplicity $1$. That is, we assumed inputs to be sets, but interpret queries under bag semantics. Other sensible interpretations of what the generalization of \abbrTIDBs from sets to bags should be exist.
One important such generalization is to assign each input tuple $\tup$ a multiplicity $m_\tup$ and probability $p$: the tuple has probability $p$ to exists with multiplicity $m_\tup$, and otherwise has multiplicity $0$. If the maximal multiplicity of all tuples in the \abbrTIDB is bound by some constant, then a generalization of our hardness results and approximation algorithm can be achieved by changing the construction of lineage polynomials as follows:
\begin{align*}
\polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases}
m_\tup X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = m_\tup \\
0 &\text{otherwise.}\end{cases}
\end{align*}
That is the variable representing a tuple is multiplied by $m_\tup$ to encode the tuple's multiplicity $m_\tup$.
Yet another option would be to assign each tuple a probability distribution over multiplicities. It seems clear that our results would not extend to a model that allows arbitrary probability distributions for this purpose. However, we would like to note that the special case of a normal distribution over multiplicities can be handled as follows: we add an additional identifier attribute to each relation in the database. For a tuple $\tup$ with maximal multiplicity $m_\tup$, we create $m_\tup$ copies of $\tup$ with different identifiers. To answer a query over this encoding, we first project away the identifier attribute.
\subsection{\abbrBIDB{}s}
\label{sec:abbrbidbs}
The approach described above works for \abbrBIBD{}s as well if we define the bag version of \abbrBIDB{}s to associate each tuple $\tup$ a multiplicity $m_\tup$. Recall that we associate each tuple in a block with a unique variable. Thus, the modified lineage polynomial construction shown above can be applied for \abbrBIDB{}s too.
%%% Local Variables:

View file

@ -85,11 +85,11 @@ In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for s
\mypar{\abbrTIDB\xplural}
We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics.
Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
% To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
When the multiplicity of input tuple is bound by some constant,
the increased input size is negligible.\label{footnote:set-not-limit}
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.
% Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
% % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
% When the multiplicity of input tuple is bound by some constant,
% the increased input size is negligible.\label{footnote:set-not-limit}
}
% OK: I tidied things up a touch.
%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
@ -246,7 +246,7 @@ A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$,
%\BG{should be $|\idb |$?},
%\BG{should be $|\idb |$?},
%Atri: No, this is fine
and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
@ -306,11 +306,11 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
e.g. when the circuit is a tree
%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
and there is a single
result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
e.g. when the circuit is a tree
%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
and there is a single
result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes} (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).})
Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB, the answer to \Cref{prob:big-o-joint-steps} is also yes.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (