set to bag
This commit is contained in:
parent
626e86536f
commit
b1ffd39f78
|
@ -1,8 +1,28 @@
|
|||
|
||||
\section{Generalizing Results Beyond set-TIDBs}
|
||||
\section{Generalizing Beyond Set Inputs}
|
||||
\label{sec:gener-results-beyond}
|
||||
|
||||
For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability of having multiplicity $1$.
|
||||
\subsection{\abbrTIDB{}s}
|
||||
\label{sec:abbrtidbs}
|
||||
|
||||
For results for \abbrTIDBs, we assumed a model of \abbrTIDBs where each input tuple is assigned a probability $p$ of having multiplicity $1$. That is, we assumed inputs to be sets, but interpret queries under bag semantics. Other sensible interpretations of what the generalization of \abbrTIDBs from sets to bags should be exist.
|
||||
|
||||
One important such generalization is to assign each input tuple $\tup$ a multiplicity $m_\tup$ and probability $p$: the tuple has probability $p$ to exists with multiplicity $m_\tup$, and otherwise has multiplicity $0$. If the maximal multiplicity of all tuples in the \abbrTIDB is bound by some constant, then a generalization of our hardness results and approximation algorithm can be achieved by changing the construction of lineage polynomials as follows:
|
||||
|
||||
\begin{align*}
|
||||
\polyqdt{\rel}{\dbbase}{\tup} =&\begin{cases}
|
||||
m_\tup X_\tup & \text{if }\dbbase.\rel\inparen{\tup} = m_\tup \\
|
||||
0 &\text{otherwise.}\end{cases}
|
||||
\end{align*}
|
||||
That is the variable representing a tuple is multiplied by $m_\tup$ to encode the tuple's multiplicity $m_\tup$.
|
||||
|
||||
Yet another option would be to assign each tuple a probability distribution over multiplicities. It seems clear that our results would not extend to a model that allows arbitrary probability distributions for this purpose. However, we would like to note that the special case of a normal distribution over multiplicities can be handled as follows: we add an additional identifier attribute to each relation in the database. For a tuple $\tup$ with maximal multiplicity $m_\tup$, we create $m_\tup$ copies of $\tup$ with different identifiers. To answer a query over this encoding, we first project away the identifier attribute.
|
||||
|
||||
\subsection{\abbrBIDB{}s}
|
||||
\label{sec:abbrbidbs}
|
||||
|
||||
The approach described above works for \abbrBIBD{}s as well if we define the bag version of \abbrBIDB{}s to associate each tuple $\tup$ a multiplicity $m_\tup$. Recall that we associate each tuple in a block with a unique variable. Thus, the modified lineage polynomial construction shown above can be applied for \abbrBIDB{}s too.
|
||||
|
||||
|
||||
|
||||
%%% Local Variables:
|
||||
|
|
|
@ -85,11 +85,11 @@ In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for s
|
|||
|
||||
\mypar{\abbrTIDB\xplural}
|
||||
We initially focus on tuple-independent probabilistic bag-databases\footnote{See \cite{DBLP:series/synthesis/2011Suciu} for a survey of set-\abbrTIDBs; the bag encoding is analogous~\cite{DBLP:conf/pods/GreenKT07}.} (\abbrTIDB\xplural), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event.\footnote{
|
||||
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under classical set semantics.
|
||||
Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
|
||||
% To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
|
||||
When the multiplicity of input tuple is bound by some constant,
|
||||
the increased input size is negligible.\label{footnote:set-not-limit}
|
||||
This model is exactly the definition of \abbrTIDB{}s \cite{VS17} under set semantics. Note that this is only one possible definition of \abbrTIDB{}s under bag semantics. In \Cref{sec:gener-results-beyond} we discuss alternatives and to what degree our results extend to these alternatives.
|
||||
% Mirroring the implementation of bag relations in production database systems (e.g., Postgresql, DB2), tuple multiplicities are modeled by retaining copies of each tuple (up to its largest possible multiplicity).
|
||||
% % To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates.
|
||||
% When the multiplicity of input tuple is bound by some constant,
|
||||
% the increased input size is negligible.\label{footnote:set-not-limit}
|
||||
}
|
||||
% OK: I tidied things up a touch.
|
||||
%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
|
||||
|
@ -246,7 +246,7 @@ A key insight of this paper is that the representation of $\circuit$ matters.
|
|||
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
|
||||
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
|
||||
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$,
|
||||
%\BG{should be $|\idb |$?},
|
||||
%\BG{should be $|\idb |$?},
|
||||
%Atri: No, this is fine
|
||||
and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
|
||||
|
||||
|
@ -306,11 +306,11 @@ Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tu
|
|||
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
|
||||
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
|
||||
|
||||
(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
|
||||
e.g. when the circuit is a tree
|
||||
%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
|
||||
and there is a single
|
||||
result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
|
||||
(i) We show that %for typical database usage patterns\BG{Not sure what we mean by that?}
|
||||
e.g. when the circuit is a tree
|
||||
%or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}),
|
||||
and there is a single
|
||||
result tuple, %\BG{This sounds like we restricting the discussion to queries that return a single tuple}\AR{Does moving the footnote help?},
|
||||
the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes} (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).})
|
||||
Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB, the answer to \Cref{prob:big-o-joint-steps} is also yes.
|
||||
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
|
||||
|
|
Loading…
Reference in a new issue