Addressing the deterministic database issue.

master
Aaron Huber 2022-06-07 10:10:19 -04:00
parent 07ea722712
commit 5a18732c08
4 changed files with 5 additions and 5 deletions

View File

@ -39,7 +39,7 @@ Define a \emph{\abbrOneBIDB} to be the pair $\pdb' = \inparen{\bigtimes_{\tup\in
%\footnote{We slightly abuse notation here, denoting a world vector as $W$ rather than $\worldvec$ to distinguish between the random variable and the world instance. When there is no ambiguity, we will denote a world vector as $\worldvec$.}
\end{Definition}
Lineage polynomials for arbitrary deterministic $\gentupset'$ are constructed in a manner analogous to $1$-\abbrTIDB\xplural (see \Cref{fig:nxDBSemantics}), differing only in the base case.
Lineage polynomials for arbitrary \dbbaseName $\gentupset'$ are constructed in a manner analogous to $1$-\abbrTIDB\xplural (see \Cref{fig:nxDBSemantics}), differing only in the base case.
In a $1$-\abbrTIDB, each tuple contributes a multiplicity of 0 or 1, and $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$. %\textcolor{red}{CHANGE}
In a \abbrOneBIDB, each tuple $\tup\in\tupset'$ contributes its corresponding multiplicity: %\textcolor{red}{CHANGE}
$\polyqdt{\rel}{\gentupset}{\tup} = c_\tup\cdot X_\tup$. These semantics are fully detailed in \Cref{fig:lin-poly-bidb}.

View File

@ -1,4 +1,4 @@
%!TEX root=./main.tex
%!TEX root= prob-def.tex
\subsection{Deterministic Query Runtimes}\label{sec:gen}
%We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.

View File

@ -5,7 +5,7 @@ We have studied the problem of calculating the expected multiplicity of a bag-qu
a problem that has a practical application in probabilistic databases over multisets.
We show that under various parameterized complexity hardness results/conjectures computing the expected multiplicities exactly is not possible in time linear in the corresponding deterministic query processing time.
We prove that it is possible to approximate the expectation of a lineage polynomial in linear time
in the deterministic query processing over TIDBs and BIDBs (assuming that there are few cancellations).
in the deterministic query processing over TIDBs and BIDBs (assuming that there are few cancellations).
Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we can handle higher moments (this follows fairly easily from our existing results-- see \Cref{sec:momemts}), more general approximations are an interesting area for exploration, including those for more general data models.
%%% Local Variables:

View File

@ -51,7 +51,7 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us
\end{align*}%\\[-10mm]
%\setlength{\abovecaptionskip}{-0.25cm}
\savecaptionspace{
\caption{Lineage polynomial semantics given $\raPlus$ query $\query$, arbitrary deterministic database $\gentupset$ with variables $\inparen{X_\tup}_{\tup \in\gentupset}$, where for $\rel\in\gentupset$, $\tup\in\rel$, the base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.}% for any $\rel\in\gentupset$ and $\tup\in\rel$.}% consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$, such that the base case $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.} %Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively. The base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$}
\caption{Lineage polynomial semantics given $\raPlus$ query $\query$, arbitrary \dbbaseName $\gentupset$ with variables $\inparen{X_\tup}_{\tup \in\gentupset}$, where for $\rel\in\gentupset$, $\tup\in\rel$, the base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.}% for any $\rel\in\gentupset$ and $\tup\in\rel$.}% consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$, such that the base case $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.} %Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively. The base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$}
\label{fig:nxDBSemantics}
}{\abovecapshrink}{\belowcapshrink}
%\vspace{-0.53cm}
@ -112,7 +112,7 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Our lower bound results}
%
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a \dbbaseName $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the the answer to the above question is no--