diff --git a/binarybidb.tex b/binarybidb.tex index a72c939..ad70ded 100644 --- a/binarybidb.tex +++ b/binarybidb.tex @@ -39,7 +39,7 @@ Define a \emph{\abbrOneBIDB} to be the pair $\pdb' = \inparen{\bigtimes_{\tup\in %\footnote{We slightly abuse notation here, denoting a world vector as $W$ rather than $\worldvec$ to distinguish between the random variable and the world instance. When there is no ambiguity, we will denote a world vector as $\worldvec$.} \end{Definition} -Lineage polynomials for arbitrary deterministic $\gentupset'$ are constructed in a manner analogous to $1$-\abbrTIDB\xplural (see \Cref{fig:nxDBSemantics}), differing only in the base case. +Lineage polynomials for arbitrary \dbbaseName $\gentupset'$ are constructed in a manner analogous to $1$-\abbrTIDB\xplural (see \Cref{fig:nxDBSemantics}), differing only in the base case. In a $1$-\abbrTIDB, each tuple contributes a multiplicity of 0 or 1, and $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$. %\textcolor{red}{CHANGE} In a \abbrOneBIDB, each tuple $\tup\in\tupset'$ contributes its corresponding multiplicity: %\textcolor{red}{CHANGE} $\polyqdt{\rel}{\gentupset}{\tup} = c_\tup\cdot X_\tup$. These semantics are fully detailed in \Cref{fig:lin-poly-bidb}. diff --git a/circuits-model-runtime.tex b/circuits-model-runtime.tex index d207032..156e007 100644 --- a/circuits-model-runtime.tex +++ b/circuits-model-runtime.tex @@ -1,4 +1,4 @@ -%!TEX root=./main.tex +%!TEX root= prob-def.tex \subsection{Deterministic Query Runtimes}\label{sec:gen} %We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions. diff --git a/conclusions.tex b/conclusions.tex index b070521..fffdd96 100644 --- a/conclusions.tex +++ b/conclusions.tex @@ -5,7 +5,7 @@ We have studied the problem of calculating the expected multiplicity of a bag-qu a problem that has a practical application in probabilistic databases over multisets. We show that under various parameterized complexity hardness results/conjectures computing the expected multiplicities exactly is not possible in time linear in the corresponding deterministic query processing time. We prove that it is possible to approximate the expectation of a lineage polynomial in linear time - in the deterministic query processing over TIDBs and BIDBs (assuming that there are few cancellations). + in the deterministic query processing over TIDBs and BIDBs (assuming that there are few cancellations). Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we can handle higher moments (this follows fairly easily from our existing results-- see \Cref{sec:momemts}), more general approximations are an interesting area for exploration, including those for more general data models. %%% Local Variables: diff --git a/introduction.tex b/introduction.tex index f1cdf30..1edbda4 100644 --- a/introduction.tex +++ b/introduction.tex @@ -51,7 +51,7 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us \end{align*}%\\[-10mm] %\setlength{\abovecaptionskip}{-0.25cm} \savecaptionspace{ - \caption{Lineage polynomial semantics given $\raPlus$ query $\query$, arbitrary deterministic database $\gentupset$ with variables $\inparen{X_\tup}_{\tup \in\gentupset}$, where for $\rel\in\gentupset$, $\tup\in\rel$, the base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.}% for any $\rel\in\gentupset$ and $\tup\in\rel$.}% consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$, such that the base case $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.} %Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively. The base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$} + \caption{Lineage polynomial semantics given $\raPlus$ query $\query$, arbitrary \dbbaseName $\gentupset$ with variables $\inparen{X_\tup}_{\tup \in\gentupset}$, where for $\rel\in\gentupset$, $\tup\in\rel$, the base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.}% for any $\rel\in\gentupset$ and $\tup\in\rel$.}% consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$, such that the base case $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.} %Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively. The base case is $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$} \label{fig:nxDBSemantics} }{\abovecapshrink}{\belowcapshrink} %\vspace{-0.53cm} @@ -112,7 +112,7 @@ Those with `Multiple' in the second column need the algorithm to be able to hand %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \mypar{Our lower bound results} % -Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q} +Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a \dbbaseName $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q} Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both deterministic and probabilistic query processing access to $\query'$}. Unfortunately the the answer to the above question is no--