From f7cc75ef6c7ef82af01ae241d18c2ce431e3e120 Mon Sep 17 00:00:00 2001 From: Aaron Huber Date: Mon, 23 Aug 2021 09:01:25 -0400 Subject: [PATCH] Changes I apparently hadn't commited. --- intro-rewrite-070921.tex | 4 ++-- ra-to-poly.tex | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/intro-rewrite-070921.tex b/intro-rewrite-070921.tex index fa0d234..6d6341c 100644 --- a/intro-rewrite-070921.tex +++ b/intro-rewrite-070921.tex @@ -1,5 +1,5 @@ %root: main.tex -\section{Introduction (Rewrite - 070921)} +\section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921} \input{two-step-model} A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is the set of deterministic instances of $\pdb$ (possible worlds) and $\pd$ is the probability distribution over $\idb$. $\pdb$ can equivalently encoded as deterministic database with $\numvar$ tuples, with $\pd$ %with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one @@ -25,7 +25,7 @@ There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natura %END Needs to be noted. A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation, effectively dividing \abbrPDB query evaluation into two steps: deterministic query evaluation\footnote{Given input $\pdb$, this step includes outputting every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$.\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}} and computing expectation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}. -The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by multiplication and addition operators.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation. +The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation. Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$. Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ for the following situations, i.e. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$. When $\poly_\tup$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it consists of a sum of unique products.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$. Let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$. For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in diff --git a/ra-to-poly.tex b/ra-to-poly.tex index 9b32d34..eedeccc 100644 --- a/ra-to-poly.tex +++ b/ra-to-poly.tex @@ -5,6 +5,8 @@ \subsection{Probabilistic Databases} +We focus primarily on set-\abbrPDB inputs in this section, but as noted in \cref{sec:intro-rewrite-070921}, this is not limiting. + An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds. Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.