Started my intro pass

This commit is contained in:
Atri Rudra 2021-08-26 19:18:56 -04:00
parent bd70b6147e
commit 26dbf42e56

View file

@ -2,26 +2,33 @@
%root: main.tex %root: main.tex
\section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921} \section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921}
\input{two-step-model} \input{two-step-model}
A probabilistic database $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances (possible worlds) and $\pd$ is a probability distribution over $\idb$. A probabilistic database (or PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances (possible worlds) and $\pd$ is a probability distribution over $\idb$.
In bag count-query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$. In bag count-query\AR{Why are we restricting ourselves to a couont query here? Why not just say `bag query'?} semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
In addition to traditional deterministic query evaluation requirements (for a given query class), the count-query evaluation problem in bag-\abbrPDB semantics can be formally stated as: In addition to traditional deterministic query evaluation requirements (for a given query class), the count-query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
\begin{Problem}\label{prob:bag-pdb-query-eval} \begin{Problem}\label{prob:bag-pdb-query-eval}
Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$)\footnote{We assume the implicity probability distribution $\pd$, and explicilty denote the distribution if it is not implicit.} of output tuple $\tup$. Given a query $\query$ from the set of positive relational algebra queries\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} ($\raPlus$), compute the expected\footnote{Unless stated otherwise, we assume the implicity probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$.}
multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$)
of output tuple $\tup$. We will be interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size).
\end{Problem} \end{Problem}
We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual copy of a tuple in a possible world can be modeled as an independent probabilistic event\footnote{ Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
This model corresponds to the classical set-relational approach to \abbrTIDB{}s, reducing duplicate tuples to a set-\abbrTIDB by assigning unique keys across all $\tup$ in $\pdb$. This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding. We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world can be modeled as an independent probabilistic event\footnote{
This model corresponds to the classical set-relational approach to \abbrTIDB{}s, where we can handle the case of each input tuple having its own multiplicity by replacing each input tuple with as many copes as its multiplicity. To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. This increases the size of the input but this overhead is negligible when each input tuple has constant multiplicity. %$\tup$ in $\pdb$.
%This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding.
We further generalize this model in \cref{sec:background} and beyond. We further generalize this model in \cref{sec:background} and beyond.
} }.
A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database $\encodedDB$ with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$ We will denote the $n$ input tuples by $t_1,\dots,t_\numvar$ and each of the $2^n$ possible database instance in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a database instance that has $t_i$ in it if and only if $w_i=1$. Further $\pd$ is compactly described by tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distrbution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$. Finally for each $\vct{W}\in\{0,1\}^\numvar$, define $\pdb_{\vct{W}}$ the deterministic database represented by $\vct{W}$.
%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
being the distribution induced from the requirement that each tuple $\tup \in \encodedDB$ be treated as an independent Bernoulli distributed random variable with probability $\prob_\tup$.
The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
Given a vector $\vct{X}$ such that each $\tup \in \encodedDB$ has a unique formal variable annotation $X_\tup \in \vct{X}$, for a boolean domain $\{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
When $\pdb$ is a \abbrTIDB, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$. %Atri: Stuff below was confusing, so am re-writing it.
Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{X}}}\inparen{\tup} = \poly_\tup\inparen{\vct{X}}$. %A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database $\encodedDB$ with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$
This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$. %with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
%being the distribution induced from the requirement that each tuple $\tup \in \encodedDB$ be treated as an independent Bernoulli distributed random variable with probability $\prob_\tup$.
%The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
%Given a vector $\vct{X}$ such that each $\tup \in \encodedDB$ has a unique formal variable annotation $X_\tup \in \vct{X}$, for a boolean domain $\{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
When $\pdb$ is a \abbrTIDB, for every output tuple $\tup$, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$.
Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{W}}}\inparen{\tup} = \poly_\tup\inparen{\vct{W}}$.
This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.
Thanks to linearity of expectation, polynomial-time algorithms exist (e.g., \cite{kennedy:2010:icde:pip}) for computing exact results for bag-probabilistic count queries over \abbrTIDB{}s. Thanks to linearity of expectation, polynomial-time algorithms exist (e.g., \cite{kennedy:2010:icde:pip}) for computing exact results for bag-probabilistic count queries over \abbrTIDB{}s.
However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}. However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.