boris bad intro

This commit is contained in:
Boris Glavic 2021-04-02 12:00:56 -05:00
parent 9fc77e005b
commit 37196c454b
2 changed files with 37 additions and 22 deletions

View file

@ -3,6 +3,16 @@
\section{Introduction}
\label{sec:intro}
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds paired with a probability distribution $\pd$ over these worlds. A well-studied problem in probabilistic databases is to compute the \emph{marginal probability} of a tuple $\tup$, i.e., its probability to to exist in the result of a query $\query$ over $\pdb$. This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for the class of union of conjunctive queries (UCQs). In this work, we consider bag semantics where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analog problem of computing the expectation of the multiplicity of a query result tuple $\tup$:
\begin{equation}\label{eq:bag-expectation}
\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
\end{equation}
Under set semantics, the marginal probability of a query result $\tup$ can be computed based on the lineage $\linsett{\query}{\pdb}{\tup}$ of tuple $\tup$. The lineage $\linsett{\query}{\pdb}{\tup}$ is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vec{X}$) over random variables that represent input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of each of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). The marginal probability of the tuple is equal to the probability of the lineage $\linsett{\query}{\pdb}{\tup}$ evaluating to true over these assignments. This following from the following fact: iff $\linsett{\query}{\pdb}{\tup}$ evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$. For bag semantics, the lineage of a tuple is a polynomial over random variables from the set $\vct{X} \in \mathbb{N}^\numvar$ with
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). Analog to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities) yields the multiplicity of result tuple $\tup$ in this world.
In their most general form, tuple-independent set-probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) answer existential queries (queries for the probability of a specific condition holding over the input database) in two steps: (i) lineage and (ii) probability.
The lineage is a boolean formula, an element of the $\text{PosBool}[\vct{X}]$ semiring, where lineage variables $\vct{X}\in \mathbb{B}^\numvar$ are random variables corresponding to the presence of each of the $\numvar$ input tuples in one possible world of the input database.
The lineage models the relationship between the presence of these input tuples and the query condition being satisfied, and thus the probability of this formula is exactly the query result.
@ -10,6 +20,7 @@ The analogous query in the bag setting~\cite{DBLP:journals/sigmod/GuagliardoL17,
The process for responding to such queries is also analogous, save that the lineage is a polynomial, an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$ and random variables from the set $\vct{X} \in \mathbb{N}^\numvar$.
The expectation of this polynomial is the query result.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\begin{subfigure}[b]{0.33\linewidth}
\centering
@ -63,7 +74,9 @@ The expectation of this polynomial is the query result.
\label{fig:ex-shipping}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company.
Table Loc lists all the locations of airports.
@ -82,7 +95,7 @@ Suppose the representative would like to consider delays from the originating ci
The resulting lineage formulas (\cref{subfig:ex-shipping-queries}) concisely describe the event of delivering a shipment to Zurich or Bremen without departure delay, or the number of departure-delay-free routes to these cities given an assignment to $L_b$.
If Chicago is delay-free ($L_b = \top$, $L_b = 1$, respectively), there exists a route (set semantics) or there are two routes (bag semantics).
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%The computation of the marginal probability in is a known . Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
A well-known dichotomy~\cite{10.1145/1265530.1265571} separates the common-case \sharpphard problem of computing the probability of a boolean lineage formulas, from the case where the probability computation can be inlined into the polynomial-time lineage construction process.

View file

@ -24,7 +24,9 @@
\newcommand{\pd}{\vct{P}}%pd for probability distribution
\newcommand{\eval}[1]{\llbracket #1 \rrbracket}%evaluation double brackets
\newcommand{\evald}[2]{\eval{{#1}}_{#2}}
\newcommand{\linsett}[3]{\Phi_{#1,#2}^{#3}}
\newcommand{\query}{Q}
\newcommand{\join}{\Join}
\newcommand{\select}{\sigma}
\newcommand{\project}{\pi}