906 lines
28 KiB
TeX
906 lines
28 KiB
TeX
|
|
|
|
|
|
|
|
|
|
|
|
\chapter{First Steps}
|
|
|
|
|
|
\section{Installing MayBMS}
|
|
|
|
|
|
\subsubsection{Using the installers}
|
|
|
|
Installers for MayBMS are available for both Windows and Linux operating systems and can be downloaded at
|
|
|
|
\url{https://sourceforge.net/projects/maybms/}
|
|
|
|
After you have obtained a copy of the installer, start it and follow the instructions.
|
|
|
|
\subsubsection{Compiling from scratch}
|
|
|
|
If the prepackaged installers do not work for you you can download the MayBMS's source code and compile it.
|
|
A copy is available for download at
|
|
|
|
\url{https://sourceforge.net/projects/maybms/}
|
|
|
|
Alternatively, you can obtain the latest snapshot from the repository by issuing the following command:
|
|
\begin{verbatim}
|
|
cvs -z3 -d:pserver:anonymous@maybms.cvs.sourceforge.net:/cvsroot/maybms
|
|
co -P maybms
|
|
\end{verbatim}
|
|
(on a single line).
|
|
|
|
This creates a directory maybms/ with a subdirectory postgresql-8.3.3/
|
|
that contains the source code of the system.
|
|
|
|
To compile and install MayBMS, just follow the instructions for installing
|
|
PostgreSQL 8.3.3. The latter is documented at
|
|
|
|
\medskip
|
|
|
|
\url{http://www.postgresql.org/docs/8.3/interactive/installation.html}.
|
|
|
|
%\medskip
|
|
%
|
|
%\noindent
|
|
%See
|
|
%
|
|
%\medskip
|
|
%
|
|
%\url{http://www.postgresql.org/docs/8.3/interactive/install-short.html}
|
|
%
|
|
%\medskip
|
|
%
|
|
%\noindent
|
|
%for a short version of the installation instructions.
|
|
|
|
\section{Running MayBMS}
|
|
|
|
After you have installed MayBMS (in either of the described ways), you can
|
|
set up a database and start using it.
|
|
Creating and accessing databases is the same as in PostgreSQL 8.3.3. Follow the links
|
|
|
|
\medskip
|
|
|
|
\url{http://www.postgresql.org/docs/8.3/interactive/tutorial-createdb.html}
|
|
|
|
\medskip
|
|
|
|
\noindent
|
|
and
|
|
|
|
\medskip
|
|
|
|
\url{http://www.postgresql.org/docs/8.3/interactive/tutorial-accessdb.html}.
|
|
|
|
See next section for short instructions on how to run MayBMS.
|
|
|
|
\section{Short Instructions}
|
|
|
|
|
|
Alternatively, you can follow the following set of instructions.\footnote{If
|
|
you do know how to compile and install Postgres, or have followed the
|
|
installation instructions above, you can ignore this.}
|
|
|
|
|
|
On most UNIX machines, Postgres
|
|
is by default installed in the directory /usr/local/pgsql/
|
|
and run under user ``postgres''.
|
|
MayBMS uses the same defaults.
|
|
If you prefer to install MayBMS
|
|
in your home directory and run it with your user privileges, you do not
|
|
need root privileges to install it. Proceed as follows:
|
|
Change the path ac\_default\_prefix in line 279 of the file
|
|
maybms/postgresql-8.3.3/configure to a path into your home directory
|
|
(e.g. /home/myname/pgsql/ if your home directory is /home/myname/).
|
|
|
|
To compile, install, and start the Postgres server, execute the
|
|
following statements:
|
|
\begin{verbatim}
|
|
cd maybms/postgresql-8.3.3/
|
|
./configure
|
|
make
|
|
make install
|
|
cd
|
|
pgsql/bin/initdb -D mydbname
|
|
pgsql/bin/pg_cql start -D mydbname
|
|
\end{verbatim}
|
|
|
|
Note: In these minimal instructions, we did not create a special database
|
|
using createdb (so the default, template1, has to be used), and error
|
|
messages are written to the console.
|
|
|
|
Now MayBMS is available for connections from applications.
|
|
|
|
For example, the Postgres command line interface psql
|
|
in which you can issue MayBMS queries
|
|
and data manipulation language statements is started with
|
|
\begin{verbatim}
|
|
psql template1
|
|
\end{verbatim}
|
|
Now you can enter the examples from, e.g., the following tutorial.
|
|
The psql program is terminated using the command ``$\backslash$q''.
|
|
The database server is stopped with
|
|
\begin{verbatim}
|
|
pgsql/bin/pg_ctl stop -D mydbname
|
|
\end{verbatim}
|
|
|
|
|
|
\subsubsection{Remark}
|
|
|
|
|
|
Since Postgres and MayBMS use the same process identifiers,
|
|
MayBMS and Postgres cannot run concurrently on the same machine.
|
|
If you start Postgres when MayBMS is already running
|
|
(or vice versa), there will be an error message stating that Postgres is
|
|
already running. Since MayBMS always identifies itself as Postgres,
|
|
standard Postgres applications and middleware can run on MayBMS.
|
|
|
|
|
|
\chapter{Probabilistic Databases}
|
|
|
|
|
|
We first give an informal definition of probabilistic databases, followed
|
|
by a formal definition.
|
|
%You should be able to do fine by just reading the
|
|
%first if the latter is too mathematical.
|
|
|
|
|
|
\section{Informal Definition}
|
|
|
|
|
|
Given a relational database schema (i.e., the structural information usually
|
|
specified by SQL CREATE TABLE statements).
|
|
A probabilistic database is a finite set of {\em possible worlds}\/, where each
|
|
possible world has a weight greater than 0 but no greater than 1
|
|
such that the sum of the weights
|
|
of all the worlds is one.
|
|
Each possible world is a relational database over the given
|
|
schema. That is, the schema is common to all possible worlds.
|
|
|
|
Possible worlds are a means of expressing uncertainty.
|
|
\begin{itemize}
|
|
\item
|
|
In a frequentist interpretation, the probabilistic database represents the
|
|
possible outcomes of a random experiment, the outcomes of which are relational
|
|
databases (or can be conveniently represented as relational databases).
|
|
The probability weight of a possible world is (the limit of) the relative
|
|
frequency of that possible world occurring as outcome of the random experiment
|
|
over a large number of trials.
|
|
|
|
\item
|
|
In a Bayesian interpretation, one of the possible worlds is
|
|
``true'', but we do not know which one, and the probabilities represent
|
|
degrees of belief in the various possible worlds.
|
|
\end{itemize}
|
|
|
|
Note that these interpretations of probabilistic databases are completely
|
|
standard in probability theory (and formalized via the notion of {\em probability spaces}). The only aspect particular to probabilistic
|
|
databases is the fact that possible worlds are relational databases.
|
|
|
|
|
|
Note that the idea of a probabilistic database as a set of possible worlds
|
|
is only the conceptual model. The physical representation of the
|
|
set of possible worlds in the MayBMS system is quite different
|
|
(see Section~\ref{sect:representation}) and allows for the efficient and
|
|
space-saving (compressed) representation of very large sets of possible
|
|
worlds.
|
|
|
|
|
|
\section{Formal Definition}
|
|
|
|
|
|
The following is a standard definition from probability theory and
|
|
shall only be recalled to demonstrate the close connection of
|
|
probabilistic databases to classical concepts in mathematics.
|
|
|
|
|
|
\begin{definition}\em
|
|
A {\em finite probability space}\/ is a triple $(\Omega, {\cal F}, \Pr)$ where
|
|
\begin{itemize}
|
|
\item
|
|
$\Omega$ is a finite set called the {\em sample space}\/,
|
|
|
|
\item
|
|
${\cal F} = 2^\Omega$ is the set of
|
|
subsets of $\Omega$ (these subsets are called {\em events}; the one-element subsets $\{\omega\}$ are called atomic events), and
|
|
|
|
\item
|
|
$\Pr$ is a
|
|
{\em probability measure}\/, i.e., a function
|
|
that maps each element $\omega \in \Omega$ (i.e., each atomic event)
|
|
to a number between 0 and 1 such that
|
|
\[
|
|
\sum_{\omega \in \Omega} \Pr[\omega] = 1
|
|
\]
|
|
and that maps
|
|
each (nonatomic) event $E \in ({\cal F} \;\backslash\; \Omega)$ to
|
|
$\sum_{\omega \in E} Pr[\omega]$.
|
|
\punto
|
|
\end{itemize}
|
|
\end{definition}
|
|
|
|
|
|
Formally,
|
|
a {\em probabilistic database}\/ over a relational database schema $sch$
|
|
is a finite probability space $(\Omega, {\cal F} = 2^\Omega, \Pr)$ with an associated
|
|
function $I$ (for {\em instance}) that maps each
|
|
$\omega \in \Omega$ to a relational database
|
|
over schema $sch$.
|
|
|
|
We call the elements $\omega$ of $\Omega$ the {\em possible worlds}\/
|
|
of the probabilistic database.
|
|
|
|
We can identify events with Boolean queries $Q$ that are true on a subset of
|
|
$\Omega$. Of course, the probability of such an event is given by
|
|
\[
|
|
\Pr[Q] = \sum_{\omega \in \Omega \;:\; Q(I(\omega))=true} \Pr[\omega].
|
|
\]
|
|
|
|
One particular type of event is membership of a given tuple $\vec{t}$
|
|
in the result of a (nonboolean) query, i.e., an event
|
|
\[
|
|
\{\omega \in \Omega \;:\; \vec{t} \in Q(I(\omega)) \}.
|
|
\]
|
|
The probability of this event is called the
|
|
{\em tuple confidence}\/ for tuple $\vec{t}$.
|
|
|
|
|
|
A {\em random variable} $X$ is a function from $\Omega$ to a set $D$
|
|
(the ``values'' of the random variable).
|
|
We can associate each expression $X=x$, where $x \in D$, with an event
|
|
\[
|
|
\{ \omega \in \Omega \mid X(\omega) = x \}.
|
|
\]
|
|
Again, this is the usual notion from probability theory.
|
|
|
|
|
|
\section{An Example}
|
|
|
|
|
|
Consider a finite probability space with
|
|
\[
|
|
\Omega = \{
|
|
\omega_{rain, wet},
|
|
\omega_{\neg rain, wet},
|
|
\omega_{rain, \neg wet},
|
|
\omega_{\neg rain, \neg wet}
|
|
\}
|
|
\]
|
|
and
|
|
$\Pr[\omega_{rain, wet}] = 0.35$,
|
|
$\Pr[\omega_{rain, \neg wet}] = 0.05$,
|
|
$\Pr[\omega_{\neg rain, wet}] = 0.1$, and
|
|
$\Pr[\omega_{\neg rain, \neg wet}] = 0.5$.
|
|
|
|
Let $Wet$ be the event $\{ \omega_{rain, wet}, \omega_{\neg rain, wet} \}$.
|
|
Then $\Pr[Wet] = 0.35+0.1 = 0.45$.
|
|
%
|
|
We define Boolean random variables Wet and Rain as follows:
|
|
\[
|
|
Wet = \{
|
|
\omega_{rain, wet} \mapsto true,
|
|
\omega_{\neg rain, wet} \mapsto true,
|
|
\omega_{rain, \neg wet} \mapsto false,
|
|
\omega_{\neg rain, \neg wet} \mapsto false
|
|
\};
|
|
\]\[
|
|
Rain = \{
|
|
\omega_{rain, wet} \mapsto true,
|
|
\omega_{\neg rain, wet} \mapsto false,
|
|
\omega_{rain, \neg wet} \mapsto true,
|
|
\omega_{\neg rain, \neg wet} \mapsto false
|
|
\}.
|
|
\]
|
|
Then, $\Pr[Wet=true]$ is again $0.45$.
|
|
|
|
The first example of the following tutorial chapter captures this
|
|
example in the framework of the MayBMS query and update language.
|
|
|
|
|
|
\chapter{Tutorial}
|
|
|
|
|
|
This tutorial introduces the main features of MayBMS in an informal
|
|
way. The full examples can be run using the psql command line interface.
|
|
|
|
|
|
|
|
\section{A Really Simple Example}
|
|
|
|
|
|
We start by creating a simple table using SQL commands.
|
|
The table encodes that we see rain and wet ground with probability 0.4,
|
|
no rain but wet ground with probability 0.1, and
|
|
no rain and dry ground with probability 0.5.
|
|
%
|
|
\begin{verbatim}
|
|
create table R (Dummy varchar, Weather varchar,
|
|
Ground varchar, P float);
|
|
insert into R values ('dummy', 'rain', 'wet', 0.35);
|
|
insert into R values ('dummy', 'rain', 'dry', 0.05);
|
|
insert into R values ('dummy', 'no rain', 'wet', 0.1);
|
|
insert into R values ('dummy', 'no rain', 'dry', 0.5);
|
|
|
|
select * from R;
|
|
dummy | weather | ground | p
|
|
-------+---------+--------+------
|
|
dummy | rain | wet | 0.35
|
|
dummy | rain | dry | 0.05
|
|
dummy | no rain | wet | 0.1
|
|
dummy | no rain | dry | 0.5
|
|
(4 rows)
|
|
\end{verbatim}
|
|
|
|
Table R is a completely standard relational database table,
|
|
created using standard SQL statement. One of the columns, P, stores
|
|
probabilities, but to the system these are only numbers without any particular
|
|
meaning so far.
|
|
|
|
The following statement creates a probabilistic database table S:
|
|
%
|
|
\begin{verbatim}
|
|
create table S as
|
|
repair key Dummy in R weight by P;
|
|
\end{verbatim}
|
|
|
|
The repair-key statement is one of the extensions of the MayBMS query
|
|
language over standard SQL, and it associates a special meaning to the
|
|
values taken from the ``weight by'' column.
|
|
|
|
The statement creates a probability space with a sample space consisting
|
|
of three possible databases -- each one consisting just of one tuple
|
|
from $R$ -- with an associated probability measure given by the P column.
|
|
|
|
There are at least two natural interpretations of this example,
|
|
one using random variables and one using a possible worlds semantics.
|
|
%
|
|
\begin{itemize}
|
|
\item
|
|
We can think of S as a table specifying the joint probability distribution
|
|
of two discrete random variables Weather (with values ``rain'' and ``no rain'')
|
|
and Ground (with values ``wet'' and ``dry'').
|
|
|
|
\item
|
|
Alternatively, there are three possible worlds. Each of these worlds
|
|
is a relation S with a single tuple from R. The probability of such a world is
|
|
the value given for the tuple in column P of R.
|
|
\end{itemize}
|
|
|
|
We can compute the probabilities Pr[Ground='wet'] and
|
|
Pr[Weather='rain' and Ground='wet'] as follows
|
|
using the MayBMS conf() aggregate (which stands for ``confidence'').
|
|
%
|
|
\begin{verbatim}
|
|
create table Wet as
|
|
select conf() as P from S where Ground = 'wet';
|
|
|
|
select * from Wet;
|
|
p
|
|
------
|
|
0.45
|
|
(1 row)
|
|
|
|
create table Rain_and_Wet as
|
|
select conf() as P from S
|
|
where Weather = 'rain' and Ground = 'wet';
|
|
|
|
select * from Rain_and_Wet;
|
|
p
|
|
------
|
|
0.35
|
|
(1 row)
|
|
\end{verbatim}
|
|
|
|
The conditional probability
|
|
Pr[Weather='rain' $|$ Ground='wet'] can be computed as the ratio
|
|
\[
|
|
\mbox{Pr[Weather='rain' and Ground='wet'] / Pr[Ground='wet'].}
|
|
\]
|
|
%
|
|
\begin{verbatim}
|
|
select R1.P/R2.P as Rain_if_Wet from Rain_and_Wet R1, Wet R2;
|
|
rain_if_wet
|
|
-------------
|
|
0.777777778
|
|
(1 row)
|
|
\end{verbatim}
|
|
|
|
Since conf() is an aggregate, we can compute the marginal probability
|
|
table for random variable Ground as
|
|
\begin{verbatim}
|
|
select Ground, conf() from S group by Ground;
|
|
ground | conf
|
|
--------+------
|
|
dry | 0.55
|
|
wet | 0.45
|
|
(2 rows)
|
|
\end{verbatim}
|
|
|
|
|
|
\section{Example: Triangles in Random Graphs}
|
|
\label{sec:randgraph}
|
|
|
|
In this tutorial, we compute the probability that a triangle occurs
|
|
in a random graph
|
|
with $k$ named (and thus distinguishable) nodes.
|
|
That is, we ask for the probability that an undirected
|
|
graph, chosen uniformly at random among the graphs of $k$ nodes,
|
|
contains at least one triangle.
|
|
This is equivalent to computing the count $n$ of graphs
|
|
that contain a triangle among the $2^{k \cdot (k-1)/2}$ undirected graphs of
|
|
$k$ distinguished nodes.
|
|
Indeed, an undirected graph of $k$ nodes has at most
|
|
$k \cdot (k-1)/2$ edges, and we obtain all the graphs over the given
|
|
$k$ nodes by considering all subsets of this maximal set of edges.
|
|
|
|
|
|
We start by creating a unary ``node'' relation, say with five nodes.
|
|
We do this with the standard SQL ``create table'' and ``insert'' commands,
|
|
which behave as usual in a relational database system.
|
|
|
|
\begin{verbatim}
|
|
create table node (n integer);
|
|
insert into node values (1);
|
|
insert into node values (2);
|
|
insert into node values (3);
|
|
insert into node values (4);
|
|
insert into node values (5);
|
|
\end{verbatim}
|
|
|
|
Next we create the total order over the nodes, i.e., a binary relation
|
|
with exactly one edge between any two nodes. This is again a standard
|
|
SQL ``create table'' statement where we compute the tuples to be inserted
|
|
with a standard SQL query over the ``node'' relation.
|
|
%
|
|
\begin{verbatim}
|
|
create table total_order as
|
|
(
|
|
select n1.n as u, n2.n as v
|
|
from node n1, node n2
|
|
where n1.n < n2.n
|
|
);
|
|
\end{verbatim}
|
|
|
|
We create a table to represent that each edge is either in the
|
|
graph (bit=1) or missing (bit=0).
|
|
%
|
|
\begin{verbatim}
|
|
create table inout (bit integer);
|
|
insert into inout values (1);
|
|
insert into inout values (0);
|
|
\end{verbatim}
|
|
|
|
The following operation introduces uncertainty into the database and
|
|
creates a probabilistic database with $2^{5 \cdot 4/2} = 1024$ possible
|
|
worlds, one for each possible edge relation over the five nodes
|
|
(=subset of the total order).
|
|
We do this by a query operation ``repair key'' that for each edge
|
|
of the total order nondeterministically chooses whether the edge is in
|
|
the graph (bit=1) or not. (That is, since we do not indicate at what
|
|
probability either of the two alternatives for bit is to be chosen, the
|
|
system makes the decision {\em uniformly}\/ at random, choosing bit=1 with
|
|
probability 0.5.) The resulting probabilistic database
|
|
represents all the alternative edge relations as possible worlds.
|
|
%
|
|
\begin{verbatim}
|
|
create table to_subset as
|
|
(
|
|
repair key u,v in (select * from total_order, inout)
|
|
);
|
|
\end{verbatim}
|
|
|
|
The ``repair key'' operation is the most difficult to understand and
|
|
at the same time the most interesting addition to SQL that MayBMS provides.
|
|
Conceptually, ``repair key'' takes a set of attributes $\vec{K}$
|
|
and a relation $R$ (in this case the relational
|
|
product of total\_order and inout)
|
|
as arguments and nondeterministically chooses a maximal repair
|
|
of key $\vec{K}$ in $R$, that is, it removes a minimal set of tuples from
|
|
$R$ such that $\vec{K}$ ceases to violate a key constraint on columns $u, v$.
|
|
In this case, there are exactly two tuples for each pair $(u,v)$, namely
|
|
$(u,v,1)$ and $(u,v,0)$, and repair key chooses exactly one of them.
|
|
The consequence is that, overall, the operation nondeterministically
|
|
chooses a subset of the set of all edges. It chooses from these subsets
|
|
uniformly. The ``repair key'' operation accepts an additional argument that
|
|
allows us to assign nonuniform probabilities to the possible choices, but
|
|
in this case we do want uniform probabilities.
|
|
|
|
We have now created a probabilistic database. Conceptually, queries
|
|
and updates are evaluated in all possible worlds in parallel. Viewed differently, there is only one to\_subset relation (but we do not know which one), and
|
|
we continue to run queries and updates on this uncertain relation.
|
|
|
|
To actually create the edge relation, we select those tuples that have
|
|
bit=1 and compute their symmetric closure (to really represent an undirected
|
|
graph).
|
|
%
|
|
\begin{verbatim}
|
|
create table edge0 as (select u,v from to_subset where bit=1);
|
|
|
|
create table edge as (select * from edge0);
|
|
insert into edge (select v as u, u as v from edge0);
|
|
\end{verbatim}
|
|
|
|
Now we can compute the probability that the chosen graph has a triangle
|
|
as
|
|
%
|
|
\begin{verbatim}
|
|
select conf() as triangle_prob
|
|
from edge e1, edge e2, edge e3
|
|
where e1.v = e2.u and e2.v = e3.u and e3.v=e1.u
|
|
and e1.u <> e2.u and e1.u <> e3.u and e2.u <> e3.u;
|
|
\end{verbatim}
|
|
where the conf aggregate computes the probability (``confidence'') that
|
|
the query given by the from-where statement returns a nonempty result.
|
|
This results in
|
|
%
|
|
\begin{verbatim}
|
|
triangle_prob
|
|
---------------
|
|
0.623355
|
|
(1 row)
|
|
\end{verbatim}
|
|
|
|
This is the correct probability: out of the 1024 possible
|
|
graphs of five nodes, 636 have a triangle, and $636/1024 \approx .623355$.
|
|
Indeed, the query
|
|
%
|
|
\begin{verbatim}
|
|
select *
|
|
from edge e1, edge e2, edge e3
|
|
where e1.v = e2.u and e2.v = e3.u and e3.v=e1.u
|
|
and e1.u <> e2.u and e1.u <> e3.u and e2.u <> e3.u;
|
|
\end{verbatim}
|
|
%
|
|
computes at least one tuple in exactly those possible worlds (=on those
|
|
graphs) that have a triangle.
|
|
The conf() aggregate applied to this query conceptually
|
|
computes the sum of the probability
|
|
weights of the worlds in which the query has a nonempty result.
|
|
(The actual implementation does not naively iterate over possible
|
|
worlds, because this would be very inefficient.)
|
|
|
|
A more efficient implementation of the same query starts from the
|
|
``edge0'' relation:
|
|
%
|
|
\begin{verbatim}
|
|
select conf() as triangle_prob
|
|
from edge0 e1, edge0 e2, edge0 e3
|
|
where e1.v = e2.u and e2.v = e3.v and e1.u = e3.u
|
|
and e1.u < e2.u and e2.u < e3.v;
|
|
\end{verbatim}
|
|
|
|
Finally, an even more efficient implementation uses the
|
|
aconf$(\epsilon, \delta)$ aggregate
|
|
to compute an $(\epsilon, \delta)$-approximation of the probability,
|
|
i.e., the probability that the computed value $\hat{p}$ returned by aconf
|
|
deviates from the
|
|
correct probability $p$ by more than $\epsilon \cdot p$ is less than $\delta$.
|
|
%
|
|
\begin{verbatim}
|
|
select aconf(.05,.05) as triangle_prob
|
|
from edge0 e1, edge0 e2, edge0 e3
|
|
where e1.v = e2.u and e2.v = e3.v and e1.u = e3.u
|
|
and e1.u < e2.u and e2.u < e3.v;
|
|
\end{verbatim}
|
|
|
|
This result may be somewhat off, but the probability that the error is greater
|
|
than 5\% is less than 5\%.
|
|
|
|
|
|
Note that in the example we have seen only two extensions of SQL, ``repair
|
|
key'' and ``[a]conf''. The good news is that this is essentially
|
|
all there is. SQL extended by just these two features allows for very powerful
|
|
queries, including the computation of conditional probability tables,
|
|
maximum likelihood estimates, maximum-a-posteriori, Bayesian learning,
|
|
and much more.
|
|
|
|
|
|
\section{Example: Skills Management}
|
|
|
|
|
|
The following example demonstrates that probabilistic databases can be useful
|
|
even if the input data is not uncertain and the desired result is a classical
|
|
relational table. We define a hypothetical query in the context of skills
|
|
management. Assume we are given a classical relational database with two
|
|
tables, one, CE, stores possible takeover targets -- companies that we might
|
|
decide to buy with the employees that work in these companies.
|
|
The second table, ES, stores each employee's skills.
|
|
|
|
Here is an example database. We can build this database in MayBMS with the
|
|
standard SQL ``create table'' and ``insert'' statements.
|
|
|
|
\[
|
|
\begin{tabular}{l@{~}|@{~}l@{~~}l}
|
|
CE & CID & EID \\
|
|
\hline
|
|
& Google & Bob \\
|
|
& Google & Joe \\
|
|
& Yahoo & Dan \\
|
|
& Yahoo & Bill \\
|
|
& Yahoo & Fred \\
|
|
\end{tabular}
|
|
\hspace{2mm}
|
|
\begin{tabular}{l@{~}|@{~}l@{~~}l}
|
|
ES & EID & Skill \\
|
|
\hline
|
|
& Bob & Web \\
|
|
& Joe & Web \\
|
|
& Dan & Java \\
|
|
& Dan & Web \\
|
|
& Bill & Search \\
|
|
& Fred & Java \\
|
|
\end{tabular}
|
|
\]
|
|
|
|
|
|
Now suppose that we want to buy exactly one of those companies, and
|
|
we expect exactly one employee to leave as a result of the takeover.
|
|
Which skills can we gain for certain?
|
|
|
|
We express this query in two steps. First we randomly choose a
|
|
company to buy and an employee who leaves, and compute the remaining
|
|
employees in the chosen company. We obtain this uncertain table using the
|
|
following query:
|
|
%
|
|
\begin{verbatim}
|
|
create table RemainingEmployees as
|
|
select CE.cid, CE.eid
|
|
from CE,
|
|
(repair key dummy
|
|
in (select 1 as dummy, * from CE)) Choice
|
|
where CE.cid = Choice.cid
|
|
and CE.eid <> Choice.eid;
|
|
\end{verbatim}
|
|
|
|
Note that the probabilistic database thus created contains five possible
|
|
worlds (since there are five tuples in CE), with a uniform probability
|
|
distribution. Not all these worlds have the same number of tuples: If we chose
|
|
Google and Bob, the world contains one tuple, Google and Joe. If we choose
|
|
Yahoo and Dan, the world contains two tuples, (Yahoo, Bill) and (Yahoo, Fred).
|
|
|
|
Now we compute which skills we gain for certain:
|
|
\begin{verbatim}
|
|
create table SkillGained as
|
|
select Q1.cid, Q1.skill, p1, p2, p1/p2 as p
|
|
from (select R.cid, ES.skill, conf() as p1
|
|
from RemainingEmployees R, ES
|
|
where R.eid = ES.eid
|
|
group by R.cid, ES.skill) Q1,
|
|
(select cid, conf() as p2
|
|
from RemainingEmployees
|
|
group by cid) Q2
|
|
where Q1.cid = Q2.cid;
|
|
|
|
select cid, skill from SkillGained where p=1;
|
|
\end{verbatim}
|
|
|
|
The result is the table
|
|
\begin{center}
|
|
\begin{tabular}{ll}
|
|
CID & Skill \\
|
|
\hline
|
|
Google & Web \\
|
|
Yahoo & Java \\
|
|
\end{tabular}
|
|
\end{center}
|
|
indicating that if we buy Google, we gain the skill ``Web'' for certain, and
|
|
if we buy Yahoo, we gain the skill ``Java'' for certain.
|
|
|
|
It is worth looking at the auxiliary table SkillGained:
|
|
\begin{center}
|
|
\begin{tabular}{l|l@{~~}l@{~~}l@{~~}l@{~~}l}
|
|
SkillGained & CID & Skill & p1 & p2 & p \\
|
|
\hline
|
|
& Google & Web & 2/5 & 2/5 & 1 \\
|
|
& Yahoo & Java & 3/5 & 3/5 & 1 \\
|
|
& Yahoo & Web & 2/5 & 3/5 & 2/3 \\
|
|
& Yahoo & Search & 2/5 & 3/5 & 2/3 \\
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
This table consists of the tuples $(x,y, p1, p2, p)$ such that
|
|
\begin{itemize}
|
|
\item
|
|
$x$ is a company,
|
|
|
|
\item
|
|
$y$ is a skill,
|
|
|
|
\item
|
|
$p1$ is the probability that the chosen company is $x$ and the skill
|
|
$y$ is gained (e.g., for $x$=Yahoo and $y$=Web, this is true in two of the
|
|
five possible worlds),
|
|
|
|
\item
|
|
$p2$ is the probability that $x$ is the chosen company
|
|
(e.g., for $x$=Yahoo, this is true in three of the five possible worlds), and
|
|
|
|
\item
|
|
$p=p1/p2$ is the probability that skill $y$ is gained if company $x$ is bought
|
|
(e.g., for $x$=Yahoo and $y$=Web, the probability is 2/3: of the three possible
|
|
worlds in which Yahoo was bought, only two worlds guarantee that the skill
|
|
Web is gained).
|
|
\end{itemize}
|
|
|
|
Thus, indeed, if we select those tuples of SkillGained for which $p=1$, we
|
|
obtain the desired pairs of companies and skills -- those skills that we
|
|
obtain for certain if we buy a company.
|
|
|
|
|
|
\section{Data Cleaning}
|
|
|
|
|
|
The following example is in the domain of data cleaning. Consider a
|
|
census in which a number of individuals complete forms, that are subsequently
|
|
digitized using an OCR system that will in some cases indicate a number
|
|
of alternative readings, together with probabilities.
|
|
For simplicity, let us assume that the forms only ask for a social security
|
|
number (SSN).
|
|
|
|
For example, if two individuals complete their forms and the OCR system
|
|
recognizes the SSN of the first to be either 185 (with probability .4) or
|
|
785 and the SSN of the second to be either 185 (with probability .7) or
|
|
186, we store this information in a probabilistic database constructed as
|
|
follows:
|
|
%
|
|
\begin{verbatim}
|
|
create table Census_SSN_0 (tid integer, ssn integer, p float);
|
|
|
|
insert into Census_SSN_0 values (1, 185, .4);
|
|
insert into Census_SSN_0 values (1, 785, .6);
|
|
insert into Census_SSN_0 values (2, 185, .7);
|
|
insert into Census_SSN_0 values (2, 186, .3);
|
|
|
|
create table Census_SSN as
|
|
repair key tid in Census_SSN_0 weight by p;
|
|
\end{verbatim}
|
|
|
|
We can view the alternatives and their probability weights by the following
|
|
query:
|
|
\begin{verbatim}
|
|
select tid, ssn, conf() as prior
|
|
from Census_SSN
|
|
group by tid, ssn;
|
|
|
|
tid | ssn | prior
|
|
-----+-----+-------
|
|
1 | 185 | 0.4
|
|
1 | 785 | 0.6
|
|
2 | 185 | 0.7
|
|
2 | 186 | 0.3
|
|
\end{verbatim}
|
|
|
|
We can determine the probability that at least one individual has any
|
|
particular SSN (assuming that the OCR system did not miss the correct
|
|
SSN as an alternative) using the following query:
|
|
\begin{verbatim}
|
|
select ssn, conf() as ssn_prior
|
|
from Census_SSN
|
|
group by ssn;
|
|
|
|
ssn | ssn_prior
|
|
-----+-----------
|
|
185 | 0.82
|
|
186 | 0.3
|
|
785 | 0.6
|
|
\end{verbatim}
|
|
Indeed, the probability that at least one individual has SSN 185 is
|
|
$1 - .6 \cdot .3 = .82$.
|
|
|
|
We now perform data cleaning using a single integrity constraint, namely
|
|
that no two individuals can have the same ssn.
|
|
Conceptually, we want to exclude worlds that violate the functional dependency
|
|
\[
|
|
ssn \rightarrow tid,
|
|
\]
|
|
i.e., the constraint that ssn must be a key for the relation.
|
|
|
|
We start by computing an auxiliary relation that
|
|
computes, in each possible worlds, the ssn values that violate the
|
|
integrity constraint.
|
|
\begin{verbatim}
|
|
create table FD_Violations as
|
|
select S1.ssn
|
|
from Census_SSN S1, Census_SSN S2
|
|
where S1.tid < S2.tid and S1.ssn = S2.ssn;
|
|
\end{verbatim}
|
|
Note that two tuples violate the constraint if they have the same ssn
|
|
but different tid. We express this in the above query using a slightly
|
|
changed condition: (S1.tid $<$ S2.tid and S1.ssn = S2.ssn) instead of
|
|
(S1.tid $<>$ S2.tid and S1.ssn = S2.ssn). However, both conditions
|
|
select the same set of distinct ssn values that violate the integrity
|
|
constraint.
|
|
|
|
This query computes the uncertain table that holds 185 in the world in
|
|
which both forms have ssn value 185. In all other worlds it is empty.
|
|
|
|
Next we compute an auxiliary relation which computes, for each
|
|
SSN that occurs in at least one world in which an FD is violated,
|
|
the sum of the weights of those worlds in
|
|
which the SSN occurs and an FD is violated.
|
|
\begin{verbatim}
|
|
|
|
create table FD_Violations_by_ssn as
|
|
(
|
|
select S.ssn, conf() as p
|
|
from FD_Violations V,
|
|
Census_SSN S
|
|
group by S.ssn
|
|
);
|
|
\end{verbatim}
|
|
|
|
Next we compute the conditional probability table
|
|
\begin{verbatim}
|
|
create table TidSSNPosterior as
|
|
select Q1.tid, Q1.ssn, p1, p2, p3,
|
|
cast((p1-p2)/(1-p3) as real) as posterior
|
|
from
|
|
(
|
|
select tid, ssn, conf() as p1
|
|
from Census_SSN
|
|
group by tid, ssn
|
|
) Q1,
|
|
(
|
|
(select ssn, p as p2 from FD_Violations_by_ssn)
|
|
union
|
|
(
|
|
(select ssn, 0 as p2 from Census_SSN_0)
|
|
except
|
|
(select possible ssn, 0 as p2 from FD_Violations_by_ssn)
|
|
)
|
|
) Q2,
|
|
(
|
|
select conf() as p3
|
|
from FD_Violations
|
|
) Q3
|
|
where Q1.ssn = Q2.ssn;
|
|
|
|
select * from TidSSNPosterior;
|
|
|
|
tid | ssn | p1 | p2 | p3 | posterior
|
|
-----+-----+-----+------+------+-----------
|
|
1 | 185 | 0.4 | 0.28 | 0.28 | 0.166667
|
|
1 | 785 | 0.6 | 0 | 0.28 | 0.833333
|
|
2 | 185 | 0.7 | 0.28 | 0.28 | 0.583333
|
|
2 | 186 | 0.3 | 0 | 0.28 | 0.416667
|
|
\end{verbatim}
|
|
|
|
This table stores, for each pair of form tid and ssn, the posterior probability
|
|
that the individual who completed the form tid has the social security number
|
|
ssn given that no two individuals can have the same ssn.
|
|
|
|
We can compute, for each form, the maximum-a-posteriori ssn (the most likely
|
|
ssn given the evidence specified by the integrity constraint) as
|
|
\begin{verbatim}
|
|
select tid, argmax(ssn, posterior) as map
|
|
from TidSSNPosterior
|
|
group by tid
|
|
order by tid;
|
|
|
|
tid | map
|
|
-----+-----
|
|
1 | 785
|
|
2 | 185
|
|
\end{verbatim}
|
|
|
|
In a sense, these map values are the locally best values that we could decide
|
|
upon for each uncertain answer in our census database. Note, however, that,
|
|
if we always choose the map value, we may sometimes create a database that
|
|
again violates the integrity constraints used for data cleaning.
|
|
This would have been the case if we had indicated probability .9 for both
|
|
185 alternatives in the input database.
|
|
|
|
|
|
|
|
|
|
A further example that computes conditional probabilities and MAP values
|
|
in a different context can be
|
|
found in Chapter~\ref{sect:ql} (Example~\ref{ex:coins_sql}).
|
|
|
|
|
|
|
|
|