master
Boris Glavic 2019-12-01 15:58:38 -06:00
parent 7b2599df3b
commit ff5d7728e6
1 changed files with 19 additions and 27 deletions

View File

@ -30,7 +30,7 @@ chapter that follows.
Given a schema with relation names $R_1, \dots, R_k$. We use $sch(R_l)$ to denote the attributes of relation schema $R_l$.
Formally,
Formally,
a {\em probabilistic database}\/ is a {\em finite}\/ set of structures
\[
\ww = \{ \tuple{R_1^1, \dots, R_k^1, p^{[1]}}, \dots,
@ -348,7 +348,7 @@ The fragment of probabilistic WSA which excludes the difference operation is cal
Computing possible and certain tuples is redundant with conf:
\begin{eqnarray*}
\mbox{poss}(R) &:=&
\mbox{poss}(R) &:=&
\pi_{sch(R)}(\mbox{conf}(R))
\\
\mbox{cert}(R) &:=& \pi_{sch(R)}(\sigma_{P=1}(\mbox{conf}(R)))
@ -774,7 +774,7 @@ For example, the US census consists of many dozens of questions for about 300 mi
Suppose forms are digitized using OCR and the resulting data contains just
two possible readings for 0.1\% of the answers before cleaning.
Then, there are on the order of $2^{10,000,000}$ possible worlds, and each one will take
close to one Terabyte of data to store.
close to one Terabyte of data to store.
Clearly, we need a way of representing this data that is much better than a naive enumeration of possible worlds.
Also, the repair-key operator of probabilistic world-set algebra in general causes an exponential increase in the number of possible worlds.
@ -1007,7 +1007,7 @@ This section gives a complete solution for efficiently evaluating a large fragme
\paragraph{Translating queries down to the representation relations}
Let $\textit{rep}$ be the {\em representation function}\/, which
maps a U-relational data\-base to the set of possible worlds it represents.
Our goal is to give a reduction that maps
Our goal is to give a reduction that maps
any positive relational algebra query $Q$ over probabilistic databases represented as U-relational
databases \textit{T} to an equivalent positive relational algebra query
$\overline{Q}$ of polynomial size such that
@ -1019,16 +1019,16 @@ where the ${\cal A}^i$ are relational database instances (possible worlds)
or, as a commutative diagram,
\vspace*{1em}
\begin{center}
\begin{psmatrix}[colsep=8em,rowsep=5em,nodesepA=3pt,nodesepB=3pt]
$T$ & $\overline{Q}(T)$\\
$\{{\cal A}^1,\ldots,{\cal A}^n\}$ &
$\{Q({\cal A}^1),\ldots,Q({\cal A}^n)\}$
%
\ncline{->}{1,1}{2,1}<{\textit{rep}}
\ncline{->}{2,1}{2,2}^{$Q$}
\ncline{->}{1,1}{1,2}^{$\overline{Q}$}
\ncline{->}{1,2}{2,2}>{\textit{rep}}
\end{psmatrix}
% \begin{psmatrix}[colsep=8em,rowsep=5em,nodesepA=3pt,nodesepB=3pt]
% $T$ & $\overline{Q}(T)$\\
% $\{{\cal A}^1,\ldots,{\cal A}^n\}$ &
% $\{Q({\cal A}^1),\ldots,Q({\cal A}^n)\}$
% %
% \ncline{->}{1,1}{2,1}<{\textit{rep}}
% \ncline{->}{2,1}{2,2}^{$Q$}
% \ncline{->}{1,1}{1,2}^{$\overline{Q}$}
% \ncline{->}{1,2}{2,2}>{\textit{rep}}
% \end{psmatrix}
\end{center}
@ -1126,7 +1126,7 @@ is evaluated as
\[
\mbox{poss}(\pi_N(\sigma_{SSN=185}(R[SSN] \bowtie R[N]))).
\]
We rewrite the query using our rewrite rules into
We rewrite the query using our rewrite rules into
\[
\pi_N(\sigma_{SSN=185}(U_{R[SSN]}
\bowtie_{\psi\wedge\phi} U_{R[N]})),
@ -1174,11 +1174,11 @@ algebra difference.
To compute the confidence in a tuple of data values occurring possibly in several tuples
of a U-relation, we have to compute the probability of the disjunction of the local conditions of all these tuples. We have to eliminate duplicate tuples because we are interested in the probability of the data tuples rather than some abstract notion of tuple identity that is really an artifact of our representation. That is, we have to compute the probability of a
DNF, i.e., the sum of the weights of the worlds identified with valuations $\theta$ of the random variables such that the DNF becomes true under $\theta$. This problem is \#P-complete
\cite{GGH1998,dalvi07efficient}. The result is not the sum of the probabilities of the individual conjunctive local conditions, because they may, intuitively, ``overlap''.
\cite{GGH1998,dalvi07efficient}. The result is not the sum of the probabilities of the individual conjunctive local conditions, because they may, intuitively, ``overlap''.
\begin{example}\em
Consider
a U-relation with schema $\{V,D\}$ (representing a nullary relation) and two tuples
a U-relation with schema $\{V,D\}$ (representing a nullary relation) and two tuples
$\tuple{x,1}$, and $\tuple{y,1}$, with the $W$ relation from Example~\ref{ex:urelation}.
Then the confidence in the nullary tuple $\tuple{}$ is $\Pr[x \mapsto 1 \lor y \mapsto 1] =
\Pr[x \mapsto 1] + \Pr[y \mapsto 1] - \Pr[x \mapsto 1 \land y \mapsto 1] = .82$.
@ -1215,7 +1215,7 @@ This may lead to an exponentially large output and a very large number of $\vec{
For these reasons, MayBMS currently does not implement the difference operation.
In many practical applications, the difference operation can be avoided.
Difference is only hard on uncertain relations. On such relations, it can only lead to displayable query results in queries that close the possible worlds semantics using conf, computing a single certain relation.
Difference is only hard on uncertain relations. On such relations, it can only lead to displayable query results in queries that close the possible worlds semantics using conf, computing a single certain relation.
%
Probably the most important application of the difference operation is for encoding universal constraints, for example in data cleaning.
But if the confidence operation is applied on top of a universal query, there is a trick that will often allow to rewrite the query into an
@ -1239,7 +1239,7 @@ $
\phi(t_i,s) = \exists t \in R\; t.TID=t_i \land t.SSN=s.
$
%
Constraint $\psi$ can be thought of as a data cleaning constraint that ensures that the SSN fields in no two distinct census forms (belonging to two different individuals) are interpreted as the same number.
Constraint $\psi$ can be thought of as a data cleaning constraint that ensures that the SSN fields in no two distinct census forms (belonging to two different individuals) are interpreted as the same number.
We compute the desired conditional probabilities, for each possible pair of a TID and an SSN, as
$
@ -1351,11 +1351,3 @@ Full probabilistic world-set algebra is essentially not harder than the language
It is worth noting that repair-key by itself, despite the blowup of possible worlds, does not make queries hard. For the language consisting of positive relational algebra, repair-key, and poss, we have shown by construction that it has PTIME complexity: We have given a positive relational algebra rewriting to queries on the representations earlier in this section. Thus queries are even in the highly parallelizable complexity class AC$_0$.
The final result in Figure~\ref{tab:complexity} concerns the language consisting of the positive relational algebra operations, repair-key, $(\epsilon, \delta)$-approximation of confidence computation, and the generalized equality generating dependencies of \cite{Koch2008} for which we can rewrite difference of uncertain relations to difference of confidence values (see Example~\ref{ex:trick}). The result is that queries of that language that close the possible worlds semantics -- i.e., that use conf to compute a certain relation -- are in PTIME overall.