latex
This commit is contained in:
parent
7b2599df3b
commit
ff5d7728e6

@ 30,7 +30,7 @@ chapter that follows.






Given a schema with relation names $R_1, \dots, R_k$. We use $sch(R_l)$ to denote the attributes of relation schema $R_l$.


Formally,


Formally,


a {\em probabilistic database}\/ is a {\em finite}\/ set of structures


\[


\ww = \{ \tuple{R_1^1, \dots, R_k^1, p^{[1]}}, \dots,



@ 348,7 +348,7 @@ The fragment of probabilistic WSA which excludes the difference operation is cal




Computing possible and certain tuples is redundant with conf:


\begin{eqnarray*}


\mbox{poss}(R) &:=&


\mbox{poss}(R) &:=&


\pi_{sch(R)}(\mbox{conf}(R))


\\


\mbox{cert}(R) &:=& \pi_{sch(R)}(\sigma_{P=1}(\mbox{conf}(R)))



@ 774,7 +774,7 @@ For example, the US census consists of many dozens of questions for about 300 mi


Suppose forms are digitized using OCR and the resulting data contains just


two possible readings for 0.1\% of the answers before cleaning.


Then, there are on the order of $2^{10,000,000}$ possible worlds, and each one will take


close to one Terabyte of data to store.


close to one Terabyte of data to store.


Clearly, we need a way of representing this data that is much better than a naive enumeration of possible worlds.




Also, the repairkey operator of probabilistic worldset algebra in general causes an exponential increase in the number of possible worlds.



@ 1007,7 +1007,7 @@ This section gives a complete solution for efficiently evaluating a large fragme


\paragraph{Translating queries down to the representation relations}


Let $\textit{rep}$ be the {\em representation function}\/, which


maps a Urelational data\base to the set of possible worlds it represents.


Our goal is to give a reduction that maps


Our goal is to give a reduction that maps


any positive relational algebra query $Q$ over probabilistic databases represented as Urelational


databases \textit{T} to an equivalent positive relational algebra query


$\overline{Q}$ of polynomial size such that



@ 1019,16 +1019,16 @@ where the ${\cal A}^i$ are relational database instances (possible worlds)


or, as a commutative diagram,


\vspace*{1em}


\begin{center}


\begin{psmatrix}[colsep=8em,rowsep=5em,nodesepA=3pt,nodesepB=3pt]


$T$ & $\overline{Q}(T)$\\


$\{{\cal A}^1,\ldots,{\cal A}^n\}$ &


$\{Q({\cal A}^1),\ldots,Q({\cal A}^n)\}$


%


\ncline{>}{1,1}{2,1}<{\textit{rep}}


\ncline{>}{2,1}{2,2}^{$Q$}


\ncline{>}{1,1}{1,2}^{$\overline{Q}$}


\ncline{>}{1,2}{2,2}>{\textit{rep}}


\end{psmatrix}


% \begin{psmatrix}[colsep=8em,rowsep=5em,nodesepA=3pt,nodesepB=3pt]


% $T$ & $\overline{Q}(T)$\\


% $\{{\cal A}^1,\ldots,{\cal A}^n\}$ &


% $\{Q({\cal A}^1),\ldots,Q({\cal A}^n)\}$


% %


% \ncline{>}{1,1}{2,1}<{\textit{rep}}


% \ncline{>}{2,1}{2,2}^{$Q$}


% \ncline{>}{1,1}{1,2}^{$\overline{Q}$}


% \ncline{>}{1,2}{2,2}>{\textit{rep}}


% \end{psmatrix}


\end{center}







@ 1126,7 +1126,7 @@ is evaluated as


\[


\mbox{poss}(\pi_N(\sigma_{SSN=185}(R[SSN] \bowtie R[N]))).


\]


We rewrite the query using our rewrite rules into


We rewrite the query using our rewrite rules into


\[


\pi_N(\sigma_{SSN=185}(U_{R[SSN]}


\bowtie_{\psi\wedge\phi} U_{R[N]})),



@ 1174,11 +1174,11 @@ algebra difference.


To compute the confidence in a tuple of data values occurring possibly in several tuples


of a Urelation, we have to compute the probability of the disjunction of the local conditions of all these tuples. We have to eliminate duplicate tuples because we are interested in the probability of the data tuples rather than some abstract notion of tuple identity that is really an artifact of our representation. That is, we have to compute the probability of a


DNF, i.e., the sum of the weights of the worlds identified with valuations $\theta$ of the random variables such that the DNF becomes true under $\theta$. This problem is \#Pcomplete


\cite{GGH1998,dalvi07efficient}. The result is not the sum of the probabilities of the individual conjunctive local conditions, because they may, intuitively, ``overlap''.


\cite{GGH1998,dalvi07efficient}. The result is not the sum of the probabilities of the individual conjunctive local conditions, because they may, intuitively, ``overlap''.




\begin{example}\em


Consider


a Urelation with schema $\{V,D\}$ (representing a nullary relation) and two tuples


a Urelation with schema $\{V,D\}$ (representing a nullary relation) and two tuples


$\tuple{x,1}$, and $\tuple{y,1}$, with the $W$ relation from Example~\ref{ex:urelation}.


Then the confidence in the nullary tuple $\tuple{}$ is $\Pr[x \mapsto 1 \lor y \mapsto 1] =


\Pr[x \mapsto 1] + \Pr[y \mapsto 1]  \Pr[x \mapsto 1 \land y \mapsto 1] = .82$.



@ 1215,7 +1215,7 @@ This may lead to an exponentially large output and a very large number of $\vec{


For these reasons, MayBMS currently does not implement the difference operation.




In many practical applications, the difference operation can be avoided.


Difference is only hard on uncertain relations. On such relations, it can only lead to displayable query results in queries that close the possible worlds semantics using conf, computing a single certain relation.


Difference is only hard on uncertain relations. On such relations, it can only lead to displayable query results in queries that close the possible worlds semantics using conf, computing a single certain relation.


%


Probably the most important application of the difference operation is for encoding universal constraints, for example in data cleaning.


But if the confidence operation is applied on top of a universal query, there is a trick that will often allow to rewrite the query into an



@ 1239,7 +1239,7 @@ $


\phi(t_i,s) = \exists t \in R\; t.TID=t_i \land t.SSN=s.


$


%


Constraint $\psi$ can be thought of as a data cleaning constraint that ensures that the SSN fields in no two distinct census forms (belonging to two different individuals) are interpreted as the same number.


Constraint $\psi$ can be thought of as a data cleaning constraint that ensures that the SSN fields in no two distinct census forms (belonging to two different individuals) are interpreted as the same number.




We compute the desired conditional probabilities, for each possible pair of a TID and an SSN, as


$



@ 1351,11 +1351,3 @@ Full probabilistic worldset algebra is essentially not harder than the language


It is worth noting that repairkey by itself, despite the blowup of possible worlds, does not make queries hard. For the language consisting of positive relational algebra, repairkey, and poss, we have shown by construction that it has PTIME complexity: We have given a positive relational algebra rewriting to queries on the representations earlier in this section. Thus queries are even in the highly parallelizable complexity class AC$_0$.




The final result in Figure~\ref{tab:complexity} concerns the language consisting of the positive relational algebra operations, repairkey, $(\epsilon, \delta)$approximation of confidence computation, and the generalized equality generating dependencies of \cite{Koch2008} for which we can rewrite difference of uncertain relations to difference of confidence values (see Example~\ref{ex:trick}). The result is that queries of that language that close the possible worlds semantics  i.e., that use conf to compute a certain relation  are in PTIME overall.




















Loading…
Reference in a new issue