Added probability notation to notation section

This commit is contained in:
Aaron Huber 2020-07-02 12:06:59 -04:00
parent b8d2ca529d
commit 9df39097c8
3 changed files with 25 additions and 19 deletions

View file

@ -11,6 +11,7 @@
\newcommand{\relii}{T}
\newcommand{\db}{D}
\newcommand{\idb}{\mathcal{\db}}
\newcommand{\pd}{\idb_{pd}}%pd for probability distribution
\newcommand{\query}{Q}
\newcommand{\join}{\Join}
\newcommand{\select}{\sigma}

View file

@ -8,7 +8,7 @@ Further, define $\rpoly(X_1,\ldots, X_\numTup)$ as the reduced version of $\poly
\[\rpoly(X_1,\ldots, X_\numTup) = \poly(X_1,\ldots, X_\numTup)_{\Sigma} \mod \wbit_1^2-\wbit_1\cdots\mod \wbit_\numTup^2 - \wbit_\numTup.\]
Intuitively, $\rpoly(\textbf{X})$ is the expanded sum of products form of $\poly(\textbf{X})$ such that if any $X_j$ term has an exponent $e > 1$, it is reduced to $1$, i.e. $X_j^e\mapsto X_j$ for any $e > 1$.
\AH{The next sentence is just to help the reader have a stronger intution for what's going on.}Alternatively, one can gain intuition for $\rpoly$ by thinking of $\rpoly$ as the resulting sum of product expansion of $\poly$ when $\poly$ is in a factorized form such that none of its terms have an exponent $e > 1$, with an idempotent product operator.
Alternatively, one can gain intuition for $\rpoly$ by thinking of $\rpoly$ as the resulting sum of product expansion of $\poly$ when $\poly$ is in a factorized form such that none of its terms have an exponent $e > 1$, with an idempotent product operator.
The usefulness of this reduction will be seen shortly.
@ -89,6 +89,14 @@ If we can compute $\poly(\wElem_1,\ldots, \wElem_\numTup) = q_E(\wElem_1,\ldots,
If we can compute $\poly(\wElem_1,\ldots, \wElem_\numTup) = q_E(\wElem_1,\ldots, \wElem_\numTup)^3$ in T(m) time for O(1) distinct values of $\prob$ then we can count the number of triangles (and the number of 3-paths, the number of 3-matchings) in $G$ in O(T(m) + m) time.
\end{Lemma}
\begin{Lemma}\label{lem:qE3-exp}
When we expand $\poly = q_E(\wElem_1,\ldots, \wElem_\numTup)$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following,
\begin{align}
&\rpoly(\prob,\ldots, \prob) = \numocc{\ed}\prob^2 + 6\numocc{\twopath}\prob^3 + 6\numocc{\twodis} + 6\numocc{\tri}\prob^3 +\nonumber\\
&\qquad\qquad6\numocc{\oneint}\prob^4 + 6\numocc{\threepath}\prob^4 + 6\numocc{\twopathdis}\prob^5 + 6\numocc{\threedis}\prob^6.\label{claim:four-one}
\end{align}
\end{Lemma}
\AH{The warm-up below is fine for now, but will need to be removed for the final draft}
First, let us do a warm-up by computing $\rpoly(\wElem_1,\dots, \wElem_\numTup)$ when $\poly = q_E(\wElem_1,\ldots, \wElem_\numTup)$. Before doing so, we introduce a notation. Let $\numocc{H}$ denote the number of occurrences that $H$ occurs in $G$. So, e.g., $\numocc{\ed}$ is the number of edges ($m$) in $G$.
@ -98,6 +106,7 @@ First, let us do a warm-up by computing $\rpoly(\wElem_1,\dots, \wElem_\numTup)$
I'm not sure what I can add. The existing notation is fine (for now). I would suggest adding
a definition table.
}
\AH{UPDATE: I did a quick google, and it \textit{appears} that there is a bit of a learning curve to implement node/edge symbols in LaTeX. So, maybe, if time is of the essence, we go with another notation.}
\begin{Claim}
We can compute $\rpoly_2$ in O(m) time.
@ -135,12 +144,6 @@ Let $\poly(\wVec) = q_E^3(\wVec)^3$.
\end{Claim}
\begin{proof}
\AH{Use this equation as its own lemma, to be used in lemmas 3 and 4.}
When we expand $\poly$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following,
\begin{align}
&\rpoly(\prob,\ldots, \prob) = \numocc{\ed}\prob^2 + 6\numocc{\twopath}\prob^3 + 6\numocc{\twodis} + 6\numocc{\tri}\prob^3 +\nonumber\\
&\qquad\qquad6\numocc{\oneint}\prob^4 + 6\numocc{\threepath}\prob^4 + 6\numocc{\twopathdis}\prob^5 + 6\numocc{\threedis}\prob^6.\label{claim:four-one}
\end{align}
We have shown and will show that the following subgraph cardinalities can be computed in $O(m)$ time:
\[\numocc{\ed}, \numocc{\twopath}, \numocc{\twodis}, \numocc{\oneint}, \numocc{\twopathdis} + \numocc{\threedis}.\]
@ -158,7 +161,7 @@ We have shown and will show that the following subgraph cardinalities can be com
It has already been shown previously that $\numocc{\ed}, \numocc{\twopath}, \numocc{\twodis}$ can be computed in O(m) time. Here are the arguments for the rest.
\[\numocc{\oneint} = \sum_{u \in V} \binom{d_u}{3}\]
$\numocc{\twopathdis} + \numocc{\threedis} = $ the number of occurrences of three distinct edges with five or six vertices. This can be counted in the following manner. For every edge $(u, v) \in E$, throw away all neighbors of $u$ and $v$ and pick two more distinct edges.
\[\numocc{\twopathdis} + \numocc{\threedis} = \sum_{(u, v) \in E} \binom{m - d_u - d_v - 1}{2}\] The implication in \cref{claim:four-two} follows by the above and \cref{claim:four-one}.\qed
\[\numocc{\twopathdis} + \numocc{\threedis} = \sum_{(u, v) \in E} \binom{m - d_u - d_v - 1}{2}\] The implication in \cref{claim:four-two} follows by the above and \cref{lem:qE3-exp}.\qed
\end{proof}
\begin{proof}[Proof of \cref{lem:gen-p}]

View file

@ -7,15 +7,15 @@
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\AH{I think I need to$\ldots$, possibly also the notion of $P(t \in \query(\idb))$, and then lead into discussing the semiring paper annotation of $\query(\rel)(t)$.}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$. When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$.
\subsection{Intro}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$. When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$.
\subsection{Modeling and Semantics}
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_i \in \idb$, where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables in $\idb$. One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.
Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly(X_1,\ldots, X_N)$, where the variables can be bound to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation. Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$. Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring. Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers. Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
Operations in $\query$ are translated into the following polynomial operations.
For the general commutative semiring, denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$. Operations in $\query$ are translated into the following polynomial operations.
%\OK{
% Eventually, you probably want a little more background here, depending on the query notation you choose to use. The simplest approach would be basing it on the Green et. al. Provenance Semirings paper. As we discussed, that would make $\query(\mathcal D)(t)$ the query polynomial.
@ -32,9 +32,9 @@ Operations in $\query$ are translated into the following polynomial operations.
\begin{align*}
&\project_A(\rel)(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
& (\rel_1 \union \rel_2)(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
& (\rel_1 \union \rel_2)(\tup) = &&\rel_1(\tup) \oplus \rel_2(\tup)\\
&(\rel_1 \join_\theta \rel_2)(\tup) = &&\begin{cases}
\rel_1(\tup_1) \times \rel_2(\tup_2) &\text{if }\theta(\tup_1, \tup_2)\\
\rel_1(\tup_1) \otimes \rel_2(\tup_2) &\text{if }\theta(\tup_1, \tup_2)\\
0 &\text{otherwise}
\end{cases} \\
&\select_\theta(\rel) = &&\begin{cases}
@ -43,15 +43,17 @@ Operations in $\query$ are translated into the following polynomial operations.
\end{cases}
\end{align*}
Considering probabilistic databases, let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s). We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable. Then $\prob(\wVec)$ denotes the probability that a given world occurs.
\subsection{Defining the Data}
Define $\pd$ to be the probability distribution for $\idb$. Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. Let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s). We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable, and it follows that $\prob(\wVec)$ denotes the probability that a given world occurs.
The output we desire performed over the tuple annotations, i.e. polynomial $\poly(X_1,\ldots, X_\numTup)$ is the expectation, i.e.
\[\expct_{\wVec}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \prob(\wVec).\]
Denote $\vct{w} \sim \pd$ to mean the probability of $\vct{w} \left(\prob(\vct{w})\right)$ according to the $\pd$ distribution.
One of the aggregates we desire to compute over the polynomial $\poly(X_1,\ldots, X_\numTup)$ is the expectation, denoted,
\[\expct_{\wVec \sim \pd}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \prob(\wVec).\]
A specific probabilistic data model is the Tuple Independent Database (\ti). This is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.
There are features of $\ti$ that we can exploit. Note that a $\ti$ naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. The bit-string world value can be used as an index to determine which tuples are present in the $\wVec$ world. We can then write and equivalent expectation for $\ti$ models,
There are features of $\ti$ that we can exploit. Note that a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. The bit-string world value can be used as an index to determine which tuples are present in the $\wVec$ world. Given an $\numTup$ vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,
\[\expct_{\wVec}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]