Some notation changes and clarifications; changes to prose of S3

master
Aaron Huber 2022-02-10 13:03:25 -05:00
parent 94073d22a2
commit 344cb72646
9 changed files with 55 additions and 51 deletions

View File

@ -15,12 +15,12 @@ and this yields the claimed runtime.
\subsection{Proof of \Cref{lem:tdet-om}}
\begin{proof}
By the recursive defintion of $\qruntime{\cdot, \cdot}$ (see \Cref{sec:gen}), we have the following equation for our hard query $\query$ when $k = 1$, (we denote this as $\query^1$).
By the recursive defintion of $\qruntimenoopt{\cdot, \cdot}$ (see \Cref{sec:gen}), we have the following equation for our hard query $\query$ when $k = 1$, (we denote this as $\query^1$).
\begin{equation*}
\qruntime{\query^1, \dbbase} = \abs{\dbbase.\vset} + \abs{\dbbase.\edgeSet} + \abs{\dbbase.\vset} + \jointime{\dbbase.\vset , \dbbase.\edgeSet , \dbbase.\vset}.
\qruntimenoopt{\query^1, \tupset} = \abs{\tupset.\vset} + \abs{\tupset.\edgeSet} + \abs{\tupset.\vset} + \jointime{\tupset.\vset , \tupset.\edgeSet , \tupset.\vset}.
\end{equation*}
We argue that $\jointime{\dbbase.\vset , \dbbase.\edgeSet , \dbbase.\vset}$ is at most $O(\numedge)$ by noting that there exists an algorithm that computes $\dbbase.\vset\join\dbbase.\edgeSet\join\dbbase.\vset$ in the same runtime\footnote{Indeed the trivial algorithm that computes the obvious pair-wise joins has the claimed runtime. That is, we first compute $\dbbase.\vset\join\dbbase.\edgeSet$, which takes $O(m)$ (assuming $\dbbase.\vset$ is stored in hash map) since tuples in $\dbbase.\vset$ can only filter tuples in $\dbbase.\edgeSet$. The resulting subset of tuples in $\dbbase.\edgeSet$ are then again joined (on the right) with $\dbbase.\vset$, which by the same argument as before also takes $O(m)$ time, as desried.}. Then by the assumption of \Cref{lem:pdb-for-def-qk} (each $v \in \vset$ has degree $\geq 1$), the sum of the first three terms is $\bigO{\numedge}$. We then obtain that $\qruntime{\query^1, \dbbase} = \bigO{\numedge} + \bigO{\numedge} = \bigO{\numedge}$. For $\query^k = \query_1^1 \times\cdots\times\query_k^1$, we have the recurrence $\qruntime{\query^k, \dbbase} = \qruntime{\query_1^1, \dbbase} + \cdots +\qruntime{\query_k^1, \dbbase} + \jointime{\query_1^1,\cdots,\query_k^1}$. Since $\query^1$ outputs a count, computing the join $\query_1^1\join\cdots\join\query_k^1$ is just multiplying $k$ numbers, which takes $O(k)$ time. Thus, we have
\[\qruntime{\query^k, \dbbase} \le k\cdot O(m)+O(k)\le O(km),\]
We argue that $\jointime{\tupset.\vset , \tupset.\edgeSet , \tupset.\vset}$ is at most $O(\numedge)$ by noting that there exists an algorithm that computes $\tupset.\vset\join\tupset.\edgeSet\join\tupset.\vset$ in the same runtime\footnote{Indeed the trivial algorithm that computes the obvious pair-wise joins has the claimed runtime. That is, we first compute $\tupset.\vset\join\tupset.\edgeSet$, which takes $O(m)$ (assuming $\tupset.\vset$ is stored in hash map) since tuples in $\tupset.\vset$ can only filter tuples in $\tupset.\edgeSet$. The resulting subset of tuples in $\tupset.\edgeSet$ are then again joined (on the right) with $\tupset.\vset$, which by the same argument as before also takes $O(m)$ time, as desried.}. Then by the assumption of \Cref{lem:pdb-for-def-qk} (each $v \in \vset$ has degree $\geq 1$), the sum of the first three terms is $\bigO{\numedge}$. We then obtain that $\qruntimenoopt{\query^1, \tupset} = \bigO{\numedge} + \bigO{\numedge} = \bigO{\numedge}$. For $\query^k = \query_1^1 \times\cdots\times\query_k^1$, we have the recurrence $\qruntimenoopt{\query^k, \tupset} = \qruntimenoopt{\query_1^1, \tupset} + \cdots +\qruntimenoopt{\query_k^1, \tupset} + \jointime{\query_1^1,\cdots,\query_k^1}$. Since $\query^1$ outputs a count, computing the join $\query_1^1\join\cdots\join\query_k^1$ is just multiplying $k$ numbers, which takes $O(k)$ time. Thus, we have
\[\qruntimenoopt{\query^k, \tupset} \le k\cdot O(m)+O(k)\le O(km),\]
as desired.
\qed
%The dominating term in the sum of the recursive definition is $\abs{\query_1^1\join\cdots\join\query_k^1} = \bigO{\numedge^k} = O_k(\numedge)$.

View File

@ -141,7 +141,7 @@ Finally, note that there are exactly three cases where the expectation of a mono
\subsection{Proof for Lemma~\ref{lem:exp-poly-rpoly}}\label{subsec:proof-exp-poly-rpoly}
\subsection{Proof for Lemma~\ref{lem:tidb-reduce-poly}}\label{subsec:proof-exp-poly-rpoly}
\begin{proof}
Let $\poly$ be a polynomial of $\numvar$ variables with highest degree $= B$, defined as follows: %, in which every possible monomial permutation appears,
\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}.\]
@ -159,14 +159,14 @@ Then in expectation we have
\end{align}
\Cref{p1-s1a} is the result of substituting in the definition of $\poly$ given above. Then we arrive at \cref{p1-s1b} by linearity of expectation. Next, \cref{p1-s1c} is the result of the independence constraint of \abbrBIDB\xplural, specifically that any monomial composed of dependent variables, i.e., variables from the same block $\block$, has a probability of $0$. \Cref{p1-s2} is obtained by the fact that all variables in each monomial are independent, which allows for the expectation to be pushed through the product. In \cref{p1-s3}, since $\randWorld_i \in \{0, 1\}$ it is the case that for any exponent $e \geq 1$, $\randWorld_i^e = \randWorld_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
Finally, it can be verified that \Cref{p1-s5} follows since \cref{p1-s4} satisfies the construction of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ in \Cref{def:reduced-bi-poly}.
Finally, it can be verified that \Cref{p1-s5} follows since \cref{p1-s4} satisfies the construction of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ in \Cref{def:reduced-poly}.
\qed
\end{proof}
\subsection{Proof For Corollary~\ref{cor:expct-sop}}
\begin{proof}
Note that \cref{lem:exp-poly-rpoly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$. Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $\bigO{\abs{\poly}}$ computations.
Note that~\Cref{lem:tidb-reduce-poly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$. Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $\bigO{\abs{\poly}}$ computations.
\qed
\end{proof}

View File

@ -220,27 +220,27 @@ Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
}
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
SELECT 1 FROM T $t_1$, Route r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
\end{lstlisting}
\secrev{
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly^2\inparen{A, B, C, E, X, Y, Z}$
The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X, Y, Z}$
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. However, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. However, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec\pbox{\tup} = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly\inparen{\vct{X}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Considering again our example,
\begin{multline*}
\refpoly^{\inparen{ABX}^2}\inparen{A, X, B} = \poly^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\end{multline*}
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\poly_R^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing expectation we get $\expct\pbox{\poly^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing expectation we get $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
% \begin{footnotesize}
% \begin{align*}
% &\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2} = \expct\pbox{\randWorld_A^2}\expct\pbox{\inparen{\randWorld_{X_1} + \randWorld_{X_2}}^2}\expct\pbox{\randWorld_B^2} = \expct\pbox{\randWorld_A}\expct\pbox{\randWorld_{X_1}^2 + 2\randWorld_{X_1}\randWorld_{X_2} + \randWorld_{X_2}^2}\expct\pbox{\randWorld_B} =\\
@ -274,23 +274,23 @@ This leads us to consider a structure related to the lineage polynomial.
%\end{footnotesize}
%\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound}}
$ to be the polynomial $\refpoly$ = $\poly\inparen{\inparen{\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}}_{\tup\in\tupset}}%,\ldots,\sum_{j\in\pbox{\bound}}j\cdot X_{\numvar, j}}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound}}
$ to be the polynomial $\refpoly{}$ = $\poly\inparen{\inparen{\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}}_{\tup\in\tupset}}%,\ldots,\sum_{j\in\pbox{\bound}}j\cdot X_{\numvar, j}}
$ and ii) define the \emph{reduced polynomial} $\rpoly\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%(X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound})
$ to be the polynomial resulting from converting $\refpoly$ into the standard monomial basis (\abbrSMB),
$ to be the polynomial resulting from converting $\refpoly{}$ into the standard monomial basis (\abbrSMB),
\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}
removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\in\tupset, j\neq j'\in\pbox{c}$, and setting all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Continuing with the example $\poly^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$, to save clutter we i) do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$, and ii) for $\sum_{j\in\pbox{\bound}}j^2\cdot X_j$, we omit the summands encoding multiplicities $> 2$, since the greatest multiplicity of the tuple annotated with $X$ is $2$, likewise those summands will always evaluated to $0$ since the tuple will never have a multiplicity of $>2$.
Continuing with the example $\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$, to save clutter we i) do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$, and ii) for $\sum_{j\in\pbox{\bound}}j^2\cdot X_j$, we omit the summands encoding multiplicities $> 2$, since the greatest multiplicity of the tuple annotated with $X$ is $2$, likewise those summands will always evaluated to $0$ since the tuple will never have a multiplicity of $>2$.
\begin{multline*}
\rpoly^2(A, B, C, E, X_1, X_2, Y, Z) = \\
\rpoly_1^2(A, B, C, E, X_1, X_2, Y, Z) = \\
A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC =\\
ABX_1 + AB\inparen{2}^2X_2 + BYE + BZC + 2AX_1BYE + 2A\inparen{2}^2X_2BYE + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{multline*}
Note that we have argued that for our specific example the expectation that we want is $\widetilde{\poly^2}(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X_1=1}, \probOf\inparen{X_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
Note that we have argued that for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X_1=1}, \probOf\inparen{X_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
@ -298,7 +298,7 @@ For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
%\BG{Term has not been introduced yet.}
%Atri: fixed
$\poly\inparen{\vct{X}}=\poly\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\refpoly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
\expct_{\vct{W} \sim \pdassign}\pbox{\refpoly{}\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{c}}}.$%,\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
\end{Lemma}
}
@ -351,16 +351,16 @@ For an upper bound on approximating the expected count, it is easy to check that
\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
\poly_1^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly\inparen{\vct{p}}
= \rpoly_1^2\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
we get that $\poly_1^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly^2_1\inparen{\vct{\prob}}, \rpoly_1^2\inparen{\vct{\prob}}]$.
%
%To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.

View File

@ -14,10 +14,13 @@ Next, we show why we can compute $\numocc{G}{\kmatch}$ from $c_{2k}$ in $O(1)$ a
We claim that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$. This can be seen intuitively by looking at the expansion of the original factorized representation
\[\poly_{G}^\kElem(\vct{X}) = \sum_{\substack{(i_1, j_1),\cdots,(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem},\]
where a unique $\kElem$-matching in the multi-set of product terms can be selected $\prod_{i = 1}^\kElem i = \kElem!$ times.
Indeed, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, with distinct indexes, and this implies that each distinct $\kElem$-matching appears the exact number of permutations that exist for the set of its edges, or $k!$. Second, the only surviving monomials $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ must have that all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ are distinct in $\poly_{G}^{\kElem}(\vct{X})$.
By the last two statements, only monomials composed of $2k$ distinct variables in $\poly_{G}^{\kElem}(\vct{X})$ (and hence of degree $2\kElem$ in $\rpoly_{G}^{\kElem}(\vct{X})$) correspond to a $k$-matching in $G$.
As noted above, each of the $k!$ permutations of an arbitrary monomial maps to the same distinct $\kElem$-matching in $G$, and this implies a $\kElem!$ to $1$ mapping between degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$ and $\kElem$-matchings in $G$.
It then follows that $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
Indeed, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, with distinct indexes, and this implies that each distinct $\kElem$-matching appears the exact number of permutations that exist for its particular set of $\kElem$ edges, or $k!$.
Since, as noted earlier, $c_{2\kElem}$ represents the number of monomials with $2\kElem$ distinct variables, then it must be that $c_{2\kElem}$ is the overall number of $\kElem$-matchings. And since we have $\kElem!$ copies of each distinct $\kElem$-matching, it follows that
%Second, the only surviving monomials $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ must have that all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ are distinct in $\poly_{G}^{\kElem}(\vct{X})$.
%By the last two statements, only monomials composed of $2k$ distinct variables in $\poly_{G}^{\kElem}(\vct{X})$ (and hence of degree $2\kElem$ in $\rpoly_{G}^{\kElem}(\vct{X})$) correspond to a $k$-matching in $G$.
%As noted above, each of the $k!$ permutations of an arbitrary monomial maps to the same distinct $\kElem$-matching in $G$, and this implies a $\kElem!$ to $1$ mapping between degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$ and $\kElem$-matchings in $G$.
%It then follows that
$c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. \qed
\end{proof}

View File

@ -70,7 +70,8 @@ In other words, if \Cref{th:single-p} holds, then so must \Cref{th:single-p-hard
\subsection{Proof of \Cref{th:single-p-hard}}
\begin{proof}
For the sake of contradiction, assume that for any $G$, we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ in $o\inparen{m^{1+\eps_0}}$ time.
Let $G$ be the input graph. It is easy to see that one can compute the expression tree for $\poly_{G}^3(\vct{X})$ in $O(m)$ time. Then by \Cref{th:single-p} we can compute $\numocc{G}{\tri}$ in further time $o\inparen{m^{1+\eps_0}}+O(m)$. Thus, the overall, reduction takes $o\inparen{m^{1+\eps_0}}+O(m)= o\inparen{m^{1+\eps_0}}$ time, which violates \Cref{conj:graph}.
Let $G$ be the input graph. %It is easy to see that one can compute the expression tree for $\poly_{G}^3(\vct{X})$ in $O(m)$ time.
Then by \Cref{th:single-p} we can compute $\numocc{G}{\tri}$ in further time $o\inparen{m^{1+\eps_0}}+O(m)$. Thus, the overall, reduction takes $o\inparen{m^{1+\eps_0}}+O(m)= o\inparen{m^{1+\eps_0}}$ time, which violates \Cref{conj:graph}.
\qed
\end{proof}

View File

@ -120,8 +120,8 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Incomplete DB/PDBs %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\idb}{{\overline{\Omega}}}
\newcommand{\pd}{{\mathcal{P}_{\idb}}}%pd for probability distribution
\newcommand{\idb}{{\Omega}}
\newcommand{\pd}{{\mathcal{P}}}%pd for probability distribution
\newcommand{\pdassign}{\mathcal{P}}
\newcommand{\pdb}{\mathcal{D}}
\newcommand{\dbbase}{\db_\idb}
@ -242,7 +242,7 @@
\newcommand{\atupvar}{\tupvar{\rel}{\tup}}
\newcommand{\polyX}{\poly\inparen{\vct{\pVar}}}%<---let's see if this proves handy
\newcommand{\rpoly}{\widetilde{\poly}}%r for reduced as in reduced 'Q'
\newcommand{\refpoly}{\poly_R}
\newcommand{\refpoly}[1]{\poly_{#1R}}
\newcommand{\rpolyX}{\rpoly\inparen{\pVar}}%<---if this isn't something we use much, we can get rid of it
\newcommand{\biDisProd}{\mathcal{B}}%bidb disjoint tuple products (def 2.5)
\newcommand{\rExp}{\mathcal{T}}%the set of variables to reduce all exponents to 1 via modulus operation; I think \mathcal T collides with the notation used for the set of tuples in D

View File

@ -2,9 +2,8 @@
%!TEX root=./main.tex
\section{Hardness of Exact Computation}
\label{sec:hard}
\AH{If anything need be changed in~\Cref{sec:hard}, it would only be in the following (opening) paragraph.}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a \abbrTIDB.
Note that this implies hardness for \bis and general \abbrBPDB, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $O\inparen{\qruntime{\query,\dbbase}}$ runtime.
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, \bis and general \abbrBPDB, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\query,\tupset}}$ runtime.
%(and hence the equivalent \Cref{prob:bag-pdb-query-eval})
%in the negative.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -15,7 +14,7 @@ In particular, we will consider the problems of computing the following counts (
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
\label{thm:k-match-hard}
Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no self-loops or parallel edges, $\kmatchtime\ge \littleomega{f(k)\cdot |\edgeSet|^c}$ for any function $f$ and fixed constant $c$ independent of $\numedge$ and $k$ (assuming $\sharpwzero\ne\sharpwone$).
Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no self-loops or parallel edges, $\kmatchtime\ge \littleomega{f(k)\cdot |\edgeSet|^c}$ for any function $f$ and fixed constant $c$ independent of $\abs{E}$ and $k$ (assuming $\sharpwzero\ne\sharpwone$).
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{hypo}\label{conj:known-algo-kmatch}
@ -47,32 +46,33 @@ For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial corresponding to the query that generalizes our example query from \Cref{sec:intro}. Let us alias
\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial whose structure mirrors the query $\query_2$ from \Cref{sec:intro}. Let us alias
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
\end{lstlisting}
as $R_i$ for each $i \in [k]$. The query $\query^k$ then becomes
\begin{lstlisting}
SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
\end{lstlisting}
\noindent Further, the PDB instance generalizes the one in \Cref{fig:two-step} as follows. Relation $OnTime$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, for this instance $\dbbase$ contains the set of $n$ unary tuples in $OnTime$ (which corresponds to $\vset$) and $m$ binary tuples in $Route$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a \abbrTIDB-lineage polynomial. % for a \abbrTIDB \abbrPDB.
Next, we note that the runtime for answering $\query^k$ on deterministic database $\dbbase$, as defined above, is $O(m)$ (i.e. deterministic query processing is `easy' for this query):
\noindent Further, the \abbrCTIDB instance of~\Cref{fig:two-step} generalizes to one compatible to~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, for this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a \abbrCTIDB-lineage polynomial. % for a \abbrTIDB \abbrPDB.
\AH{Can the proofs generalize to $2$-\abbrTIDB, as the new updated~\Cref{fig:two-step} now is?}
Next, we note that the runtime for answering $\query^k$ on deterministic database $\tupset$, as defined above, is $\bigO{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
\begin{Lemma}\label{lem:tdet-om}
Let $\query^k$ and $\dbbase$ be as defined above. Then
Let $\query^k$ and $\tupset$ be as defined above. Then
% of \Cref{def:qk}, the runtime
$\qruntime{\query^k, \dbbase}$ is $O(\kElem\numedge)$.
$\qruntimenoopt{\query^k, \tupset}$ is $\bigO{\kElem\numedge}$.
\end{Lemma}
\AH{Should the above be $\qruntimenoopt{}$ or $\qruntime$?}
\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
We are now ready to present our main hardness result.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\AH{Note that~\Cref{def:reduced-poly} has been changed, where we compute $\rpoly$ from $\refpoly{}$.}
\begin{Theorem}\label{thm:mult-p-hard-result}
Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$. Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ (over all $i\in [2k+1]$ for arbitrary $G=(\vset,\edgeSet)$
%and any $(2k+1)$ distinct values $\prob_i$ ($0\le i \le 2k$)
@ -80,7 +80,7 @@ needs time $\bigOmega{\kmatchtime}$, assuming $\kmatchtime\ge \omega\inparen{\ab
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
Note that the second row of \Cref{tab:lbs} follows from \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{thm:k-match-hard} while the third row is proved by \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{conj:known-algo-kmatch}. Since \Cref{conj:known-algo-kmatch} is non-standard, the latter hardness result should be interpreted as follows. Any substantial polynomial improvement for \Cref{prob:bag-pdb-poly-expected} (over the trivial algorithm that converts $\poly$ into SMB and then uses \Cref{cor:expct-sop} for \abbrStepTwo) would lead to an improvement over the state of the art {\em upper} bounds on $\kmatchtime$. Finally, note that \Cref{thm:mult-p-hard-result} needs one to be able to compute the expected multiplicities over $(2k+1)$ distinct values of $p_i$, each of which corresponds to distinct $\pd$ (for the same $\dbbase$), which explain the `Multiple' entry in the second column in the second and third row in \Cref{tab:lbs}. Next, we argue how to get rid of this latter requirement.
Note that the second row of \Cref{tab:lbs} follows from \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{thm:k-match-hard} while the third row is proved by \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{conj:known-algo-kmatch}. Since \Cref{conj:known-algo-kmatch} is non-standard, the latter hardness result should be interpreted as follows. Any substantial polynomial improvement for \Cref{prob:bag-pdb-poly-expected} (over the trivial algorithm that converts $\poly$ into SMB and then uses \Cref{cor:expct-sop} for \abbrStepTwo) would lead to an improvement over the state of the art {\em upper} bounds on $\kmatchtime$. Finally, note that \Cref{thm:mult-p-hard-result} needs one to be able to compute the expected multiplicities over $(2k+1)$ distinct values of $p_i$, each of which corresponds to distinct $\bpd$ (for the same $\tupset$), which explain the `Multiple' entry in the second column in the second and third row in \Cref{tab:lbs}. Next, we argue how to get rid of this latter requirement.
%%% Local Variables:

View File

@ -52,7 +52,7 @@ Recalling \Cref{fig:nxDBSemantics} again, which defines the lineage polynomial $
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\Omega,\bpd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for arbitrary result tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
we have (denoting $\randDB$ as the random variable over $\idb$):
we have (denoting $\randDB$ as the random variable over $\Omega$):
$ \expct_{\randDB \sim \bpd}[\query(\randDB)(t)] = \expct_{\vct{\randWorld}\sim \pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}. $
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.\footnote{Although \Cref{prop:expection-of-polynom} follows, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}

View File

@ -56,7 +56,7 @@ The probability distribution $\bpd'$ is the one induced by $\vct{p} = \inparen{\
\end{Definition}
For the \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while in the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$. %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.
Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics} now becomes $\poly\pbox{\rel,\tupset, \tup} = \sum_{j\in\pbox{\bound}}jX_{\tup, j}$. Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly\inparen{\rvworld}} = \rpoly\inparen{\vct{\prob}}$.
Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics} now becomes $\poly\pbox{\rel,\tupset, \tup} = \sum_{j\in\pbox{\bound}}jX_{\tup, j}$. Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly\inparen{\rvworld}} = \rpoly\inparen{\vct{\prob}}$.
\AH{@atri, not sure if $\bpd'$ should be $\bpd''$ (in the above expectation) as discussed below. Since $\bpd'\equiv\bpd''$, then the proof still holds for~\Cref{lem:tidb-reduce-poly}, but maybe it is important to $\bpd''$ to drive the point home that we iterate over the all worlds set (as opposed to the set of possible worlds) when computing the expectation of a polynomial. Or maybe it suffices to note that $\bpd'\equiv\bpd''$.}
}
%In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.