Finished pass on S2 with comments.

master
Aaron Huber 2022-03-03 10:06:07 -05:00
parent b9ee014642
commit bda73134c7
5 changed files with 25 additions and 19 deletions

View File

@ -63,7 +63,7 @@ We assume that full table scans are used for every base relation access. We can
%Observe that
% () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known .
\Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}. Lastly, we can handle FAQs and factorized databases with allowing for optimization, i.e. $\optquery{\query}$.
\Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}. Lastly, we can handle FAQs and factorized databases by allowing for optimization, i.e. $\optquery{\query}$.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}

View File

@ -8,10 +8,11 @@ Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ consists of a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector from $\worlds$, the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity.
\BGdel{encodes a bag of uncertain tuples such that each possible tuple encoded in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity and $\bpd$ is the probability distribution over $\worlds$.}{tried to rephrase}
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. Given that the multiplicities of tuples are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a particular world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,\worldvec_{\tup}}$.
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. Given that the multiplicities of tuples are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a (disjoint) probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a particular world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,\worldvec_{\tup}}$.
\BGdel{any tuple $\tup$ can then be encoded as $\prob_{\tup, j} = \probOf\pbox{\worldvec_{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
is an independent random event. %for $\tup \in \tupset$.
}{tried to repharse, but also this is technically not correct, the random variables in $[0,c]$ are independent by the different multiplicities for the same tuple are disjoint.}
\AH{I have added the term (disjoint) to the text to explicitly mention the obvious; though this should be obvious by the fact that $\worldvec_\tup$ can only take one value from $\pbox{\bound}$.}
Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.

View File

@ -50,16 +50,16 @@ Queries over probabilistic databases are traditionally viewed as being evaluated
The result of a query is the pair $\inparen{\query\inparen{\Omega}, \bpd'}$ where $\bpd'$ is a probability distribution that assigns to each possible query result the sum of the probabilites of the worlds that produce this answer: $\probOf\pbox{\omega\in\Omega} = \sum_{\omega'\in\Omega,\\\query\inparen{\omega'}=\query\inparen{\omega}}\probOf\pbox{\omega'}$.
Suppose that $\pdb''$ is a reduced \abbrOneBIDB from \abbrCTIDB $\pdb'$ as defined by~\Cref{def:ctidb-reduct}. Instead of looking only at the possible worlds of $\pdb''$, one can consider the set of all worlds, including those that cannot exist due to, e.g., disjointness. Since $\abs{\tupset'} = \numvar$ the all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\numvar\bound}$, such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$ (\emph{here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_{\numvar'}$, where $\numvar'=\sum_{\tup\in\tupset}\abs{\block_\tup}$}).
Suppose that $\pdb'$ is a reduced \abbrOneBIDB from \abbrCTIDB $\pdb$ as defined by~\Cref{def:ctidb-reduct}. Instead of looking only at the possible worlds of $\pdb'$, one can consider the set of all worlds, including those that cannot exist due to, e.g., disjointness. Since $\abs{\tupset} = \numvar$ the all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\numvar\bound}$, such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$ (\emph{here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_{\numvar'}$, where $\numvar'=\sum_{\tup\in\tupset}\abs{\block_\tup}$}).
\footnote{
In this example, $\abs{\block_\tup} = \bound$ for all $\tup$.
}%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
We can denote a probability distribution over all $\worldvec \in \{0, 1\}^{\numvar\bound}$ as $\bpd''$. When $\bpd''$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j}, \worldvec_{\tup, j'} \neq 0$ for $j\neq j'$, we end up with a bijective mapping from $\bpd$ to $\bpd''$, such that each mapping is equivalent, implying the distributions are equivalent, and thus query results.
We can denote a probability distribution over all $\worldvec \in \{0, 1\}^{\numvar\bound}$ as $\bpd'$. When $\bpd'$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j}, \worldvec_{\tup, j'} \neq 0$ for $j\neq j'$, we end up with a bijective mapping from $\bpd$ to $\bpd'$, such that each mapping is equivalent, implying the distributions are equivalent, and thus query results.
\Cref{subsec:supp-mat-ti-bi-def} has more details. \medskip
We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
\AH{Not sure if I should change~\Cref{prop:expection-of-polynom} to \abbrCTIDB. The proof is for general \abbrBPDB, which is a stronger statement, but our work focuses mostly on \abbrCTIDB. Do I stick with \abbrCTIDB and modify the proof? Why?}
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\Omega,\bpd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for arbitrary result tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
we have (denoting $\randDB$ as the random variable over $\Omega$):

View File

@ -10,7 +10,6 @@
We focus on the problem of computing $\expct_{\worldvec\sim\pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \tupset, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
\Cref{prob:intro-stmt} asks if there exists a linear time approximation algorithm in the size of a given circuit \circuit which encodes $\poly\inparen{\vct{X}}$. Recall that in this work we
represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way. Since we are specifically using circuits to model lineage polynomials, we can refer to these circuits as lineage circuits. However, when the meaning is clear, we will drop the term lineage and only refer to them as circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -92,12 +91,12 @@ The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables annotating tuples in $\dbbase$. Fix an $\raPlus$ query $\query$ and a result tuple $\tup$.
Let $\pdb'$ be an arbitrary \abbrCTIDB and $\vct{X}$ be the set of variables annotating tuples in $\tupset'$. Fix an $\raPlus$ query $\query$ and a result tuple $\tup$.
The \expectProblem is defined as follows:\\[-7mm]
\begin{center}
\textbf{Input}: $\circuit \in \circuitset{\polyX}$ for $\polyX = \apolyqdt$
\textbf{Input}: $\circuit \in \circuitset{\polyX}$ for $\poly'\inparen{\vct{X}} = \poly'\pbox{\query,\tupset',\tup}$
\hspace*{2mm}
\textbf{Output}: $\expct_{\vct{W} \sim \pdassign}[\apolyqdt(\vct{W})]$
\textbf{Output}: $\expct_{\vct{W} \sim \bpd}\pbox{\poly'\pbox{\query, \tupset', \tup}\inparen{\vct{W}}}$
\end{center}
\end{Definition}

View File

@ -26,8 +26,8 @@ When it is unclear, we use $\smbOf{\genpoly}~\inparen{\smbOf{\poly}}$ to denote
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Degree]\label{def:degree-of-poly}
The degree of polynomial $\genpoly(\vct{X})$ is the largest $\sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
$% = \sum_{\tup\in\tupset} d_\tup$
The degree of polynomial $\genpoly(\vct{X})$ is the largest $\sum_{\tup\in S}d_\tup %= \norm{\vct{d}}_1
$ for all $\vct{d}\in\inset{0,\ldots,\hideg}^S$% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$.
We denote the degree of $\genpoly$ as $\deg\inparen{\genpoly}$.
% maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
@ -87,25 +87,31 @@ We now define the reduced polynomial $\rpoly'$ of a \abbrOneBIDB.
\resizebox{\textwidth}{!}{
\begin{minipage}{\textwidth}
\begin{align*}
\poly'\pbox{\project_A\inparen{\query}, \tupset', \tup_j} =& \sum_{\substack{\tup_{j'},\\\project_{A}\inparen{\tup_{j'}} = \tup_j}}\poly'\pbox{\query, \tupset', \tup_{j'}} &
\poly'\pbox{\query_1\union\query_2, \tupset', \tup_j} =& \poly'\pbox{\query_1, \tupset', \tup_j}+\poly'\pbox{\query_2, \tupset', \tup_j}\\
\poly'\pbox{\select_\theta\inparen{\query}, \tupset', \tup_j} =& \begin{cases}\theta = 1&\poly'\pbox{\query, \tupset', \tup_j}\\\theta = 0& 0\\\end{cases} &
\poly'\pbox{\project_A\inparen{\query}, \gentupset', \tup_j} =& \sum_{\substack{\tup_{j'},\\\project_{A}\inparen{\tup_{j'}} = \tup_j}}\poly'\pbox{\query, \gentupset', \tup_{j'}} &
\poly'\pbox{\query_1\union\query_2, \gentupset', \tup_j} =& \poly'\pbox{\query_1, \gentupset', \tup_j}+\poly'\pbox{\query_2, \gentupset', \tup_j}\\
\poly'\pbox{\select_\theta\inparen{\query}, \gentupset', \tup_j} =& \begin{cases}\theta = 1&\poly'\pbox{\query, \gentupset', \tup_j}\\\theta = 0& 0\\\end{cases} &
\begin{aligned}
\poly'\pbox{\query_1\join\query_2, \tupset', \tup_j} = \\~
\poly'\pbox{\query_1\join\query_2, \gentupset', \tup_j} = \\~
\end{aligned} &
\begin{aligned}
&\poly'\pbox{\query_1, \tupset', \project_{attr\inparen{\query_1}}\inparen{\tup_j}}\\ &~~~\cdot\poly'\pbox{\query_2, \tupset', \project_{attr\inparen{\query_2}}\inparen{\tup_j}}
&\poly'\pbox{\query_1, \gentupset', \project_{attr\inparen{\query_1}}\inparen{\tup_j}}\\ &~~~\cdot\poly'\pbox{\query_2, \gentupset', \project_{attr\inparen{\query_2}}\inparen{\tup_j}}
\end{aligned}\\
&&&\poly'\pbox{\rel,\tupset', \tup_j} = j\cdot X_{\tup, j}.
&&&\poly'\pbox{\rel,\gentupset', \tup_j} = j\cdot X_{\tup, j}.
\end{align*}\\[-10mm]
\end{minipage}}
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over $\tupset'$.}
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over $\gentupset'$.}
\label{fig:lin-poly-bidb-redux}
\end{figure}
\AH{I feel disatisfied by the following items:\newline
i) The use of $X\in\vct{X}$ in~\Cref{fig:lin-poly-bidb-redux} is assuming a \emph{type}! This is an abuse of our designating $X\in\vct{X}$ to be typless/abstract.\newline
ii) I am not sure whether to use $\tupset'$ or $\gentupset'$ in~\Cref{fig:lin-poly-bidb-redux}. I am inclined to use $\gentupset'$ to denote arbitrary reduced \abbrOneBIDB.
\newline
iii) The typeset of $\gentupset'$ is a bit odd, with the apostrophe being higher; however, placing the apostrophe underneath the overline is too low.
}
\begin{Definition}[$\rpoly'$]\label{def:reduced-poly-redux}
Given a polynomial $\poly'\inparen{\vct{X}}$ generated from a \abbrOneBIDB produced from the reduction of~\Cref{def:ctidb-reduct} and let $\rpoly'\inparen{\vct{X}}$ denote the reduced form of $\poly'\inparen{\vct{X}}$ computed as follows: i) compute $\smbOf{\poly'\inparen{\vct{X}}}$, ii) reduce all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly'\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly'$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials.
Then given $\worldvec\in\inset{0,1}^{\tupset'}$, the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly'\inparen{\worldvec}$ is of the same structure as the reformulated polynomial $\refpoly{}\inparen{\worldvec}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly'$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials.
\begin{Lemma}
Given any %\abbrCTIDB $\pdb$, its reduced counterpart
\emph{\abbrOneBIDB} $\pdb'$, $\raPlus$ query $\query$, and lineage polynomial