diff --git a/intro-rewrite-070921.tex b/intro-rewrite-070921.tex index a2c35ac..a4c60be 100644 --- a/intro-rewrite-070921.tex +++ b/intro-rewrite-070921.tex @@ -4,20 +4,21 @@ \secrev{ This work explores the problem of computing the expectation of a tuple's multiplicity in an important special case of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB, -$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. The set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec = \inset{0,\ldots, \bound}^{\abs{\tupset}}$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{i} = j}$ (for $j \in\pbox{\bound}$), where each distribution is independent for $\tup \in \tupset$. +$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. The set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec = \inset{0,\ldots, \bound}^{\abs{\tupset}}$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each %distribution +$\tup$ is an independent random event. %for $\tup \in \tupset$. } %\mypar{For a later section} %\sout{ %Since each tuple in $\pdb$ has a mutually exclusive probability distribution over its possible multiplicities, it is natural to reduce a \abbrCTIDB to traditional (set) block independent database (\abbrBIDB). We refer to the reduced \abbrBIDB as a $1$-\abbrBIDB, as it is the case that each tuple can appear in a possible world at most $c = 1$ time. \Cref{fig:ctidb-red} shows an example of this reduction. %} \secrev{ -Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB$, which (assuming set query semantics), is the same as the traditional set \abbrTIDB. -In this work, since we are generally considering bag query input, we will only be considering bag query semantics. +Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB. +In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\vct{W}}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\vct{W}\in\worlds$. We can formally state this problem as: \begin{Problem}\label{prob:expect-mult} -Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\randDB\sim\bpd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$. +Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\vct{W}\sim\bpd}\pbox{\query\inparen{\vct{W}}\inparen{\tup}}$. \end{Problem} \AH{I \emph{think} we use $\randDB$ to denote something different in one of the proofs. Have to keep an eye open for this to avoid overloading notation.} @@ -144,7 +145,7 @@ $\Omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{c_0\cdot k}}$ for {\em \caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\gentupset$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\gentupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).} \label{tab:lbs} \end{table} -\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \gentupset}}^k}$, where $k$ is the largest degree of the query $\query$ (i.e., join width) over all result tuples $\tup$ (and the parameter that defines our family of hard queries). +\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \gentupset}}^k}$, where $k$ is the join width (our notion of join width follows from~\cref{def:degree-of-poly} and~\cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\cref{prob:expect-mult}. However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \gentupset}$ by just $\abs{\gentupset}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard. @@ -243,10 +244,7 @@ $\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWo \end{footnotesize} \noindent This property leads us to consider a structure related to the lineage polynomial. \begin{Definition}\label{def:reduced-poly} -For any polynomial $\poly(\vct{X})$ corresponding to a \abbrCTIDB (henceforth, \abbrCTIDB-lineage polynomial), -%\BG{Better introduce the notion of TIDB lin poly before here, then it iis more clear?}, -%Atri: Done - define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the standard monomial basis (\abbrSMB) \footnote{ +For any polynomial $\poly(\vct{X})$ define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the standard monomial basis (\abbrSMB) \footnote{ This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition. } form of $\poly(\vct{X})$ to $1$. diff --git a/poly-form.tex b/poly-form.tex index 823584d..85396c2 100644 --- a/poly-form.tex +++ b/poly-form.tex @@ -22,7 +22,7 @@ Unless othewise noted, we consider all polynomials to be in \abbrSMB representat When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a polynomial $\poly$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\begin{Definition}[Degree]\label{def:degree} +\begin{Definition}[Degree]\label{def:degree-of-poly} The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{i=1}^n d_i$ such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$. \end{Definition} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/two-step-model.tex b/two-step-model.tex index 7b86e71..35f336b 100644 --- a/two-step-model.tex +++ b/two-step-model.tex @@ -15,26 +15,26 @@ \node[cylinder, text width=0.28\textwidth, align=center, draw=black, text=black, cylinder uses custom fill, cylinder body fill=blue!10, aspect=0.12, minimum height=5cm, minimum width=2.5cm, cylinder end fill=blue!50, shape border rotate=90] (cylinder) at (0, 0) { \tabcolsep=0.1cm \begin{tabular}{>{\small}c | >{\small}c | >{\small}c} - \multicolumn{3}{c}{$\boldsymbol{OnTime}$}\\ + \multicolumn{3}{c}{$\boldsymbol{T}$}\\ %\toprule - City & $\Phi$ & \textbf{p}\\ + Point & $\Phi$ & $\semN$\\ \midrule - Buffalo & $A$ & 0.9 \\ - Chicago & $B$ & 0.5\\ - Bremen & $C$ & 0.5\\ - Zurich & $E$ & 1.0\\ + $e_1$ & $A$ & 1 \\ + $e_2$ & $B$ & 1\\ + $e_3$ & $C$ & 1\\ + $e_4$ & $E$ & 1\\ \end{tabular}\\ \tabcolsep=0.05cm %\captionof{table}{Route} \begin{tabular}{>{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c} - \multicolumn{4}{c}{$\boldsymbol{Route$}}\\ + \multicolumn{4}{c}{$\boldsymbol{R$}}\\ %\toprule - $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\ + $\text{Point}_1$ & $\text{Point}_2$ & $\Phi$ & $\semN$ \\ \midrule - Buffalo & Chicago & $X$ & 1.0 \\ - Chicago & Zurich & $Y$ & 1.0 \\ + $e_1$ & $e_2$ & $X$ & 2 \\ + $e_2$ & $e_4$ & $Y$ & 4 \\ %& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ \\ - Chicago & Bremen & $Z$ & 1.0 \\ + $e_2$ & $e_3$ & $Z$ & 3 \\ \end{tabular}}; %label below cylinder \node[below=0.2 cm of cylinder]{{\LARGE$ \tupset$}}; @@ -51,11 +51,11 @@ \begin{tabular}{>{\normalsize}c | >{\centering\arraybackslash\normalsize}m{1.95cm} | >{\centering\arraybackslash\small}m{1.95cm}} %\multicolumn{3}{c}{$\boldsymbol{\query(\pdb)}$}\\[1mm] %\toprule - City & $\Phi$ & Circuit\\% & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline + Point & $\Phi$ & Circuit\\% & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline \midrule %\hline %\\\\[-3.5\medskipamount] - Buffalo & $AX$ &\resizebox{!}{10mm}{ + $e_1$ & $AX$ &\resizebox{!}{10mm}{ \begin{tikzpicture}[thick] \node[gen_tree_node](sink) at (0.5, 0.8){$\boldsymbol{\circmult}$}; \node[gen_tree_node](source1) at (0, 0){$A$}; @@ -64,7 +64,7 @@ \draw[->] (source2)--(sink); \end{tikzpicture}% & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$ }\\% & $0.9$ \\ - Chicago & $B(Y + Z)$\newline \text{Or}\newline $BY+ BZ$& + $e_2$ & $B(Y + Z)$\newline \text{Or}\newline $BY+ BZ$& \resizebox{!}{16mm} { \begin{tikzpicture}[thick] \node[gen_tree_node] (a1) at (1, 0){$Y$}; @@ -116,17 +116,17 @@ \begin{tabular}{>{\small}c | >{\centering\arraybackslash\small}m{1.95cm}} %\multicolumn{2}{c}{$\expct\pbox{\poly(\vct{X})}$}\\[1mm] %\toprule - City & $\mathbb{E}[\poly(\vct{X})]$\\ + Point & $\mathbb{E}[\poly(\vct{X})]$\\ \midrule%[0.05pt] - Buffalo & $1.0 \cdot 0.9 = 0.9$\\[3mm] - Chicago & $(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\ + $e_1$ & $A\cdot\probOf\pbox{A = 1}\inparen{X\cdot\probOf\pbox{X = 1} + X\cdot\probOf\pbox{X = 2}}$\\[2mm]%$1.0 \cdot 0.9 = 0.9$\\[3mm] + $e_2$ & $(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\ \end{tabular} }; %label of rounded rectangle \node[below=0.2cm of rrect]{{\LARGE $\expct\pbox{\poly(\vct{X})}$}}; \end{tikzpicture} } - \caption{Intensional Query Evaluation Model ($\query = \project_{\text{City}}\inparen{Route\join_{\text{City}_1 = City}OnTime}$).} + \caption{Intensional Query Evaluation Model ($\query = \project_{\text{City}}\inparen{T\join_{\text{City} = \text{City}_1}R}$).} \label{fig:two-step} \end{figure}