diff --git a/introduction.tex b/introduction.tex index b2fa300..97be00d 100644 --- a/introduction.tex +++ b/introduction.tex @@ -19,6 +19,39 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us } $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$. \end{Problem} + +\begin{figure}[t!] + \begin{align*} + &\begin{aligned}[t] + &\polyqdt{\project_A(\query)}{\gentupset}{\tup} =\\ + &~~\sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} + \end{aligned} + & + &\begin{aligned}[t] + &\polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =\\ + &\qquad \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\ + \end{aligned}\\ + &\begin{aligned} + &\polyqdt{\select_\theta(\query)}{\gentupset}{\tup} =\\ + &~~ \begin{cases} + \polyqdt{\query}{\gentupset}{\tup} & \text{if }\theta(\tup) \\ + 0 & \text{otherwise}. + \end{cases} + \end{aligned} + & + &\begin{aligned} + &\polyqdt{\query_1 \join \query_2}{\gentupset}{\tup} =\\ + &\qquad\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}}\\ + &\qquad\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}} + \end{aligned}\\ + &&&\polyqdt{\rel}{\gentupset}{\tup} = X_\tup + \end{align*}%\\[-10mm] + \setlength{\abovecaptionskip}{-0.25cm} + \caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over an arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.} + \label{fig:nxDBSemantics} + \vspace{-0.53cm} +\end{figure} + It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB. In this work we will assume that $c =\bigO{1}$ since this is what is typically seen in practice. Allowing for unbounded $c$ is an interesting open problem. @@ -32,7 +65,8 @@ Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in ti Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. This paper considers $\raPlus$ queries for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{Note that our work applies to any $\query \in\raPlus$, which implies that specific heuristics for choosing an optimized query can be abstracted away, i.e., our work does not consider heuristic techniques.} -\begin{table}[t!] +\begin{table*}[t!] +\centering \begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|} \hline \textbf{Lower bound on $\timeOf{}^*(\qhard,\pdb)$} & \textbf{Num.} $\bpd$s @@ -45,7 +79,9 @@ $\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{c_0\cdo \end{tabular} \caption{Our lower bounds for a specific hard query $\qhard$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).} \label{tab:lbs} -\end{table} +\vspace{-0.73cm} +\end{table*} + \mypar{Our lower bound results} Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$. Unfortunately this is not the case. ~\Cref{tab:lbs} shows our results. @@ -64,26 +100,6 @@ Further, our approximation algorithm works for a more general notion of bag \abb \subsection{Polynomial Equivalence}\label{sec:intro-poly-equiv} A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages or propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over variables $\vct{X}$ encoding input tuple multiplicities. Evaluating a lineage polynomial for a query result tuple $t_{out}$ by, for each tuple $\tup_{in}$, assigning the variable $X_{t_{in}}$ encoding the tuple's multiplicity to the tuple's multiplicity in the possible world yields the multiplicity of the $\tup_{out}$ in the query result for this world. -\begin{figure}[b!] - \begin{align*} - \polyqdt{\project_A(\query)}{\gentupset}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} & - \polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =& \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\ - \polyqdt{\select_\theta(\query)}{\gentupset}{\tup} =& \begin{cases} - \polyqdt{\query}{\gentupset}{\tup} & \text{if }\theta(\tup) \\ - 0 & \text{otherwise}. - \end{cases} & - \begin{aligned} - \polyqdt{\query_1 \join \query_2}{\gentupset}{\tup} =\\ ~ - \end{aligned}& - \begin{aligned} - &\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}} \\ - &~~~\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}} - \end{aligned}\\ - & & & \polyqdt{\rel}{\gentupset}{\tup} = X_\tup - \end{align*}\\[-10mm] - \caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over an arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.} - \label{fig:nxDBSemantics} -\end{figure} We drop $\query$, $\tupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now specify the problem of computing the expectation of tuple multiplicity in the language of lineage polynomials: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/main.pdf b/main.pdf index 838fb61..c35774d 100644 Binary files a/main.pdf and b/main.pdf differ diff --git a/main.synctex.gz b/main.synctex.gz index 96540c7..190224c 100644 Binary files a/main.synctex.gz and b/main.synctex.gz differ