Trimming for space

master
Oliver Kennedy 2020-12-19 12:59:27 -05:00
parent 22e5af76db
commit a2bf8d6daf
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
8 changed files with 62 additions and 81 deletions

View File

@ -3,8 +3,8 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability of being present (\Cref{th:single-p-hard}).
Given this, in this section we design an approximation algorithm for our problem that runs in {\em linear time}.
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
@ -89,6 +89,7 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\begin{figure}[h!]
\resizebox{0.65\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
@ -119,12 +120,13 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\draw[<-|, highlight_color] (TR) -- (tr-label);
\draw[<-|, highlight_color] (root) -- (t-label);
\end{tikzpicture}
}
\caption{Expression tree $\etree$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
\label{fig:expr-tree-T}
\trimfigurespacing
\end{figure}
\begin{Definition}[Positive T]\label{def:positive-tree}
For any expression tree $\etree$, the corresponding
{\em positive tree}, denoted $\abs{\etree}$ obtained from $\etree$ as follows. For each leaf node $\ell$ of $\etree$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$. %value $\coef$ of each coefficient leaf node in $\etree$ is set to %$\coef_i$ in $\etree$ is exchanged with its absolute value$|\coef|$.
@ -133,7 +135,7 @@ For any expression tree $\etree$, the corresponding
Using the same factorization from ~\Cref{example:expr-tree-T}, $poly(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
\begin{Definition}[Evaluation]\label{def:exp-poly-eval}
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
\end{Definition}
\subsection{Our main result}
@ -157,7 +159,7 @@ P\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error' \
The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
It turns out that to get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter (which roughly counts the (weighted) number of monomials in $\expandtree{\etree}$ that get `canceled' when modded with $\mathcal{B}$):
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expandtree{\etree}$ to be `canceled' when it is modded with $\mathcal{B}$:
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\etree$, define
\[\gamma(\etree)=\frac{\sum_{(\monom, \coef)\in \expandtree{\etree}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\etree}(1,\ldots, 1)}\]
@ -182,10 +184,10 @@ We note that the restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$)
\OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
\subsection{Approximating $\rpoly$}
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we note that we can exactly represent $\rpoly(\vct{X}$ as follows:
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\sum_{(v,c)\in \expandtree{\etree}} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\prod_{X_i\in \var\inparen{v}} X_i.
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i.
\end{equation}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
@ -233,7 +235,7 @@ to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot
%\bi Version of Approximation Algorithm
\begin{algorithm}[H]
\begin{algorithm}[t]
\caption{$\approxq(\etree, \vct{p}, \conf, \error)$}
\label{alg:mon-sam}
\begin{algorithmic}[1]
@ -328,9 +330,9 @@ we first state the lemmas that summarize the relevant properties of $\onepass$ a
\begin{Lemma}\label{lem:one-pass}
The $\onepass$ function completes in $O(size(\etree))$ time. After $\onepass$ returns the following post conditions hold. First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
The $\onepass$ function completes in $O(size(\etree))$ time. $\onepass$ guarantees two post conditions: First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
\end{Lemma}
In proving correctness of~\Cref{alg:mon-sam}, we will only use the following fact (which follows from the above lemma), $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
To prove correctness of~\Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
%\AH{I'm wondering if there is a better notation to use here. I myself got confused by my own notation of $\etree_{\vari{mod}}$. \emph{But}, we need to to be referencing the modified $\etree$ returned by $\onepass$ in the algorithm, so maybe this is the best we can do?}
%\AR{yeah, I think this is fine.}
%At the conclusion of $\onepass$, $\etree.\vari{partial}$ will hold the sum of all coefficients in $\expandtree{\abs{\etree}}$, i.e., $\sum\limits_{(\monom, \coef) \in \expandtree{\abs{\etree}}}\coef$. $\etree.\vari{weight}$ will hold the weighted probability that $\etree$ is sampled from from its parent $+$ node.
@ -357,7 +359,7 @@ $\empmean$ has bounds
The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as follows (where $\etree_\lchild$ and $\etree_\rchild$ are the `left' and `right' children of $\etree$ if they exist):
{\small
\begin{align}
\label{eq:T-all-ones}
\abs{\etree}(1,\ldots, 1) = \begin{cases}
@ -367,6 +369,7 @@ The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as fol
1 &\textbf{if }\etree.\type = \var.
\end{cases}
\end{align}
}
%\begin{align*}
%&\eval{\etree ~|~ \etree.\type = +}_{\abs{\etree}} =&& \eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}\\
@ -387,38 +390,36 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}
%\end{align*}
Algorithm ~\ref{alg:one-pass} essentially implements the above definitions.
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions.
%\subsubsection{Psuedo Code}
%See algorithm ~\ref{alg:one-pass} for details.
For an example of how $\onepass$ works, the pseudocode, and the proof of correctness (~\Cref{lem:one-pass}) of Algorithm ~\ref{alg:one-pass}see~\Cref{sec:proofs-approx-alg}.
\subsection{\sampmon\ Algorithm}
\label{sec:samplemonomial}
%Algorithm ~\ref{alg:sample} takes $\etree$ as input, samples an arbitrary $(\monom, \coef)$ from $\expandtree{\etree}$ with probabilities $\stree_\lchild.\wght$ and $\stree_\rchild.\wght$ for each subtree $\stree$ with $\stree.\type = +$, outputting the tuple $(\monom, \sign(\coef))$. While one cannot compute $\expandtree{\etree}$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\etree$ which produces a sample from $\expandtree{\etree}$ without ever materializing $\expandtree{\etree}$.
One way to implement \sampmon\ would be to compute $E(T)$ and then sample from it. However, this would be too time consuming.
A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
% However, this would be too time consuming.
%
Instead~\Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by the following top-down traversal. For a parent $+$ node, a subtree is chosen over the previously computed weighted sampling distribution. When a parent $\times$ node is visited, both children are visited. All variable leaf nodes of the subgraph traversal are added to a set. Additionally, the product of signs over all coefficient leaf nodes of the subgraph traversal is computed. The algorithm returns a set of the distinct variables of which the monomial is composed and the monomial's sign.
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
When a parent $\times$ node is visited, both children are visited.
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.
%\begin{Definition}[TreeSet]
%A TreeSet is a data structure whose elements form a set, each of which are stored in a binary tree.
%\end{Definition}
We will assume the TreeSet data structure to maintain sets with logarithmic time insertion and linear time traversal of its elements.
%
$\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \Cref{lem:sample}) is provided in \Cref{sec:proofs-approx-alg}.
\subsubsection{Pseudo Code}
See algorithm ~\ref{alg:sample} for the details of $\sampmon$ algorithm.
\begin{algorithm}
\begin{algorithm}[t]
\caption{\sampmon(\etree)}
\label{alg:sample}
\begin{algorithmic}[1]
@ -451,7 +452,6 @@ See algorithm ~\ref{alg:sample} for the details of $\sampmon$ algorithm.
\end{algorithmic}
\end{algorithm}
We argue the correctness of Algorithm ~\ref{alg:sample} by proving~\Cref{lem:sample} in~\Cref{sec:proofs-approx-alg}.
% \subsection{Experimental results}
% \label{sec:experiments}
% We conducted an experiment running modified TPCH queries over uncertain data generated by pdbench~\cite{pdbench}, both of which (data and queries) represent what is typically encountered in practice. Queries were run two times, once filtering $\bi$ cancellations, and then second not filtering the cancellations. The purpose of this was to determine an indication for how many $\bi$ cancellations occur in practice. Details and results can be found in~.

View File

@ -1,26 +1,35 @@
%!TEX root=./main.tex
\section{Generalizations}
\label{sec:gen}
In this section, we consider a couple of generalizations/corollaries of our results so far. In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of the earlier (expression) tree (\Cref{def:express-tree}) and we observe that all of our results carry over to the DAG representation. Then we formalize our claim in~\Cref{sec:intro} that a linear runtime algorithm for our problem would imply that we can process PDBs in the same time as deterministic query processing. Finally, in~\Cref{sec:momemts}, we make some simple observations on how our results can be used to estimate moments beyond the expectation of a lineage polynomial.
In this section, we consider a several generalizations/corollaries of our results.
In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
Then we formalize our claim in~\Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries.
Finally, in~\Cref{sec:momemts}, we observe how our results can be used to estimate moments other than the expectation.
\subsection{Lineage circuits}
\label{sec:circuits}
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and indeed pretty much of the rest of the paper has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, which are a standard way to represent polynomials over fields (and is standard in the field of algebraic complexity), though in our case we use them for polynomials over $\mathbb N$ in the obvious way. We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start. A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks correspond to the output. Every other node has at most two incoming edges (and is labeled as either an addition or a multiplication node), but there is no limit on the outdegree of such nodes. We note that if we restricted the outdegree to be one, then we get back expression trees.
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start.
A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to the output.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get back expression trees.
In~\Cref{sec:results-circuits} we argue why our results from earlier sections also hold for lineage circuits and then argue why lineage circuits do indeed capture the notion of runtime of some well-known query processing algorithms in~\Cref{sec:circuit-runtime} (We formally define the corresponding cost model in~\Cref{sec:cost-model}).
In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for lineage circuits and then argue why lineage circuits capture the notion of runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
\subsubsection{Extending our results to lineage circuits}
\label{sec:results-circuits}
We first note that since expression trees are a special case of them, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.
For the approximation algorithm in~\Cref{sec:algo} we note that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well. It turns out that both $\onepass$ and $\sampmon$ work for lineage circuits as well, simply because the only property these use for expression trees is that each node has two children. This is still valid of lineage circuits where for each non-source node the children correspond to the two nodes that have incoming edges to the given node. Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
For further discussion on why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
Analogously in a circuit, each node has a maximum in-degree of two.
Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%
For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.
\subsubsection{The cost model}
\label{sec:cost-model}
@ -28,13 +37,15 @@ Thus far, our analysis of the runtime of $\onepass$ has been in terms of the siz
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.
We adopt a minimalistic compute-bound model of query evaluation drawn from worst-case optimal joins~\cite{skew,ngo-survey}.
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
{\small
\begin{align*}
\qruntime{Q} & = |Q|\\
\qruntime{\sigma Q} & = \qruntime{Q}\\
\qruntime{\pi Q} & = \qruntime{Q} + \abs{Q}\\
\qruntime{Q \cup Q'} & = \qruntime{Q} + \qruntime{Q'} +\abs{Q}+\abs{Q'}\\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + |Q_1 \bowtie \ldots \bowtie Q_n|\\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + |Q_1 \bowtie \ldots \bowtie Q_n|
\end{align*}
}
Under this model the query plan $Q(D)$ has runtime $O(\qruntime{Q(D)})$.
Base relations assume that a full table scan is required; We model index scans by treating an index scan query $\sigma_\theta(R)$ as a single base relation.

View File

@ -77,7 +77,7 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
\label{subsec:expectation-of-polynom-proof}
\BG{TODO}
\subsection{Supplementary Material for~\Cref{def:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
\subsection{Supplementary Material for~\Cref{subsec:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob(\tup)$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.

View File

@ -32,15 +32,15 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Such compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction from counting $k$-matchings.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{factorized-db} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
As we will show, the size of the factorized
lineage formula for a query --- and by extension, our approximation algorithm --- is proportional to the complexity of evaluating the same query on a comparable deterministic database instance~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
lineage formula for a query --- and by extension, our approximation algorithm --- is proportional to the complexity of evaluating the same query on a comparable deterministic database instance~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
\subsection{Sets vs Bags}
@ -225,7 +225,7 @@ Consider the Cartesian product of $\poly$ with itself:
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial is broken up into separate factors.
One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.
For example:
{\small
\begin{equation*}

View File

@ -363,11 +363,17 @@
\newcommand{\inparen}[1]{\left({#1}\right)}
\newcommand{\inset}[1]{\left\{{#1}\right\}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Forcing Layouts
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
%%%Adding stuff below so that long chain of display equatoons can be split across pages
\allowdisplaybreaks
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"

View File

@ -84,11 +84,6 @@ Juliana Freire},
Christoph Koch and
Dan Olteanu},
booktitle = {ICDE},
editor = {Rada Chirkova and
Asuman Dogac and
M. Tamer Özsu and
Timos K. Sellis},
pages = {1479--1480},
title = {MayBMS: Managing Incomplete Information with Probabilistic World-Set
Decompositions},
year = {2007}
@ -110,17 +105,6 @@ Atri Rudra},
year = {2016}
}
@article{10.1145/3003665.3003667,
author = {Olteanu, Dan and Schleich, Maximilian},
journal = {SIGMOD Rec.},
number = {2},
numpages = {12},
pages = {5--16},
title = {Factorized Databases},
volume = {45},
year = {2016}
}
@article{DBLP:journals/sigmod/GuagliardoL17,
author = {Paolo Guagliardo and
Leonid Libkin},
@ -148,8 +132,7 @@ Jennifer Widom},
@inproceedings{k-match,
author = {Radu Curticapean},
booktitle = {Automata, Languages, and Programming - 40th International Colloquium,
ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I},
booktitle = {ICALP},
editor = {Fedor V. Fomin and
Rusins Freivalds and
Marta Z. Kwiatkowska and
@ -197,11 +180,7 @@ Maximilian Schleich},
@inproceedings{ngo-survey,
author = {Hung Q. Ngo},
booktitle = {Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles
of Database Systems, Houston, TX, USA, June 10-15, 2018},
editor = {Jan Van den Bussche and
Marcelo Arenas},
pages = {111--124},
booktitle = {PODS},
title = {Worst-Case Optimal Join Algorithms: Techniques, Results, and Open
Problems},
year = {2018}
@ -354,16 +333,6 @@ Emanuela Merelli},
year = {2009}
}
@article{OS16,
author = {Olteanu, Dan and Schleich, Maximilian},
journal = {SIGMOD Record},
number = {2},
pages = {5--16},
title = {Factorized Databases},
volume = {45},
year = {2016}
}
@article{FO16,
author = {Robert Fink and Dan Olteanu},
journal = {TODS},
@ -386,7 +355,7 @@ Emanuela Merelli},
@inproceedings{AB15c,
author = {Antoine Amarilli and Pierre Bourhis and Pierre Senellart},
booktitle = {Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part II},
booktitle = {ICALP},
pages = {56--68},
title = {Provenance Circuits for Trees and Treelike Instances},
year = {2015}
@ -414,8 +383,6 @@ Emanuela Merelli},
@inproceedings{roy-11-f,
author = {Sudeepa Roy and Vittorio Perduca and Val Tannen},
booktitle = {ICDT},
editor = {Tova Milo},
pages = {232--243},
title = {Faster query answering in probabilistic databases using read-once functions},
year = {2011}
}
@ -498,7 +465,6 @@ Emanuela Merelli},
@inproceedings{fink-11,
author = {Robert Fink and Dan Olteanu},
booktitle = {ICDT},
editor = {Tova Milo},
pages = {174--185},
title = {On the optimal approximation of queries using tractable propositional languages},
year = {2011}
@ -522,7 +488,6 @@ Emanuela Merelli},
@conference{BD05,
author = {Boulos, J. and Dalvi, N. and Mandhani, B. and Mathur, S. and Re, C. and Suciu, D.},
booktitle = {SIGMOD},
pages = {891--893},
title = {MYSTIQ: a system for finding more answers by using probabilities},
year = {2005}
}
@ -565,10 +530,7 @@ Design, 1993, Santa Clara, California, USA, November 7-11, 1993},
author = {R. Iris Bahar and Erica A. Frohm and Charles M. Gaona and Gary
D. Hachtel and Enrico Macii and Abelardo Pardo and Fabio
Somenzi},
booktitle = {Proceedings of the 1993 IEEE/ACM International Conference on
Computer-Aided Design, 1993, Santa Clara, California, USA,
November 7-11, 1993},
pages = {188--191},
booktitle = {IEEE CAD},
title = {Algebraic decision diagrams and their applications},
year = {1993}
}

View File

@ -76,6 +76,7 @@ We focus on this problem exclusively from now on, assume an implicit result tupl
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
%
A \bi $\pxdb = (\db, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.

View File

@ -1,7 +1,8 @@
%!TEX root=./main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Compressed Representations of Polynomials and Boolean Formulas}\label{sec:compr-repr-polyn}
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{OS16,DBLP:conf/tapp/Zavodny11}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{DBLP:conf/tapp/Zavodny11,factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Parameterized Complexity}\label{sec:param-compl}