Done with my pass

master
Atri Rudra 2021-09-18 01:47:02 -04:00
parent 7351a432ad
commit d0b2c75d63
4 changed files with 107 additions and 61 deletions

View File

@ -2,6 +2,23 @@
%\input{app_approx-alg-pseudo-code}
The following results assume input circuit \circuit computed from an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We refer to \circuit as a \abbrBIDB circuit.
\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
\begin{Theorem}\label{lem:approx-alg}
Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
{\small
\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\circuit}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
}
such that
\begin{equation}
\label{eq:approx-algo-bound}
\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
\end{equation}
\end{Theorem}
\AR{\textbf{Aaron:}Just copied over from S4. The above text might need smoothening into the appendix.}
\subsection{Proof of Theorem \ref{lem:approx-alg}}\label{sec:proof-lem-approx-alg}
\input{app_approx-alg-pseudo-code}
In order to prove \Cref{lem:approx-alg}, we will need to argue the correctness of \approxq, which relies on the correctness of auxiliary algorithms \onepass and \sampmon.

View File

@ -4,13 +4,13 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
The folowing approximation algorithm applies to \abbrBIDB lineage polynomials (over $\raPlus$ queries), though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits (see the discussion after \Cref{lem:val-ub} for more).
The folowing approximation algorithm applies to \abbrBIDB lineage polynomials (over $\raPlus$ queries), though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}. All proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
We now introduce useful definitions and notation related to circuits and polynomials.
We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results.
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$.
@ -49,29 +49,12 @@ $\degree(\circuit)$ is defined recursively as follows:
\]
\end{Definition}
Finally, we use the following notation for the complexity of multiplying integers:
Next, we use the following notation for the complexity of multiplying integers:
\begin{Definition}[$\multc{\cdot}{\cdot}$]\footnote{We note that when doing arithmetic operations on the RAM model for input of size $N$, we have that $\multc{O(\log{N})}{O(\log{N})}=O(1)$. More generally we have $\multc{N}{O(\log{N})}=O(N\log{N}\log\log{N})$.}
In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity of multiplying two integers represented with $M$-bits. (We will assume that for input of size $N$, $W=O(\log{N})$.
In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity of multiplying two integers represented with $M$-bits. (We will assume that for input of size $N$, $W=O(\log{N})$.)
\end{Definition}
\subsection{Our main result}\label{sec:algo:sub:main-result}
The following results assume input circuit \circuit computed from an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We refer to \circuit as a \abbrBIDB circuit.
\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
\begin{Theorem}\label{lem:approx-alg}
Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
{\small
\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\circuit}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
}
such that
\begin{equation}
\label{eq:approx-algo-bound}
\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
\end{equation}
\end{Theorem}
To get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
Finally, to get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
$\expansion{\circuit}$
that need to be `canceled' when monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
@ -82,36 +65,14 @@ Given a \abbrBIDB circuit $\circuit$ define
\[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\neg\isInd{\encMon}} }%\encMon\mod{\mathcal{B}}\equiv 0}}
{\abs{\circuit}(1,\ldots, 1)}.\]
\end{Definition}
\noindent We next present a few corollaries of \Cref{lem:approx-alg}.
\begin{Corollary}
\label{cor:approx-algo-const-p}
Let $\poly(\vct{X})$ be as in \Cref{lem:approx-alg} and let $\gamma=\gamma(\circuit)$ for \abbrBIDB circuit \circuit. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying \Cref{eq:approx-algo-bound} can be computed in time
\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Corollary}
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
\begin{Lemma}
\label{lem:val-ub}
For any \abbrBIDB circuit $\circuit$ with $\degree(\circuit)=k$, we have
$\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \size(\circuit)}.$
Further, under either of the following conditions:
\begin{enumerate}
\item $\circuit$ is a tree,
\item $\circuit$ encodes the run of the algorithm on a FAQ~\cite{DBLP:conf/pods/KhamisNR16}\AH{AJAR citation.} query,
\end{enumerate}
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
\end{Lemma}
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}. In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under \raPlus or FAQ.
\AH{AJAR reference.}
\subsection{Our main result}\label{sec:algo:sub:main-result}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Approximating $\rpoly$}
We prove \Cref{lem:approx-alg} by developing an approximation algorithm (\approxq pseudo code in \Cref{sec:proof-lem-approx-alg}) with the desired runtime. This algorithm is based on the following observation.
\mypar{Algorithm Idea}
%We prove \Cref{lem:approx-alg} by developing an
Our approximation algorithm (\approxq pseudo code in \Cref{sec:proof-lem-approx-alg})
%with the desired runtime. This algorithm
is based on the following observation.
% The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
@ -126,9 +87,76 @@ Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circui
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional
to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}
\cdot \prod_{X_i\in \monom} p_i$. Taking $\ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$ samples and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
\cdot \prod_{X_i\in \monom} p_i$. %Taking $\ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$ samples
Repeating the sampling appropriate number of times
and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
%%%%%%%%%%%%%%%%%%%%%%%
%The following results assume input circuit \circuit computed from an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We refer to \circuit as a \abbrBIDB circuit.
%\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
%\begin{Theorem}\label{lem:approx-alg}
%Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
%and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
%Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
%{\small
%\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\circuit}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
%}
%such that
%\begin{equation}
%\label{eq:approx-algo-bound}
%\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
%\end{equation}
%\end{Theorem}
\mypar{Runtime analysis} We can argue the following runtime for the algorithm outlined above:
% We next present a few corollaries of \Cref{lem:approx-alg}.
\begin{Theorem}
\label{cor:approx-algo-const-p}
Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
%Let $\poly(\vct{X})$ be as in \Cref{lem:approx-alg} and
Let $\gamma=\gamma(\circuit)$ for \abbrBIDB circuit \circuit. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
satisfying
\begin{equation}
\label{eq:approx-algo-bound-main}
\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error' \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf
\end{equation}
can be computed in time
\begin{equation}
\label{eq:approx-algo-runtime}
O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right).
\end{equation}
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Theorem}
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$.
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
\begin{Lemma}
\label{lem:val-ub}
For any \abbrBIDB circuit $\circuit$ with $\degree(\circuit)=k$, we have
$\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \size(\circuit)}.$
Further, under either of the following conditions:
\begin{enumerate}
\item $\circuit$ is a tree,
\item $\circuit$ encodes the run of the algorithm on a FAQ~\cite{DBLP:conf/pods/KhamisNR16}/AJAR~\cite{ajar} query,
\end{enumerate}
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
\end{Lemma}
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and
\begin{Corollary}
Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
$O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
\end{Corollary}
\AR{The above Corollary needs to be improved/generalized. This is a place-holder for now.}
In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under FAQ/AJAR setup.
%\AH{AJAR reference.}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"

View File

@ -1,17 +1,18 @@
%!TEX root=./main.tex
\section{Conclusions and Future Work}\label{sec:concl-future-work}
We have studied the problem of calculating the expectation of lineage polynomials over BIDBs. %random integer variables.
This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
We have studied the problem of calculating the expected multiplicity of a query result tuple, %expectation of lineage polynomials over BIDBs. %random integer variables.
a problem that has a practical application in probabilistic databases over multisets. %, where it corresponds to calculating the expected multiplicity of a query result tuple.
% It has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention.
While the expectation of a polynomial can be calculated in linear time for % in the size of
polynomials % that are
in SOP form, the problem is \sharpwonehard for factorized polynomials (proven through a reduction from the problem of counting k-matchings).
%While the expectation of a polynomial can be calculated in linear time for % in the size of
% polynomials % that are
%in SOP form, the problem is \sharpwonehard for factorized polynomials (proven through a reduction from the problem of counting k-matchings).
%We have proven this claim through a reduction from the problem of counting k-matchings.
We show that under various parameterized complexity hardness results/conjectures computing the expected multiplicities exactly is not possible in time linear in the corresponding deterministic query processing time.
We prove that it is possible to approximate the expectation of a lineage polynomial in linear time
% When only considering polynomials for result tuples of
for UCQs over TIDBs and BIDBs (assuming that there are few cancellations).
Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we handle higher moments in \Cref{app:sec-cicuits}, more general approximations are an interesting area for exploration, including those for more general data models. % beyond what we consider in this paper.
in the deterministic query processing over TIDBs and BIDBs (assuming that there are few cancellations).
Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we can handle higher moments (this follows fairly easily from our existing results-- see \Cref{app:sec-cicuits}), more general approximations are an interesting area for exploration, including those for more general data models. % beyond what we consider in this paper.
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
% \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:

View File

@ -17,14 +17,14 @@ Dalvi et al.~\cite{DS12} and Olteanu et al.~\cite{FO16} proved dichotomies for U
% Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation.
% R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15}. %,AB15c
Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
Another line of work studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
In this paper we focus on intensional query evaluation with polynomials.
Many data models have been proposed for encoding PDBs more compactly than as sets of possible worlds.
These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06}.
%
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (\emph{pvc-tables}) that supports bags, and has runtime complexity linear in the size of the lineage.
However, this lineage is encoded as a tree; the size (and thus the runtime) are still superlinear in $\qruntime(\query, \dbbase)$.
However, this lineage is encoded as a tree; the size (and thus the runtime) are still superlinear in $\qruntime{\query, \dbbase}$.
The runtime bound is also limited to a specific class of (hierarchical) queries, suggesting the possibility of a generalization of \cite{DS12}'s dichotomy result to \abbrBPDB\xplural.
% Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}.
% over the symbolic expressions that are used as tuple annotations and values in pvc-tables.