master
Oliver Kennedy 2021-09-15 23:51:29 -04:00
parent 473b5885a4
commit c7cae52c11
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
5 changed files with 15 additions and 16 deletions

View File

@ -5,7 +5,7 @@
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
@ -121,7 +121,7 @@ $\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \size(\circuit)}.$
Further, under either of the following conditions:
\begin{enumerate}
\item $\circuit$ is a tree,
\item $\circuit$ encodes the run of the algorithm in~\cite{DBLP:conf/pods/KhamisNR16} on an FAQ\AH{AJAR citation.} query,
\item $\circuit$ encodes the run of the algorithm in~\cite{DBLP:conf/pods/KhamisNR16} on a FAQ\AH{AJAR citation.} query,
\end{enumerate}
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
\end{Lemma}

View File

@ -1,6 +1,6 @@
%!TEX root=./main.tex
\subsection{More on Circuits and Moments}\label{sec:gen}
\subsection{Relationship to Deterministic Query Runtimes}\label{sec:gen}
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
Lastly, we generalize our result for expectation to other moments.
@ -17,7 +17,7 @@ Lastly, we generalize our result for expectation to other moments.
\mypar{The cost model}
%\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
@ -54,7 +54,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
We are now ready to formally state our claim from \Cref{sec:intro}:
We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}\label{cor:cost-model}
Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time

View File

@ -1,6 +1,6 @@
%root:main.tex
%!TEX root=./main.tex
\section{Hardness of exact computation}
\section{Hardness of Exact Computation}
\label{sec:hard}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a \abbrTIDB.
@ -11,7 +11,7 @@ In this section, we will prove the hardness results claimed in Table~\ref{tab:lb
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Preliminaries}
Our hardness results are based on (exactly) counting the number of (not necessarily induced) subgraphs in $G$ isomorphic to $H$. Let $\numocc{G}{H}$ denote this quantity. We can think of $H$ as being of constant size and $G$ as growing. %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). We use $\kmatchtime$ to denote the optimal runtime of computing $\numocc{G}{\kmatch}$. Our hardness results in \Cref{sec:multiple-p} is based on the following hardness results/conjectures:
In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). We use $\kmatchtime$ to denote the optimal runtime of computing $\numocc{G}{\kmatch}$. Our hardness results in \Cref{sec:multiple-p} are based on the following hardness results/conjectures:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
@ -21,7 +21,7 @@ Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no sel
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%The above result means that we cannot hope to count the number of $k$-matchings in $G=(\vset,\edgeSet)$ in time $f(k)\cdot |\vset|^{c}$ for any function $f$ and constant $c$ independent of $k$.
\begin{hypo}\label{conj:known-algo-kmatch}
There exists an absolute constant $c_0>0$ such that for every $G=(\vset,\edgeSet)$, we have $\kmatchtime \ge \Omega{|E|^{c_0\cdot k}}$.
There exists an absolute constant $c_0>0$ such that for every $G=(\vset,\edgeSet)$, we have $\kmatchtime \ge \Omega\inparen{|E|^{c_0\cdot k}}$.
\end{hypo}
We note that the above conjecture is somewhat non-standard. In particular, the best known state of the art algorithm to compute $\numocc{G}{\kmatch}$ takes time $\Omega\inparen{|V|^{k/2}}$ (i.e. if this is the best algorithm then $c_0=\frac 14$)~\cite{k-match}. What the above conjecture is saying is that one can only hope for a polynomial improvement over the state of the art algorithm to compute $\numocc{G}{\kmatch}$.
%
@ -59,7 +59,7 @@ WHERE a.city = r.city1 AND b.city = r.city2
\end{lstlisting}
as $R_i$ for each $i \in [k]$. The query $\query^k$ then becomes
\begin{lstlisting}
SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
\end{lstlisting}
%RA format for the same query
%\begin{align*}
@ -100,7 +100,6 @@ $\qruntime{\query^k, \dbbase}$ is $O(\kElem\numedge)$.
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
We are now ready to present our main hardness result.
%
e
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}\label{thm:mult-p-hard-result}
Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$. Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ (over all $i\in [2k+1]$ for arbitrary $G=(\vset,\edgeSet)$

View File

@ -1,7 +1,7 @@
%root: main.tex
%!TEX root=./main.tex
\subsection{Problem Definition}\label{sec:expression-trees}
\subsection{Formalizing \Cref{prob:intro-stmt}}\label{sec:expression-trees}
%We first formally define circuits, an encoding of polynomials that we use throughout the paper.
%
@ -11,9 +11,9 @@ We represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-compl
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Circuit]\label{def:circuit}
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domN$ or $\vct{X}$. For each output tuple there exists one source gate. The internal gates have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domN$ or $\vct{X}$. For each output tuple there exists one sink gate. The internal gates have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
%
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an additional member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
\end{Definition}
When the underlying DAG is a tree (with edges pointing towards the root), the structure is an expression tree \etree. In such a case, the root of \etree is analogous to the sink of \circuit. The fields \vari{partial}, \degval, \vari{Lweight}, and \vari{Rweight} are used in the proofs of \Cref{sec:proofs-approx-alg}.
@ -92,7 +92,7 @@ Note that each circuit \circuit encodes a tree, with edges pointing towards the
%\end{figure}
We next formally define the relationship of circuits with polynomials. While the definition assumes one sink for notational convenience, it easily generalizes to the multiple sinks case.
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
Denote $\polyf(\circuit)$ to be the function from the sink of circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
\begin{equation*}
\polyf(\circuit) = \begin{cases}
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\
@ -102,7 +102,7 @@ Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corr
\end{equation*}
\end{Definition}
$\circuit$ need not encode $\poly\inparen{\vct{X}}$ in the same, default \abbrSMB representation. For instance, $\circuit$ could encode the factorized representation $(X + 2Y)(2X - Y)$ of $\poly\inparen{\vct{X}} = 2X^2+3XY-2Y^2$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = \poly\inparen{\vct{X}}$, the equivalent \abbrSMB representation.
$\circuit$ need not encode $\poly\inparen{\vct{X}}$ in the same, default \abbrSMB representation. For instance, $\circuit$ could encode the factorized representation $(X + 2Y)(2X - Y)$ of $\poly\inparen{\vct{X}} = 2X^2+3XY-2Y^2$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = \poly\inparen{\vct{X}}$ is always the equivalent \abbrSMB representation.
\begin{Definition}[Circuit Set]\label{def:circuit-set}
$\circuitset{\polyX}$ is the set of all possible circuits $\circuit$ such that $\polyf(\circuit) = \polyX$.

View File

@ -55,7 +55,7 @@ A \bi $\pdb$ is a \abbrPDB with the constraint that
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
}
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB where each block has size exactly $1$.
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
%that $\forall i \in \abs{\block}, \forall j\neq k \in [\block_i] \suchthat \db\inparen{\tup_{i, j}} = 0 \vee \db\inparen{\tup_{i, k} = 0}$.In other words, each random variable corresponds to the event of a single tuple's presence.