typos
parent
473b5885a4
commit
c7cae52c11
|
@ -5,7 +5,7 @@
|
|||
|
||||
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
|
||||
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
|
||||
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
|
||||
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
|
||||
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
|
||||
|
||||
\subsection{Preliminaries and some more notation}
|
||||
|
@ -121,7 +121,7 @@ $\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \size(\circuit)}.$
|
|||
Further, under either of the following conditions:
|
||||
\begin{enumerate}
|
||||
\item $\circuit$ is a tree,
|
||||
\item $\circuit$ encodes the run of the algorithm in~\cite{DBLP:conf/pods/KhamisNR16} on an FAQ\AH{AJAR citation.} query,
|
||||
\item $\circuit$ encodes the run of the algorithm in~\cite{DBLP:conf/pods/KhamisNR16} on a FAQ\AH{AJAR citation.} query,
|
||||
\end{enumerate}
|
||||
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
|
||||
\end{Lemma}
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
%!TEX root=./main.tex
|
||||
|
||||
\subsection{More on Circuits and Moments}\label{sec:gen}
|
||||
\subsection{Relationship to Deterministic Query Runtimes}\label{sec:gen}
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Lastly, we generalize our result for expectation to other moments.
|
||||
|
||||
|
@ -17,7 +17,7 @@ Lastly, we generalize our result for expectation to other moments.
|
|||
|
||||
\mypar{The cost model}
|
||||
%\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
|
||||
So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
|
||||
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
|
@ -54,7 +54,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
|
|||
%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
We are now ready to formally state our claim from \Cref{sec:intro}:
|
||||
We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}\label{cor:cost-model}
|
||||
Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
%root:main.tex
|
||||
%!TEX root=./main.tex
|
||||
\section{Hardness of exact computation}
|
||||
\section{Hardness of Exact Computation}
|
||||
\label{sec:hard}
|
||||
|
||||
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a \abbrTIDB.
|
||||
|
@ -11,7 +11,7 @@ In this section, we will prove the hardness results claimed in Table~\ref{tab:lb
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Preliminaries}
|
||||
Our hardness results are based on (exactly) counting the number of (not necessarily induced) subgraphs in $G$ isomorphic to $H$. Let $\numocc{G}{H}$ denote this quantity. We can think of $H$ as being of constant size and $G$ as growing. %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
|
||||
In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). We use $\kmatchtime$ to denote the optimal runtime of computing $\numocc{G}{\kmatch}$. Our hardness results in \Cref{sec:multiple-p} is based on the following hardness results/conjectures:
|
||||
In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). We use $\kmatchtime$ to denote the optimal runtime of computing $\numocc{G}{\kmatch}$. Our hardness results in \Cref{sec:multiple-p} are based on the following hardness results/conjectures:
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Theorem}[\cite{k-match}]
|
||||
|
@ -21,7 +21,7 @@ Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no sel
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%The above result means that we cannot hope to count the number of $k$-matchings in $G=(\vset,\edgeSet)$ in time $f(k)\cdot |\vset|^{c}$ for any function $f$ and constant $c$ independent of $k$.
|
||||
\begin{hypo}\label{conj:known-algo-kmatch}
|
||||
There exists an absolute constant $c_0>0$ such that for every $G=(\vset,\edgeSet)$, we have $\kmatchtime \ge \Omega{|E|^{c_0\cdot k}}$.
|
||||
There exists an absolute constant $c_0>0$ such that for every $G=(\vset,\edgeSet)$, we have $\kmatchtime \ge \Omega\inparen{|E|^{c_0\cdot k}}$.
|
||||
\end{hypo}
|
||||
We note that the above conjecture is somewhat non-standard. In particular, the best known state of the art algorithm to compute $\numocc{G}{\kmatch}$ takes time $\Omega\inparen{|V|^{k/2}}$ (i.e. if this is the best algorithm then $c_0=\frac 14$)~\cite{k-match}. What the above conjecture is saying is that one can only hope for a polynomial improvement over the state of the art algorithm to compute $\numocc{G}{\kmatch}$.
|
||||
%
|
||||
|
@ -59,7 +59,7 @@ WHERE a.city = r.city1 AND b.city = r.city2
|
|||
\end{lstlisting}
|
||||
as $R_i$ for each $i \in [k]$. The query $\query^k$ then becomes
|
||||
\begin{lstlisting}
|
||||
SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
|
||||
SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
|
||||
\end{lstlisting}
|
||||
%RA format for the same query
|
||||
%\begin{align*}
|
||||
|
@ -100,7 +100,6 @@ $\qruntime{\query^k, \dbbase}$ is $O(\kElem\numedge)$.
|
|||
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
|
||||
We are now ready to present our main hardness result.
|
||||
%
|
||||
e
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Theorem}\label{thm:mult-p-hard-result}
|
||||
Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$. Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ (over all $i\in [2k+1]$ for arbitrary $G=(\vset,\edgeSet)$
|
||||
|
|
10
prob-def.tex
10
prob-def.tex
|
@ -1,7 +1,7 @@
|
|||
%root: main.tex
|
||||
%!TEX root=./main.tex
|
||||
|
||||
\subsection{Problem Definition}\label{sec:expression-trees}
|
||||
\subsection{Formalizing \Cref{prob:intro-stmt}}\label{sec:expression-trees}
|
||||
|
||||
%We first formally define circuits, an encoding of polynomials that we use throughout the paper.
|
||||
%
|
||||
|
@ -11,9 +11,9 @@ We represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-compl
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[Circuit]\label{def:circuit}
|
||||
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domN$ or $\vct{X}$. For each output tuple there exists one source gate. The internal gates have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
|
||||
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domN$ or $\vct{X}$. For each output tuple there exists one sink gate. The internal gates have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
|
||||
%
|
||||
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an additional member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
|
||||
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
|
||||
\end{Definition}
|
||||
When the underlying DAG is a tree (with edges pointing towards the root), the structure is an expression tree \etree. In such a case, the root of \etree is analogous to the sink of \circuit. The fields \vari{partial}, \degval, \vari{Lweight}, and \vari{Rweight} are used in the proofs of \Cref{sec:proofs-approx-alg}.
|
||||
|
||||
|
@ -92,7 +92,7 @@ Note that each circuit \circuit encodes a tree, with edges pointing towards the
|
|||
%\end{figure}
|
||||
We next formally define the relationship of circuits with polynomials. While the definition assumes one sink for notational convenience, it easily generalizes to the multiple sinks case.
|
||||
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
|
||||
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
|
||||
Denote $\polyf(\circuit)$ to be the function from the sink of circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
|
||||
\begin{equation*}
|
||||
\polyf(\circuit) = \begin{cases}
|
||||
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\
|
||||
|
@ -102,7 +102,7 @@ Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corr
|
|||
\end{equation*}
|
||||
\end{Definition}
|
||||
|
||||
$\circuit$ need not encode $\poly\inparen{\vct{X}}$ in the same, default \abbrSMB representation. For instance, $\circuit$ could encode the factorized representation $(X + 2Y)(2X - Y)$ of $\poly\inparen{\vct{X}} = 2X^2+3XY-2Y^2$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = \poly\inparen{\vct{X}}$, the equivalent \abbrSMB representation.
|
||||
$\circuit$ need not encode $\poly\inparen{\vct{X}}$ in the same, default \abbrSMB representation. For instance, $\circuit$ could encode the factorized representation $(X + 2Y)(2X - Y)$ of $\poly\inparen{\vct{X}} = 2X^2+3XY-2Y^2$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = \poly\inparen{\vct{X}}$ is always the equivalent \abbrSMB representation.
|
||||
|
||||
\begin{Definition}[Circuit Set]\label{def:circuit-set}
|
||||
$\circuitset{\polyX}$ is the set of all possible circuits $\circuit$ such that $\polyf(\circuit) = \polyX$.
|
||||
|
|
|
@ -55,7 +55,7 @@ A \bi $\pdb$ is a \abbrPDB with the constraint that
|
|||
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
|
||||
}
|
||||
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
|
||||
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB where each block has size exactly $1$.
|
||||
|
||||
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
|
||||
%that $\forall i \in \abs{\block}, \forall j\neq k \in [\block_i] \suchthat \db\inparen{\tup_{i, j}} = 0 \vee \db\inparen{\tup_{i, k} = 0}$.In other words, each random variable corresponds to the event of a single tuple's presence.
|
||||
|
|
Loading…
Reference in New Issue