updates
parent
40cac20325
commit
c9cde8f57a
|
@ -11,14 +11,17 @@ The folowing approximation algorithm applies to \bi, though our bounds are more
|
|||
\subsection{Preliminaries and some more notation}
|
||||
|
||||
We now introduce useful definitions and notation related to circuits and polynomials. All proofs and missing pseudocode can be found in \Cref{sec:proofs-approx-alg}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[Variables in a monomial]\label{def:vars}
|
||||
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
|
||||
\end{Definition}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\noindent For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
|
||||
|
||||
|
||||
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
|
||||
The logical view of $\expansion{\circuit}$ is a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef$ is in $\reals$.
|
||||
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \reals$.
|
||||
$\expansion{\circuit}$ has the following recursive definition ($\circ$ is list concatenation).
|
||||
|
||||
$\expansion{\circuit} =
|
||||
|
@ -108,17 +111,26 @@ we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
|
|||
|
||||
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}. In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under \raPlus or FAQ.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Approximating $\rpoly$}
|
||||
The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
|
||||
We prove \Cref{lem:approx-alg} by developing an approximation algorithm (\approxq detailed in \Cref{alg:mon-sam}) with the desired runtime. This algorithm is based on the following observation.
|
||||
% The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
|
||||
Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{equation}
|
||||
\label{eq:tilde-Q-bi}
|
||||
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expansion{\circuit}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{\monom}}\hspace*{-2mm} X_i
|
||||
\end{equation}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
\input{app_approx-alg-pseudo-code}
|
||||
|
||||
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional %\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
|
||||
to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
|
||||
%\approxq (\Cref{alg:mon-sam}) modifies \circuit with a call to \onepass. It then samples from $\circuit_{\vari{mod}}\numsamp$ times and uses that information to approximate $\rpoly$.
|
||||
|
||||
\input{app_approx-alg-pseudo-code}
|
||||
|
||||
|
||||
|
||||
%\subsubsection{Correctness}
|
||||
|
|
|
@ -45,16 +45,16 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
|
|||
Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
|
||||
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
|
||||
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
|
||||
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
|
||||
%
|
||||
%We now make a simple observation on the above cost model:
|
||||
%\begin{proposition}
|
||||
%\label{prop:queries-need-to-output-tuples}
|
||||
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
|
||||
%\end{proposition}
|
||||
|
||||
%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
%
|
||||
We are now ready to formally state our claim from \Cref{sec:intro}:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}
|
||||
|
@ -75,10 +75,10 @@ This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime})
|
|||
%\label{sec:momemts}
|
||||
%
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far we have only focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
|
||||
So far we have only focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
|
||||
Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave further investigations for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
|
|
|
@ -10,7 +10,7 @@ in SOP form, the problem is \sharpwonehard for factorized polynomials (proven th
|
|||
%We have proven this claim through a reduction from the problem of counting k-matchings.
|
||||
We prove that it is possible to approximate the expectation of a lineage polynomial in linear time
|
||||
% When only considering polynomials for result tuples of
|
||||
UCQs over TIDBs and BIDBs (under the assumption that there are few cancellations).
|
||||
for UCQs over TIDBs and BIDBs (assuming that there are few cancellations).
|
||||
Interesting directions for future work include development of a dichotomy for bag PDBs and approximations for more general data models. % beyond what we consider in this paper.
|
||||
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
|
||||
|
||||
|
|
|
@ -51,7 +51,11 @@ even hold for the expression trees. %this polynomial can be encoded in an expres
|
|||
|
||||
|
||||
\noindent Returning to \Cref{fig:ex-shipping-simp}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our running example query:
|
||||
\resizebox{1\linewidth}{!}{
|
||||
\begin{minipage}{1.05\linewidth}
|
||||
\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
|
||||
\end{minipage}
|
||||
}
|
||||
where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $OnTime$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
|
||||
Note that this implies that our hard query polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$. %; our hardness result transfers here as well.
|
||||
% OK: The following (commented-out) sentence feels a bit misplaced here.
|
||||
|
@ -70,9 +74,14 @@ Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2
|
|||
\end{Theorem}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{E}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. By contrast the approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
|
||||
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{E}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. The approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
|
||||
|
||||
\noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
|
||||
\begin{Lemma}\label{lem:qEk-multi-p}
|
||||
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $O\inparen{\kElem^3}$ time.
|
||||
\end{Lemma}
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
%%% TeX-master: "main"
|
||||
%%% End:
|
||||
|
|
|
@ -50,11 +50,12 @@ An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$
|
|||
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.
|
||||
|
||||
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
|
||||
%
|
||||
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
|
||||
|
||||
Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
|
||||
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
|
||||
In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
|
||||
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}. % and summarized here.
|
||||
$\semNX$-relations are functions from tuples to elements of $\semNX$, typically called annotations.
|
||||
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.
|
||||
Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$ to $\vct{X}$.
|
||||
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{W})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{W}$.
|
||||
|
@ -77,7 +78,7 @@ $\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).
|
|||
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
|
||||
\end{aligned}\\
|
||||
& & \evald{R}{\db}(\tup) =& \rel(\tup)
|
||||
\end{align*}
|
||||
\end{align*}\\[-10mm]
|
||||
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
|
||||
\label{fig:nxDBSemantics}
|
||||
\end{figure}
|
||||
|
|
Loading…
Reference in New Issue