Trimming about all that I can trim through rephrasing+spacing cheats

master
Oliver Kennedy 2020-12-19 14:02:12 -05:00
parent 6662a05813
commit 5321761acf
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
7 changed files with 54 additions and 22 deletions

View File

@ -87,7 +87,7 @@ Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\C
\end{Example}
\begin{figure}[h!]
\begin{figure}[t]
\resizebox{0.65\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]

View File

@ -60,16 +60,16 @@ It can be verified that the worst-case join algorithms~\cite{skew,ngo-survey}, a
\subsubsection{Lineage circuit for query plans}
\label{sec:circuits-formal}
We now define a lineage circuit more formally and also show how to construct a lineage circuit given a SPJU query $Q$.
We now formalize lineage circuits and the construction of lineage circuits for SPJU queries.
As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.
A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices in the graph.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices.
We require that $\phi_Q$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_Q : V_Q \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
We require that vertices have an in-degree of at most two.
%
For the specifics on how lineage circuits are translated to represent polynomials see~\Cref{app:subsec-rep-poly-lin-circ}.
@ -82,7 +82,7 @@ We now connect the size of a lineage circuit (where the size of a lineage circui
\label{lem:circuits-model-runtime}
The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance. That is, for any query plan $Q$ we have $|V_Q| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$.
\end{lemma}
Proof is in~\Cref{app:subsec-lem-lin-vs-qplan}.
\noindent The proof appears in~\Cref{app:subsec-lem-lin-vs-qplan}.
We now have all the pieces to argue the following, which formally states that our approximation algorithm implies that approximating the expected multiplicities of SPJU query can be done in essentially the same runtime as deterministic query processing of the same query:
\begin{Corollary}
@ -95,4 +95,10 @@ This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit cou
\subsection{Higher moments}
\label{sec:momemts}
We make a simple observation to conclude the presentation of our results. So far we have presented algorithms that when given $\poly$, we approximate its expectation. In addition, we would e.g. prove bounds of probability of the multiplicity being at least $1$. While we do not have a good approximation algorithm for this problem, we can make some progress as follows. We first note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$). In other words, we can compute the $m$-th moment of the multiplicities as well. This allows us e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in. However, we leave the question of coming up with better approximation algorithms for proving probability bounds for future work.
We make a simple observation to conclude the presentation of our results.
So far we have presented algorithms that approximate the expectation of $\poly$.
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
However, we leave the question of coming up with a wider range of approximation algorithms for future work.

View File

@ -1,6 +1,14 @@
%!TEX root=./main.tex
\section{Conclusions and Future Work}\label{sec:concl-future-work}
We have studied the problem of calculating the expectation of polynomials over random integer variables. This problem has a practical application in probabilistic databases over multisets where it corresponds to calculating the expected multiplicity of a query result tuple using the tuple's provenance polynomial. This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in sum-of-products normal form, the problem is \sharpwonehard for factorized polynomials. We have proven this claim through a reduction from the problem of counting k-matchings. When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is possible to approximate the expectation of a polynomial in linear time. An interesting direction for future work would be development of a dichotomy for queries over bag PDBs. Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
We have studied the problem of calculating the expectation of polynomials over random integer variables.
This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
We have proven this claim through a reduction from the problem of counting k-matchings.
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
\BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
\textbullet{More queries: what happens with negation can circuits with monus be used?}

View File

@ -181,7 +181,7 @@ As a further interesting feature of this example, note that $\expct\pbox{W_i} =
\end{multline}
\begin{figure}[h!]
\resizebox{\columnwidth}{!}{
\resizebox{0.8\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
@ -216,6 +216,7 @@ As a further interesting feature of this example, note that $\expct\pbox{W_i} =
}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\trimfigurespacing
\end{figure}
\subsection{Superlinearity of Bag PDBs}

View File

@ -41,12 +41,12 @@ Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-har
Both of our hardness results rely on a simple query polynomial encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
Consider the polynomial
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.\]
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j\]
The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}
For any graph $G=([n],E)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem.\]
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned by to $X_i$ by $\vct{p}$) are the same value. Note that this polynomial can be encoded in an expression tree of size $\Theta(km)$.

View File

@ -1,16 +1,30 @@
%!TEX root=./main.tex
\section{Related Work}\label{sec:related-work}
In addition to work on probabilistic databases, our work has connections to work on compact representations of polynomials and relies on past work in fine-grained complexity which we review in \Cref{sec:compr-repr-polyn} and \Cref{sec:param-compl}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Probabilistic Databases}\label{sec:prob-datab}
%\subsection{Probabilistic Databases}\label{sec:prob-datab}
Probabilistic databases have been studied predominantly under set semantics.
A multitude of probabilistic data models have been proposed for encoding a probabilistic database more compactly than as its set of possible worlds. Tuple-independent databases~ consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events. In spite of the inability to encode correlations, \tis have received much attention, because it was shown that any finite probabilistic database can be encode as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}. Block-independent databases (\bis) generalize \tis by partitioning the input into blocks where tuples within each block as disjoint events and blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution for each of its variables. This is similar to the $\semNX$-PDBs we use here, except that we do not allow for variables as attribute values and instead of local conditions which are propositional formulas which may contain comparisons, we associate tuples with polynomials $\semNX$.
Probabilistic Databases (PDBs) have been studied predominantly under set semantics.
A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds.
Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.
Approaches for probabilistic query processing, i.e., computing the marginal probability for each result tuple of a query over a probabilistic database, fall into two broad categories. \emph{Intensional} (or \emph{grounded}) query evaluation approaches compute the \emph{lineage} of a tuple which is a Boolean formula encoding the provenance of the tuple and then compute the probability of the lineage formula. In this paper we also focus on intensional query evaluation, but use polynomials instead of boolean formulas to deal with multisets. It is a well-known fact that computing the probability of a tuple in the result of a query over a probabilistic database (the \emph{marginal probability of a tuple}) is \sharpphard which can be proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the the probability of a tuple's lineage formula is equal to the marginal probability of the tuple. The second category, \emph{extensional} query evaluation, avoids calculating the lineage. This approach is in \ptime, but is limited to certain classes of queries. Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard or \ptime. Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries. Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}. Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple (a Boolean formula encoding the provenance of the tuple) and then the probability of the lineage formula.
In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the tuple's marginal probability is the probability of a its lineage formula).
The second category, \emph{extensional} query evaluation, avoids calculating the lineage.
This approach is in \ptime, but is limited to certain classes of queries.
Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (allows intensional).
Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}.
Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
Several techniques for approximating the probability of a query result tuple have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}. These approaches either rely on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}. The approximation algorithm for bag expectation we present in this work is based on sampling.
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}, relying on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}.
The approximation algorithm for bag expectation we present in this work is based on sampling.
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10} over the symbolic expressions that are used as tuple annotations and values in pvc-tables. \cite{FH12} identifies a tractable class of queries involving aggregation. In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while \cite{FH12} computes probabilities for individual output annotations).

View File

@ -57,7 +57,7 @@ We need all the possible edge patterns in an arbitrary $G$ with at most three di
%Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
For any graph $G$, the following formulas for $\numocc{G}{H}$ compute their respective patterns exactly in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$ (proofs are n \Cref{app:easy-counts}):
For any graph $G$, the following formulas for $\numocc{G}{H}$ compute their respective patterns exactly in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$ (proofs are in \Cref{app:easy-counts}):
\begin{align}
&\numocc{G}{\ed} = \numedge, \label{eq:1e}\\
&\numocc{G}{\twopath} = \sum_{i \in V} \binom{d_i}{2} \label{eq:2p}\\
@ -81,10 +81,12 @@ Note that $\rpoly_{G}^3(\prob,\ldots, \prob)$ as a polynomial in $\prob$ has deg
\begin{Lemma}\label{lem:qE3-exp}
%When we expand $\poly_{G}^3(\vct{X})$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following result,
For any $p$, we have:
{\small
\begin{align}
&\rpoly_{G}^3(\prob,\ldots, \prob) = \numocc{G}{\ed}\prob^2 + 6\numocc{G}{\twopath}\prob^3 + 6\numocc{G}{\twodis} + 6\numocc{G}{\tri}\prob^3\nonumber\\
&+ 6\numocc{G}{\oneint}\prob^4 + 6\numocc{G}{\threepath}\prob^4 + 6\numocc{G}{\twopathdis}\prob^5 + 6\numocc{G}{\threedis}\prob^6.\label{claim:four-one}
\end{align}
}
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -94,7 +96,8 @@ By definition we have that
\[\poly_{G}^3(\vct{X}) = \sum_{\substack{(i_1, j_1), (i_2, j_2), (i_3, j_3) \in E}}~\; \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Hence $\rpoly_{G}^3(\vct{X})$ has degree six. Note that the monomial $\prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}$ will contribute to the coefficient of $p^\nu$ in $\rpoly_{G}^3(\vct{X})$, where $\nu$ is the number of distinct variables in the monomial.
%Rather than list all the expressions in full detail, let us make some observations regarding the sum.
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum is a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take (and in each case, we account for their contribution to $\rpoly_{G}^3(\vct{X})$).
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$.
We compute $\rpoly_{G}^3(\vct{X})$ by considering each of the three forms that the triple $(e_1, e_2, e_3)$ can take.
\textsc{case 1:} $e_1 = e_2 = e_3$ (all edges are the same). There are exactly $\numedge=\numocc{G}{\ed}$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
@ -118,7 +121,7 @@ Next, we relate the various sub-graph counts in $\graph{2}$ and $\graph{3}$ to t
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3m-G2}
The number of $3$-matchings in graph $\graph{2}$ satisfies the following identity,
The $3$-matchings in graph $\graph{2}$ satisfy the identity:
\begin{align*}
\numocc{\graph{2}}{\threedis} &= 8 \cdot \numocc{\graph{1}}{\threedis} + 6 \cdot \numocc{\graph{1}}{\twopathdis}\\
&+ 4 \cdot \numocc{\graph{1}}{\oneint} + 4 \cdot \numocc{\graph{1}}{\threepath} + 2 \cdot \numocc{\graph{1}}{\tri}.
@ -128,7 +131,7 @@ The number of $3$-matchings in graph $\graph{2}$ satisfies the following identit
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3m-G3}
The number of 3-matchings in $\graph{3}$ satisfy the following identity,
The 3-matchings in $\graph{3}$ satisfy the identity:
\begin{align*}
\numocc{\graph{3}}{\threedis} &= 4\numocc{\graph{1}}{\twopath} + 6\numocc{\graph{1}}{\twodis} + 18\numocc{\graph{1}}{\tri}\\
&+ 21\numocc{\graph{1}}{\threepath}+ 24\numocc{\graph{1}}{\twopathdis} + 20\numocc{\graph{1}}{\oneint}\\
@ -140,14 +143,14 @@ The number of 3-matchings in $\graph{3}$ satisfy the following identity,
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3p-G2}
The number of $3$-paths in $\graph{2}$ satisfies the following identity,
The $3$-paths in $\graph{2}$ satisfy the identity:
\[\numocc{\graph{2}}{\threepath} = 2 \cdot \numocc{\graph{1}}{\twopath}.\]
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:3p-G3}
The number of $3$-paths in $\graph{3}$ satisfies the following identity,
The $3$-paths in $\graph{3}$ satisfy the identity:
\[\numocc{\graph{3}}{\threepath} = \numocc{\graph{1}}{\ed} + 2 \cdot \numocc{\graph{1}}{\twopath}.\]
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -178,7 +181,7 @@ Fix $p\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\i
\end{pmatrix}
=\vct{b},
\]
from which we can compute $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
giving $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%