Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

This commit is contained in:
Boris Glavic 2020-12-20 16:19:40 -06:00
commit 32c2511129
7 changed files with 46 additions and 34 deletions

View file

@ -8,7 +8,7 @@
However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \not \in \{0,1\}$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs competitive with deterministic databases.
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs to be competitive with deterministic databases.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
% \AH{Low-level why-would-an-expert-read-this}

View file

@ -19,7 +19,7 @@ We now introduce useful definitions and notation related to polynomials. We use
\begin{Definition}[Variables in a monomial]\label{def:vars}
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
\end{Definition}
For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
\noindent For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
@ -32,7 +32,7 @@ For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\etree)$ to be the function from expression tree $\etree$ to its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
With addition and multiplication following the standard interpretation:
With addition and multiplication following the standard interpretation for polynomials:
%
% \begin{align*}
% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
@ -57,7 +57,9 @@ With addition and multiplication following the standard interpretation:
\begin{Definition}[Expanded T]\label{def:expand-tree}
$\expandtree{\etree}$ is the (pure) sum of products expansion of $\etree$, which we formally define next. The logical view of \expandtree{\etree} ~is a list of tuples $(\monom, \coef)$, where $\monom$ is a monomial and $\coef$ is in $\mathbb{R}$. \expandtree{\etree} has the following recursive definition (where $\circ$ is list concatenation).
$\expandtree{\etree}$ is the reduced sum of products expansion of $\etree$.
The logical view of \expandtree{\etree} is a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef$ is in $\mathbb{R}$.
\expandtree{\etree} has the following recursive definition ($\circ$ is list concatenation).
%
% recursively defined as
% \begin{align*}
@ -81,10 +83,15 @@ $\expandtree{\etree}$ is the (pure) sum of products expansion of $\etree$, which
%where that the multiplication of two tuples %is the standard multiplication over monomials and the standard multiplication over coefficients to produce the product tuple, as in
%is their direct product $(\monom_1, \coef_1) \cdot (\monom_2, \coef_2) = (\monom_1 \cdot \monom_2, \coef_1 \times \coef_2)$ such that monomials $\monom_1$ and $\monom_2$ are concatenated in a product operation, while the standard product operation over reals applies to $\coef_1 \times \coef_2$. The product of $\expandtree{\etree_\lchild} \cdot \expandtree{\etree'_\rchild}$ is then the cross product of the multiplication of all such tuples returned to both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$. %The operator $\otimes$ is defined as the cross-product tuple multiplication of all such tuples returned by both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$.
In the following, we abuse notation and write $\monom$ to denote the monomial obtained as the products of the variables in the set.
\begin{Example}\label{example:expr-tree-T}
Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\Cref{eq:poly-eg}. Its expression tree $\etree$ is illustrated in Figure ~\ref{fig:expr-tree-T}. The pure expansion of the product is $2X^2 - XY + 4XY - 2Y^2$ and the $\expandtree{\etree}$ is $[(2, X^2), (-1, XY), (4, XY), (-2, Y^2)]$.
Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\Cref{eq:poly-eg}.
Its expression tree $\etree$ is illustrated in Figure ~\ref{fig:expr-tree-T}.
The pure expansion of the product is $2X^2 - XY + 4XY - 2Y^2$ and the $\expandtree{\etree}$ is $[(X, 2), (XY, -1), (XY, 4), (Y, -2)]$.
\end{Example}
$\expandtree{\etree}$ encodes the \emph{reduced} form of $\polyf\inparen{\etree}$, decoupling each monomial into a set of variables $\monom$ and a real coefficient $\coef$.
Note, however, that unlike $\rpoly$, $\expandtree{\etree}$ does not need to be in SOP form.
\begin{figure}[t]
@ -157,7 +164,7 @@ such that
%with multiplicative $(\error,\delta)$-bounds, where $k$ denotes the degree of $\poly$.
\end{Theorem}
The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
\noindent The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expandtree{\etree}$ to be `canceled' when it is modded with $\mathcal{B}$:
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
@ -168,7 +175,7 @@ Given an expression tree $\etree$, define
%\AR{Need to make sure use of indicator variable $\onesymbol$ above is consistent with the rest of the paper.}
%\OK{Done}
We next present couple of corollaries of~\Cref{lem:approx-alg}.
\noindent We next present couple of corollaries of~\Cref{lem:approx-alg}.
\begin{Corollary}
\label{cor:approx-algo-const-p}
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
@ -177,8 +184,8 @@ In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the abo
\end{Corollary}
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
The restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for all queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
Note that (i) tuple presence is independent across blocks, so the corresponding probabilities (and hence $\prob_0$) are independent of the number of blocks, and (ii) \bis model uncertain attributes, so block size (and hence $\gamma$) is a function of the ``messiness'' of a dataset, rather than its size.
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (\Cref{app:subsec:experiment} shows experimentally that $\gamma$ is negligible in practice for these queries).
We also observe that (i) tuple presence is independent across blocks, so the corresponding probabilities (and hence $\prob_0$) are independent of the number of blocks, and (ii) \bis model uncertain attributes, so block size (and hence $\gamma$) is a function of the ``messiness'' of a dataset, rather than its size.
Thus, we expect the corollary to hold in general.
% \AH{I am thinking that perhaps the terminology and presentation of~\Cref{app:subsec:experiment} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
% \AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{app:subsec:experiment}}
@ -191,12 +198,12 @@ Thus, we expect the corollary to hold in general.
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{\monom}}\hspace*{-2mm} X_i
\end{equation}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(\monom,\coef)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
%\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
The number of samples is computed by (see \Cref{app:subsec-th-mon-samp}):
\begin{equation*}
2\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \conf \implies\samplesize \geq \frac{2\log{\frac{2}{\conf}}}{\error^2}.
@ -399,7 +406,7 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}
%\end{align*}
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions. Lemma~\ref{lem:one-pass} is also proved in~\Cref{sec:proofs-approx-alg}.
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \wght and \vpartial variables on each node with the definitions above. Lemma~\ref{lem:one-pass} is also proved in~\Cref{sec:proofs-approx-alg}.
%\subsubsection{Psuedo Code}

View file

@ -19,7 +19,7 @@ A large body of work has focused on identifying tractable cases by either identi
The problem of computing the marginal probability of a result tuple has a natural correspondence under bag semantics: computing the expected multiplicity of a query result tuple.
Analogously, this problem can be reduced to computing the expectation of the lineage, which under bag semantics is a polynomial.
This problem has received much less attention, perhaps because the problem is trivially tractable;
This problem has received much less attention, perhaps because the problem is trivially tractable.
In fact it is linear time when the lineage polynomial is encoded in the typical sum of products (SOP) representation.
However, there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than the SOP representation of a polynomial.
These compression schemes are close analogs of typical database optimizations like projection push-down~\cite{DBLP:conf/pods/KhamisNR16}, hinting that perhaps even Bag-PDBs inherently have higher query processing complexity than deterministic databases.
@ -286,8 +286,8 @@ W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^
The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
\begin{multline*}
\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} \\
+ \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} \\
+ \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
+ 2\expct\pbox{W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} \\
+ 2\expct\pbox{W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{multline*}
Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
@ -331,14 +331,14 @@ Concretely, in this paper:
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(1,\dots,1)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(1,\dots,1)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We also formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\paragraph{Paper Organization.} We present some relevant background and setup our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}.
\paragraph{Paper Organization.} We present some relevant background and set up our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View file

@ -88,6 +88,7 @@
\newcommand{\val}{\vari{val}\xspace}
\newcommand{\type}{\vari{type}\xspace}
\newcommand{\wght}{\vari{weight}\xspace}
\newcommand{\vpartial}{\vari{partial}\xspace}
%types of T
\newcommand{\var}{\textsc{var}}
\newcommand{\tnum}{num}

View file

@ -4,7 +4,7 @@
\subsection{Reduced Polynomials and Equivalences}
We now introduce some terminology for polynomials and develop a reduced form for polynomials --- a closed form of the polynomial's expectation over probability distributions derived from a \bi or \ti.
Throughout, we will use $(X + Y)^2$ as a running example.
We will use $(X + Y)^2$ as a running example.
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{Definition}[Monomial]\label{def:monomial}
@ -36,7 +36,9 @@ The degree of polynomial $\poly(\vct{X})$ is the maximum sum of exponents, over
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The degree of the running example polynomial is $2$. In this paper we consider only finite degree polynomials.
The degree of the running example polynomial is $2$.
Note that product terms can only arise as a consequence of join operations, so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins in one clause of the UCQ query that created it.
In this paper we consider only finite degree polynomials.
%
% Throughout this paper, we also make the following \textit{assumption}.
%
@ -54,7 +56,7 @@ As they are a special case of \bis, the following applies to \tis as well.
Recall that in a \bi $\pxdb$ with tuples $t_1, \ldots, t_n$, each input tuple $t_i$ is annotated with a unique variable $X_i$.
Tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_i$ is associated with a probability $\prob_{\tup_i} = \pd[X_i = 1]$.
\footnote{
Although it is customary to define a single independent, $[\abs{\block_j}+1]$-valued variable per block, we decompose it into $\abs{\block_j}$ correlated $\{0,1\}$-valued variables per block that can be directly used in polynomials (without an indicator function). For $t_i \in b_j$, the event $(X_i = 1)$ is identical to the event $(X_j = i)$ in the customary annotation scheme.
Although it is customary to define a single independent, $[\abs{\block_i}+1]$-valued variable per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_j \in b_i$, the event $(X_j = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
}
Because blocks are independent and tuples from the same block are disjoint, $\prob$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
@ -92,7 +94,7 @@ Given the set of BIDB variables $\inset{X_{b,i}}$, define
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
Intuitively, in the reduced form, all exponents $e > 1$ are reduced to $e = 1$ and all monomials with multile variables from the same block $\block$ are dropped (any world containing more than one tuple from a block has $0$ probability and can be ignored).
Intuitively, in the reduced form, all exponents $e > 1$ are reduced to $e = 1$ by $\text{mod } \mathcal T$, and all monomials with multile variables from the same block $\block$ are dropped by $\text{mod } \mathcal B$ (i.e., any world containing more than one tuple from a block has $0$ probability and can be ignored).
For the special case of \tis, the second step is not necessary since every block contains a single tuple.
%Alternatively, one can think of $\rpoly$ as the \abbrSMB of $\poly(\vct{X})$ when the product operator is idempotent.
%

View file

@ -50,10 +50,10 @@ We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ ov
\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}
Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number coefficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$ and with the standard addition and multiplication of polynomials.
We will utilize $\semNX$-PDB $\pxdb$, defined as the tuple $(\db, \pd)$, where $\semNX$-database $\db$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ (i.e., $\polyForTuple = \query(\pxdb)(t)$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ that takes an $\semNX$-PDB input and outputs an equivalent $\semN$-PDB are formally defined in \Cref{subsec:supp-mat-background}.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\db, \pd)$, where $\semNX$-database $\db$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ from an $\semNX$-PDB to an equivalent $\semN$-PDB are both formalized in \Cref{subsec:supp-mat-background}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -102,7 +102,9 @@ tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes bei
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}. Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$.
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}.
The semantics of expression trees follow the obvious interpretation; We define them formally with \Cref{def:poly-func} in \Cref{sec:algo}.
Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Definition}[poly$(\cdot)$]\label{def:poly-func}
@ -135,9 +137,9 @@ We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{se
For our running example, $\etreeset{\smb} = \{2X^2 + 3XY - 2Y^2, (X + 2Y)(2X - Y), X(2X - Y) + 2Y(2X - Y), 2X(X + 2Y) - Y(X + 2Y)\}$. Note that \cref{def:express-tree-set} implies that $\etree \in \etreeset{poly(\etree)}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\medskip
\noindent We are now ready to formally state our \textbf{main problem}.
\medskip
\noindent We are now ready to formally state our \textbf{main problem}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X} \to [0,1]$, $\query$ an n-ary query, and $t$ an n-ary tuple.

View file

@ -10,7 +10,7 @@ While \Cref{thm:mult-p-hard-result} shows that computing $\rpoly(\prob,\dots,\pr
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}\label{th:single-p-hard}
Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}.
Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ from $G$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}.
\end{Theorem}
%\begin{proof}[Proof of Corollary ~\ref{th:single-p-gen-k}]
%Consider $\poly^3_{G}$ and $\poly' = 1$ such that $\poly'' = \poly^3_{G} \cdot \poly'$. By \Cref{th:single-p}, query $\poly''$ with $\kElem = 4$ has $\Omega(\numvar^{\frac{4}{3}})$ complexity.
@ -51,7 +51,7 @@ We need all the possible edge patterns in an arbitrary $G$ with at most three di
%\item Triangle ($\tri$)
%\item 3-path ($\threepath$)
\item 3-star ($\oneint$)--this is the graph that results when all three edges share exactly one common endpoint. The remaining endpoint for each edge is disconnected from any endpoint of the remaining two edges.
\item Disjoint Two-Path ($\twopathdis$)--this subgraph consists of a two path and a remaining disjoint edge.
\item Disjoint Two-Path ($\twopathdis$)--this subgraph consists of a two-path and a remaining disjoint edge.
%\item 3-matching ($\threedis$)--this subgraph is composed of three disjoint edges.
\end{itemize}
@ -103,9 +103,9 @@ We compute $\rpoly_{G}^3(\vct{X})$ by considering each of the three forms that t
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence in $\poly_{G}^3(\vct{X})$. Analogusly, there are three such occurrences in $\poly_{G}^3(\vct{X})$ when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. %Again, there are three combinations for this.
This implies that all $3 + 3 = 6$ combinations of two distinct edges $e$ and $e'$ contribute to the same monomial in $\rpoly_{G}^3$. % consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$.
Since $e\ne e'$, this case produces the following edge patterns: $\twopath, \twodis$, which contribute $\prob^3$ and $\prob^4$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
Since $e\ne e'$, this case produces the following edge patterns: $\twopath, \twodis$, which contribute $6\prob^3$ and $6\prob^4$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 3:} All $e_1,e_2$ and $e_3$ are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$, each of which contribute to a different monomial in the SMP representation of $\poly_{G}^3(\vct{X})$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$, which contribute $\prob^3, \prob^4, \prob^4, \prob^5$ and $\prob^6$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 3:} All $e_1,e_2$ and $e_3$ are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$, each of which contribute to a different monomial in the SMP representation of $\poly_{G}^3(\vct{X})$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$, which contribute $6\prob^3, 6\prob^4, 6\prob^4, 6\prob^5$ and $6\prob^6$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\end{proof}
\qed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%