More adjustments to save space; currently ~8.5 pages over.

This commit is contained in:
Aaron Huber 2021-03-09 11:43:38 -05:00
parent 0530ffc5cf
commit 39b176f045
4 changed files with 161 additions and 367 deletions

276
intro.tex
View file

@ -4,8 +4,6 @@
\section{Introduction}
\label{sec:intro}
% \AR{\textbf{Oliver/Boris:} What is missing from the intro is why would someone care about bag-PDBs in {\em practice}? This is kinda obliquely referred to in the first para but it would be good to motivate this more. The intro (rightly) focuses on the theoretical reasons to study bag PDBs but what (if any) are the practical significance of getting bag PDBs done in linear-time? Would this lead to much faster real-life PDB systems?}
As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
@ -32,87 +30,13 @@ By extension, this algorithm only has a constant factor overhead relative to det
}
This is an important result, because it implies that computing approximate expectations for SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.
% is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
% Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
% This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
% As we also show, this complexity is proportional to the same query on a deterministic database.
% In this paper, we prove this
% limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.
% The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
% Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
% However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
% A naive strategy might be to move from the theoretically simpler set-relational model~\cite{DBLP:series/synthesis/2011Suciu,BD05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} to the computationally simpler bag-relational model, mirroring a similar transition in deterministic datbases decades ago.
% However, after discarding a long-held approach to representing lineage, we prove that query processing in Bag-PDBs is \sharpwonehard.
% This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
% The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
% Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
% However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
% In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
% MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
% }
% Consider the dominant problem in Set-PDBs (Computing marginal probabilities) and the corresponding problem in Bag-PDBs (computing expectations of counts).
% In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
% Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
% In the corresponding Bag-PDB problem~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
% The expectation of the multiplicity is the expectation of this polynomial.
% Lineage in Set-PDBs is typically encoded in disjunctive normal form.
% This representation is significantly larger than the query result sans lineage.
% However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities is the probability computation itself, not the lineage formula.
% The Bag-PDB analog is a polynomial in sum of products (SOP) form --- a sum of `clauses', each the product of a set of integer or variable atoms.
% Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
% Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
% Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11
% or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
% Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
% The initial picture is not good.
% We prove that computing expected counts is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
% Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
% This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
% As we also show, this complexity is proportional to the same query on a deterministic database.
% In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Sets vs Bags}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
%Figures, etc
%Relations for example 1
%Graph of query output for intro example
%\begin{figure}
% \begin{tikzpicture}
% \node at (1.5, 3) [tree_node](top){a};
% \node at (0, 0) [tree_node](left){b};
% \node at (3, 0) [tree_node](right){c};
% \draw (top)--(left);
% \draw (left)--(right);
% \draw (right)--(top);
% \end{tikzpicture}
%\caption{Graph of tuples in table E}
%\label{fig:intro-ex-graph}
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handle Block Independent Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu}.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$)
% \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?}
identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
@ -120,9 +44,9 @@ The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top
\end{Example}
\begin{figure}[t]
\begin{subfigure}{0.45\textwidth}
\begin{subfigure}{0.33\linewidth}
\centering
\resizebox{!}{8mm}{
\resizebox{!}{10mm}{
\begin{tabular}{ c | c c c}
$\rel$ & A & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
@ -130,12 +54,12 @@ The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top
& b & $W_b$ & $W_b$\\
& c & $W_c$ & $W_c$\\
\end{tabular}
} %\caption{Atom 1 of query $\poly$ in ~\Cref{intro:ex}}
} \caption{Relation $R$ in ~\Cref{ex:intro}}
\label{subfig:ex-atom1}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\end{subfigure}%
\begin{subfigure}{0.33\linewidth}
\centering
\resizebox{!}{8mm}{
\resizebox{!}{10mm}{
\begin{tabular}{ c | c c c c}
$E$ & A & B & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
@ -144,23 +68,48 @@ The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top
& c & a & $\top$ & $1$\\
\end{tabular}
}
%\caption{Atom 3 of query $\poly$ in ~\Cref{intro:ex}}
\caption{Relation $E$ in ~\Cref{ex:intro}}
\label{subfig:ex-atom3}
\end{subfigure}%
\begin{subfigure}{0.33\linewidth}
\centering
\resizebox{!}{29mm}{
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0){$W_a$};
\node[tree_node] (b1) at (1, 0){$W_b$};
\node[tree_node] (c1) at (2, 0){$W_c$};
\node[tree_node] (d1) at (3, 0){$W_d$};
\node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
\node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
\node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
\draw[->] (a1) -- (a2);
\draw[->] (b1) -- (a2);
\draw[->] (b1) -- (b2);
\draw[->] (c1) -- (b2);
\draw[->] (c1) -- (c2);
\draw[->] (d1) -- (c2);
\draw[->] (c2) -- (a3);
\draw[->] (a2) -- (a4);
\draw[->] (b2) -- (a3);
\draw[->] (a3) -- (a4);
%sink
\draw[thick, ->] (a4.110) -- (a5.250);
\draw[thick, ->] (a4.70) -- (a5.290);
\draw[thick, ->] (a5) -- (0.75, 3.0);
\end{tikzpicture}
}
\caption{Circuit encoding for query $\poly^2$.}
\label{fig:circuit-q2-intro}
\end{subfigure}
% \begin{subfigure}{0.15\textwidth}
% \centering
% \begin{tabular}{ c | c | c}
% $\rel$ & B & $\Phi$\\
% \hline
% & b & $W_b$\\
% & c & $W_c$\\
% & a & $W_a$\\
% \end{tabular}
% %\caption{Atom 2 of query $\poly$ in ~\Cref{intro:ex}}
% \label{subfig:ex-atom2}
% \end{subfigure}
%\vspace*{3mm}
\vspace*{-3mm}
\caption{$\ti$ relations for $\poly$}
\caption{ }%{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\trimfigurespacing
\end{figure}
@ -172,7 +121,7 @@ Without loss of generality, we assume that input relations are sets (i.e. $Dom(W
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean formula (polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the query result is a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
$$
\begin{tabular}{c c}
@ -202,7 +151,6 @@ $$
The Set-PDB query is satisfied in this possible world and the Bag-PDB result tuple has a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
% \AR{What is $\mu$ below?}
{\small
\begin{align*}
\probOf[\poly_{set}] &= \hspace*{-1mm}
@ -214,15 +162,6 @@ The marginal probability (resp., expected count) of this query is computed over
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{FH13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
% \begin{equation*}
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
% \end{equation*}
% In general, such a computation can be exponential in the size of the database.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause.
For \Cref{ex:intro}, the expectation is simply
@ -236,52 +175,12 @@ In this particular lineage polynomial, all variables in each product clause are
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{multline}
\begin{equation}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}}
% = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
% + P[W_c = 1]P[W_a = 1]\\
= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
\end{multline}
\end{equation}
\begin{figure}[t]
\resizebox{0.8\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
};
\end{tikzpicture}
}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\trimfigurespacing
\end{figure}
\subsection{Superlinearity of Bag PDBs}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
@ -297,32 +196,27 @@ For example:
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right)
\end{equation*}
}
This factorized expression can be easily modeled as an expression tree, as in \Cref{fig:intro-q2-etree},
This factorized expression can be easily modeled as an expression tree, as in \Cref{fig:circuit-q2-intro},
while the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
\begin{multline*}
\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} \\
+ 2\expct\pbox{W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} \\
+ 2\expct\pbox{W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{multline*}
\begin{footnotesize}
\begin{equation*}
\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} + 2\expct\pbox{W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{equation*}
\end{footnotesize}
Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
% \par Atri suggests a proof in the appendix regarding this claim.}
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c)
% =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
% &+ 2W_aW_bW_c\\
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
%\SF{Should this be like $\tilde{\poly^2}$ to avoid ambiguous?}
Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).
Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.
@ -337,12 +231,6 @@ Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed
If so, then Bag-PDBs can compete with deterministic databases.
This is ufortunately not the case, and an approximation is required.
% Consider the :
% \begin{equation*}
% \poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
% \end{equation*}
% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.
\subsection{Overview of our results and techniques}
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
@ -361,62 +249,6 @@ We also formalize our claim that, since our approximation algorithm runs in time
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Interesting contributions, problem definition, known results, our results, etc
%
%\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
%This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
%expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
%been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
%in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
%claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
%expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
%
%\paragraph{Interesting Mathematical Contributions}
%This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
%exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
%This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
%This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
%$\poly(\vct{X})$.
%
%Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
%3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
%hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
%an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
%leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
%forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
%of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
%in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
%on the structure of the query.
%
%\section{Outline of the rest of the paper}
%\begin{enumerate}
% \item Background Knowledge and Notation
% \begin{enumerate}
% \item Review notation for PDBs
% \item Review the use of semirings as generating output polynomials
% \item Review the translation of semiring operators to RA operators
% \item Polynomial formulation and notation
% \end{enumerate}
% \item Reduction to hardness results in graph theory
% \begin{enumerate}
% \item $\rpoly$ and its equivalence to $\expct\pbox{\poly}$ when $\vct{X} \in \{0, 1\}^\numvar$
% \item Results for SOP polynomial
% \item Results for compressed version of polynomial
% \item ~\Cref{lem:const-p} proof
% \end{enumerate}
% \item Approximation Algorithm
% \begin{enumerate}
% \item Description of the Algorithm
% \item Theoretical guarantees
% \item Will we have time to tackle BIDB?
% \begin{enumerate}
% \item If so, experiments on BIDBs?
% \end{enumerate}
% \end{enumerate}
% \item Future Work
% \item Conclusion
%\end{enumerate}

View file

@ -171,7 +171,7 @@
\tikzset{
default_node/.style={align=center, inner sep=0pt},
pattern_node/.style={fill=gray!50, draw=black, semithick, inner sep=0pt, minimum size = 2pt, circle},
tree_node/.style={default_node, draw=black, black, circle, text width=0.3cm, font=\bfseries, minimum size=0.65cm},
tree_node/.style={default_node, draw=black, black, circle, text width=0.5cm, font=\bfseries, minimum size=0.65cm},
highlight_color/.style={black}, wght_color/.style={black},
highlight_treenode/.style={tree_node, draw=black, black},
edge from parent path={(\tikzparentnode) -- (\tikzchildnode)}

View file

@ -18,11 +18,7 @@ We will use $(X + Y)^2$ as a running example.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Standard Monomial Basis]\label{def:smb}
A monomial is a product of variable terms, each raised to a non-negative integer power.
A polynomial in \termSMB (\abbrSMB) has the form:
\[
\sum_{i=1}^n c_i \cdot m_i
\]
where each $c_i$ is an integer and each $m_i$ is a monomial and $m_i \neq m_j$ for $i \neq j$. The \abbrSMB of a polynomial $\poly$ is $\smbOf{\poly}$.
A polynomial in \termSMB (\abbrSMB) has the form: $\sum_{i=1}^n c_i \cdot m_i$, where each $c_i$ is an integer and each $m_i$ is a monomial and $m_i \neq m_j$ for $i \neq j$. The \abbrSMB of a polynomial $\poly$ is $\smbOf{\poly}$.
% fully expanded out such that no product of sums exist and where each unique monomial appears exactly once.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -80,8 +76,28 @@ For example when $S_0=\inset{X^2-X, Y^2-Y}$, taking the polynomial $2X^2 + 3XY -
%
\begin{Definition}\label{def:mod-set-polys}
Given the set of BIDB variables $\inset{X_{b,i}}$, define
\[\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{ and } i\ne j \in [~\abs{\block}~]}\]
\[\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{ and } i \in [~\abs{\block}~]}\]
\setlength\parindent{0pt}
\vspace*{-3mm}
{\small
\begin{tabular}{@{}l l}
\begin{minipage}[b]{0.45\linewidth}
\centering
\begin{equation*}
\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{ and } i\ne j \in [~\abs{\block}~]},
\end{equation*}
\end{minipage}%
\hspace{13mm}
&
\begin{minipage}[b]{0.45\linewidth}
\centering
\begin{equation*}
\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{ and } i \in [~\abs{\block}~]}
\end{equation*}
\end{minipage}
\\
\end{tabular}
}
\end{Definition}
%
\begin{Definition}[Reduced \bi Polynomials]\label{def:reduced-bi-poly}

View file

@ -22,9 +22,9 @@ We review positive relational algebra semantics for $\semK$-relations below.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%
\begin{align}\label{eq:bag-expectation}
\begin{equation}\label{eq:bag-expectation}
\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
\end{align}
\end{equation}
%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.
@ -34,18 +34,18 @@ Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of
\subsubsection{$\semK$-relational Query Semantics}
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
%
\begin{align*}
& \evald{\project_A(\rel)}{\db}(\tup) & & = & & \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') \\
& \evald{(\rel_1 \union \rel_2)}{\db}(\tup) & & = & & \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup) \\
& \evald{(\rel_1 \join \rel_2)}{\db}(\tup) & & = & & \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK \evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\
& \evald{\select_\theta(\rel)}{\db}(\tup) & & = & & \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} \\
& \evald{R}{\db}(\tup) & & = & & \rel(\tup)
\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK \evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\
& & \evald{R}{\db}(\tup) &= \rel(\tup)
\end{align*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}
@ -95,15 +95,6 @@ We first formally define circuits, an encoding of polynomials that we use throug
For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2X^2 + 3XY - 2Y^2$ over $\vct{X} = [X, Y]$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\oldstuff{\begin{Definition}[Expression Tree]\label{def:express-tree}
%Consider a vector of variables $\vct{X}$.
% An expression tree $\etree$ over $\vct{X}$ is a binary %an ADT logically viewed as an n-ary
%tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either from the set $\mathbb{R}$ $(\tnum)$ or from the set of monomials $(\var)$. The members of $\etree$ are \type, \val, \vari{partial}, \vari{children}, and \vari{weight}, where \type is the type of value stored in the node $\etree$ (i.e. one of $\{+, \times, \var, \tnum\}$, \val is the value stored, and \vari{children} is the list of $\etree$'s children where $\etree_\lchild$ is the left child and $\etree_\rchild$ the right child.
%\end{Definition}}
\revision{
We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
\begin{Definition}[Circuit]\label{def:circuit}
@ -122,139 +113,94 @@ The circuit \circuit in ~\Cref{fig:circuit-express-tree} encodes the polynomial
\end{Example}
\begin{figure}[t]
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0){$\boldsymbol{X}$};
\node[tree_node] (b1) at (1, 0){$\boldsymbol{Y}$};
\node[tree_node] (c1) at (2, 0){$\boldsymbol{W}$};
\node[tree_node] (d1) at (3, 0){$\boldsymbol{Z}$};
\node[tree_node] (a2) at (0.5, 1){$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (2.5, 1){$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (1.5, 2){$\boldsymbol{\circplus}$};
\draw[->] (a1) -- (a2);
\draw[->] (b1) -- (a2);
\draw[->] (c1) -- (b2);
\draw[->] (d1) -- (b2);
\draw[->] (a2) -- (a3);
\draw[->] (b2) -- (a3);
\end{tikzpicture}
\caption{Circuit encoding $XY + WZ$, a special case of an expression tree}
\label{fig:circuit-express-tree}
\begin{subfigure}[b]{0.45\linewidth}
\centering
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0){$\boldsymbol{X}$};
\node[tree_node] (b1) at (1, 0){$\boldsymbol{Y}$};
\node[tree_node] (c1) at (2, 0){$\boldsymbol{W}$};
\node[tree_node] (d1) at (3, 0){$\boldsymbol{Z}$};
\node[tree_node] (a2) at (0.5, 1){$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (2.5, 1){$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (1.5, 2){$\boldsymbol{\circplus}$};
\draw[->] (a1) -- (a2);
\draw[->] (b1) -- (a2);
\draw[->] (c1) -- (b2);
\draw[->] (d1) -- (b2);
\draw[->] (a2) -- (a3);
\draw[->] (b2) -- (a3);
\draw[->] (a3) -- (1.5, 2.5);
\end{tikzpicture}
\caption{Circuit encoding $XY + WZ$, a special case of an expression tree}
\label{fig:circuit-express-tree}
\end{subfigure}
\hspace{5mm}
\begin{subfigure}[b]{0.45\linewidth}
\centering
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0) {$\boldsymbol{X}$};
\node[tree_node] (b1) at (1.5, 0) {$\boldsymbol{2}$};
\node[tree_node] (c1) at (3, 0) {$\boldsymbol{Y}$};
\node[tree_node] (d1) at (4.5, 0) {$\boldsymbol{-1}$};
\node[tree_node] (a2) at (0.75, 0.75) {$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (2.25, 0.75) {$\boldsymbol{\circmult}$};
\node[tree_node] (c2) at (3.75, 0.75) {$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (0.55, 1.5) {$\boldsymbol{\circplus}$};
\node[tree_node] (b3) at (3.75, 1.5) {$\boldsymbol{\circplus}$};
\node[tree_node] (a4) at (2.25, 2.25) {$\boldsymbol{\circmult}$};
\draw[->] (a1) -- (a2);
\draw[->, thick] (a1) -- (a3);
\draw[->] (b1) -- (a2);
\draw[->] (b1) -- (b2);
\draw[->] (c1) -- (c2);
\draw[->] (c1) -- (b2);
\draw[->] (d1) -- (c2);
\draw[->] (a2) -- (b3);
\draw[->] (b2) -- (a3);
\draw[->] (c2) -- (b3);
\draw[->] (a3) -- (a4);
\draw[->] (b3) -- (a4);
\draw[->] (a4) -- (2.25, 2.75);
\end{tikzpicture}
\caption{Circuit encoding of $(X + 2Y)(2X - Y)$}
\label{fig:circuit}
\end{subfigure}
\caption{ }
\end{figure}
%\begin{figure}[t]
%
%\resizebox{0.65\columnwidth}{!}{
%\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
% \node[tree_node](root){$\boldsymbol{\times}$}
% child{node[tree_node]{$\boldsymbol{+}$}
% child{node[tree_node]{x}
% %child[missing]{node[tree_node]{}}
% %child{node[tree_node]{x}}
% }
% child{node[tree_node]{$\boldsymbol{\times}$}
% child{node[tree_node]{2}}
% child{node[tree_node]{y}}
% }
% }
% child{node[highlight_treenode] (TR) {$\boldsymbol{+}$}
% child{node[tree_node]{$\boldsymbol{\times}$}
% child{node[tree_node]{2}}
% child{node[tree_node]{x}}
% }
% child{node[tree_node]{$\boldsymbol{\times}$}
% child{node[tree_node] (neg-leaf) {-1}}
% child{node[tree_node]{y}}
% }
% %child[sibling distance= 0cm, grow=north east, red]{node[tree_node]{$\circuit_\rchild$}}
% };
%% \node[below=2pt of neg-leaf, inner sep=1pt, blue] (neg-comment) {\textbf{Negation pushed to leaf nodes}};
%% \draw[<-|, blue] (neg-leaf) -- (neg-comment);
% \node[above right=0.7cm of TR, highlight_color, inner sep=0pt, font=\bfseries] (tr-label) {$\circuit_\rinput$};
% \node[above right=0.7cm of root, highlight_color, inner sep=0pt, font=\bfseries] (t-label) {$\circuit$};
% \draw[<-|, highlight_color] (TR) -- (tr-label);
% \draw[<-|, highlight_color] (root) -- (t-label);
%\end{tikzpicture}
%}
%\vspace*{-2mm}
%\caption{Expression tree $\circuit$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
%\label{fig:expr-tree-T}
%\trimfigurespacing
%\end{figure}
We ignore the remaining fields (\vari{partial}, \vari{Lweight}, and \vari{Rweight}) until \Cref{sec:algo}.
}
%Also note that the out degree of any internal node can grow with the circuit size.
The semantics of \revision{circuits} ~follows the obvious interpretation. We \revision{next} define \revision{its realtionship with polynomials } formally:
The semantics of circuits follows the obvious interpretation. We next define its realtionship with polynomials formally:
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote \revision{$\polyf(\circuit)$}~ to be the function from circuit \revision{$\circuit$}~ to its corresponding polynomial. $\polyf(\cdot)$ is recursively defined on \revision{$\circuit$}~ as follows, with addition and multiplication following the standard interpretation for polynomials:
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial. $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
\begin{equation*}
\polyf(\revision{\circuit}) = \begin{cases}
\polyf(\revision{\circuit_\lchild}) + \polyf(\revision{\circuit_\rchild}) &\text{ if \revision{\circuit}.\type } = \revision{\circplus}\\
\polyf(\revision{\circuit_\lchild}) \cdot \polyf(\revision{\circuit_\rchild}) &\text{ if \revision{\circuit}.\type } = \revision{\circmult}\\
\revision{\circuit.\val} &\text{ if \revision{\circuit}.\type } = \var \text{ OR } \tnum.
\polyf(\circuit) = \begin{cases}
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\
\polyf(\circuit_\lchild) \cdot \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circmult\\
\circuit.\val &\text{ if \circuit.\type } = \var \text{ OR } \tnum.
\end{cases}
\end{equation*}
\end{Definition}
Note that $\circuit$ need not encode an expression in standard monomial basis, while as stated previously a polynomial is considered to be in SMB, and the output of \polyf($\cdot$) is therefore in SMB. For instance, $\circuit$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$\revision{
Note that $\circuit$ need not encode an expression in standard monomial basis, while as stated previously a polynomial is considered to be in SMB, and the output of \polyf($\cdot$) is therefore in SMB. For instance, $\circuit$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$
, as shown in \Cref{fig:circuit}.
\begin{figure}[t]
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (0, 0) {$\boldsymbol{X}$};
\node[tree_node] (b1) at (1.5, 0) {$\boldsymbol{2}$};
\node[tree_node] (c1) at (3, 0) {$\boldsymbol{Y}$};
\node[tree_node] (d1) at (4.5, 0) {$\boldsymbol{-1}$};
\node[tree_node] (a2) at (0.75, 1) {$\boldsymbol{\circmult}$};
\node[tree_node] (b2) at (2.25, 1) {$\boldsymbol{\circmult}$};
\node[tree_node] (c2) at (3.75, 1) {$\boldsymbol{\circmult}$};
\node[tree_node] (a3) at (0.55, 2) {$\boldsymbol{\circplus}$};
\node[tree_node] (b3) at (3.75, 2) {$\boldsymbol{\circplus}$};
\node[tree_node] (a4) at (2.25, 3) {$\boldsymbol{\circmult}$};
\draw[->] (a1) -- (a2);
\draw[->, thick] (a1) -- (a3);
\draw[->] (b1) -- (a2);
\draw[->] (b1) -- (b2);
\draw[->] (c1) -- (c2);
\draw[->] (c1) -- (b2);
\draw[->] (d1) -- (c2);
\draw[->] (a2) -- (b3);
\draw[->] (b2) -- (a3);
\draw[->] (c2) -- (b3);
\draw[->] (a3) -- (a4);
\draw[->] (b3) -- (a4);
\draw[->] (a4) -- (2.25, 3.5);
\end{tikzpicture}
\caption{Circuit encoding of the formula (X + 2Y)(2X - Y)}
\label{fig:circuit}
\end{figure}
}
\oldstuff{
\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree) = \poly(\vct{X})$.
\end{Definition}
\revision{
\begin{Definition}[Circuit Set]\label{def:circuit-set}
$\circuitset{\smb}$ is the set of all possible circuits $\circuit$ such that $\polyf(\circuit) = \poly(\vct{X})$.
\end{Definition}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For our running example, $\circuitset{\smb} \supset \{2X^2 + 3XY - 2Y^2, (X + 2Y)(2X - Y), X(2X - Y) + 2Y(2X - Y), 2X(X + 2Y) - Y(X + 2Y)\}$. Note that ~\Cref{def:circuit-set} implies that \revision{$\circuit \in \circuitset{\polyf(\circuit)}$}.
}
For our running example, $\circuitset{\smb} \supset \{2X^2 + 3XY - 2Y^2, (X + 2Y)(2X - Y), X(2X - Y) + 2Y(2X - Y), 2X(X + 2Y) - Y(X + 2Y)\}$. Note that ~\Cref{def:circuit-set} implies that $\circuit \in \circuitset{\polyf(\circuit)}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\medskip
@ -265,7 +211,7 @@ Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$
The \expectProblem is defined as follows:
% \AH{I think we mean $\poly(\vct{X}) = \query(\pxdb)(t)$ instead of $\poly(\vct{X}) = \query(\pdb)(t)$. I changed the following to reflect this.}
% \BG{Correct}
\\\hspace*{5mm}\textbf{Input}: A \revision{circuit $\circuit \in \circuitset{\smb}$}~ for $\poly(\vct{X}) = \query(\pxdb)(t)$
\\\hspace*{5mm}\textbf{Input}: A circuit $\circuit \in \circuitset{\smb}$ for $\poly(\vct{X}) = \query(\pxdb)(t)$
\\\hspace*{5mm}\textbf{Output}: $\expct_{\vct{W} \sim \pd}[\poly(\vct{W})]$
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%