Wordsmithing

This commit is contained in:
Oliver Kennedy 2021-03-27 22:26:31 -04:00
parent 8a37d83737
commit fda627ff38
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60

View file

@ -3,17 +3,19 @@
\section{Introduction}
\label{sec:intro}
Computing the expectation of a lineage formula in a probabilistic database (PDB) involves two steps. First, the lineage formula is computed (typically this is instrumented in the query), while the second step computes the expectation. The computation might be performed in either the setting of set or bag semantics. In set semantics, the lineage formula is an element of the $PosBool[\vct{X}]$ semiring, where lineage variables are from the set $\vct{X}$ ranging over the elements of $\mathbb{B}$, and computing the expectation of this formula is the same as computing the marginal probability of the output tuple's existence. In bag semantics, we are computing the expected count of a tuple whose lineage formula is an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$, and variables from the set $\vct{X}$, ranging over some given set such as the reals $\mathbb{R}$.
\begin{Example}\label{ex:intro-tbls}
Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company. Table Loc lists all the locations to which air transport is provided. Table Route identifies all locations connected via air transit. Elements from both $PosBool[\vct{X}]$ and $\mathbb{N}[\vct{X}]$ semirings can be seen as annotations in attributes $\Phi_{set}$ and $\Phi_{bag}$.
\end{Example}
In their most general form, tuple-independent set-probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) answer existential queries (queries for the probability of a specific condition holding over the input database) in two steps: (i) lineage and (ii) probability.
The lineage is a boolean formula, an element of the $\text{PosBool}[\vct{X}]$ semiring, where lineage variables $\vct{X}\in \mathbb{B}^\numvar$ are random variables corresponding to the presence of each of the $\numvar$ input tuples in one possible world of the input database.
The lineage models the relationship between the presence of these input tuples and the query condition being satisfied, and thus the probability of this formula is exactly the query result.
The analogous query in the bag setting~\cite{DBLP:journals/sigmod/GuagliardoL17,feng:2019:sigmod:uncertainty} asks for the expectation of the number (multiplicity) of result tuples that satisfy the query condition.
The process for responding to such queries is also analogous, save that the lineage is a polynomial, an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$ and random variables from the set $\vct{X} \in \mathbb{N}^\numvar$.
The expectation of this polynomial is the query result.
\begin{figure}[t]
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{9mm}{
\begin{tabular}{ c | c c c}
$Loc$ & City & $\Phi_{set}$ & $\Phi_{bag}$\\
$Loc$ & City$_\ell$ & $\Phi_{set}$ & $\Phi_{bag}$\\
\hline
& Buffalo & $L_a$ & $L_a$\\
& Chicago & $L_b$ & $L_b$\\
@ -50,7 +52,7 @@ Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping
\multicolumn{1}{c}{\vspace{1mm}}\\
$Q_{2}$ & $\text{City}_1$ & $\Phi_{set}$ & $\Phi_{bag}$ \\
\hline
& Buffalo & $L_a \wedge \top$ & $2L_a$\\
& Chicago & $L_a \wedge \top$ & $2L_a$\\
\end{tabular}
}
\caption{$Q_1$ and $Q_2$ in ~\Cref{ex:intro-tbls}}
@ -62,17 +64,33 @@ Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping
\trimfigurespacing
\end{figure}
\begin{Example}
Now consider the case when a customer service representative needs to expedite a shipment en route to Western Europe. Further assume that $L_c$ and $L_d$ both are set to $\mathbbold{1}$, the semiring multiplicative identity. To find the cities providing air service to either Zurich or Bremen, the query $Q_1 := \pi_{\text{City}_1}\left(\sigma_{\text{City}_2 = "Bremen" ~OR~ \text{City}_2 = "Zurich"}\right.$$\left.(Route)\right)$ might be issued, where the output is a bag PDB of the cities that have air transit to either Zurich and Bremen. The output bag PDB of \cref{subfig:ex-shipping-queries} is a bag \ti, and we see an example of a bag PDBs modeled by a query $Q$ over a set input relation (annotations from $\domain(W_i) = \{0, 1\})$, where the query output is closed with respect to the data model. This generalization to bags can be done WLOG.
\begin{Example}\label{ex:intro-tbls}
Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company.
Table Loc lists all the locations of airports.
Table Route identifies all flight routes.
The tuples of both tables are annotated with elements of the $PosBool[\vct{X}]$ ($\Phi_{set}$) and $\mathbb{N}[\vct{X}]$ ($\Phi_{bag}$) semirings that indicate the tuples presence or multiplicity, respectively.
Tuples of Routes are annotated with random variables $L_i$ that models the probability of no delays at the airport on a given day\footnote{We assume for simplicity that these variables are independent events.}.
Tuples of Routes are annotated with a constant ($\top$ or $1$ respectively), and are deterministic; Queries over this table follow classical query evaluation semantics.
Suppose the customer service agent would like to see if there are locations in closer proximity of the shipment's origination that connect to Chicago by the query $Q_2 := \pi_{\text{City}_1}(Loc$ $\bowtie_{\text{City}_2 = "Chicago"} Q_{1})$. Assume that at this point in time, government regulations are allowing transpotation of goods from the only location in Buffalo, where random variable $L_a = \top$ in the set semantics setting, and the $Q_2$ output table tells us that the shipment can be made from Buffalo. In the bags case $L_a = 1$, and $Q_2$ output then tells us we have $1 \cdot 2 = 2$ possible options to ship from Buffalo to Western Europe.
Consider a customer service representative who needs to expedite a shipment to Western Europe.
The query $Q_1 := \pi_{\text{City}_1}\left(\sigma_{\text{City}_2 = \text{``Bremen"} ~OR~ \text{City}_2 = \text{``Zurich"}}\right.$$\left.(Route)\right)$ asks for all cities with routes to either Zurich or Bremen.
Both routes exist from Chicago, and so the result lineage~\cite{DBLP:conf/pods/GreenKT07} of the corresponding tuple (\cref{subfig:ex-shipping-queries}) indicates that the tuple is deterministically present, either via Zurich or Bremen.
Analogously, under bag semantics Chicago appears in the result twice.
Observe that even when the input is a set (i.e., input tuple annotations are at most $1$), we can still evaluate bag queries over it.
Suppose the representative would like to consider delays from the originating city, as per the query $Q_2 := \pi_{\text{City}_1}(Loc$ $\bowtie_{\text{City}_\ell = \text{City}_1} Q_{1})$.
The resulting lineage formulas (\cref{subfig:ex-shipping-queries}) concisely describe the event of delivering a shipment to Zurich or Bremen without departure delay, or the number of departure-delay-free routes to these cities given an assignment to $L_b$.
If Chicago is delay-free ($L_b = \top$, $L_b = 1$, respectively), there exists a route (set semantics) or there are two routes (bag semantics).
\end{Example}
%The computation of the marginal probability in is a known \sharpphard problem. Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
It is known that a dichotomy exists for queries over set PDBs, partitioning the set of queries into classes of tractability and intractibility. Queries that are safe can be evaluated using extensional query evaluation, which is linear in the query runtime. Extensional query evaluation essentially performs both steps of the expectation computation in one pass, merging step two into the actual query evaluation. Unsafe queries must, however, be evaluated using intensional query evaluation semantics, which computes step two only \emph{after} step one is computed by the query.
%The computation of the marginal probability in is a known . Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$. The computation is a two step process. The first step involves actually computing the lineage formula of $\tup$. The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
A well-known dichotomy~\cite{10.1145/1265530.1265571} separates the common-case \sharpphard problem of computing the probability of a boolean lineage formulas, from the case where the probability computation can be inlined into the polynomial-time lineage construction process.
Historically, the first step in computing the expectation of the lineage formula is not the bottleneck, since computing the lineage of an output tuple can be done by instrumenting the query, where the additional operations are $O(1)$ time. In the case of set semantics, computing the marginal probability of a boolean formula in general is a problem in \sharpphard, even when the boolean formula is in DNF. In this setting, it is indeed the case that step two is the bottleneck. However, in the case of bag semantics the underlying assumption has been that computing the expected count over the polynomial lineage formula is linear in the size of the polynomial. This is perhaps due to the observation that many modern PDB systems represent the polynomial as a sum of products (SOP), where independent PDB data models like Tuple Independent Databases (\ti) enjoy linearity of expecation in both the sum and product operators. Unlike set semantics, when the lineage is in SOP (bag equivalent to DNF in set semantics), bags enjoy linear time in computing the expectation over the lineage polynomial. However, what can be said about lineage polynomials in a compressed representation (e.g. factorized), i.e., not in SOP form?
Historically, the bottleneck for \emph{set-}probabilistic databases has been the second step; An instrumented query can compute a circuit encoding of the result lineage with at most a constant-factor overhead over the un-instrumented query ((TODO: Find citation)).
Because the probability computation is the bottleneck, it is typical to assume that the lineage formula is provided in disjunctive normal form (DNF), as even when this assumption holds the problem remains \sharpphard in general.
However, for bag semantics the analogous sum of products (SOP) lineage representation admits a trivial naive implementation due to linearity of expectation.
However, what can be said about lineage polynomials (i.e., bag-probabilistic database query results) that are in a compressed (e.g, circuit) representation instead?
In this paper we study computing the expected count of an output bag PDB tuple whose lineage formula is in a compressed representation, using the more general intensional query evaluation semantics.
%
@ -250,7 +268,7 @@ In this paper we study computing the expected count of an output bag PDB tuple w
%%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
%is this computation also linear (in the size of an equivalent compressed representation) when the lineage polynomial of $\tup$ is in compressed form?
%there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart.
Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of an SOP).
Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of a SOP).
\begin{figure}[t]
\begin{subfigure}[b]{0.51\linewidth}
\centering
@ -323,12 +341,15 @@ Such compressed forms naturally occur in typical database optimizations, e.g., p
\label{fig:ex-proj-push}
\end{figure}
\begin{Example}
Consider again the tables in \cref{subfig:ex-shipping-loc} and \cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables as shown in \cref{subfig:ex-proj-push-q4}. Note that the query $Q_3 := \pi_{\text{City}}(Loc \bowtie_{\text{City} = \text{City}_1}Route)$ is equivalent to $Q_4 := Loc \bowtie\rho_{\text{City}_1 \mapsto \text{City}}\left(\pi_{\text{City}_1}(Route)\right)$, where $Q_4$ performs a projection push-down, producing a compressed annotation. Note, while in this example there is one less gate in the compressed circuit representation of $Q_4$ (\cref{subfig:ex-proj-push-circ-q4}), in general compressed representations of the lineage polynomial can have an exponential reduction of gates in the product width of the query.
Consider again the tables in \cref{subfig:ex-shipping-loc} and \cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables as shown in \cref{subfig:ex-proj-push-q4}.
Consider the equivalent queries $Q_3 := \pi_{\text{City}_1}(Loc \bowtie_{\text{City}_\ell = \text{City}_1}Route)$ and $Q_4 := Loc \bowtie_{\text{City}_\ell = \text{City}_1}\pi_{\text{City}_1}(Route)$.
The latter's ``pushed down'' projection produces a compressed annotation, both in the polynomial, as well as its circuit encoding (\cref{subfig:ex-proj-push-circ-q3,subfig:ex-proj-push-circ-q4}).
In general, compressed representations of the lineage polynomial can be exponentially smaller than the polynomial.
\end{Example}
This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
In view of this hardness result (i.e., step 2 of the workflow is indeed the bottleneck in the bag setting for compressed representations), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
In view of this hardness result (i.e., step 2 of the workflow is the bottleneck in the bag setting as well), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
}