Addressed @oliver and @atri comments 021822, readied the paper for Pablo.

master
Aaron Huber 2022-02-22 22:16:29 -05:00
parent bfdbe56742
commit 1e27058b0c
11 changed files with 145 additions and 138 deletions

View File

@ -1,27 +1,12 @@
%root: main.tex
%!TEX root=./main.tex
\begin{abstract}
% The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) is a % arguably the most
fundamental problem in set-PDBs.
%can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
%The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
% The analog for bag semantics is computing the expected multiplicity of a result tuple.
%In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
In this work, we study the analog problem for bag semantics: computing a tuple's expected multiplicity exactly and approximately.
% Specifically, we are interested in the fine-grained complexity of computing this type of expectation based on a query result tuple's lineage polynomial which encodes how the tuple's multiplicity is computed based on the multiplicity of input tuples.
% Furthermore, we study how the complexity of this problem compares to
We are specifically
In this work, we study the problem computing a tuple's expected multiplicity over bag-\abbrTIDB\xplural exactly and approximately.
We refer to bag-\abbrTIDB\xplural as \abbrCTIDB\xplural, where $\bound$ is the bound on the maximum multiplicity. We are specifically
interested in the fine-grained complexity and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
Unfortunately, % we show the reverse;
our results imply that computing expected multiplicities for Bag-PDBs based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
% Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
% The problem stays hard even if
% This is the case even if
%all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).\BG{Replace with this because notion of hardness unclear: This is the case even if \ldots}
%Atri: Fair enough: droppped.
%We proceed to study how approximate multiplicities using lineage polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We proceed to study approximation of expected multiplicities of result tuples of positive relational algebra queries ($\raPlus$) over \AHchange{\abbrCTIDB\xplural} and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
We proceed to study approximation of expected multiplicities of result tuples of positive relational algebra queries ($\raPlus$) over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of a comparable deterministic query for any $\raPlus$ query.
% By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
\end{abstract}

View File

@ -2,22 +2,19 @@
%!TEX root=./main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In \Cref{sec:hard}, we showed that \secrev{
\Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime.
}
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits\secrev{, (thus affirming~\Cref{prob:intro-stmt};} see the discussion after \Cref{lem:val-ub} for more).
The following approximation algorithm applies to \secrev{
\abbrCTIDB lineage polynomials and general \abbrBIDB (over bag-$\raPlus$ query semantics) lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a $1$-\abbrBIDB (blocks are size $1$) and our experimental results (see~\Cref{app:subsec:experiment}) using queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
}
\secrev{Corresponding proofs and pseudocode for all formal statements and algorithms }
In \Cref{sec:hard}, we showed that \Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. In light of this, we desire to produce and approximation algorithm that runs in time $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$. We do this by showing the result via circuits,
such that our approximation algorithm for this problem runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits, (thus affirming~\Cref{prob:intro-stmt}); see the discussion after \Cref{lem:val-ub} for more).
The following approximation algorithm applies to bag query semantics over both
\abbrCTIDB lineage polynomials and general \abbrBIDB lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a \abbrBIDB (blocks are size $1$). Our experimental results (see~\Cref{app:subsec:experiment}) which use queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
Corresponding proofs and pseudocode for all formal statements and algorithms
can be found in \Cref{sec:proofs-approx-alg}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results. First we introduce the expansion $\expansion{\circuit}$ of circuit $\circuit$ which % encodes the reduced polynomial for $\polyf\inparen{\circuit}$ and is the basis
is used in our
\secrev{auxiliary algorithm for sampling monomials when computing the approximation. }% (part of our approximation algorithm).
is used in our auxiliary algorithm~\Cref{alg:sample} for sampling monomials when computing the approximation. % (part of our approximation algorithm).
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$.
@ -66,11 +63,12 @@ $\expansion{\circuit}$
that need to be `canceled' when monomials with dependent variables are removed (\Cref{subsec:one-bidb}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given a \abbrBIDB circuit $\circuit$ define
Given a \abbrOneBIDB circuit $\circuit$ define
\[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\neg\isInd{\encMon}} }%\encMon\mod{\mathcal{B}}\equiv 0}}
{\abs{\circuit}(1,\ldots, 1)}.\]
\end{Definition}
\subsection{Our main result}\label{sec:algo:sub:main-result}
We solve~\Cref{prob:intro-stmt} for any fixed $\epsilon > 0$ in what follows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Algorithm Idea}
@ -80,9 +78,8 @@ Our approximation algorithm (\approxq pseudo code in \Cref{sec:proof-lem-approx-
is based on the following observation.
% The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over
\secrev{
$1$-\abbrBIDB (recall that all \abbrCTIDB can be reduced to $1$-\abbrBIDB by~\Cref{def:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
}
\abbrOneBIDB (recall that all \abbrCTIDB can be reduced to \abbrOneBIDB by~\Cref{def:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{equation}
@ -121,7 +118,7 @@ and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is
% We next present a few corollaries of \Cref{lem:approx-alg}.
\begin{Theorem}
\label{cor:approx-algo-const-p}
Let \circuit be an arbitrary $1$-\abbrBIDB circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
Let \circuit be an arbitrary \emph{\abbrOneBIDB} circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
satisfying
\begin{equation}
\label{eq:approx-algo-bound-main}
@ -135,16 +132,23 @@ O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Theorem}
The restriction on $\gamma$ is satisfied by any \secrev{
$1$-\abbrTIDB (where $\gamma=0$ in the equivalent $1$-\abbrBIDB of~\Cref{def:ctidb-reduct})
} as well as for all three queries of the PDBench \abbrBIDB benchmark (see \Cref{app:subsec:experiment} for experimental results).
\begin{Lemma}
Given \emph{\abbrOneBIDB} computed from the reduction of~\Cref{def:ctidb-reduct}, $\gamma\inparen{\circuit}=\inparen{c + 1}^{-k}$.
\end{Lemma}
\begin{Corollary}
Given any \abbrCTIDB circuit \circuit, $\poly\inparen{\vct{X}} = \polyf\inparen{\circuit}$, for $k =\degree\inparen{\circuit}$, $\gamma\inparen{\circuit}$, and $\prob_i\ge\prob_0$ for all $i\in\pbox{\numvar}$. The results of~\Cref{cor:approx-algo-const-p} follow for estimating $\rpoly\inparen{\prob_1,\ldots, \prob_\numvar}$.
\end{Corollary}
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).
The restriction on $\gamma$ is satisfied by any
$1$-\abbrTIDB (where $\gamma=0$ in the equivalent $1$-\abbrBIDB of~\Cref{def:ctidb-reduct})
as well as for all three queries of the PDBench \abbrBIDB benchmark (see \Cref{app:subsec:experiment} for experimental results).
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
\begin{Lemma}
\label{lem:val-ub}
For any \abbrBIDB circuit $\circuit$ with $\degree(\circuit)=k$, we have
For any \emph{\abbrOneBIDB} circuit $\circuit$ with $\degree(\circuit)=k$, we have
$\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \depth(\circuit)}.$
Further, %under either of the following conditions:
%\begin{enumerate}
@ -156,16 +160,16 @@ we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} with yes for such circuits.
\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant? Otherwise this does not uphold~\Cref{prob:intro-stmt}.}
%\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant? Otherwise this does not uphold~\Cref{prob:intro-stmt}.}
Finally, note that by \Cref{prop:circuit-depth} and \Cref{lem:circ-model-runtime} for any $\raPlus$ query $\query$, there exists a circuit $\circuit^*$ for $\apolyqdt$ such that $\depth(\circuit^*)\le O_{|Q|}(\log{n})$ and $\size(\circuit)\le O_k\inparen{\qruntime{\query, \dbbase}}$. Using this along with \Cref{lem:val-ub}, \Cref{cor:approx-algo-const-p} and the fact that $n\le \qruntime{\query, \dbbase}$, we answer \Cref{prob:big-o-joint-steps} in the affirmative as follows:
\begin{Corollary}
\label{cor:approx-algo-punchline}
Let $\query$ be an $\raPlus$ query and $\pdb$ be a $1$-\abbrBIDB with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
Let $\query$ be an $\raPlus$ query and $\pdb$ be a \emph{\abbrOneBIDB} with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
%Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
% $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
\end{Corollary}
\AH{What is $\abs{\query}$? Isn't that just $k$?}
%\AH{What is $\abs{\query}$? Isn't that just $k$?}
If we want to approximate the expected multiplicities of all $Z=O(n^k)$ result tuples $\tup$ simultaneously, we just need to run the above result with $\conf$ replaced by $\frac \conf Z$. Note this increases the runtime by only a logarithmic factor.

View File

@ -25,12 +25,12 @@
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
In~\Cref{sec:intro}, we introduced the structure $T_{det}\inparen{\cdot}$ to analyze the runtime complexity of~\Cref{prob:expect-mult}.
To decouple our results from specific join algorithms, we first abstract the cost of a join.
To decouple our results from specific join algorithms, we first lower bound the cost of a join.
\begin{Definition}[Join Cost]
\label{def:join-cost}
Denote by $\jointime{R_1, \ldots, R_m}$ the runtime of an algorithm for computing the $m$-ary join $R_1 \bowtie \ldots \bowtie R_m$.
We require only that the algorithm must enumerate its output, i.e., that $\jointime{R_1, \ldots, R_m} \geq |R_1 \bowtie \ldots \bowtie R_m|$.
We require only that the algorithm must enumerate its output, i.e., that $\jointime{R_1, \ldots, R_m} \geq |R_1 \bowtie \ldots \bowtie R_m|$. With this definition of $\jointime{\cdot}$, worst-case optimal join algorithms are handled.
\end{Definition}
Worst-case optimal join algorithms~\cite{skew,ngo-survey} and query evaluation via factorized databases~\cite{factorized-db} (as well as work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as $\raPlus$ queries (though the query size is data dependent).
@ -58,12 +58,12 @@ For these algorithms, $\jointime{R_1, \ldots, R_n}$ is linear in the {\em AGM bo
}\\
Under this model, an $\raPlus$ query $\query$ evaluated over database $\gentupset$ has runtime $O(\qruntimenoopt{Q,\gentupset})$.
Under this model, an $\raPlus$ query $\query$ evaluated over database $\gentupset$ has runtime $O(\qruntimenoopt{Q,\gentupset, \bound})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
%Observe that
% () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known .
Finally, \Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{Q, \tupset,\bound})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}.
\Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}. Lastly, we can handle FAQs and factorized databases with allowing for optimization, i.e. $\optquery{\query}$.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}

View File

@ -2,13 +2,11 @@
%root: main.tex
\section{Introduction}\label{sec:intro}
\secrev{
This work explores the problem of computing the expectation of a tuple's multiplicity in a specific construction of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB,
This work explores the problem of computing the expectation of a tuple's multiplicity in bag \abbrTIDB\xplural, which we term as \abbrCTIDB\xplural. A \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each possible tuple encoded in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity and $\bpd$ is the probability distribution over $\worlds$. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The probability distribution $\bpd$ for any tuple $\tup$ can then be encoded as $\prob_{\tup, j} = \probOf\pbox{\worldvec_{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
is an independent random event. %for $\tup \in \tupset$.
}
\secrev{
Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
@ -107,19 +105,12 @@ Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$.
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
}
%\sout{
%\mypar{Example that can perhaps be used later on (using commented out figure above)}
%Given a \abbrCTIDB $\pdb$ with $\numvar$ tuples, we can encode a possible world by the vector $\vct{W} \in \inset{0,\ldots, c}^\numvar$, with the intuitive interpretation when bit $W_i = j$, then tuple $\tup_i$ with multiplicity $j$ is selected, with $\tup_i$ not existing for the special case of $j = 0$. For the example in ~\Cref{fig:ctidb-red}, we have that for \abbrCTIDB $\textbf{R}$, $\numvar = 2$. Then, e.g., arbitrary world vector $\vct{W} = [2, 3]$ encodes the possible world $\db = \inset{\intup{a, 2}, \intup{b, 3}}$ Computing ~\Cref{prob:expect-mult} for tuple $\tup_2$ in ~\Cref{fig:ctidb-red} when $\query = \mathbf{\rel}$ then becomes $\expct_{\randDB\sim\pd}\pbox{\mathbf{\rel}\inparen{\tup_2}} = 1\cdot\prob_{2,1} + 2\cdot\prob_{2,2} + 3\cdot\prob_{2,3} = 1\cdot 0.2 + 2\cdot 0.35 + 3\cdot 0.15 = 1.35$.
%}
\secrev{
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an equivalent deterministic query. If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Being we consider $\raPlus$ queries in which order of operators can impact runtime, we denote the optimized $\raPlus$ query as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. This paper considers $\raPlus$ queries for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{Note that our work applies to any $\query \in\raPlus$, which implies that specific heuristics for choosing an optimized query can be abstracted away, i.e., our work does not consider heuristic techniques.}
\begin{table}[h!]
\begin{table}[t!]
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\bpd$s & Hardness Assumption\\
@ -141,7 +132,7 @@ Our question is whether or not it is always true that $\timeOf{}^*\inparen{\quer
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bounds for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
and bag \abbrPDB\xplural are deployable in practice.
@ -155,14 +146,13 @@ In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DB
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
%we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases
(see \Cref{subsec:tidbs-and-bidbs}). %(\abbrBIDB\xplural).
}
\secrev{
\subsection{Polynomial Equivalence}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
\subsection{Polynomial Equivalence}\label{sec:intro-poly-equiv}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages or propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
%Intuitively, a \abbrCTIDB lends itself to a useful reduction to a specific type of block independent database (\abbrBIDB) which we refer to as a $1$-\abbrBIDB. A $1$-\abbrBIDB is a \abbrBIDB in the traditional sense of allowing no duplicate tuples, \emph{but} where we use bag query semantics instead of the usual set query semantics.
%(see~\Cref{fig:nxDBSemantics} for a definition)
\begin{figure}
\begin{figure}[b!]
\begin{align*}
\polyqdt{\project_A(\query)}{\gentupset}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} &
\polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =& \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\
@ -179,7 +169,7 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski198
\end{aligned}\\
& & & \polyqdt{\rel}{\gentupset}{\tup} = X_\tup%\sum_{j \in [c]}j\cdot\pVar_{\tup, j}
\end{align*}\\[-10mm]
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over a arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.}
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over an arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$. Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.}
\label{fig:nxDBSemantics}
\end{figure}
@ -194,10 +184,8 @@ multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign
We note that computing \Cref{prob:expect-mult}
is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
}
\secrev{
All of our results rely on working with a {\em reduced} form $\inparen{\poly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistic database~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with tuple multiplicity at most $1$), for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
%\begin{Definition}
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner. For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$. Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
@ -207,12 +195,12 @@ All of our results rely on working with a {\em reduced} form $\inparen{\poly}$ o
%\end{Example}
Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
}
\begin{lstlisting}
SELECT 1 FROM T $t_1$, Route r, T $t_2$
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
\end{lstlisting}
\secrev{
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X, Y, Z}$
$$
@ -222,7 +210,7 @@ To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ . Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Considering again our example,
\begin{multline*}
@ -272,7 +260,11 @@ $ to be the polynomial resulting from converting $\refpoly{}$ into the standard
}
removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\in\tupset, j\neq j'\in\pbox{c}$, and setting all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Continuing with the example $\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$, to save clutter we i) do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$, and ii) for $\sum_{j\in\pbox{\bound}}j^2\cdot X_j$, we omit the summands encoding multiplicities $> 2$, since the greatest multiplicity of the tuple annotated with $X$ is $2$, likewise those summands will always evaluated to $0$ since the tuple will never have a multiplicity of $>2$.
Continuing with the example
\footnote{
To save clutter we do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$.
}
$\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$ we have
\begin{multline*}
\rpoly_1^2(A, B, C, E, X_1, X_2, Y, Z) = \\
A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC =\\
@ -290,9 +282,7 @@ For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
\expct_{\vct{W} \sim \pdassign}\pbox{\refpoly{}\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{c}}}.$%,\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
\end{Lemma}
}
\secrev{
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
@ -302,7 +292,7 @@ We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymb
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
We adopt a two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation in~\Cref{sec:expression-trees}).
@ -326,7 +316,7 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
}
as the representation system of $\poly(\vct{X})$.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of \abbrStepTwo.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of the \abbrStepTwo step.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
@ -349,15 +339,12 @@ For an upper bound on approximating the expected count, it is easy to check that
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly_1^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly^2_1\inparen{\vct{\prob}}, \rpoly_1^2\inparen{\vct{\prob}}]$.
we get that $\poly_1^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly^2_1\inparen{\vct{\prob}}, \rpoly_1^2\inparen{\vct{\prob}}]$, which is \emph{not a tight approximation}.
%
%To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
\rule{\textwidth}{1.5pt}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -85,7 +85,7 @@ sensitive=true
\title{Parameterized and Fine-Grained Analysis of Query Evaluation Over Bag PDBs}
\title{Computing expected multiplicities for bag-TIDBs with bounded multiplicities}
\titlerunning{Bag PDB Queries}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -3,7 +3,7 @@
\section{Hardness of Exact Computation}
\label{sec:hard}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, \bis and general \abbrBPDB, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. The results also apply to \abbrOneBIDB and other more general \abbrPDB\xplural.
%(and hence the equivalent \Cref{prob:bag-pdb-query-eval})
%in the negative.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -14,7 +14,7 @@ In particular, we will consider the problems of computing the following counts (
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
\label{thm:k-match-hard}
Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no self-loops or parallel edges, $\kmatchtime\ge \littleomega{f(k)\cdot |\edgeSet|^c}$ for any function $f$ and fixed constant $c$ independent of $\abs{E}$ and $k$ (assuming $\sharpwzero\ne\sharpwone$).
Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no self-loops or parallel edges, $\kmatchtime\ge \littleomega{f(k)\cdot |\edgeSet|^c}$ for any function $f$ and any constant $c$ independent of $\abs{E}$ and $k$ (assuming $\sharpwzero\ne\sharpwone$).
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{hypo}\label{conj:known-algo-kmatch}
@ -46,7 +46,7 @@ For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial whose structure mirrors the query $\query_2$ from \Cref{sec:intro}. Let us alias
\noindent Returning to \Cref{fig:two-step}, it can be seen that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial from query $\query_k$, which we define next ($\query_2$ from~\Cref{sec:intro} is the same query with $k=2$). Let us alias
\begin{lstlisting}
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
@ -55,16 +55,15 @@ as $R_i$ for each $i \in [k]$. The query $\query^k$ then becomes
\begin{lstlisting}
SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
\end{lstlisting}
\noindent Consider again the \abbrCTIDB instance $\pdb$ of~\Cref{fig:two-step} and, for our hard instance, let $\bound = 1$. $\pdb$ generalizes to one compatible to~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, for this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a \abbrCTIDB-lineage polynomial. % for a \abbrTIDB \abbrPDB.
\AH{
\textbf{@atri}, we discussed this last meeting, but I am not sure if we really pinpointed how we want to treat (\emph{in a consistent manner}) the runtime of~\Cref{lem:tdet-om} since $k$ is a constant and $m$ is growing. Would it be a good idea to be consistent with the $O_\epsilon$ notation of~\Cref{prob:big-o-joint-steps} and say $O_k(\numedge)$}
Next, we note that the runtime for answering $\query^k$ on deterministic database $\tupset$, as defined above, is $\bigO{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
\noindent Consider again the \abbrCTIDB instance $\pdb$ of~\Cref{fig:two-step} and, for our hard instance, let $\bound = 1$. $\pdb$ generalizes to one compatible to~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a $1$-\abbrTIDB lineage polynomial. % for a \abbrTIDB \abbrPDB.
Next, we note that the runtime for answering $\query^k$ on deterministic database $\tupset$, as defined above, is $O_k\inparen{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
\begin{Lemma}\label{lem:tdet-om}
Let $\query^k$ and $\tupset$ be as defined above. Then
% of \Cref{def:qk}, the runtime
$\qruntimenoopt{\query^k, \tupset}$ is $\bigO{\kElem\numedge}$.
$\qruntimenoopt{\query^k, \tupset}$ is $O_k\inparen{\numedge}$.
\end{Lemma}
\subsection{Multiple Distinct $\prob$ Values}

View File

@ -42,20 +42,23 @@ Let $\abs{\poly}$ be the number of operators in $\poly$.
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is a $1$-\abbrBIDB lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\abs{\poly}}$ time.
\end{Corollary}
\secrev{
\subsubsection{Possible World Semantics}\label{subsub:possible-world-sem}
Queries over probabilistic databases are traditionally viewed as being evaluated using the so-called possible world semantics. A general bag-\abbrPDB can be defined as the pair $\pdb = \inparen{\Omega, \bpd}$ where $\Omega$ is the set of possible worlds represented by $\pdb$. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\Omega$ is the set of query answers produced by evaluating $\query$ over each possible world $\omega\in\Omega$: $\inset{\query\inparen{\omega}: \omega\in\Omega}$.
The result of a query is the pair $\inparen{\query\inparen{\omega}, \bpd'}$ where $\bpd'$ is a probability distribution that assigns to each possible query result the sum of the probabilites of the worlds that produce this answer: $\probOf\pbox{\omega\in\Omega} = \sum_{\omega'\in\Omega,\\\query\inparen{\omega'}=\query\inparen{\omega}}\probOf\pbox{\omega'}$.
}
\AH{
I am not sure the following paragraph is needed, since the reduction definition says pretty much the same thing. Unless that definition changes, we can get rid of this paragraph.
}
Suppose that $\pdb$ is a $1$-\abbrBIDB. Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to, e.g., disjointness. The all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\bound\numvar}$, such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$ (\emph{here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{\tup\in\tupset}\abs{b_\tup}$}).%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
We can denote a probability distribution over all $\worldvec \in \{0, 1\}^{\bound\numvar}$ as $\bpd'$. When $\bpd'$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j}, \worldvec_{\tup, j'} \geq 1$ for $j\neq j'$, we end up with a bijective mapping from $\bpd$ to $\bpd'$, such that each mapping is equivalent, implying the distributions are equivalent.
\Cref{subsec:supp-mat-ti-bi-def} has more details.
In this section, we show how the traditional possible worlds semantics corresponds to our setup. Readers can safely skip this part without missing anything vital to the results of this paper.
Queries over probabilistic databases are traditionally viewed as being evaluated using the so-called possible world semantics. A general bag-\abbrPDB can be defined as the pair $\pdb = \inparen{\Omega, \bpd}$ where $\Omega$ is the set of possible worlds represented by $\pdb$ and $\bpd$ the probability distribution over $\Omega$. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\Omega$ is the set of query answers produced by evaluating $\query$ over each possible world $\omega\in\Omega$: $\inset{\query\inparen{\omega}: \omega\in\Omega}$.
The result of a query is the pair $\inparen{\query\inparen{\Omega}, \bpd'}$ where $\bpd'$ is a probability distribution that assigns to each possible query result the sum of the probabilites of the worlds that produce this answer: $\probOf\pbox{\omega\in\Omega} = \sum_{\omega'\in\Omega,\\\query\inparen{\omega'}=\query\inparen{\omega}}\probOf\pbox{\omega'}$.
Recall \Cref{fig:nxDBSemantics} again, which defines the lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
Suppose that $\pdb''$ is a reduced \abbrOneBIDB from \abbrCTIDB $\pdb'$ as defined by~\Cref{def:ctidb-reduct}. Instead of looking only at the possible worlds of $\pdb''$, one can consider the set of all worlds, including those that cannot exist due to, e.g., disjointness. Since $\abs{\tupset'} = \numvar$ the all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\numvar\bound}$, such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$ (\emph{here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_{\numvar'}$, where $\numvar'=\sum_{\tup\in\tupset}\abs{\block_\tup}$}).
\footnote{
In this example, $\abs{\block_\tup} = \bound$ for all $\tup$.
}%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
We can denote a probability distribution over all $\worldvec \in \{0, 1\}^{\numvar\bound}$ as $\bpd''$. When $\bpd''$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j}, \worldvec_{\tup, j'} \neq 0$ for $j\neq j'$, we end up with a bijective mapping from $\bpd$ to $\bpd''$, such that each mapping is equivalent, implying the distributions are equivalent, and thus query results.
\Cref{subsec:supp-mat-ti-bi-def} has more details. \medskip
We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\Omega,\bpd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for arbitrary result tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
@ -63,7 +66,7 @@ we have (denoting $\randDB$ as the random variable over $\Omega$):
$ \expct_{\randDB \sim \bpd}[\query(\randDB)(t)] = \expct_{\vct{\randWorld}\sim \pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}. $
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.\footnote{Although \Cref{prop:expection-of-polynom} follows, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
We focus on the problem of computing $\expct_{\worldvec\sim\pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \tupset, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -6,24 +6,26 @@
%We first formally define circuits, an encoding of polynomials that we use throughout the paper.
%
%For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2X^2 + 3XY - 2Y^2$ over $\vct{X} = [X, Y]$.
\secrev{
~\Cref{prob:intro-stmt} asks if there exists a linear time approximation algorithm in the size of a given circuit \circuit which encodes $\poly\inparen{\vct{X}}$. In this work we
}
represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way. Since we are particularly using circuits to model lineage polynomials, we can refer to these circuits as lineage circuits. However, when the meaning is clear, we will drop the term lineage and only refer to them as circuits.
We focus on the problem of computing $\expct_{\worldvec\sim\pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \tupset, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
\Cref{prob:intro-stmt} asks if there exists a linear time approximation algorithm in the size of a given circuit \circuit which encodes $\poly\inparen{\vct{X}}$. Recall that in this work we
represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way. Since we are specifically using circuits to model lineage polynomials, we can refer to these circuits as lineage circuits. However, when the meaning is clear, we will drop the term lineage and only refer to them as circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Circuit]\label{def:circuit}
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domN$ or $\vct{X} = \inparen{X_1,\ldots,X_\numvar}$. For each result tuple there exists one sink gate. The internal gates have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
%
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
Each gate has the following members: \type, \vari{input}, \val, \vpartial, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
\end{Definition}
\AH{Does the following matter, i.e., does it point anything out special for our research? \textbf{EDIT}: ~\Cref{lem:val-ub} does use this (when \circuit is a tree) to answer~\Cref{prob:intro-stmt} with a yes.}
%\AH{Does the following matter, i.e., does it point anything out special for our research? \textbf{EDIT}: ~\Cref{lem:val-ub} does use this (when \circuit is a tree) to answer~\Cref{prob:intro-stmt} with a yes.}
When the underlying DAG is a tree (with edges pointing towards the root), the structure is an expression tree \etree. In such a case, the root of \etree is analogous to the sink of \circuit. The fields \vari{partial}, \degval, \vari{Lweight}, and \vari{Rweight} are used in the proofs of \Cref{sec:proofs-approx-alg}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The circuits in \Cref{fig:two-step} encode their respective polynomials in column $\poly$.
Note that each circuit \circuit encodes a tree, with edges pointing towards the root.
Note that the ciricuit \circuit representing $AX$ and the circuit \circuit' representing $B\inparen{Y+Z}$ each encode a tree, with edges pointing towards the root.
\begin{wrapfigure}{l}{0.45\linewidth}
@ -64,7 +66,7 @@ Note that each circuit \circuit encodes a tree, with edges pointing towards the
%\end{figure}
We next formally define the relationship of circuits with polynomials. While the definition assumes one sink for notational convenience, it easily generalizes to the multiple sinks case.
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\circuit)$ to be the function from the sink of circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
$\polyf(\circuit)$ maps the sink of circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
\begin{equation*}
\polyf(\circuit) = \begin{cases}
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\

View File

@ -7,19 +7,18 @@
%We now introduce some terminology
%and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
%Note that
\secrev{
Given an index set $S$ over variables $X_\tup$ for $\tup\in S$, a (general) polynomial $\genpoly$ over $\inparen{X_\tup}_{\tup \in S}$ with individual degree $\hideg <\infty$
is formally defined as (where $c_{\vct{d}}\in \semN$):
\begin{equation}
is formally defined as:
\begin{align}
\label{eq:sop-form}
\genpoly\inparen{\inparen{X_\tup}_{\tup\in S}}=\sum_{\vct{d}\in\{0,\ldots,\hideg\}^{S}} c_{\vct{d}}\cdot \prod_{\tup\in S}X_\tup^{d_\tup}.
\end{equation}
}
\genpoly\inparen{\inparen{X_\tup}_{\tup\in S}}=\sum_{\vct{d}\in\{0,\ldots,\hideg\}^{S}} c_{\vct{d}}\cdot \prod_{\tup\in S}X_\tup^{d_\tup}&&\text{ where } c_{\vct{d}}\in \semN.
\end{align}
%where $c_{\vct{d}}\in \semN$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Standard Monomial Basis]\label{def:smb}
\secrev{The term $\prod_{\tup\in S} X_\tup^{d_\tup}$ }in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\genpoly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
The term $\prod_{\tup\in S} X_\tup^{d_\tup}$ in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\genpoly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Unless othewise noted, we consider all polynomials to be in \abbrSMB representation.
@ -27,20 +26,19 @@ When it is unclear, we use $\smbOf{\genpoly}~\inparen{\smbOf{\poly}}$ to denote
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Degree]\label{def:degree-of-poly}
The degree of polynomial $\genpoly(\vct{X})$ is the largest \secrev{$\sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
$}% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$. \secrev{
The degree of polynomial $\genpoly(\vct{X})$ is the largest $\sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
$% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$.
We denote the degree of $\genpoly$ as $\deg\inparen{\genpoly}$.
}% maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
% maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins needed to produce a result tuple.
%in any clause of the $\raPlus$ query that created it.
\secrev{
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\abbrCTIDB-lineage polynomial} (%resp., \emph{\ti-lineage polynomial},
or simply lineage polynomial), if it is clear from context that there exists an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$, and result tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
}
@ -51,7 +49,7 @@ or simply lineage polynomial), if it is clear from context that there exists an
%\noindent\secrev{
%A block independent database \abbrBIDB $\pdb'$ is the union of $\numblock$ sets of tuples, where each set of tuples consists of elements all of which are disjoint to one another. Each set of tuples is called a block, denoted $\block_i$ for $i\in\pbox{\numblock}$, where all $\block_i$ are independent events. We define next a specific construction of \abbrBIDB that is useful for our work.}
\noindent \secrev{A block independent database \abbrBIDB $\pdb'$ models a set of worlds each of which consists of a subset of the possible tuples $\tupset'$, where $\tupset'$ is partitioned into $\numblock$ blocks $\block_i$ and all $\block_i$ are independent random events. $\pdb'$ further constrains that all $\tup\in\block_i$ for all $i\in\pbox{\numblock}$ of $\tupset'$ be disjoint events. We define next a specific construction of \abbrBIDB that is useful for our work.
\noindent A block independent database \abbrBIDB $\pdb'$ models a set of worlds each of which consists of a subset of the possible tuples $\tupset'$, where $\tupset'$ is partitioned into $\numblock$ blocks $\block_i$ and all $\block_i$ are independent random events. $\pdb'$ further constrains that all $\tup\in\block_i$ for all $i\in\pbox{\numblock}$ of $\tupset'$ be disjoint events. We define next a specific construction of \abbrBIDB that is useful for our work.
%\secrev{
%A block independent database \abbrBIDB $\pdb'$ can viewed as a $1$-\abbrTIDB $\pdb$ with the added flexibility that each $\tup\in\tupset$ has multiple disjoint alternatives, i.e., all $\tup \in \tupset'$ are partitioned into $m$ independent blocks with the condition that tuples $\tup \in \block_i$ for $i \in \pbox{m}$ are disjoint events. We define next a specific construction of \abbrBIDB that is useful for our work.
@ -80,13 +78,42 @@ Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, let $\pdb' = \inparen{\onebid
%$\tup_j\geq1\implies \tup_{j'} = 0$.$\forall j, j' \in \pbox{\bound},\forall \tup\in\tupset, \tup_j\geq 1\implies \tup_{j'} = 0$ for any block $\block_\tup$.
\end{Proposition}
For the \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while in the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$. %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.
Hence, in the setting of \abbrOneBIDB, we have the following semantics for generating lineage polynomials in $\raPlus$ queries: $\poly'\pbox{\project_A\inparen{\query}, \tupset', \tup_j} = \sum_{\tup_{j'} \in \project_{A}\inparen{\query\inparen{\tupset'}}: \tup_{j'} = \tup_j}\poly'\pbox{\query, \tupset', \tup_{j'}}$,
$\poly'\pbox{\select_\theta\inparen{\query}, \tupset', \tup_j} = \begin{cases}\theta = 1&\poly'\pbox{\query, \tupset', \tup_j}\\\theta = 0& 0\\\end{cases}$,
$\poly'\pbox{\query_1\join\query_2, \tupset', \tup_j} = \poly'\pbox{\query_1, \tupset', \project_{attr\inparen{\query_1}}\inparen{\tup_j}}\cdot\poly'\pbox{\query_2, \tupset', \project_{attr\inparen{\query_2}}\inparen{\tup_j}}$,
$\poly'\pbox{\query_1\union\query_2, \tupset', \tup_j} = \poly'\pbox{\query_1, \tupset', \tup_j}+\poly'\pbox{\query_2, \tupset', \tup_j}$,
and the base case now becomes $\poly'\pbox{\rel,\tupset', \tup_j} = j\cdot X_{\tup, j}$ (c.f.~\Cref{fig:nxDBSemantics}). Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly'\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly'$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly'\inparen{\rvworld}} = \rpoly'\inparen{\vct{\prob}}$.
}
For $\poly\inparen{\vct{X}}$ generated from \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while, given $\poly'\inparen{\vct{X}}$ produced from the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$. %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.
Hence, in the setting of \abbrOneBIDB, we have the following semantics for generating lineage polynomials in $\raPlus$ queries shown in~\Cref{fig:lin-poly-bidb-redux}. Note that the semantics for lineage polynomial construction only changes for the base case.
We now define the reduced polynomial $\rpoly'$ of a \abbrOneBIDB.
\begin{figure}[t!]
\centering
\resizebox{\textwidth}{!}{
\begin{minipage}{\textwidth}
\begin{align*}
\poly'\pbox{\project_A\inparen{\query}, \tupset', \tup_j} =& \sum_{\substack{\tup_{j'},\\\project_{A}\inparen{\tup_{j'}} = \tup_j}}\poly'\pbox{\query, \tupset', \tup_{j'}} &
\poly'\pbox{\query_1\union\query_2, \tupset', \tup_j} =& \poly'\pbox{\query_1, \tupset', \tup_j}+\poly'\pbox{\query_2, \tupset', \tup_j}\\
\poly'\pbox{\select_\theta\inparen{\query}, \tupset', \tup_j} =& \begin{cases}\theta = 1&\poly'\pbox{\query, \tupset', \tup_j}\\\theta = 0& 0\\\end{cases} &
\begin{aligned}
\poly'\pbox{\query_1\join\query_2, \tupset', \tup_j} = \\~
\end{aligned} &
\begin{aligned}
&\poly'\pbox{\query_1, \tupset', \project_{attr\inparen{\query_1}}\inparen{\tup_j}}\\ &~~~\cdot\poly'\pbox{\query_2, \tupset', \project_{attr\inparen{\query_2}}\inparen{\tup_j}}
\end{aligned}\\
&&&\poly'\pbox{\rel,\tupset', \tup_j} = j\cdot X_{\tup, j}.
\end{align*}\\[-10mm]
\end{minipage}}
\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over $\gentupset$.}
\label{fig:lin-poly-bidb-redux}
\end{figure}
\begin{Definition}[$\rpoly'$]\label{def:reduced-poly-redux}
Given a polynomial $\poly'\inparen{\vct{X}}$ generated from a \abbrOneBIDB produced from the reduction of~\Cref{def:ctidb-reduct} and let $\rpoly'\inparen{\vct{X}}$ denote the reduced form of $\poly'\inparen{\vct{X}}$ computed as follows: i) compute $\smbOf{\poly'\inparen{\vct{X}}}$, ii) reduce all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly'\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly'$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials.
\begin{Lemma}
Given any \abbrCTIDB $\pdb$, its reduced counterpart \emph{\abbrOneBIDB} $\pdb'$, $\raPlus$ query $\query$, and lineage polynomial
$\poly'\inparen{\vct{X}}=\poly'\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$, it holds that $
\expct_{\vct{W} \sim \pdassign'}\pbox{\poly'\inparen{\vct{W}}} = \rpoly'\inparen{\probAllTup}.
$%, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{c}}}.$%,\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
%$\expct_{\rvworld\sim\bpd'}\pbox{\poly'\inparen{\rvworld}} = \rpoly'\inparen{\vct{\prob}}$.
\end{Lemma}
%In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.
%%
%A \bi $\pdb$ is a \abbrPDB with the constraint that

View File

@ -7,7 +7,7 @@
\renewcommand{\belowrulesep}{0pt}
\begin{figure}[h!]
\begin{figure}[t!]
\centering
\resizebox{\textwidth}{5.2cm}{%
\begin{tikzpicture}
@ -119,14 +119,14 @@
Point & $\mathbb{E}[\poly(\vct{X})]$\\
\midrule%[0.05pt]
$e_1$ & $\inparen{\prob_{A, 1} +\prob_{A, 2}}\cdot\left(\prob_{X, 1} + 2\prob_{X, 2}\right)$\\%$1.0 \cdot 0.9 = 0.9$\\[3mm]
$e_2$ & $\inparen{\prob_{B, 1} + \prob_{B_2}}\inparen{\prob_{Y, 1}+\prob_{Y, 2} + \prob_{Z, 1} + \prob_{Z, 2}}$\\%$(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\
$e_2$ & $\inparen{\prob_{B, 1} + \prob_{B_2}}\inparen{\prob_{Y, 1}+2\prob_{Y, 2} + \prob_{Z, 1} + 2\prob_{Z, 2}}$\\%$(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\
\end{tabular}
};
%label of rounded rectangle
\node[below=0.2cm of rrect]{{\LARGE $\expct\pbox{\poly(\vct{X})}$}};
\end{tikzpicture}
}
\caption{Intensional Query Evaluation Model $\inparen{\query_2 = \project_{\text{Point}}\inparen{T\join_{\text{Point} = \text{Point}_1}R} \text{ and }c = 2}$.}
\caption{Intensional Query Evaluation Model $(\query_2 = \project_{\text{Point}}$ $\inparen{T\join_{\text{Point} = \text{Point}_1}R}$ where, for table $R,~\bound = 2$, while for $T,~\bound = 1.)$}
\label{fig:two-step}
\end{figure}