Started proof for |C|(1,...,1)

master
Aaron Huber 2021-03-31 23:02:13 -04:00
parent 9f53fe0aca
commit 9c410fa3b8
5 changed files with 23 additions and 9 deletions

View File

@ -9,12 +9,7 @@
As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \not \in \{0,1\}$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs to be competitive with deterministic databases.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
% \AH{Low-level why-would-an-expert-read-this}
% \BG{However, when we consider compressed representations of the tuple lineage, the complexity landscape changes. If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.}
% \AH{Key technical contributions}
% \BG{This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear. As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time. We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy. It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.}
\end{abstract}
%%% Local Variables:

View File

@ -624,7 +624,26 @@ When $\gate_{k+1}.\type = \circmult$, then line ~\ref{alg:one-pass-mult} compute
\paragraph{\onepass Runtime}
It is known that $\topord(G)$ is computable in linear time. Next, each of the $\numvar$ iterations of the loop in ~\Cref{alg:one-pass-loop} take $O(1)$ time, thus yielding a runtime of $O\left(\size(\circuit)\right)$.
\paragraph{$\abs{\circuit}(1,\ldots, 1)$ is size $O(N)$}
For our runtime results to be relevant, it must be the case that the sum of the coefficients computed by \onepass is indeed size $O(N)$ since there are $O(\log{N})$ bits in the RAM model where $N$ is the size of the input. The size of the input here is \size(\circuit). We show that $\abs{\circuit}(1,\ldots, 1)$ is size $O(1)$ by induction on \depth(\circuit).
\begin{proof}[Proof of $\abs{\circuit}(1,\ldots, 1)$ is size $O(N)$]
To prove this result, we start by proving that $\abs{\circuit}(1,\ldots, 1) \leq N^{2^k}$ for \degree(\circuit) $= k$.
For the base case, we have that \depth(\circuit) $= 0$, and there can only be one node which must contain a coefficient (or constant) of $1$. In this case, $\abs{\circuit}(1,\ldots, 1) = 1$, and \size(\circuit) $= 1$, and it is true that $\abs{\circuit}(1,\ldots, 1) = 1 \leq N^{2^k} = 1^{2^0} = 1$.
Assume for $\ell > 0$ an arbitrary circuit \circuit of $\depth(\circuit) \leq \ell$ that it is true that $\abs{\circuit}(1,\ldots, 1) \leq N^{2^k}$.
For the inductive step we consider a circuit \circuit such that $\depth(\circuit) \leq \ell + 1$. When we consider the sink gate, it is the case that $\abs{\circuit_i}(1,\ldots, 1) \leq (N-1)^{2^k}$ for $i \in \{\linput, \rinput\}$. The sink can only be either a $\circmult$ or $\circplus$ gate. Consider when sink node is $\circmult$. In this case we have for $k_i + k_j = k - 1$ that
\begin{equation}
\abs{\circuit}(1,\ldots, 1) = \abs{\circuit_\linput}(1,\ldots, 1)\circmult \abs{\circuit_\rinput}(1,\ldots, 1) \leq (N-1)^{2^{k_i}} \circmult (N-1)^{2^{k_j}} \leq (N-1)^{2^{k-1}} \circmult (N-1) \leq N^{2^k}.
\end{equation}
For the case when the sink node is a $\circplus$ node, then we have for $k = k_i \circplus k_j$ that
\begin{equation}
\abs{\circuit}(1,\ldots, 1) = \abs{\circuit_\linput}(1,\ldots, 1) \circplus \abs{\circuit_\rinput}(1,\ldots, 1) \leq
(N-1)^{2^{k_i}} + (N-1)^{2^{k_j}} \leq (N-1)^{2^k} + (N-1) \leq N^{2^k}.
\end{equation}
Since $\abs{\circuit}(1,\ldots, 1) \leq N^{2^k}$, then $\log{N^{2^k}} = 2^k \cdot \log{N}$ which for fixed $k$ yields the desired $O(\log{N})$ bits for $O(1)$ arithmetic operations.
\end{proof}
\subsection{\sampmon Notes}
\revision{

View File

@ -64,7 +64,7 @@ sensitive=true
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\author{Su Feng}{Illinois Institute of Technology, Chicago, USA}{sfeng14@hawk.iit.edu}{}{}{}
\author{Boris Glavic}{Illinois Institure of Technology, USA}{bglavic@iit.edu}{}{}{}
\author{Boris Glavic}{Illinois Institute of Technology, USA}{bglavic@iit.edu}{}{}{}
\author{Aaron Huber}{University at Buffalo, USA}{ahuber@buffalo.edu}{}{}{}
\author{Oliver Kennedy}{University at Buffalo, USA}{okennedy@buffalo.edu}{}{}{}
\author{Atri Rudra}{University at Buffalo, USA}{atri@buffalo.edu}{}{}{}

View File

@ -50,7 +50,7 @@ For any graph $G=([n],E)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that this polynomial can be encoded in an expression tree of size $\Theta(km)$.
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results do not require the general circuit representation and hold for even the expression tree representation. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.
Using the tables in \cref{fig:ex-shipping}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ can be constructed as follows:
\[\poly^k_G:- Loc(C_1),Route(C_1, C_1'),Loc(C_1'),\dots,Loc(C_\kElem),Route(C_\kElem,C_\kElem'),Loc(C_\kElem')\]

View File

@ -101,7 +101,7 @@ Given the set of BIDB variables $\inset{X_{b,i}}$, define
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
In the reduced form, all exponents $e > 1$ are reduced to $e = 1$ via mod $\mathcal{T}$. (Note that this can be seen inductively by the fact that $x \mod (x^2 - x) = x$ and $x^2 \mod (x^2 - x) = x$ and by the multiplicative rule for modular arithmetic, $x \cdot x^2 \mod (x^2 - x) = x$). To filter disallowed $\bi$ cross-terms, all monomials with multiple variables from the same block $\block$ are dropped by $\text{mod } \mathcal B$ (i.e., any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
All exponents $e > 1$ in $\smbOf{\poly(\vct{X})}$ are reduced to $e = 1$ via mod $\mathcal{T}$. Performing the modulus of $\rpoly(\vct{X})$ with $\mathcal{B}$ ensures the disjoint condition of \bi, removing monomials with lineage variables from the same block.%, (recall the constraint on tuples from the same block being disjoint in a \bi).% any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
For the special case of \tis, the second step is not necessary since every block contains a single tuple.
%Alternatively, one can think of $\rpoly$ as the \abbrSMB of $\poly(\vct{X})$ when the product operator is idempotent.