paper-BagRelationalPDBsAreHard/Sketching Worlds/retracted_bidb_stuff.tex

451 lines
36 KiB
TeX

%root = main.tex
\AH{\large\bf{New stuff 092520.}}
\begin{Claim}\label{claim:constpk-TI}
Given a positive query polynomial $\poly$ over a $\ti$, with constant $\prob$ such that there exists a $\prob_0$ where for all $\prob_i, \prob_0 \leq \prob_i$, and constant $k = \degree(\poly)$, the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ is constant.
\end{Claim}
\begin{proof}[Proof of Claim ~\ref{claim:constpk-TI}]
By independence, a $\ti$ has the property that all of its annotations are positive. Combined with the fact that \Cref{claim:constpk-TI} uses only positive queries, i.e., queries that only use $\oplus$ and $\otimes$ semiring operators over its polynomial annotations, it is the case that no negation exists pre or post query.
For any $\poly$ then, it is true that all coefficients in $\abs{\etree}(1,\ldots, 1)$ are positive and thus the same as their $\rpoly$ counterparts. This then implies that the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)} \leq \frac{\abs{\etree}(1,\ldots, 1)}{\abs{\etree}(1,\ldots, 1) \cdot \prob_0^k}$, which is indeed a constant.
\end{proof}
\qed
\subsection{$\rpoly$ over $\bi$}
\AH{A general sufficient condition is the $\bi$ having fixed block size (thus implying increasing number of blocks for growing $\numvar$). For increasing $\numvar$, the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ can be proven to be a constant since, as $\numvar$ increases, it has to be the case that new blocks are added, and this results in a constant number of terms cancelled out by $\rpoly$, with the rest surviving, which gives us a constant $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$.
\par In the general case, with fixed number of blocks and growing $\numvar$, all additional terms will be cancelled out by $\rpoly$ while for $\abs{\etree}(1,\ldots, 1)$ it is the case that it will grow exponentially with $\numvar$, yielding a ratio $\frac{O(2^\numvar)}{O(1)}$ and (as will be seen) greater.}
\subsubsection{Known Reduction Result $\bi \mapsto \ti$}
Denote an arbitrary $\bi$ as $\bipdb = (\bipd, \biwset)$ and a constructed $\ti$ to be $\tipdb = (\tipd, \tiwset)$, the details to be described next.
It is well known that $\bipdb$ can be reduced to a query $\poly$ over $\tipdb$. For completeness, let us describe the reduction.
Let tuples in $\bipdb$ be denoted $a_{\block, i}$ and their $\tipdb$ counterparts as $x_{\block, i}$, where $\block$ represents the block id in which $a_{\block, i}$ resides.
\begin{Theorem}\label{theorem:bi-red-ti}
For any $\bipdb$, there exists a query $\poly$ and $\tipdb$ such that $\poly(\tiwset)$ over distribution $\tipd$ outputs elements in $\biwset$ according to their respective probabilities in $\bipd$.
\end{Theorem}
\begin{Definition}[Total Ordering $\biord$]\label{def:bi-red-ti-order}
The order $\biord$ is a fixed total order across all tuples in block $\block$ of $\bipdb$.
\end{Definition}
\begin{Definition}[Query $\poly$]\label{def:bi-red-ti-q}
$\poly$ is constructed to map all possible worlds of $\db_{ti} \in \tiwset$ for which $x_i$ is the greatest according to $\biord$, to the worlds $\vct{w}$ in $\biwset$ in which $a_{\block, i}$ is present and $\bipd(\vct{w}) > 0$. Recall the constraint on $\bipdb$ to be that if $a_{\block, i}$ is present, then it is the case that for all $j \neq i$, tuple $a_{\block, j}$ is not present. For $\bipdb$ with exactly one block, all such worlds $\db_{ti}$ are mapped to the world $\{a_i\}$.
\end{Definition}
For simplicity, we will consider $\bipdb$ to consist of one block $\block$. By independence of blocks in $\bi$, the proofs below immediately generalize to the case of $\bipdb$ with multiple blocks\textcolor{blue}{...umm, we'll see, we made need to argue this}.
The reduction consists of the construction of a query $\poly$ and $\tipdb$ such that $\poly$ is computed over $\tipdb$. To construct the $\tipdb$ given an arbitrary $\bipdb$ a tuple alternative $a_{\block, i}$ is transcribed to a tuple in $\tipdb$ with probability
\begin{equation}
P(x_{b, i}) = \begin{cases}
\frac{P(a_{\block, i})}{\prod_{j = 1}^{i - 1}(1 - P(x_{\block, j}))} &\textbf{if }i > 1\\
P(a_i) &\textbf{if } i = 1.
\end{cases}\label{eq:bi-red-ti-func}
\end{equation}
The above is more simply written as
\begin{equation*}
\tipd(x_{\block, i}) = \frac{P(a_{\block, i})}{1 - \sum_{j = 1}^{i - 1} P(a_{\block, j})}
\end{equation*}
The above mapping is applied across all tuples of $\bipdb$.
This method for computing the probabilities of the tuples in $\tipdb$ allows for the following. According to $\biord$, the powerset of possible worlds is mapped in such a way that the first ordered tuple appearing in a possible world $\db_{\tiabb}$of $\tiwset$ has that world mapped to the world $\db_{\biabb} \in \biwset$ where $a_{\block, i}$ is present with $\bipd(\db_{\biabb}) > 0$. Recall that since we are considering a $\bi$ with one block, there is only one such world in $\biwset$.
\begin{Lemma}\label{lem:bi-red-ti-prob}
The sum of the probabilities of all $\db_{\tiabb} \in \tiwset$ database worlds mapped to a a given tuple $x_{b, i}$ equals the probability of the tuple $a_{\block, i}$ in the original $\bipdb$.
\end{Lemma}
\begin{proof}[Proof of Lemma ~\ref{lem:bi-red-ti-prob}]
The proof is by induction. Given a tuple $a_{\block, i}$ in $\bipdb$ such that $1 \leq i \leq \abs{b}$, (where $\abs{b}$ denotes the number of alternative tuples in block $\block$), by \Cref{eq:bi-red-ti-func} $P(x_{\block, i}) = \frac{P(a_{\block, i})}{1 \cdot \prod_{j = 1}^{i - 1} (1 - P(x_{\block, j}))}$.
For the base case, we have that $i = 1$ which implies that $P(x_{\block, i}) = P(a_{\block, i})$ and the base case is satisfied.
%Other neat tidbits include that $\abs{b} = 1$, the set $b = \{a_1\}$, and the powerset $2^b = \{\emptyset, \{1\}\} = \tiwset$. For coolness, also see that $P(\neg x_i) = 1 - P(x_i) = 1 - P(a_i) = \emptyset$, so there is, in this case, a one to one correspondence of possible worlds and their respective probabilities in both $\ti$ and $\bi$, but this is extraneous information for the proof.
The hypothesis is then that for $k \geq 1$ tuple alternatives, \Cref{lem:bi-red-ti-prob} holds.
For the inductive step, prove that \Cref{lem:bi-red-ti-prob} holds for $k + 1$ alternatives. By definition of the query $\poly$ ( \Cref{def:bi-red-ti-q}), it is a fact that only the world $\wElem_{x_{\block, k + 1}} = \{x_{\block, k + 1}\}$ in the set of possible worlds is mapped to $\bi$ world $\{a_{\block, k + 1}\}$. Then for world $\wElem_{x_{\block, k + 1}}$ it is the case that $P(\wElem_{x_{\block, k + 1}}) = \prod_{j = 1}^{k} (1 - P(x_j)) \cdot P(x_{\block k + 1})$. Since by \Cref{eq:bi-red-ti-func} $P(x_{\block, k + 1}) = \frac{P(a_{\block, k + 1})}{\prod_{j = 1}^{k}(1 - P(x_{\block, j}))}$, we get
\begin{align*}
P(\wElem_{x_{\block, k + 1}}) =& \prod_{j = 1}^{k} (1 - P(x_{\block, j})) \cdot P(x_{\block, k + 1})\\
=&\prod_{j = 1}^{k} (1 - P(x_{\block, j})) \cdot \frac{P(a_{\block, k + 1})}{\prod_{j = 1}^{k}(1 - P(x_{\block, j}))}\\
=&P(a_{\block, k + 1}).
\end{align*}
\end{proof}
\qed
This leaves us with the task of constructing a query $\poly$ over $\tipdb$ to perform the desired mapping of possible worlds. Setting $\poly$ to the following query yields the desired result.
\begin{lstlisting}
SELECT A FROM TI as a
WHERE A = 1 OR
OR A = 2 AND NOT EXISTS(SELECT A FROM TI as b
WHERE A = 1 AND a.blockID = b.blockID)
$\vdots$
OR A = $|$b.blockID$|$ AND NOT EXISTS(SELECT A FROM TI as b
WHERE A = 1 OR A = 2 $\ldots$ A = $|$b.blockID$|$ AND a.blockID = b.blockID
\end{lstlisting}
\begin{Lemma}\label{lem:bi-red-ti-q}
The query $\poly$ satisfies the requirements of \Cref{def:bi-red-ti-q}.
\end{Lemma}
\begin{proof}[Proof of Lemma ~\ref{lem:bi-red-ti-q}]
For any possible world in $2^b$, notice that the WHERE clause selects the tuple with the greatest ordering in the possible world. For all other tuples, disjunction of predicates dictates that no other tuple will be in the output by mutual exclusivity of the disjunction. Thus, it is the case for any $\ti$ possible world, that the tuple $x_{\block, i}$ with the greatest ordering appearing in that possible world will alone be in the output, and all such possible worlds with $x_{\block, i}$ as the greatest in the ordering will output the same world corresponding to the $\bi$ world for the disjoint tuple $a_{\block, i}$.
\end{proof}
\qed
\begin{proof}[Proof of Theorem ~\ref{theorem:bi-red-ti}]
For multiple blocks in $\bipdb$, note that the above reduction to $\poly(\tipdb)$ with multiple 'blocks' will behave the same as $\bipdb$ since the property of independence for $\ti$ ensures that all tuples in the $\ti$ will have the same marginal probability across all possible worlds as their tuple probability, regardless of how many tuples and, thus, worlds the $\tipdb$ has. Note that this propety is unchanging no matter what probabilities additional tuples in $\tipdb$ are assigned.
To see this consider the following.
\begin{Lemma}\label{lem:bi-red-ti-ind}
For any set of independent variables $S$ with size $\abs{S}$, when adding another distinct independent variable $y$ to $S$ with probability $\prob_y$, it is the case that the probability of each variable $x_i$ in $S$ remains unchanged.
\AH{This may be a well known property that I might not even have the need to prove, but since I am not certain, here goes.}
\end{Lemma}
\begin{proof}[Proof of Lemma ~\ref{lem:bi-red-ti-ind}]
The proof is by induction. For the base case, consider a set of one element $S = \{x\}$ with probability $\prob_x$. The set of possible outcomes includes $2^S = \{\emptyset, \{x\}\}$, with $P(\emptyset) = 1 - \prob_x$ and $P(x) = p_x$. Now, consider $S' = \{y\}$ with $P(y) = \prob_y$ and $S \cup S' = \{x, y\}$ with the set of possible outcomes now $2^{S \cup S'} = \{\emptyset, \{x\}, \{y\}, \{x, y\}\}$. The probabilities for each world then are $P(\emptyset) = (1 - \prob_x)\cdot(1 - \prob_y), P(x) = \prob_x \cdot (1 - \prob_y), P(y) = (1 - \prob_x)\cdot \prob_y$, and $P(xy) = \prob_x \cdot \prob_y$. For the worlds where $x$ appears we have
\[P(x) + P(xy) = \prob_x \cdot (1 - \prob_y) + \prob_x \cdot \prob_y = \prob_x \cdot \left((1 - \prob_y) + \prob_y\right) = \prob_x \cdot 1 = \prob_x.\]
Thus, the base case is satisfied.
For the hypothesis, assume that $\abs{S} = k$ for some $k \geq 1$, and for $S'$ such that $\abs{S'} = 1$ where its element is distinct from all elements in $S$, the probability of each independent variable in $S$ is the same in $S \cup S'$.
For the inductive step, let us prove that for $\abs{S_{k + 1}} = k + 1$ elements, adding another element will not change the probabilities of the independent variables in $S$. By the hypothesis, that $S_k \cup S_{k + 1}$, all probabilities in $S_k$ remained untainted after the union. Now consider a set $S' = \{z\}$ and the union $S_{k + 1} \cup S'$. Since all variables are distinct and independent, it is the case that the set of possible outcomes of $S_{k + 1} \cup S' = 2^{S_{k + 1} \cup S'}$ with $\abs{2^{S_{k + 1} \cup S'}} = 2^{\abs{S_{k + 1}} + \abs{S'}}$ since $\abs{S_{k + 1}} + \abs{S'} = \abs{S_{k + 1} \cup S'}$. Then, since $2^{\abs{S_{k + 1}} + \abs{S'}} = 2^{\abs{S_{k + 1}}} \cdot 2^{\abs{S'}}$, and $2^{S'} = \{\emptyset, \{x\}\}$, it is the case that all elements in the original set of out comes will appear \textit{exactly one} time without $z$ and \textit{exactly one }time with $z$, such that for element $x \in 2^{S_{k + 1}}$ with probability $\prob_x$ we have $P(x\text{ }OR\text{ }xz) = \prob_x \cdot (1 - \prob_z) + \prob_x \cdot \prob_z = \prob_x\cdot \left((1 - z) + z\right) = \prob_x \cdot 1 = \prob_x$, and the probabilities remain unchanged, and, thus, the marginal probabilities for each variable in $S_{k + 1}$ across all possible outcomes remain unchanged.
\end{proof}
\qed
The repeated application of \Cref{lem:bi-red-ti-ind} to any 'block' of independent variables in $\tipdb$ provides the same result as joining two sets of distinct elements of size $\abs{S_1}, \abs{S_2} > 1$.
Thus, by lemmas ~\ref{lem:bi-red-ti-prob}, ~\ref{lem:bi-red-ti-q}, and ~\ref{lem:bi-red-ti-ind}, the proof follows.
\end{proof}
\qed
\subsubsection{General results for $\bi$}\label{subsubsec:bi-gen}
\AH{One thing I don't see in the argument below is that as $\numvar \rightarrow \infty$, we have that $\prob_0 \rightarrow 0$.}
The general results of approximating a $\bi$ using the reduction and \Cref{alg:mon-sam} do not allow for the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ to be a constant. Consider the following example.
Let monomial $y_i = P(x_i) \cdot \prod_{j = 1}^{i - 1}(1 - P(x_j))$ Let $\poly(\vct{X}) = \sum_{i = 1}^{\numvar}y_i$. Note that this query output can exist on a projection for which each tuple agrees on the projected values of the query in a $\bi$ consisting of one block and $\numvar$ tuples.
First, let's analyze the numerator $\abs{\etree}(1,\ldots, 1)$. Expanding $\abs{\etree}$ yields $X_i + (1 + X_1)\cdot X_2 + \cdots + (1 + X_1)\cdot(1 + X_2)\cdots(1 + X_{\numvar - 1})\cdot X_n$ which yields a geometric series $S_{\abs{\etree}} = 2^0 + 2^1 +\cdots+2^{\numvar - 1}$. We can perform the following manipulations to obtain the following closed form.
\begin{align*}
2 \cdot S_{\abs{\etree}} =& 2^1 +\cdots+2^\numvar = 2^{\numvar} + S_2 - 1 \\
S_{\abs{\etree}} =& 2^{\numvar + 1} - 1
\end{align*}
So, then $\abs{\etree}(1,\ldots, 1) = 2^{\numvar} - 1$.
On the other hand, considering $\rpoly(\prob_1,\ldots, \prob_\numvar)$, since we are simply summing up the probabilities of one block of disjoint tuples (recall that $P(x_i) = \frac{P(a_i)}{1\cdot\prod_{j = 1}^{i - 1}(1 - P(x_j))}$ in the reduction for $a_i$ the original $\bi$ probability), it is the case that $\rpoly(\prob_1,\ldots, \prob_\numvar) \leq 1$, and the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ in this case is exponential $O(2^\numvar)$. Further note that setting $\poly(\vct{X}) = \sum_{i = 1}^{\numvar} y_i^k$ will yield an $O(2^{\numvar \cdot k})$ bound.
\subsubsection{Sufficient Condition for $\bi$ for linear time Approximation Algorithm}
Let us introduce a sufficient condition on $\bipdb$ for a linear time approximation algorithm.
\AH{Lemma ~\ref{lem:bi-suf-cond} is not true for the case of $\sigma$, where a $\sigma(\bowtie)$ query could select tuples from the same block, and self join them such that all tuples cancel out. We need a definition for 'safe' (in this context) queries, to prove the lemma.}
\begin{Lemma}\label{lem:bi-suf-cond}
For $\bipdb$ with fixed block size $\abs{b}$, the ratio $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ is a constant.
\end{Lemma}
\AH{Two observations.
\par
1) I am not sure that the argument below is correct, as I think we would still get something exponential in the numerator $\abs{\etree}(1,\ldots, 1)$.
\par2) I \textit{think} a similar argument will hold however for the method of not using the reduction.}
\begin{proof}[Prood of Lemma ~\ref{lem:bi-suf-cond}]
For increasing $\numvar$ and fixed block size $\abs{b}$ in $\bipdb$ given query $\poly = \sum_{i = 1}^{\numvar}$ where $y_i = x_i \cdot \prod_{j = 1}^{i - 1} (1 - x_j)$, a query whose output is the maximum possible output, it has to be the case as seen in \Cref{subsubsec:bi-gen} that for each block $b$, $\rpoly(\prob_{b, 1},\ldots, \prob_{b, \abs{b}}) = P(a_{b, 1}) + P(a_{b, 2}) + \cdots + P(a_{b, \abs{b}})$ for $a_i$ in $\bipdb$. As long as there exists no block in $\bipdb$ such that the sum of alternatives is $0$ (which by definition of $\bi$ should be the case), we can bound the $\rpoly(p_1,\ldots, \prob_\numvar) \geq \frac{\prob_0 \cdot \numvar}{\abs{\block}}$ for $\prob_0 > 0$, and then we have that $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_1,\ldots, \prob_\numvar)}$ is indeed a constant.
\end{proof}
\qed
Given a $\bipdb$ satisfying \Cref{lem:bi-suf-cond}, it is the case by \Cref{lem:approx-alg} that \Cref{alg:mon-sam} runs in linear time.
\AH{\Large \bf{092520 -- 100220 New material.}}
\section{Algorithm ~\ref{alg:mon-sam} for $\bi$}
We may be able to get a better run time by developing a separate approximation algorithm for the case of $\bi$. Instead performing the reduction from $\bi \mapsto \poly(\ti)$, we decide to work with the original variable annotations given to each tuple alternative in $\bipdb$. For clarity, let us assume the notation of $\bivar$ for the annotation of a tuple alternative. The algorithm yields $0$ for any monomial sampled that cannot exist in $\bipdb$ due to the disjoint property characterizing $\bi$. The semantics for $\rpoly$ change in this case. $\rpoly$ not only performs the same modding function, but also sets all monomial terms to $0$ if they contain variables which appear within the same block.
\begin{algorithm}[H]
\caption{$\approxq_{\biabb}$($\etree$, $\vct{p}$, $\conf$, $\error$, $\bivec$)}
\label{alg:bi-mon-sam}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
\Require $\vct{p} = (\prob_1,\ldots, \prob_\numvar)$ $\in [0, 1]^N$
\Require $\conf$ $\in [0, 1]$
\Require $\error$ $\in [0, 1]$
\Require $\bivec$ $\in [0, 1]^{\abs{\block}}$\Comment{$\abs{\block}$ is the number of blocks}
\Ensure \vari{acc} $\in \mathbb{R}$
\State $\vari{sample}_\vari{next} \gets 0$
\State $\accum \gets 0$\label{alg:mon-sam-global1}
\State $\numsamp \gets \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$\label{alg:mon-sam-global2}
\State $(\vari{\etree}_\vari{mod}, \vari{size}) \gets $ \onepass($\etree$)\label{alg:mon-sam-onepass}\Comment{$\onepass$ is \Cref{alg:one-pass} \;and \sampmon \; is \Cref{alg:sample}}
\For{\vari{i} \text{ in } $1\text{ to }\numsamp$}\Comment{Perform the required number of samples}
\State $(\vari{M}, \vari{sgn}_\vari{i}) \gets $ \sampmon($\etree_\vari{mod}$)\label{alg:mon-sam-sample}
\For{$\vari{x}_\vari{\block,i}$ \text{ in } $\vari{M}$}
\If{$\bivec[\block] = 1$}\Comment{If we have already had a variable from this block, $\rpoly$ drops the sample.}
\State $\vari{sample}_{\vari{next}} \gets 1$
\State break
\Else
\State $\bivec[\block] = 1$
% \State $\vari{sum} = 0$
% \For{$\ell \in [\abs{\block}]$}
% \State $\vari{sum} = \vari{sum} + \bivec[\block][\ell]$
% \EndFor
% \If{$\vari{sum} \geq 2$}
% \State $\vari{sample}_{\vari{next}} \gets 1$
% \State continue\Comment{Not sure for psuedo code the best way to state this, but this is analogous to C language continue statement.}
\EndIf
\EndFor
\If{$\vari{sample}_{\vari{next}} = 1$}
\State $\vari{sample}_{\vari{next}} \gets 0$
\State continue
\EndIf
\State $\vari{Y}_\vari{i} \gets 1$\label{alg:mon-sam-assign1}
\For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{M}$}%_{\vari{i}}$}
\State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the assignment to $\vari{x}_\vari{j}$ from input $\vct{p}$}
\EndFor
\State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times\; \vari{sgn}_\vari{i}$\label{alg:mon-sam-product}
\State $\accum \gets \accum + \vari{Y}_\vari{i}$\Comment{Store the sum over all samples}\label{alg:mon-sam-add}
\EndFor
\State $\vari{acc} \gets \vari{acc} \times \frac{\vari{size}}{\numsamp}$\label{alg:mon-sam-global3}
\State \Return \vari{acc}
\end{algorithmic}
\end{algorithm}
Before redefining $\rpoly$ in terms of the $\bi$ model, we need to define the notion of performing a mod operation with a set of polynomials.
\begin{Definition}[Mod with a set of polynomials]\label{def:mod-set-poly}
To mod a polynomial $\poly$ with a set $\vct{Z} = \{Z_1,\ldots Z_x\}$ of polynomials, the mod operation is performed successively on the $\poly$ modding out each element of the set $\vct{Z}$ from $\poly$.
\end{Definition}
\begin{Example}\label{example:mod-set-poly}
To illustrate for $\poly = X_1^2 + X_1X_2^3$ and the set $\vct{Z} = \{X_1^2 - X_1, X_2^2 - X_2, X_1X_2\}$ we get
\begin{align*}
&X_1^2 + X_1X_2^3 \mod X_1^2 - X_1 \mod X_2^2 - X_2 \mod X_1X_2\\
=&X_1 + X_1X_2^3 \mod X_2^2 - X_2 \mod X_1X_2\\
=&X_1 + X_1X_2 \mod X_1X_2\\
=&X_1
\end{align*}
\end{Example}
\begin{Definition}[$\rpoly$ for $\bi$ Data Model]\label{def:bi-alg-rpoly}
$\rpoly(\vct{X})$ over the $\bi$ data model is redefined to include the following mod operation in addition to definition ~\ref{def:qtilde}. For every $j \neq i$, we add the operation $\mod X_{\block, i}\cdot X_{\block, j}$. For set of blocks $\mathcal{B}$ and the size of block $\block$ as $\abs{\block}$,
\[\rpoly(\vct{X}) = \poly(\vct{X}) \mod \{X_{\block, i}^2 - X_{\block, i} \st \block \in \mathcal{B}, i \in [\abs{\block}]\} \cup_{\block \in \mathcal{B}} \{X_{\block, i}X_{\block, j} \st i, j \in [\abs{\block}], i \neq j\}
% \mod X_{\block_1, 1}^2 - X_{\block_1, 1} \cdots \mod X_{\block_k, \abs{\block_k}}^2 - X_{\block_k, \abs{\block_k}} \mod X_{b_1, 1} \cdot X_{b_1, 2}\cdots \mod X_{\block_1, \abs{\block_1} -1} \cdot X_{\block, \abs{\block_1}}\cdots \mod X_{\block_k, 1} \cdot X_{\block_k, 2} \cdots \mod X_{\block_k, \abs{\block_k} - 1}\cdot X_{\block_K, \abs{\block_k}}.
\]
\end{Definition}
\subsection{Correctness}
\begin{Theorem}\label{theorem:bi-approx-rpoly-bound}
For any query polynomial $\poly(\vct{X})$, an approximation of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ in the $\bi$ setting can be computed in $O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)}{\error^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)$, with multiplicative $(\error,\delta)$-bounds, where $k$ denotes the degree of $\poly$.
\end{Theorem}
\begin{proof}[Proof of Theorem ~\ref{theorem:bi-approx-rpoly-bound}]
By the proof of \Cref{lem:approx-alg}, with a minor adjustment on $\evalmp$, such that we define the function to output $0$ for any monomial sharing disjoint variables, coupled with the fact that additional operations in \Cref{alg:bi-mon-sam} are $O(1)$ occuring at most $k$ times for each of the $\numsamp$ samples, the proof of \Cref{theorem:bi-approx-rpoly-bound} immediately follows.
\end{proof}
\qed
\subsection{Safe Query Class for $\bi$}
We want to analyze what is the class of queries and data restrictions that are necessary to guarantee that $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\prob_{1},\ldots, \prob_{\numvar})}$ is $O(1)$.
\subsubsection{When $\rpoly$ is zero}
First, consider the case when $\rpoly$ cancels out all terms in $\poly$, where $\poly \neq \emptyset$. For $\rpoly$ to cancel out a tuple $\tup$, by \Cref{def:bi-alg-rpoly} it must be the case that output tuple $\tup$ is dependent on two different tuples appearing in the same block. For this condition to occur, it must be that the query $\poly$ contains a self join operation on a table $\rel$, from which $\tup$ has been derived.
Certain conditions on both the data and query must exist for all tuples $\tup$ to be cancelled out by $\rpoly$ as described above.
For $\rpoly$ to be $0$, the data of a $\bi$ must satisfy certain conditions.
\begin{Definition}[Data Restrictions]\label{def:bi-qtilde-data}
Consider $\bi$ table $\rel$. For $\rpoly$ to potentially cancel all its terms, $\rel$ must be such that given a self join, the join constraints remain unsatisfied for all tuple combinations $x_{\block_i, \ell} \times x_{\block_j, \ell'}$ for $i \neq j$, $\ell \in [\abs{\block_i}], \ell' \in [\abs{\block_j}]$, i.e. combinations across different blocks. Note that this is trivially satisfied with a $\rel$ composed of just one block. Further, it must be the case that the self join constraint is only satisfied in one or more crossterm combinations $x_{\block, i} \times x_{\block_j}$ for $i \neq j$, i.e., within the same block of the input data.
\end{Definition}
To be precise, only equijoins are considered in the following definition. Before preceding, note that a natural self join will never result in $\rpoly$ cancelling all terms, since it is the case that each tuple will necessarily join with itself, and $\rpoly$ will not mod out this case. Also, although we are using the term self join, we consider cases such that query operations over $\rel$ might be performed on each join input prior to the join operation. While technically the inputs may not be the same set of tuples, this case must be considered, since all the tuples originate from the table $\rel$. To this end, let $\poly_1(\rel) = S_1$ and $\poly_2(\rel) = S_2$ be the input tables to the join operation.
\begin{Definition}[Class of Cancelling Queries]\label{def:bi-qtilde-query-class}
When \Cref{def:bi-qtilde-data} is satisfied, it must be that $\poly$ contains a join $S_1 \bowtie_\theta S_2$ such that either% that satisfies the following constraints based on its structure.
\textsc{Case 1:} $S_1 \cap S_2 = \emptyset$
%Any join over this structure will produce a $\poly$ such that $\rpoly$ cancels all monomials out.
%Such a condition implies $\rpoly$ is $0$ regardless of join condition $\theta$. Note the beginning premise of this definition, and the fact that such premise rules out the natural join across all attributes, since we would have that $\poly = \rpoly = 0$.
Or
\textsc{Case 2:} $S_1 \cap S_2 \neq \emptyset$, the attributes in the join predicate are non-matching, i.e., neither operand of the comparison is a strict subset of the other, and no input tuple has agreeing values across the join attributes.
%\begin{enumerate}
% \item When the join condition $\theta$ involves equality between matching attributes, it must be that the attributes of the join conditon $\attr{\theta}$ are a strict subset of $\attr{\rel}$. Then, to satisfy \Cref{def:bi-qtilde-data} it must be that the join input consists of non-intersecting strict subsets of $\rel$, meaning $S_1 \cap S_2 = \emptyset$ and $S_1, S_2 \neq \emptyset$. $\poly_1$ in \Cref{ex:bi-tildeq-0} illustrates this condition.
% \item If $\theta$ involves an equality on non-matching attributes, there exist two cases.
% \begin{enumerate}
% \item The first case consists of when the join inputs intersect, i.e., $S_1 \cap S_2 \neq \emptyset$ . To satisfy \Cref{def:bi-qtilde-data} it must be the case that no tuple can exist with agreeing values across all attributes in $\attr{\theta}$. $\poly_3$ of \Cref{ex:bi-tildeq-0} demonstrates this condition.
% \item The second case consists of when $S_1 \cap S_2 = \emptyset$ and $S_1, S_2 \neq \emptyset$ in the join input, and this case does not contradict the requirements of \Cref{def:bi-qtilde-query-class}. This case is illustrated in $\poly_2$ of \Cref{ex:bi-tildeq-0}.
% \end{enumerate}
%\end{enumerate}% , cause $\rpoly$ to be $0$ must have the following characteristics. First, there must be a self join. Second, prior to the self join, there must be operations that produce non-intersecting sets of tuples for each block in $\bi$ as input to the self join operation.
\end{Definition}
In \Cref{ex:bi-tildeq-0}, $\poly_1$ and $\poly_2$ are both examples of \textsc{Case 1}, while $\poly_3$ is an example of \textsc{Case 2}.
\begin{Theorem}\label{theorem:bi-safe-q}
When both \Cref{def:bi-qtilde-data} and \Cref{def:bi-qtilde-query-class} are satisfied, $\rpoly$ cancels out all monomials.
\end{Theorem}
\begin{proof}[Proof of Theorem ~\ref{theorem:bi-safe-q}]
Starting with the case that $S_1 \cap S_2 = \emptyset$. When this is the case, by definition, all joins on tuples in $S_1$ and $S_2$ will be will involve elements in $S_1 \times S_2$ such that both tuples are distinct. Further, \Cref{def:bi-qtilde-data} rules out joins across different blocks, while calling for joins of the above form within the same block. Thus all tuples in the query output are dependent on more than one tuple from the same block, thus implying by \Cref{def:bi-alg-rpoly} that $\rpoly$ will cancel all monomials.
For the next case where $S_1 \cap S_2 \neq \emptyset$, note that there exists at least one tuple in both $S_1$ and $S_2$ that is the same. Therefore, all equijoins involving matching attributes will produce at least one self joined tuple in the output, breaking the last property of \Cref{def:bi-qtilde-data}. For the case of equijoins with predicates involving non-matching attribute operands, note that by definition of equijoin, the only case that a tuple shared in both $S_1$ and $S_2$ can join on itself is precisely when that tuple's values agree across all the join attributes in $\theta$. Thus, it is the case that when $S_1 \cap S_2 \neq \emptyset$ and the join predicate involves equality comparison between non-matching attributes such that the values of the non-matching comparison attributes for each tuple in $\{S_1 \cap S_2\}$ do not agree, we have that \Cref{def:bi-qtilde-data} is not contradicted, and when \Cref{def:bi-qtilde-data} is fulfilled, it must be the case that $\poly \neq 0$ while $\rpoly = 0$.
This concludes the proof.
\end{proof}
\qed
Note then that the class of queries described in \Cref{def:bi-qtilde-query-class} belong to the set of queries containing some form of selction over self cross product.
%\begin{proof}[Proof of Lemma ~\ref{lem:bi-qtilde-data}]
%\end{proof}
%\begin{proof}[Proof of Lemma ~\ref{lem:bi-qtilde-query-class}]
%\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%
%The condition that causes $\rpoly(\prob_1,\ldots, \prob_\numvar)$ to be $0$ is when all the output tuples in each block cancel each other out. Such occurs when the annotations of each output tuple break the required $\bi$ property that tuples in the same block must be disjoint. This can only occur for the case when a self-join outputs tuples each of which have been joined to another tuple from its block other than itself.
%
%The observation is then the following. In order for such a condition to occur, we must have a query that is a self-join such that the join is on two different sets of atoms for each block. This condition can occur when inner query operations with different constraints on input table $\rel$ produce two non-intersecting sets of tuples and then performs a self join on them, such that the join condition \textit{only} holds for tuples that are members of the same block.
%
%There are two operators that can produce the aforementioned selectivity. First, consider $\sigma$, where two different selection conditions $\theta_1$ and $\theta_2$ over $\rel$ can output sets $S_{\sigma_{\theta_1}}$ and $S_{\sigma_{\theta_2}}$ where $S_{\sigma_{\theta_1}} \cap S_{\sigma_{\theta_2}} = \emptyset$. A join over these two outputs can produce an ouput $\poly$ where all annotations will be disjoint and $\rpoly$ will effectively cancel them all out. Second, consider the projection operator $\pi$, such that projections over $\rel$ which project on different attributes can output two non-intersecting sets of tuples, which when joined, again, provided that the join condition holds only for tuples appearing in the same block, can output tuples all of which will break the disjoint requirement and $\rpoly$ will cancel them out.
\begin{Example}\label{ex:bi-tildeq-0}
Consider the following $\bi$ table $\rel$ consisting of one block, with the following queries $\poly_1 = \sigma_{A = 1}(\rel)\bowtie_{B = B'} \sigma_{A = 2}(\rel)$, $\poly_2 = \sigma_{A = 1}(\rel)\bowtie_{A = B'} \sigma_{A = 2}(\rel)$, and $\poly_3 = \rel \bowtie_{A = B} \rel$. While the output $\poly_i \neq \emptyset$, all queries have that $\rpoly_i = 0$. Since $\rel$ consists of only one block, we will use single indexing over the annotations.
\end{Example}
\begin{figure}[ht]
\begin{tabular}{ c | c c c }
\rel & A & B & $\phi$\\
\hline
& 1 & 2 & $x_1$\\
& 2 & 1 & $x_2$\\
& 1 & 3 & $x_3$\\
& 3 & 1 & $x_4$\\
\end{tabular}
\caption{Example~\ref{ex:bi-tildeq-0} Table $\rel$}
\label{fig:bi-ex-table}
\end{figure}
%%%%%%%%%%Query 1 and 2
\begin{figure}[ht]
\begin{subfigure}{0.2\textwidth}
\centering
\begin{tabular}{ c | c c c }
$\sigma_{\theta_{A = 1}}(\rel )$& A & B & $\phi$\\
\hline
& 1 & 2 & $x_1$\\
& 1 & 3 & $x_3$\\
\end{tabular}
\caption{$\poly_1, \poly_2$ First Selection}
\label{subfig:bi-q1-sigma1}
\end{subfigure}
\begin{subfigure}{0.2\textwidth}
\centering
\begin{tabular}{ c | c c c}
$\sigma_{\theta_{A = 2}}(\rel)$ & A & B' & $\phi$\\
\hline
& 2 & 1 & $x_2$\\
\end{tabular}
\caption{$\poly_1, \poly_2$ Second Selection}
\label{subfig:bi-q1-sigma2}
\end{subfigure}
\begin{subfigure}{0.25\textwidth}
\centering
\begin{tabular}{ c | c c c c c}
$\poly_1(\rel)$ & $A_R$ & $B_R$ & $A_{\rel'}$ & $B_{\rel'}$ & $\phi$\\
\hline
& 1 & 2 & 2 & 1 & $x_1x_2$\\
\end{tabular}
\caption{$\poly_1(\rel)$ Output}
\label{subfig:bi-q1-output}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}
\centering
\begin{tabular}{ c | c c c c c}
$\poly_2(\rel)$ & $A_R$ & $B_R$ & $A_{\rel'}$ & $B_{\rel'}$ & $\phi$\\
\hline
& 1 & 2 & 2 & 1 & $x_1x_2$\\
& 1 & 3 & 2 & 1 & $x_2x_3$\\
\end{tabular}
\caption{$\poly_2(\rel)$ Output}
\label{subfig:bi-q2-output}
\end{subfigure}
\caption{$\poly_1, \poly_2(\rel)$}
\label{fig:bi-q1-q2}
\end{figure}
%%%%%%%%%%%Query 3
\begin{figure}[ht]
% \begin{subfigure}{0.2\textwidth}
% \centering
% \begin{tabular}{ c | c c }
% $\pi_{A}(\rel)$ & A & $\phi$\\
% \hline
% & 1 & $x_1$\\
% & 2 & $x_2$\\
% & 1 & $x_3$\\
% & 3 & $x_4$\\
% \end{tabular}
% \caption{$\poly_3$ First Projection}
% \label{subfig:bi-q3-pi1}
% \end{subfigure}
% \begin{subfigure}{0.2\textwidth}
% \centering
% \begin{tabular}{ c | c c }
% $\pi_{B}(\rel)$ & B & $\phi$\\
% \hline
% & 2 & $x_1$\\
% & 1 & $x_2$\\
% & 3 & $x_3$\\
% & 1 & $x_4$\\
% \end{tabular}
% \caption{$\poly_3$ Second Projection}
% \label{subfig:bi-q3-pi2}
% \end{subfigure}
\begin{subfigure}{0.2\textwidth}
\centering
\begin{tabular}{ c | c c c c c }
$\poly_3(\rel)$ & A & B & $A_{\rel'}$ & $B_{\rel'}$ & $\phi$\\
\hline
& 1 & 2& 2 & 1 & $x_1x_2$\\
& 1 & 2 & 3 & 1 & $x_1x_2$\\
& 2 & 1 & 1 & 2 & $x_1x_2$\\
& 1 & 3 & 2 & 1 & $x_2x_3$\\
& 1 & 3 & 3 & 1 & $x_3x_4$\\
& 3 & 1 & 1 & 3 & $x_3x_4$\\
\end{tabular}
\caption{$\poly_3(\rel)$ Output}
\label{subfig:bi-q3-output}
\end{subfigure}
\caption{$\poly_3(\rel)$}
\label{fig:bi-q3}
\end{figure}
Note that all of \Cref{subfig:bi-q1-output}, \Cref{subfig:bi-q2-output}, and \Cref{subfig:bi-q3-output} each have a set of tuples, where each annotation has cross terms from its block, and by \Cref{def:bi-alg-rpoly} $\rpoly$ will eliminate all tuples output in the respective queries.
\subsubsection{When $\rpoly > 0$}
\par\AH{General Case and Sufficient Condition for $\bi$ and $\rpoly_{\bi}$ approx alg needs to be written.}
\paragraph{General Case}
Consider the query $\poly = \sum_{i = 1}^{\numvar}x_i$, analogous to a projection where all tuples match on the projected set of attributes, meaning $\tup_i[A] = \tup_j[A]$ for $i, j \in [\numvar]$ such that $i \neq j$. When $\numvar$ grows unboundedly, $\abs{\etree}(1,\ldots, 1) = \numvar$. We assume that the sum of the probabilities of all $\numvar$ tuples in the block remain a constant as $\numvar$ grows. Thus, we have that $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\vct{\prob})} = \frac{n}{c}$ for some constant $c$, and this implies $O(\numvar)$ growth.
% while $\rpoly(\vct{\prob}) \leq 1$, which implies that the ratio is linear, i.e., $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\vct{p})} = \frac{\numvar}{\numvar \cdot \prob_0} = \frac{1}{\prob_0}$ for $\prob_0 = min(\vct{\prob})$. However, note that for $\numvar \rightarrow \infty$ it is the case that $\prob_0 \rightarrow 0$, and as $\numvar$ grows, so does $\frac{1}{\prob_0}$. Intuitively, consider when $p_0 = \frac{1}{\numvar}$. Then we know that the bound is $\frac{\numvar}{1}$ which is $O(\numvar)$.
\paragraph{Sufficient Condition for $\bi$ to achieve linear approximation}
Consider the same query $\poly = \sum_{i = 1}^{\numvar}$, but this time conditioned on a fixed block size which we denote $\abs{\block}$. Then it is the case that $\abs{\etree}(1,\ldots, 1) = \numvar$, but if we assume that all blocks have a sum of probabilities equal to $1$, $\rpoly(\vct{\prob}) = \frac{\numvar}{\abs{b}}$, and this means that $\frac{\abs{\etree}(1,\ldots, 1)}{\rpoly(\vct{\prob})} = \frac{\numvar}{\frac{\numvar}{\abs{\block}}} = \abs{\block}$. For the general case when all blocks do not have the property that the sum of the probabilities of the alternatives equal $1$, we can lower bound the sum of probabilities as $\frac{\numvar}{\abs{\block}} \cdot \prob_0$ for $\prob_0 = min(\vct{\prob})$. Note that in $\numvar \cdot \frac{\prob_0}{\abs{\block}}$, $\frac{\prob_0}{\block}$ is indeed a constant, and this gives an overall ratio of $O(1)$ as $\numvar$ increases.