Minor cleaning

master
Oliver Kennedy 2022-06-05 13:15:51 -04:00
parent 8d4467f7c4
commit 71f7c3259f
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
7 changed files with 26 additions and 25 deletions

View File

@ -3,7 +3,7 @@
In this work, we study the problem of computing a query result tuple's expected multiplicity for probabilistic databases under bag semantics (where each tuple is associated with a multiplicity) exactly and approximately.
Specifically, we are interested in the fine-grained complexity of this problem for \abbrCTIDB\xplural, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant $\bound$.
% We consider bag-\abbrTIDB\xplural where we have a bound $\bound$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as \abbrCTIDB\xplural).
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by deterministic query evaluation algorithms introduces super-linear overhead (under certain parameterized complexity hardness conjectures).
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the output of deterministic query evaluation algorithms introduces super-linear overhead (under certain parameterized complexity hardness conjectures).
% We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
% Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
Nonetheless, we develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra ($\raPlus$) query over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).

View File

@ -19,7 +19,7 @@ is used in our auxiliary algorithm \sampmon for sampling monomials when computin
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$.
$\expansion{\circuit}$ has the following recursive definition ($\circ$ is list concatenation).
$\expansion{\circuit}$ has the following recursive definition ($\circ$ is list concatenation).\\
$\expansion{\circuit} =
\begin{cases}
\expansion{\circuit_\linput} \circ \expansion{\circuit_\rinput} &\textbf{ if }\circuit.\type = \circplus\\

View File

@ -32,7 +32,7 @@ We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\abbrCTIDB-lineage polynom
Define a \emph{\abbrOneBIDB} to be the pair $\pdb' = \inparen{\bigtimes_{\tup\in\tupset'}\inset{0, \bound_\tup}, \bpd'},$ where $\tupset'$ is the set of possible tuples such that each $\tup \in \tupset'$ has a multiplicity domain of $\inset{0, \bound_\tup}$, with $\bound_\tup\in\mathbb{N}$. $\tupset'$ is partitioned into $\numblock$ independent blocks $\block_i,$ for $i\in\pbox{\numblock}$, of disjoint tuples. $\bpd'$ is characterized by the vector $\inparen{\prob_\tup}_{\tup\in\tupset'}$ where for every block $\block_i$, $\sum_{\tup \in \block_i}\prob_\tup \leq 1$. Given $W\in\onebidbworlds{\tupset'}$ and for $i\in\pbox{\numblock}$, let $\prob_\tup(W) = \begin{cases}
1 - \sum_{\tup\in\block_i}\prob_\tup & \text{if }W_\tup = 0\text{ for all }\tup\in\block_i\\
0 & \text{if there exists } \tup \neq \tup'\in\block_i; W_\tup, W_{\tup'}\neq 0\\
\prob_\tup & W_\tup \ne 0 \text{ for the unique } t\in B_i.\\
\prob_\tup & W_\tup \ne 0 \text{ for one unique } t\in B_i.\\
\end{cases}$
\noindent$\bpd'$ is the probability distribution across all worlds such that, given $W\in\bigtimes_{\tup \in \tupset'}\inset{0,\bound_\tup}$, $\probOf\pbox{\worldvec = W} = \prod_{\tup\in\tupset'}\prob_{\tup}(W)$.
@ -40,7 +40,7 @@ Define a \emph{\abbrOneBIDB} to be the pair $\pdb' = \inparen{\bigtimes_{\tup\in
\end{Definition}
Lineage polynomials for arbitrary deterministic $\gentupset'$ are constructed in a manner analogous to $1$-\abbrTIDB\xplural (see \Cref{fig:nxDBSemantics}), differing only in the base case.
In a $1$-\abbrTIDB, each tuple contributes a multiplicity of 0 or 1, and $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$.%\textcolor{red}{CHANGE}
In a $1$-\abbrTIDB, each tuple contributes a multiplicity of 0 or 1, and $\polyqdt{\rel}{\gentupset}{\tup} = X_\tup$. %\textcolor{red}{CHANGE}
In a \abbrOneBIDB, each tuple $\tup\in\tupset'$ contributes its corresponding multiplicity: %\textcolor{red}{CHANGE}
$\polyqdt{\rel}{\gentupset}{\tup} = c_\tup\cdot X_\tup$. These semantics are fully detailed in \Cref{fig:lin-poly-bidb}.

View File

@ -25,7 +25,7 @@
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
In~\Cref{sec:intro}, we introduced the structure $T_{det}\inparen{\cdot}$ to analyze the runtime complexity of~\Cref{prob:expect-mult}.
To decouple our results from specific join algorithms, we first lower bound the cost of a join.
To decouple our results from any specific join algorithm, we first lower bound the cost of a join.
\begin{Definition}[Join Cost]
\label{def:join-cost}

View File

@ -2,7 +2,7 @@
%root: main.tex
\section{Introduction}\label{sec:intro}
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$ and multiplicities assigned to any two tuples are independent of each other.
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ is defined over a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
@ -57,13 +57,13 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us
%\vspace{-0.53cm}
\end{figure}
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524,DBLP:journals/sigmod/GuagliardoL17}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
% We will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
% Allowing for unbounded $c$ is an interesting open problem.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and its data complexity has, in general been shown % by Dalvi and Suicu
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and their data complexity has, in general been shown % by Dalvi and Suicu
to be \sharpphard\cite{10.1145/1265530.1265571}.
In an independent work, Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural with unbounded multiplicities, which requires %them to explicitly address the issue of
a succinct representation of probability distributions over infinitely many multiplicities.
@ -83,7 +83,8 @@ question that we explore is %the hardness of
\cite{https://doi.org/10.48550/arxiv.2201.11524} also observe that computing the expectation of an output tuple multiplicity is in \ptime, they do not investigate the fine-grained complexity of this problem.}
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query, which we make more precise shortly.
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the potential practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$; Although higher multiplicities may arise in intermediate results or outputs, input tuples are frequently unique, even in bag-relational data.
If true, this opens up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$.
That is, although production database systems use bag semantics for query evaluation, allowing duplicate intermediate or output tuples, input tuples in real world datasets are still frequently unique.
To analyze this question we denote by $\timeOf{}^*(\query,\pdb, \bound)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$ and query $\query$.
@ -112,7 +113,7 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
\mypar{Our lower bound results}
%
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query also encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both the deterministic and probabilistic query processing access to $\query'$}.
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the the answer to the above question is no--
\Cref{tab:lbs} shows our results.
@ -152,7 +153,7 @@ compute $\expct_{\vct{W}\sim \pdassign}\pbox{\poly\inparen{\worldvec}}$).
%We note that computing \Cref{prob:expect-mult} is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
Further, only light extensions are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
Further, only light extensions (see \Cref{def:reduced-poly-one-bidb}) are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
Next, we motivate this reduced polynomial $\rpoly$.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
@ -185,7 +186,7 @@ The simple insight to get around this issue to note that the random variables $\
Given that $U$ can only have multiplicity of $1$ or $2$ but not both, we drop the monomials with the term $U_1U_2$ to get
$\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2, B} = A^2U_1^2B^2+2^2\cdot A^2 U_2^2B^2.$
Now that all the world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
Now that world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
@ -231,7 +232,7 @@ We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymb
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
\input{two-step-model}
We adopt a two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial $\poly(\vct{X})%=\textcolor{red}{CHANGE}\apolyqdt\inparen{\vct{X}}$
$;
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
@ -245,8 +246,8 @@ $\circuit : \timeOf{\abbrStepOne}(\query,\tupset, \circuit) + \timeOf{\abbrStepT
\end{Problem}
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ is too large.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $|\circuit|$ is the size of circuit $\circuit$.
Hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ is too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable/numeric source gates and multiplication/addition internal/sink gates.
@ -254,14 +255,14 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
Given that a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can focus on the complexity of \abbrStepTwo.
As we also show in \Cref{sec:circuit-runtime}, the size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$), where $|\circuit|$ is the size of circuit $\circuit$.
As we also show in \Cref{sec:circuit-runtime}, the size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
%Thus, the question of approximation can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
Given such a $\circuit^*$, to solve \Cref{prob:big-o-joint-steps}, it is \emph{sufficient} to solve: % the following problem:
\begin{Problem}\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\Phi\inparen{\vct{X}}$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ (recall \Cref{def:reduced-poly}) is a constant factor approximation of $\rpoly$.
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ is a constant factor approximation of $\rpoly$ (recall \Cref{def:reduced-poly}).
This is illustrated in the following example using $\query_1^2$ from earlier. To aid in presentation we again limit our focus to $\refpoly{1, }^{\inparen{ABU}^2}$, assume $\bound = 2$ for variable $U$ and $\bound = 1$ for all other variables. Let $\prob_A$ denote $\probOf\pbox{A = 1}$.
%In computing $\rpoly$, we have some cancellations to deal with:
Then we have:
@ -296,7 +297,7 @@ If we assume that all probability values are in $[p_0,1]$ for some $p_0>0$,
we get that $\refpoly{1, }^{\inparen{ABU}^2}\inparen{\vct{\prob}} - 4\prob_A^2\prob_{U_1}\prob_{U_2}\prob_B^2$ is in the range $\pbox{p_0^3\cdot\rpoly^{\inparen{ABU}^2}_1\inparen{\vct{\prob}}, \rpoly_1^{\inparen{ABU}^2}\inparen{\vct{\prob}}}$.
%We can simulate sampling from $\refpoly{1, }^2\inparen{\vct{X}}$ by sampling monomials from $\refpoly{1, }^2$ while ignoring any samples $A^2X_1X_2B^2$.
Note however, that this is \emph{not a tight approximation}.
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
In~\Cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.

View File

@ -2,7 +2,7 @@
%!TEX root=./main.tex
\section{Hardness of Exact Computation}
\label{sec:hard}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\qhard,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instances $(\qhard,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$
%; \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. The results also apply to
as well as \abbrOneBIDB. % and other \abbrPDB\xplural.
@ -76,7 +76,7 @@ SELECT COUNT(*) FROM $\underbrace{Q_1\text{ JOIN }Q_1\text{ JOIN}\cdots\text{JOI
In the above, $\query_1$ is defined in \Cref{sec:intro}, which is the same as $\qhard^1$.
%
%\noindent %Consider again the \abbrCTIDB instance $\pdb$ of~\Cref{fig:two-step} and, for our hard instance, let $\bound = 1$. $\pdb$ generalizes to one compatible
We next define the instances for $T$ and $R$ that lead to the lineage polynomial in~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
We next define the instances for $T$ and $R$ that lead to the lineage polynomial in~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with a probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a $1$-\abbrTIDB lineage polynomial.

View File

@ -2,10 +2,10 @@
%!TEX root=./main.tex
\subsection{Formalizing \Cref{prob:intro-stmt}}\label{sec:expression-trees}
We focus on the problem of computing $\expct_{\worldvec\sim\pdassign}\pbox{\poly\inparen{\vct{\randWorld}}}$ from now on.%, assume implicit $\query, \tupset, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
We focus on the problem of computing $\expct_{\worldvec\sim\pdassign}\pbox{\poly\inparen{\vct{\randWorld}}}$ from now on. %, assume implicit $\query, \tupset, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
%
\Cref{prob:intro-stmt} asks if there exists a linear time approximation algorithm in the size of a given circuit \circuit which encodes $\poly\inparen{\vct{X}}$. Recall that in this work we
represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way. Since we are specifically using circuits to model lineage polynomials, we can refer to these circuits as lineage circuits. However, when the meaning is clear, we will drop the term lineage and only refer to them as circuits.
\Cref{prob:intro-stmt} asks if there exists a linear time approximation algorithm in the size of a given circuit \circuit that encodes $\poly\inparen{\vct{X}}$. Recall that in this work we
represent lineage polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity), which we use for polynomials over $\mathbb N$ in the obvious way. Since we are specifically using circuits to model lineage polynomials, we can refer to these circuits as lineage circuits. However, when the meaning is clear, we will drop the term lineage and only refer to them as circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Circuit]\label{def:circuit}
@ -100,7 +100,7 @@ $\circuitset{\polyX}$ is the set of all possible circuits $\circuit$ such that $
\Cref{fig:circuit} depicts a circuit \circuit in $\circuitset{2X^2+3XY-2Y^2}$. Light-text annotations
%denote the computation of $\abs{\circuit}\inparen{1, \ldots, 1}$ which we introduce
can be ignored until~\Cref{sec:algo}.%One can think of $\circuitset{\polyX}$ as the infinite set of circuits where for each element \circuit, $\polyf\inparen{\circuit} = \polyX$.
can be ignored until~\Cref{sec:algo}. %One can think of $\circuitset{\polyX}$ as the infinite set of circuits where for each element \circuit, $\polyf\inparen{\circuit} = \polyX$.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\medskip
@ -115,7 +115,7 @@ can be ignored until~\Cref{sec:algo}.%One can think of $\circuitset{\polyX}$ as
Let $\pdb$ be an arbitrary \abbrCTIDB and $\vct{X}$ be the set of variables annotating tuples in $\tupset$. Fix an $\raPlus$ query $\query$ and a result tuple $\tup$.
The \expectProblem is defined as follows:%\\[-7mm]
\begin{flalign*}
&\textbf{Input}: \circuit \in \circuitset{\polyX}\\% \text{ for }\poly\inparen{\vct{X}} = \textcolor{red}{CHANGE}\poly\pbox{\query,\tupset,\tup}&\\
&\textbf{Input}: \circuit \in \circuitset{\poly\pbox{\query, \tupset, \tup}\inparen{\vct{X}}}\\% \text{ for }\poly\inparen{\vct{X}} = \textcolor{red}{CHANGE}\poly\pbox{\query,\tupset,\tup}&\\
&\textbf{Output}: \expct_{\vct{W} \sim \bpd}\pbox{\poly\pbox{\query, \tupset, \tup}\inparen{\vct{W}}}.&
\end{flalign*}
\end{Definition}