Changes per meeting 030422.

master
Aaron Huber 2022-03-08 11:38:31 -05:00
parent 78bb297b88
commit fb95a8972f
3 changed files with 18 additions and 15 deletions

View File

@ -108,27 +108,29 @@ In this work we will assume that $c =\bigO{1}$ since this is what is typically s
Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$.\BG{I don't think this is understandable, so far we have not defined $\query\inparen{\pdb}\inparen{\tup}$. I don't think that anybody would understand that we mean the lineage polynomial. Also query operators is really misleading here, since people would probably take this to mean relational algebra operators in this context. Perhaps, we just mention here that there exists a trivial algorithm and point forward and discuss the trivial algorithm in 1.1}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection (see~\Cref{sec:intro-poly-equiv}).% by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$.\BG{I don't think this is understandable, so far we have not defined $\query\inparen{\pdb}\inparen{\tup}$. I don't think that anybody would understand that we mean the lineage polynomial. Also query operators is really misleading here, since people would probably take this to mean relational algebra operators in this context. Perhaps, we just mention here that there exists a trivial algorithm and point forward and discuss the trivial algorithm in 1.1}
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an equivalent\BG{what do we mean here by equivalent?} deterministic query. If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query which we make more precise shortly.
%\BG{what do we mean here by equivalent?} deterministic query.
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. This paper considers $\raPlus$ queries for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{Note that our work applies to any $\query \in\raPlus$, which implies that specific heuristics for choosing an optimized query can be abstracted away, i.e., our work does not consider heuristic techniques.}
\begin{table}[t!]
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
\textbf{Lower bound on $\timeOf{}^*(\query,\pdb)$} & \textbf{Num.} $\bpd$s% \BG{$\bpd$ is used for probability distribution before}
\textbf{Lower bound on $\timeOf{}^*(\qhard,\pdb)$} & \textbf{Num.} $\bpd$s% \BG{$\bpd$ is used for probability distribution before}
& \textbf{Hardness Assumption}\\
\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).\BG{Maybe we should give $Q$ a name, e.g., $Q_a$, $Q_{hard}$, ...}}
\caption{Our lower bounds for a specific hard query $\qhard$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).\BG{Maybe we should give $Q$ a name, e.g., $Q_a$, $Q_{hard}$, ...}}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results}
@ -293,8 +295,8 @@ $, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
%\AH{Regarding what follows (in the next paragraph): I think this \emph{may} be misleading (also, technically incorrect since $\poly$ is used instead of $\rpoly$) since it the \emph{lead} $c_{2k}$ of the term in $\rpoly\inparen{\vct{X}}$ with $2k$ distinct variables. However, technically, since we have that $\rpoly\inparen{\vct{\prob}}$ is a univariate polynomial, then, indeed this IS an accurate statement, since the term with $2k$ distinct variables in $\rpoly\inparen{\vct{\prob}}$ is the term with the highest degree (this assumes for $d$ distinct edges that $d \geq k$ for our special graph query; otherwise, there is no $k$-matching, and the leading coefficient is not $c_{2k}$). Perhaps we should note this. However, the context is in light of considering the \emph{univariate} polynomial $\rpoly\inparen{\vct{\prob}}$. Perhaps change $\poly$ to $\rpoly\inparen{\prob,\ldots,\prob}$.}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient $c_{2k}$ (assuming $c_{2k}>0$; see~\Cref{subsec:c2k-proportional}) in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$,
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $\qhard^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query_1$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the coefficient corresponding to the power of $2k$ in $\poly$ of $\qhard^k$ is proportional to the number of $k$-matchings in $G$,
a known hard problem in parameterized/fine-grained complexity literature.

View File

@ -55,6 +55,7 @@
\newcommand{\relii}{T}
\newcommand{\db}{D}
\newcommand{\query}{Q}
\newcommand{\qhard}{\query_{hard}}
\newcommand{\tset}{\mathcal{T}}%the set of tuples in a database
\newcommand{\join}{\mathlarger\Join}
\newcommand{\select}{\sigma}

View File

@ -2,7 +2,7 @@
%!TEX root=./main.tex
\section{Hardness of Exact Computation}
\label{sec:hard}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\qhard,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. The results also apply to \abbrOneBIDB and other more general \abbrPDB\xplural.
%(and hence the equivalent \Cref{prob:bag-pdb-query-eval})
%in the negative.
@ -46,12 +46,12 @@ For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent Returning to \Cref{fig:two-step}, it can be seen that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial from query $\query_k$, which we define next ($\query_2$ from~\Cref{sec:intro} is the same query with $k=2$). Let us alias
\noindent Returning to \Cref{fig:two-step}, it can be seen that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial from query $\qhard^k$, which we define next ($\query_2$ from~\Cref{sec:intro} is the same query with $k=2$). Let us alias
\begin{lstlisting}
SELECT DISTINCT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
\end{lstlisting}
as $R$. The query $\query^k$ then becomes
as $R$. The query $\qhard^k$ then becomes
\mdfdefinestyle{underbrace}{topline=false, rightline=false, bottomline=false, leftline=false, backgroundcolor=black!15!white, innerbottommargin=0pt}
\begin{mdframed}[style=underbrace]
\begin{lstlisting}
@ -62,11 +62,11 @@ SELECT COUNT(*) FROM $\underbrace{R\text{ JOIN }R\text{ JOIN}\cdots\text{JOIN }R
In other words, this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a $1$-\abbrTIDB lineage polynomial. % for a \abbrTIDB \abbrPDB.
Next, we note that the runtime for answering $\query^k$ on deterministic database $\tupset$, as defined above, is $O_k\inparen{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
Next, we note that the runtime for answering $\qhard^k$ on deterministic database $\tupset$, as defined above, is $O_k\inparen{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
\begin{Lemma}\label{lem:tdet-om}
Let $\query^k$ and $\tupset$ be as defined above. Then
Let $\qhard^k$ and $\tupset$ be as defined above. Then
% of \Cref{def:qk}, the runtime
$\qruntimenoopt{\query^k, \tupset}$ is $O_k\inparen{\numedge}$.
$\qruntimenoopt{\qhard^k, \tupset}$ is $O_k\inparen{\numedge}$.
\end{Lemma}
\subsection{Multiple Distinct $\prob$ Values}