multi

2021-04-10 13:11:35 -05:00 · 2021-04-10 13:11:35 -05:00 · 40cac20325
parent e836805534
commit 40cac20325
4 changed files with 19 additions and 18 deletions
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -45,14 +45,15 @@ For any graph $G=([n],E)$ and $\kElem\ge 1$, define
 \[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
 \end{Definition}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results do not require the general circuit representation and hold for even the expression tree representation. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.
+Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
+even hold for the expression trees. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.



-\noindent Returning to in \Cref{fig:ex-shipping-simp}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our running example query:
-\[\poly^k_G\dlImp Loc(C_1),Route(C_1, C_1'),Loc(C_1'),\dots,Loc(C_\kElem),Route(C_\kElem,C_\kElem'),Loc(C_\kElem')\]
-where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $Loc$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
-Note that this implies that our hard query polynomial can be represented even as an expression tree and is created from a project-join query with same probability value for each $\prob_i$. %; our hardness result transfers here as well.
+\noindent Returning to \Cref{fig:ex-shipping-simp}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our running example query:
+\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
+where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $OnTime$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
+Note that this implies that our hard query polynomial can be represented as an expression tree produced by a  project-join query with same probability value for each input tuple $\prob_i$. %; our hardness result transfers here as well.
 % OK: The following (commented-out) sentence feels a bit misplaced here.
 % -- by contrast our approximation algorithm in \Cref{sec:algo} can handle lineage polynomials represented as circuits generated by union of select-project-join (SPJU) queries with potentially distinct $\prob_i$ values. % (i.e. we do not need union or select operator to derive our hardness result).

@ -69,7 +70,7 @@ Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2
 \end{Theorem}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
-We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. By contrast the approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for  this query. % (since it runs in linear-time on all lineage polynomials).
+We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{E}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. By contrast the approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for  this query. % (since it runs in linear-time on all lineage polynomials).

 \noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
 \begin{Lemma}\label{lem:qEk-multi-p}
--- a/prob-def.tex
+++ b/prob-def.tex
@ -9,15 +9,15 @@ For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2X^2 + 3XY -

 We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Definition}[Circuit]\label{def:circuit}
 A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source nodes (in degree of $0$) consist of elements in either $\reals$ or $\vct{X}$.  The internal nodes and (the single) sink node of $\circuit$ (corresponding to the result tuple $t$) have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
-
-$\circuit$ additionally has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input.
+%
+Each node in a circuit $\circuit$ has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node (one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of the nodes inputs. We use $\circuit_\linput$ to denote the left input and $\circuit_\rinput$ the right input or the sink of circuit $\circuit$.
 %The member \degval holds the degree of \circuit.
 When the underlying DAG is a tree (with edges pointing towards the root), we will refer to the structure as an expression tree \etree.  Note that in such a case, the root of \etree is analogous to the sink of \circuit.
 \end{Definition}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


 As stated in \Cref{def:circuit}, every internal node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -47,7 +47,7 @@ The reduced form of a lineage polynomial can be obtained but requires a linear s
 \subsection{Probabilistic Databases (PDBs)}

 An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
-Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$
+Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.

 For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$  that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
 \[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
@ -55,8 +55,8 @@ For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is th
 Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
 We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
 In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
-We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
-Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$.
+We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.
+Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$ to $\vct{X}$.
 The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{W})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{W}$.
 $\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).

--- a/single_p.tex
+++ b/single_p.tex
@ -10,13 +10,13 @@ While \Cref{thm:mult-p-hard-result} shows that computing $\rpoly(\prob,\dots,\pr

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}\label{th:single-p-hard}
-Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ from $G$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}. 
+Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, any algorithm that computes $\rpoly_{G}^3(\prob,\dots,\prob)$ from $G$ exactly has to run in time $\Omega\inparen{m^{1+\eps_0}}$, where $\eps_0$ is as defined in \Cref{conj:graph}.
 \end{Theorem}
 %\begin{proof}[Proof of Corollary ~\ref{th:single-p-gen-k}]
 %Consider $\poly^3_{G}$ and $\poly' = 1$ such that $\poly'' = \poly^3_{G} \cdot \poly'$.  By  \Cref{th:single-p}, query $\poly''$ with $\kElem = 4$ has $\Omega(\numvar^{\frac{4}{3}})$ complexity.
 %\end{proof}
-The above shows the hardness for a very specific query polynomial but it is easy to come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial query polynomials. 
-Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard. 
+The above shows the hardness for a very specific query polynomial but it is easy to come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial query polynomials.
+Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard.
 However, in \Cref{sec:algo} we show that if we are willing to compute an approximation that this problem (and indeed solving our problem for a much more general setting) is in linear time.

 %\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
@ -24,7 +24,7 @@ We will prove the above result by the following reduction:
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}\label{th:single-p}
 Fix $\prob\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
-If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$ %count the number of triangles, 3-paths, and 3-matchings in $G$ 
+If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$ %count the number of triangles, 3-paths, and 3-matchings in $G$
 in $O\inparen{T(\numedge) + \numedge}$ time.
 \end{Theorem}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -44,7 +44,7 @@ Fix $\prob\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\e
 1 - 3p                                     &       -(3\prob^2 - \prob^3)\\
 10(3\prob^2 - \prob^3)		&       10(3\prob^2 - \prob^3)
 \end{pmatrix}
-\cdot 
+\cdot
 \begin{pmatrix}
 \numocc{G}{\tri}]\\
 \numocc{G}{\threedis}