More changes to notation, etc.

Aaron Huber 2021-06-11 11:22:58 -04:00
parent 569ae79057
commit 95a311565b
8 changed files with 73 additions and 64 deletions

View File

@ -3,7 +3,7 @@
To justify the use of $\semNX$-databases, we need to show that we can encode any $\semN$-PDB in this way and that the query semantics over this representation coincides with query semantics over $\semN$-PDB. For that it will be opportune to define representation systems for $\semN$-PDBs.\BG{cite}
To justify the use of $\semNX$-databases, we need to show that we can encode any $\semN$-PDB in this way and that the query semantics over this representation coincides with query semantics over its respective $\semN$-PDB. For that it will be opportune to define representation systems for $\semN$-PDBs.\BG{cite}
@ -19,16 +19,17 @@ To justify the use of $\semNX$-databases, we need to show that we can encode any
As mentioned above we will use $\semNX$-databases paired with a probability distribution as a representation system.
We refer to such databases as $\semNX$-PDBs and use bold symbols to distinguish them from possible worlds (which are $\semN$-databases).
Formally, an $\semNX$-PDB is an $\semNX$-database $\idb_{\semNX}$ and a probability distribution $\pd$ over assignments $\assign$ of the variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ occurring in annotations of $\idb_{\semNX}$ to $\{0,1\}$. Note that an assignment $\assign: \vct{X} \to \{0,1\}^\numvar$ can be represented as a vector $\vct{w} \in \{0,1\}^n$ where $\vct{w}[i]$ records the value assigned to variable $X_i$. Thus, from now on we will solely use such vectors which we refer to as \emph{world vectors} and implicitly understand them to represent assignments. Given an assignment $\assign$ we use $\assign(\pxdb)$ to denote the semiring homomorphism $\semNX \to \semN$ that applies the assignment $\assign$ to all variables of a polynomial and evaluates the resulting expression in $\semN$.\BG{explain connection to homomorphism lifting in K-relations}
As mentioned above we will use $\semNX$-databases paired with a probability distribution as a representation system, referring to such databases as $\semNX$-PDBs.
Formally, an $\semNX$-PDB is an $\semNX$-database $\idb_{\semNX}$ and a probability distribution $\pd$ over assignments $\assign$ of the variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ occurring in annotations of $\idb_{\semNX}$ to $\{0,1\}$.
\AH{There was a big ICDT reviewer complaint in this section, but I don't know that I think it confuses things to think of them both an assignment and/or a vector of variables.}
Note that an assignment $\assign: \vct{X} \to \{0,1\}^\numvar$ can be represented as a vector $\vct{w} \in \{0,1\}^n$ where $\vct{w}[i]$ records the value assigned to variable $X_i$. Thus, from now on we will solely use such vectors which we refer to as \emph{world vectors} and implicitly understand them to represent assignments. Given an assignment $\assign$ we use $\assign(\pxdb)$ to denote the semiring homomorphism $\semNX \to \semN$ that applies the assignment $\assign$ to all variables of a polynomial and evaluates the resulting expression in $\semN$.\BG{explain connection to homomorphism lifting in K-relations}
An $\semNX$-PDB $\pxdb$ over variables $\vct{X} = \{X_1, \ldots, X_n\}$ is a tuple $(\idb_{\semNX},\pd)$ where $\db$ is an $\semNX$-database and $\pd$ is a probability distribution over $\vct{w} \in \{0,1\}^n$. We use $\assign_{\vct{w}}$ to denote the assignment corresponding to $\vct{w} \in \{0,1\}^n$. The $\semN$-PDB $\rmod(\pxdb) = (\idb, \pd')$ encoded by $\pxdb$ is defined as:
\idb & = \{ \assign_{\vct{w}}(\pxdb) \mid \vct{w} \in \{0,1\}^n \} \\
\forall \db \in \idb: \probOf'(\db) & = \sum_{\vct{w} \in \{0,1\}^n: \assign_{\vct{w}}(\pxdb) = \db} \probOf(\vct{w})
\forall \db \in \idb: \probOf(\db) & = \sum_{\vct{w} \in \{0,1\}^n: \assign_{\vct{w}}(\pxdb) = \db} \probOf(\vct{w})
@ -39,7 +40,9 @@ For instance, consider a $\pxdb$ consisting of a single tuple $\tup_1 = (1)$ ann
D_{[0,1]}(\tup_1) = 1 \hspace{0.3cm} \mathbf{and} \hspace{0.3cm} D_{[1,1]}(\tup_1) = 2
Importantly, as the following proposition shows, any finite $\semN$-PDB can be encoded as an $\semNX$-PDB and $\semNX$-PDBs are closed under positive relational algebra queries, the class of queries we are interested in in this work.
\AH{I get the notation above, but we never formally introduced it.}
Importantly, as the following proposition shows, any finite $\semN$-PDB can be encoded as an $\semNX$-PDB and $\semNX$-PDBs are closed under positive relational algebra queries, the class of queries we are interested in in this work. \AH{We need to stick with one formalism.}
\AH{Is it a known result that $\semNX$-\abbrPDB\xplural are closed under $\raPlus$ queries?}
@ -48,23 +51,27 @@ $\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is clos
%\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\idb_{\semNX}, \pd')$ such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\idb_{\semNX}, \pd')$ such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$
\AH{What are we using $max(D_i)$ for?}
denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
%variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
In $\idb_{\semNX}$ we assign each tuple $\tup$ the polynomial:
\idb_{\semNX}(\tup) = \sum_{i=1}^{\abs{\idb}} D_i(\tup)\cdot X_{i}
The probability distribution $\pd'$ assigns all world vectors zero probability except for $\abs{\idb}$ world vectors (representing the possible worlds) $\vct{w_i}$. All elements of $\vct{w_i}$ are zero except for the position corresponding to variables $X_{i}$ which is set to $1$. Unfolding definitions it is trivial to show that $\rmod(\pxdb) = \pdb$. Thus, $\semNX$ are a complete representation system.
The probability distribution $\pd'$ assigns all world vectors zero probability except for $\abs{\idb}$ world vectors (representing the possible worlds) $\vct{w}_i$. All elements of $\vct{w}_i$ are zero except for the position corresponding to variables $X_{i}$ which is set to $1$. Unfolding definitions it is trivial to show that $\rmod(\pxdb) = \pdb$. Thus, $\semNX$ are a complete representation system.
The closure under $\raPlus$ queries follows from the fact that an assignment $\vct{X} \to \{0,1\}$ is a semiring homomorphism and that semiring homomorphisms commute with queries over $\semK$-relations.
Now let us consider computing the expected multiplicity of a tuple $\tup$ in the result of a query $\query$ over an $\semN$-PDB $\pdb$ using the annotation of $\tup$ in the result of evaluating $\query$ over an $\semNX$-PDB $\pxdb$ for which $\rmod(\pxdb) = \pdb$. The expectation of the polynomial $\poly = \query(\pxdb)(\tup)$ based on the probability distribution of $\pxdb$ over the variables in $\pxdb$ is:
\AH{The wording ``...over the variables...'' I {\emph think} can be misleading since we also discuss a probability distribution $\pd$ being induced by a vector of probability assignments $\vct{p}$ to each variable $\pVar_i$.}
\expct_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})} = \sum_{\vct{w} \in \{0,1\}^n} \assign_{\vct{w}}(\query(\pxdb)(\tup)) \cdot \probOf(\vct{w})\label{eq:expect-q-nx}
@ -76,7 +83,8 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
\subsection{\tis and \bis in the $\semNX$-PDB model}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. \AH{Reviewer complaint: This is not true by definition.}
The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
As already noted above, in this work, we define \tis and \bis as subclasses of $\semNX$-PDBs.
@ -100,16 +108,16 @@ A well-known result for set semantics PDBs is that while not all finite PDBs can
\subsection{Proof of~\Cref{prop:expection-of-polynom}}
We need to prove for $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$ that $\expct_{\db \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}$
We need to prove for $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$ that $\expct_{\randDB\sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}$
By expanding $\polyForTuple$ and the expectation we have:
\expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}
& = \sum_{\vct{w} \in \{0,1\}^n}\probOf'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
& = \sum_{\vct{w} \in \{0,1\}^n}\probOf(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
\intertext{From $\rmod(\pxdb) = \pdb$, we have that the range of $\assign_{\vct{w}(\pxdb)}$ is $\idb$, so}
& = \sum_{\db \in \idb}\;\;\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\probOf'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
& = \sum_{\db \in \idb}\;\;\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\probOf(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
\intertext{In the inner sum, $\assign_{\vct{w}}(\pxdb) = \db$, so by distributivity of $+$ over $\times$}
& = \sum_{\db \in \idb}\query(\db)(t)\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\probOf'(\vct{w})\\
\intertext{From the definition of $\probOf$, given $\rmod(\pxdb) = \pdb$, we get}
& = \sum_{\db \in \idb}\query(\db)(t)\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\probOf(\vct{w})\\
\intertext{From the definition of $\pd$ in \cref{def:semnx-pdbs}, given $\rmod(\pxdb) = \pdb$, we get}
& = \sum_{\db \in \idb}\query(\db)(t) \cdot \probOf(D) \quad = \expct_{\db \sim \pd}[\query(\db)(t)]
@ -119,9 +127,9 @@ By expanding $\polyForTuple$ and the expectation we have:
$\poly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}} \cdot \prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i^{d_i}$
$\poly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} = \{d_1,\ldots, d_\numvar\}\in \domN^\numvar}c_{\vct{d}} \cdot \prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i^{d_i}$
$\rpoly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \eta} q_{\vct{d}}\cdot\prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i$% \;\;\; for some $\eta \subseteq \{0,\ldots, B\}^\numvar$
$\rpoly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} = \{d_1,\ldots, d_\numvar\}\in \semN^\numvar} c_{\vct{d}}\cdot\prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i$% \;\;\; for some $\eta \subseteq \{0,\ldots, B\}^\numvar$
@ -151,16 +159,16 @@ Note that any $\poly$ in factorized form is equivalent to its \abbrSMB expansion
\subsection{Proof for Lemma~\ref{lem:exp-poly-rpoly}}
Let $\poly$ be the generalized polynomial, i.e., the polynomial of $\numvar$ variables with highest degree $= B$: %, in which every possible monomial permutation appears,
\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}\].
Then, in expectation we have
\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}\].
Then, denoting the corresponding exponent vector $\vct{d}$ for a world $\vct{\wElem}$ over the set of valid worlds $\valworlds$ as $\vct{d} \in \valworlds$, in expectation we have
\expct_{\vct{W}}\pbox{\poly(\vct{W})} &= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \expct_{\vct{w}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar w_i^{d_i}}\label{p1-s1}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i^{d_i}}\label{p1-s2}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i}\label{p1-s3}\\
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
\expct_{\vct{W}}\pbox{\poly(\vct{W})} &= \sum_{\vct{d} \in \eta}c_{\vct{d}}\cdot \expct_{\vct{w}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar w_i^{d_i}}\label{p1-s1}\\
&= \sum_{\vct{d} \in \eta}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i^{d_i}}\label{p1-s2}\\
&= \sum_{\vct{d} \in \eta}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{w}}\pbox{w_i}\label{p1-s3}\\
&= \sum_{\vct{d} \in \eta}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
&= \rpoly(\prob_1,\ldots, \prob_\numvar)\label{p1-s5}
In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent, or the monomial expectation is 0), the expecation can be pushed all the way inside of the product. In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall that by \bi constraints, the variables are independent, otherwise the monomial expectation is 0), the expecation can be pushed all the way inside of the product. In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that $\rpoly(\prob_1,\ldots, \prob_\numvar)$ is exactly the product of probabilities of each variable in each monomial across the entire sum.
@ -169,6 +177,6 @@ Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that
\subsection{Proof For Corollary~\ref{cor:expct-sop}}
Note that \cref{lem:exp-poly-rpoly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$. Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $O(\smbOf{|\poly|})$ computations.
Note that \cref{lem:exp-poly-rpoly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$. Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $\bigO{\size\inparen{\smbOf{\poly}}}$ computations.

View File

@ -49,7 +49,7 @@ The function \size~ takes a circuit $\circuit$ as input and outputs the number o
The function \depth~ has circuit $\circuit$ as input and outputs the number of levels in \circuit.
\begin{Definition}[$\degree(\cdot)$]\label{def:degree}\footnote{Note that the degree of $\polyf(\abs{\circuit})$ is always upper bounded by $\deg(\circuit)$ and the latter can be strictly larger (e.g. consider the case when $\circuit$ multiplies two copies of the constant $1$-- here we have $\deg(\circuit)=1$ but degree of $\polyf(\abs{\circuit})$ is $0$).}
\begin{Definition}[$\degree(\cdot)$]\label{def:degree}\footnote{Note that the degree of $\polyf(\abs{\circuit})$ is always upper bounded by $\degree(\circuit)$ and the latter can be strictly larger (e.g. consider the case when $\circuit$ multiplies two copies of the constant $1$-- here we have $\deg(\circuit)=1$ but degree of $\polyf(\abs{\circuit})$ is $0$).}
$\degree(\circuit)$ is defined recursively as follows:
@ -65,7 +65,6 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
\subsection{Our main result}
%In the subsequent subsections we will prove the following theorem.
Let \circuit be a circuit for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
@ -83,6 +82,7 @@ such that
To get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expansion{\circuit}$ to be `canceled' when it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\circuit$, define
\AH{Technically, $\monom$ is a set of variables rather than a monomial. Perhaps we don't need the $\var(\cdot)$ function and can replace is with a function that returns the monomial represented by a set of variables.}
\[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\circuit}(1,\ldots, 1)}\]

View File

@ -18,7 +18,7 @@ Lastly, we generalize our result for expectation to other moments.
\mypar{The cost model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}

View File

@ -97,10 +97,10 @@
\newcommand{\pd}{\mathcal{P}}%pd for probability distribution
\newcommand{\pxdb}{\pdb_{\semNX}}%<---changed from the origninal \mathbf{\db}
\newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db--Are we currently using this?
\newcommand{\valworlds}{\eta}%valid worlds--in particular referring to something like a BIDB, where not all worlds have Pr[w] > 0.
@ -134,7 +134,7 @@
%Math Symbols, Functions/Operators, Containers %
%Number Sets
\newcommand{\domR}{\mathbb{R}}%Consider changing this to \domR
%Probability, Expectation
@ -167,7 +167,7 @@
%Random Variables
\newcommand{\rvW}{W}%\rvW for random variable of type World
\newcommand{\rvW}{W}%\rvW for random variable of type World<---this is the same as \randWorld
%One of these needs to go...I think...
\newcommand{\randomvar}{W}%this little guy needs a home!
@ -185,7 +185,7 @@
\newcommand{\rpolyX}{\rpoly\inparen{\pVar}}%<---if this isn't something we use much, we can get rid of it
\newcommand{\biDisProd}{\mathcal{B}}%bidb disjoint tuple products (def 2.5)
\newcommand{\rExp}{\mathcal{T}}%the set of variables to reduce all exponents to 1 via modulus operation; I think \mathcal T collides with the notation used for the set of tuples in D
\newcommand{\polyForTuple}{\poly_{\tup}}%do we use this?
\newcommand{\polyForTuple}{\poly_{\tup}}%do we use this?<--S 2
%Do we use this?
\newcommand{\out}{output}%output aggregation over the output vector
\newcommand{\prel}{\mathcal{\rel}}%What is this?
@ -196,6 +196,8 @@
%Graph Notation %
\newcommand{\eset}[1]{E^{(#1)}_S} %edge set for arbitrary subgraph
@ -216,7 +218,6 @@
%formally \rchild and \lchild
@ -228,6 +229,10 @@
%types of C
%Do we use this?
@ -252,6 +257,7 @@
@ -301,12 +307,10 @@
%I don't think we talk of T but of C; let's update this. These should be used only in S 2 and S4
%members of T
%types of T
@ -351,7 +355,7 @@
%not sure about these? Perhaps in appendix B for \assign and S 5 for \support?
\newcommand{\assign}{\psi}%assignment function from a world vector to polynomial output in App A

View File

@ -19,7 +19,7 @@
%%%%%%%%%% SQL + proveannce listing settings

View File

@ -19,30 +19,30 @@ Given positive integer $k$ and undirected graph $G$ with no self-loops or parall
\sharpwonehard (parameterization is in $k$).
The above result means that we cannot hope to count the number of $k$-matchings in $G=(V,E)$ in time $f(k)\cdot |V|^{c}$ for any function $f$ and constant $c$ independent of $k$. In fact, all known algorithms to solve this problem take time $|V|^{\Omega(k)}$.
The above result means that we cannot hope to count the number of $k$-matchings in $G=(\vset,\edgeSet)$ in time $f(k)\cdot |\vset|^{c}$ for any function $f$ and constant $c$ independent of $k$. In fact, all known algorithms to solve this problem take time $|\vset|^{\Omega(k)}$.
Our hardness result in Section~\ref{sec:single-p} is based on the following conjectured hardness result:
There exists a constant $\eps_0>0$ such that given an undirected graph $G=(V,E)$, computing exactly $\numocc{G}{\tri}$ cannot be done in time $o\inparen{|E|^{1+\eps_0}}$.
There exists a constant $\eps_0>0$ such that given an undirected graph $G=(\vset,\edgeSet)$, computing exactly $\numocc{G}{\tri}$ cannot be done in time $o\inparen{|\edgeSet|^{1+\eps_0}}$.
Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection of whether $G$ has a triangle or not takes time $\Omega\inparen{|E|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection of whether $G$ has a triangle or not takes time $\Omega\inparen{|\edgeSet|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
%The current best known algorithm to count the number of $3$-matchings, to
%\AR{Need to add something about 3-paths and 3-matchings as well.}
Both of our hardness results rely on a simple query polynomial encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
To prove our hardness result, consider a graph $G(\vset, \edgeSet)$, where $|\edgeSet| = m$, $|\vset| = \numvar$. Our query polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
Consider the polynomial
$\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.$
$\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j.$
The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above:
For any graph $G=([n],E)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem\]
For any graph $G=([n],\edgeSet)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j\right)^\kElem\]
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
@ -56,12 +56,8 @@ even hold for the expression trees. %this polynomial can be encoded in an expres
\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $OnTime$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard query polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$. %; our hardness result transfers here as well.
% OK: The following (commented-out) sentence feels a bit misplaced here.
% -- by contrast our approximation algorithm in \Cref{sec:algo} can handle lineage polynomials represented as circuits generated by union of select-project-join (SPJU) queries with potentially distinct $\prob_i$ values. % (i.e. we do not need union or select operator to derive our hardness result).
%\AR{need discussion on the `tightness' of various params. First, this is for degree 6 poly-- while things are easy for say deg 2. Second this is for any fixed p. Finally, we only need project-join queries to get the hardness results. Also need to compare this with the generality of the approx upper bound results.}
where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $OnTime$ has $n$ tuples corresponding to each vertex in $\vset=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard query polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$.
\subsection{Multiple Distinct $\prob$ Values}
@ -74,11 +70,11 @@ Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{E}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. The approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{\edgeSet}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. The approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
\noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $O\inparen{\kElem^3}$ time.
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $\bigO{\kElem^3}$ time.
%%% Local Variables:

View File

@ -7,6 +7,7 @@ We now introduce some terminology % for polynomials
and develop a reduced form (a closed form of the polynomial's expectation) for polynomials over probability distributions derived from a \bi or \ti.
%We will use $(X + Y)^2$ as a running example.
Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ is formally defined as (with $c_\vct{i} \in \domN$):
\AH{My attempt to clear up any confusion in the ambiguity of $c_{\vct{i}}$. We may want to say that $\domain\inparen{c_\vct{i}} = \domR$ instead?}
\poly\inparen{X_1,\dots,X_n}=\sum_{\vct{d}=(d_1,\dots,d_n)\in \semN^n} c_{\vct{d}}\cdot \prod_{i=1}^n X_i^{d_i}.
@ -17,8 +18,8 @@ Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ is formally defined as (wi
From above, the term $\prod_{i=1}^n X_i^{d_i}$ is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{i}}\ne 0$ from \Cref{eq:sop-form}.
We consider \abbrSMB as the default representation of a polynomial.
We use $\smbOf{\poly}$ to denote the \abbrSMB form of a polynomial $\poly$.
Unless othewise noted, we consider all polynomials to be in \abbrSMB representation.
When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a polynomial $\poly$.
@ -155,7 +156,7 @@ to the variables $\vct{X}$. Intuitively, \Cref{lem:exp-poly-rpoly} states that w
If $\poly$ is a \bi-lineage polynomial, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(\size\inparen{\smbOf{\poly}})$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
If $\poly$ is a \bi-lineage polynomial, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\smbOf{\poly}}}$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
%\AH{What if $\poly$ is not in \abbrSMB form?}

View File

@ -13,7 +13,7 @@ We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complex
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in degree of $0$) consist of elements in either $\domR$ or $\vct{X}$. The internal gates and (the single) sink gate of $\circuit$ (corresponding to the result tuple $t$) have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
Each node in a circuit $\circuit$ has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the gatee (one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of the gate's inputs. We use $\circuit_\linput$ to denote the left input and $\circuit_\rinput$ the right input of the sink of circuit $\circuit$.
Each node in a circuit $\circuit$ has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the gate (one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of the gate's inputs. We use $\circuit_\linput$ to denote the left input and $\circuit_\rinput$ the right input of the sink of circuit $\circuit$.
%The member \degval holds the degree of \circuit.
When the underlying DAG is a tree (with edges pointing towards the root), we will refer to the structure as an expression tree \etree. Note that in such a case, the root of \etree is analogous to the sink of \circuit.
@ -22,7 +22,7 @@ When the underlying DAG is a tree (with edges pointing towards the root), we wil
As stated in \Cref{def:circuit}, every internal node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get expression trees.
We ignore the fields \vari{partial}, \vari{Lweight}, and \vari{Rweight} until \Cref{sec:algo}.
We ignore the fields \vari{partial}, \vari{Lweight}, and \vari{Rweight} until \Cref{sec:algo}.\AH{We omit degree here too, which {\emph I think} is used only in appendix proofs.}
@ -105,15 +105,15 @@ Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corr
Note that $\circuit$ need not encode an expression in SMB. For instance, $\circuit$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = 2X^2+3XY-2Y^2$.
Note that $\circuit$ need not encode an expression in SMB. For instance, $\circuit$ could represent a compressed form of the running example, such as $(X + 2Y)(2X - Y)$, as shown in \Cref{fig:circuit}, while $\polyf(\circuit) = 2X^2+3XY-2Y^2$.\footnote{As stated previously, unless otherwise mentioned all polynomials are considered in the $\abbrSMB$ representation, and this implies that the output of $\polyf\inparen{\cdot}$ is indeed $\abbrSMB$.}
\begin{Definition}[Circuit Set]\label{def:circuit-set}
$\circuitset{\smbOf{\polyX}}$ is the set of all possible circuits $\circuit$ such that $\polyf(\circuit) = \smbOf\polyX$.
$\circuitset{\polyX}$ is the set of all possible circuits $\circuit$ such that $\polyf(\circuit) = \polyX$.\footnote{Again, the representation of $\polyX$ is $\abbrSMB$.}
The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$. One can think of $\circuitset{\smbOf{\polyX}}$ as the infinite set of circuits each of which model an encoding (factorization) equal to $\polyf(\circuit)$.
The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$. One can think of $\circuitset{\polyX}$ as the infinite set of circuits each of which equal $\polyX$ when represented in $\abbrSMB$.
Note that \Cref{def:circuit-set} implies that $\circuit \in \circuitset{\polyf(\circuit)}$.
@ -122,10 +122,10 @@ Note that \Cref{def:circuit-set} implies that $\circuit \in \circuitset{\polyf(\
\noindent We are now ready to formally state our \textbf{main problem}.
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X} \to \{0,1\}$, $\query$ an n-ary query, and $t$ an n-ary tuple.
Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pxdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X} \to \{0,1\}$, $\query$ an n-ary query, and $t$ an n-ary tuple.
The \expectProblem is defined as follows:\\[-7mm]
\textbf{Input}: A circuit $\circuit \in \circuitset{\smbOf{\polyX}}$ for $\poly(\vct{X}) = \query(\pxdb)(t)$
\textbf{Input}: A circuit $\circuit \in \circuitset{\polyX}$ for $\polyX = \query(\pxdb)(t)$
\hspace*{5mm}\textbf{Output}: $\expct_{\vct{W} \sim \pd}[\poly(\vct{W})]$