An \textit{incomplete database}$\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database}$\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:
For a probabilistic database $\pdb=(\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK=(\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when being combined with $\multK$.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to\domK$ with finite support $\support{\rel}=\{\tup\mid\rel(\tup)\neq\zeroK\}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ ($\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to tuple $t$.
Consider the semiring $\semN=(\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases are used to model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We will study the problem of evaluating statical moments of query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb=(\idb, \pd)$, query $\query$, and possible result tuple $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb\sim\pd}[\query(\db)(t)]$:
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (boolean) result of evaluating condition $\theta$ over $\tup$.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$ and addition and multiplication are standard addition and multiplication of polynomials. We will utilize $\semNX$-databases $\db$ paired with probability distributions to represent $\semN$-PDBs.\BG{Need more motivation?} To justify the use of $\semNX$-databases, we need to show that we can encode any $\semN$-PDB in this way and that the query semantics over this representation coincides with query semantics over $\semN$-PDB. For that it will be opportune to define representation systems for $\semN$-PDBs.\BG{cite}
A representation system for $\semN$-PDBs is a tuple $(\reprs, \rmod)$ where $\reprs$ is a set of representations and $\rmod$ associates which each $\repr\in\reprs$ a $\semN$-PDB $\pdb$. We say that a representation system is \emph{closed} under a class of queries $\qClass$ if for any query $\query\in\qClass$ we have:
%
\[\rmod(\query(\repr))=\query(\rmod(\repr))\]
A representation system is \emph{complete} if for every $\semN$-PDB $\pdb$ there exists $\repr\in\reprs$ such that:
Formally, a $\semNX$-PDB is a $\semNX$-database $\db$ and a probability distribution $\pd$ over assignments $\assign$ of the variables $\vct{X}=\{X_1, \ldots, X_n\}$ occurring in annotations of $\db$ to $\{0,1\}$. Note that an assignment $\assign: \vct{X}\to\{0,1\}$ can be represented as a vector $\vct{w}\in\{0,1\}^n$ where $\vct{w}[i]$ records the value assigned to variable $X_i$. Thus, from now on we will solely use such vectors which we refer to as \emph{world vectors} and implicitly understand them to represent assignments. Given an assignment $\assign$ we use $\assign(\pxdb)$ to denote the semiring homomorphism $\semNX\to\semN$ that applies the assignment $\assign$ to all variables of a polynomial and evaluates the resulting expression in $\semN$.\BG{explain connection to homomorphism lifting in K-relations}
A $\semNX$-PDB $\pxdb$ over variables $\vct{X}=\{X_1, \ldots, X_n\}$ is a tuple $(\db,\pd)$ where $\db$ is an $\semNX$-database and $\pd$ is a probability distribution over $\vct{w}\in\{0,1\}^n$. We use $\assign_{\vct{w}}$ to denote the assignment corresponding to $\vct{w}\in\{0,1\}^n$. The $\semN$-PDB $\rmod(\pxdb)=(\idb, \pd')$ encoded by $\pxdb$ is defined as:
For instance, consider a $\pxdb$ consisting of a single tuple $\tup_1=(1)$ annotated with $X_1+ X_2$ with probability distribution $\pd([0,0])=0$, $\pd([0,1])=0$, $\pd([1,0])=0.3$ and $\pd([1,1])=0.7$. This $\semNX$-PDB encodes two possible worlds (with non-zero) probability that we denote using their world vectors.
Importantly, as the following proposition shows, any finite $\semN$-PDB can be encoded as a $\semNX$-PDBs and $\semNX$-PDBs are closed under positive relational algebra queries, the class of queries we are interested in in this work.
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb=(\idb, \pd)$ produces a $\semNX$-PDB $\pxdb=(\db, \pd')$ such that $\rmod(\pxdb)=\pdb$. Let $\idb=\{D_1, \ldots, D_n\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$. In $\db$ we assign each tuple $\tup$ the polynomial:
The probability distribution $\pd'$ assigns all world vectors zero probability except for $n$ world vectors (representing the possible worlds) $\vct{w_i}$. All elements of $\vct{w_i}$ are zero except for the positions corresponding to variables $X_{ij}$ for $j \in\{1, \ldots\}$ which are set to $1$. Unfolding definitions it is trivial to show that $\rmod(\pxdb)=\pdb$. Thus, $\semNX$ are a complete representation system.
The closure under $\raPlus$ queries follows from the fact that an assignment $\vct{X}\to\{0,1\}$ is a semiring homomorphism and that semiring homomorphisms commute with queries over $\semK$-relations.
Now let us consider computing the expected multiplicity of a tuple $\tup$ in the result of a query $\query$ over a $\semN$-PDB $\pdb$ using the annotation of $\tup$ in the result of evaluating $\query$ over an $\semNX$-PDB $\pxdb$ for which $\rmod{\pxdb}=\pdb$. The expectation of the polynomial $\poly=\query(\pxdb)(\tup)$ based on the probability distribution of $\pxdb$ over the variables in $\pxdb$ is:
Since $\semNX$-PDBs $\pxdb$ are a complete representation system closed under $\raPlus$, computing the expectation of the multiplicity of a tuple $t$ in the result a $\raPlus$ query over the $\semN$-PDB $\rmod(\pxdb)$, is the same as computing the expectation of the polynomial $\query(\pxdb)(t)$.
Two important subclasses of $\semNX$-PDBs that are of interested to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\vct{p}(\tup)$. The set of possible worlds represented by a \ti$\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi$\db$ are all subsets of $\db$ that contain at most one tuple from each block. The probability of such a world is the product of the probabilities of all tuples present in the world and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world. For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
\begin{Definition}[\tis and \bis]\label{def:tidbs-and-bidbs}
A \emph{\ti}$\pxdb=(\db, \pd)$ is a $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) the probability distribution $\pd$ is such that all variables are independent.
A \emph{\bi}$\pxdb=(\db, \pd)$ is a $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup)\neq0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
Note that the main difference to the standard definitions of \tis and \bis is that we define them as subclasses of $\semNX$-PDBs and that we use bag semantics. Even though tuples cannot occur more than once in the input \ti or \bi, they can occur with a multiplicity larger than one in the result of a query. Since in \tis and \bis, there is a one-to-one correspondence between tuples in the database and variables, we can interpret a vector $\vct{w}\in\{0,1\}^n$ as denoting which tuples exist in the possible world $\assign_{\vct{w}}(\pxdb)$ (the ones where $\vct{w}[i]=1$). Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$. Let $\pd^{(\vct{p})}$ denote the distribution induced by $\vct{p}$.
\BG{Oliver's conjecture: Bag-\tis + Q can express any finite bag-PDB:
A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as \tis, any finite PDB can be encoded using a \ti and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag \ti and a query (WHAT CLASS? ADD PROOF)
An expression tree $\etree$ over $\vct{X}$ is a binary %an ADT logically viewed as an n-ary
tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either from the set $\mathbb{R}$$(\tnum)$ or from the set of monomials $(\var)$. The members of $\etree$ are \type, \val, \vari{partial}, \vari{children}, and \vari{weight}, where \type is the type of value stored in the node $\etree$ (i.e. one of $\{+, \times, \var, \tnum\}$, \val is the value stored, and \vari{children} is the list of $\etree$'s children where $\etree_\lchild$ is the left child and $\etree_\rchild$ the right child.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We ignore the remaining fields (\vari{partial} and \vari{weight}) for now. Their purpose will become clear in~\Cref{sec:approximation-algo}. Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(x +2y)(2x - y)$.
Denote $poly(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
% &\etree.\type = \var \text{ OR } \tnum\mapsto&& \etree.\val
% \end{align*}
\begin{equation*}
\polyf(\etree) = \begin{cases}
\polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type} = +\\
\polyf(\etree_\lchild) \cdot\polyf(\etree_\rchild) &\text{ if \etree.\type} = \times\\
\etree.\val&\text{ if \etree.\type} = \var\text{ OR }\tnum.
\end{cases}
\end{equation*}
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that addition and multiplication above follow the standard interpretation over polynomials.
%Specifically, when adding two monomials whose variables and respective exponents agree, the coefficients corresponding to the monomials are added and their sum is multiplied to the monomial. Multiplication here is denoted by concatenation of the monomial and coefficient. When two monomials are multiplied, the product of each corresponding coefficient is computed, and the variables in each monomial are multiplied, i.e., the exponents of like variables are added. Again we notate this by the direct product of coefficient product and all disitinct variables in the two monomials, with newly computed exponents.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree)=\poly(\vct{X})$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For our running example, $\etreeset{\smb}=\{2x^2+3xy -2y^2, (x +2y)(2x - y), x(2x - y)+2y(2x - y), 2x(x +2y)- y(x +2y)\}$. Note that \cref{def:express-tree-set} implies that $\etree\in\etreeset{poly(\etree)}$.
Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X}\to [0,1]$, $\query$ an n-ary query, and $t$ an n-ary tuple.
% When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two-tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is a probability distribution over $\wSet$.
% The possible worlds semantics gives a framework for how to think about running queries over $\idb$. Given a query $\query$, $\query$ is deterministically run over each $\db \in \idb$, and the output of $\query(\idb)$ is defined as the set of results (worlds) from running $\query$ over each $\db_i \in \idb$. We write this formally as,