boris bad intro

2021-04-02 12:00:56 -05:00 · 2021-04-02 12:00:56 -05:00 · 37196c454b
parent 9fc77e005b
commit 37196c454b
2 changed files with 37 additions and 22 deletions
--- a/intro.tex
+++ b/intro.tex
@ -3,13 +3,24 @@

 \section{Introduction}
 \label{sec:intro}
+A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds paired with a probability distribution $\pd$ over these worlds. A well-studied problem in probabilistic databases is to compute the \emph{marginal probability} of a tuple $\tup$, i.e., its probability to  to exist in the result of a query $\query$ over $\pdb$. This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for the class of union of conjunctive queries (UCQs). In this work, we consider bag semantics where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analog problem of computing the expectation of the multiplicity of a query result tuple $\tup$:
+\begin{equation}\label{eq:bag-expectation}
+\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
+\end{equation}
+
+Under set semantics, the marginal probability of a query result $\tup$ can be computed based on the lineage $\linsett{\query}{\pdb}{\tup}$ of tuple $\tup$. The lineage $\linsett{\query}{\pdb}{\tup}$ is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vec{X}$) over random variables that represent input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of each of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). The marginal probability of the tuple is equal to the probability of the lineage $\linsett{\query}{\pdb}{\tup}$ evaluating to true over these assignments. This following from the following fact: iff $\linsett{\query}{\pdb}{\tup}$ evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$. For bag semantics, the lineage of a tuple is a polynomial over random variables from the set $\vct{X} \in \mathbb{N}^\numvar$ with
+coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). Analog to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities) yields the multiplicity of result tuple $\tup$ in this world.
+
+
+
 In their most general form, tuple-independent set-probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) answer existential queries (queries for the probability of a specific condition holding over the input database) in two steps: (i) lineage and (ii) probability.
 The lineage is a boolean formula, an element of the $\text{PosBool}[\vct{X}]$ semiring, where lineage variables $\vct{X}\in \mathbb{B}^\numvar$ are random variables corresponding to the presence of each of the $\numvar$ input tuples in one possible world of the input database.
 The lineage models the relationship between the presence of these input tuples and the query condition being satisfied, and thus the probability of this formula is exactly the query result.
-The analogous query in the bag setting~\cite{DBLP:journals/sigmod/GuagliardoL17,feng:2019:sigmod:uncertainty} asks for the expectation of the number (multiplicity) of result tuples that satisfy the query condition.  
+The analogous query in the bag setting~\cite{DBLP:journals/sigmod/GuagliardoL17,feng:2019:sigmod:uncertainty} asks for the expectation of the number (multiplicity) of result tuples that satisfy the query condition.
 The process for responding to such queries is also analogous, save that the lineage is a polynomial, an element from the $\mathbb{N}[\vct{X}]$ semiring, with coefficients in the set of natural numbers $\mathbb{N}$ and random variables from the set $\vct{X} \in \mathbb{N}^\numvar$.
 The expectation of this polynomial is the query result.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{figure}[t]
 	\begin{subfigure}[b]{0.33\linewidth}
 		\centering
@ -23,7 +34,7 @@ The expectation of this polynomial is the query result.
 			%& Tel Aviv & $L_d$ & $L_d$\\
 			& Zurich & $L_d$ & $L_e$\\
 		\end{tabular}
-	}		
+	}
 		\caption{Relation $Loc$ in ~\Cref{ex:intro-tbls}}
 		\label{subfig:ex-shipping-loc}
 	\end{subfigure}%
@ -63,16 +74,18 @@ The expectation of this polynomial is the query result.
 	\label{fig:ex-shipping}
  \trimfigurespacing
 \end{figure}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-tbls}
-Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company.  
+Consider the \ti tables (\cref{fig:ex-shipping}) from an international shipping company.
 Table Loc lists all the locations of airports.
 Table Route identifies all flight routes.
 The tuples of both tables are annotated with elements of the $PosBool[\vct{X}]$ ($\Phi_{set}$) and $\mathbb{N}[\vct{X}]$ ($\Phi_{bag}$) semirings that indicate the tuples presence or multiplicity, respectively.
-Tuples of Routes are annotated with random variables $L_i$ that models the probability of no delays at the airport on a given day\footnote{We assume for simplicity that these variables are independent events.}.  
+Tuples of Routes are annotated with random variables $L_i$ that models the probability of no delays at the airport on a given day\footnote{We assume for simplicity that these variables are independent events.}.
 Tuples of Routes are annotated with a constant ($\top$ or $1$ respectively), and are deterministic; Queries over this table follow classical query evaluation semantics.

-Consider a customer service representative who needs to expedite a shipment to Western Europe.  
+Consider a customer service representative who needs to expedite a shipment to Western Europe.
 The query $Q_1 := \pi_{\text{City}_1}\left(\sigma_{\text{City}_2 = \text{``Bremen"} ~OR~ \text{City}_2 = \text{``Zurich"}}\right.$$\left.(Route)\right)$ asks for all cities with routes to either Zurich or Bremen.
 Both routes exist from Chicago, and so the result lineage~\cite{DBLP:conf/pods/GreenKT07} of the corresponding tuple (\cref{subfig:ex-shipping-queries}) indicates that the tuple is deterministically present, either via Zurich or Bremen.
 Analogously, under bag semantics Chicago appears in the result twice.
@ -82,19 +95,19 @@ Suppose the representative would like to consider delays from the originating ci
 The resulting lineage formulas (\cref{subfig:ex-shipping-queries}) concisely describe the event of delivering a shipment to Zurich or Bremen without departure delay, or the number of departure-delay-free routes to these cities given an assignment to $L_b$.
 If Chicago is delay-free ($L_b = \top$, $L_b = 1$, respectively), there exists a route (set semantics) or there are two routes (bag semantics).
 \end{Example}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-
-%The computation of the marginal probability in  is a known .  Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$.  The computation is a two step process.  The first step involves actually computing the lineage formula of $\tup$.  The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.  
+%The computation of the marginal probability in  is a known .  Its corresponding problem in bag PDBs is computing the expected count of a tuple $\tup$.  The computation is a two step process.  The first step involves actually computing the lineage formula of $\tup$.  The second step then computes the marginal probability (expected count) of the lineage formula of $\tup$, a boolean formula (polynomial) in the set (bag) semantics setting.
 A well-known dichotomy~\cite{10.1145/1265530.1265571} separates the common-case \sharpphard problem of computing the probability of a boolean lineage formulas, from the case where the probability computation can be inlined into the polynomial-time lineage construction process.

-Historically, the bottleneck for \emph{set-}probabilistic databases has been the second step; An instrumented query can compute a circuit encoding of the result lineage with at most a constant-factor overhead over the un-instrumented query ((TODO: Find citation)).  
+Historically, the bottleneck for \emph{set-}probabilistic databases has been the second step; An instrumented query can compute a circuit encoding of the result lineage with at most a constant-factor overhead over the un-instrumented query ((TODO: Find citation)).
 Because the probability computation is the bottleneck, it is typical to assume that the lineage formula is provided in disjunctive normal form (DNF), as even when this assumption holds the problem remains \sharpphard in general.
 However, for bag semantics the analogous sum of products (SOP) lineage representation admits a trivial naive implementation due to linearity of expectation.
 However, what can be said about lineage polynomials (i.e., bag-probabilistic database query results) that are in a compressed (e.g, circuit) representation instead?

 In this paper we study computing the expected count of an output bag PDB tuple whose lineage formula is in a compressed representation, using the more general intensional query evaluation semantics.
 %
-%%Most theoretical developments in probabilistic databases (PDBs) have been made in the setting of set semantics.  This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage formula (a boolean formula encoding the contributing input tuples to the output tuple) in set semantics versus the linear runtime when computing the expectation over the lineage polynomial (a standard polynomial analogously encoding contributing input tuples) of a tuple from an output bag PDB.  However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in its "expanded" sum of products (SOP) form (each term is a product, where all (product) terms are summed).  What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics? 
+%%Most theoretical developments in probabilistic databases (PDBs) have been made in the setting of set semantics.  This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage formula (a boolean formula encoding the contributing input tuples to the output tuple) in set semantics versus the linear runtime when computing the expectation over the lineage polynomial (a standard polynomial analogously encoding contributing input tuples) of a tuple from an output bag PDB.  However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in its "expanded" sum of products (SOP) form (each term is a product, where all (product) terms are summed).  What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics?
 %
 %%As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
 %%Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
@ -102,8 +115,8 @@ In this paper we study computing the expected count of an output bag PDB tuple w
 %%Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
 %%However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
 %\subsection{Sets vs. Bags}
-%In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result.  It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples. 
-%Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF--the "expanded" form of the lineage formula in set semantics, corresponding to SOP of bag semantics). 
+%In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result.  It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples.
+%Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF--the "expanded" form of the lineage formula in set semantics, corresponding to SOP of bag semantics).
 %%A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}).  In this work we define a compressed representation as any one of the possible circuit representations of the lineage formula (please see Definitions~\ref{def:circuit},~\ref{def:poly-func}, and~\ref{def:circuit-set}).
 %
 %In bag semantics this problem corresponds to computing the expected multiplicity of a query result tuple, which can be reduced to computing the expectation of the lineage polynomial.
@ -152,15 +165,15 @@ In this paper we study computing the expected count of an output bag PDB tuple w
 %			\node[tree_node] (b1) at (1, 0){$W_b$};
 %			\node[tree_node] (c1) at (2, 0){$W_c$};
 %			\node[tree_node] (d1) at (3, 0){$W_d$};
-%	
+%
 %			\node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
 %			\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
 %			\node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};
-%	
+%
 %			\node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
 %			\node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
 %			\node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};
-%	
+%
 %			\draw[->] (a1) -- (a2);
 %			\draw[->] (b1) -- (a2);
 %			\draw[->] (b1) -- (b2);
@ -242,7 +255,7 @@ In this paper we study computing the expected count of an output bag PDB tuple w
 %\end{Example}
 %
 %Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
-%To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c, W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive).  Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition). 
+%To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c, W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive).  Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition).
 %Conversely, in Bag-PDBs, correlations between monomials of the SOP polynomial (\ref{eq:poly-bag}) are not problematic thanks to linearity of expectation.
 %The expectation computation over the output lineage is simply the sum of expectations of each clause.
 %Referring again to example~\ref{ex:intro}, the expectation is simply
@ -265,9 +278,9 @@ In this paper we study computing the expected count of an output bag PDB tuple w
 %The workflow modeling this particular problem can be broken down into two steps.  We start with converting the output boolean formula (polynomial) into a representation.  This representation is then the interface for the second step, which is computing the marginal probability (count) of the encoded boolean formula (polynomial).  A natural question arises as to which representation to use.  Our choice to use circuits (\Cref{def:circuit}) to represent the lineage polynomials follows from the observation that the work in WCOJ/FAQ/Factorized DB's --\color{red}CITATION HERE\color{black}-- all contain algorithms that can be easily be modified to output circuits without changing their runtime.  Further, circuits generally allow for greater compression than other respresentations, such as expression trees.  By the former observation, step one is always linear in the size of the circuit representation of the boolean formula (polynomial), implying that if the second step of the workflow is computed in time greater, then reducing the complexity of the second step would indeed improve the overall efficiency of computing the marginal probability (count) of an output set (bag) PDB tuple.  This, however, as noted earlier, cannot be done in the set semantics setting, due to known hardness results.
 %
 %Though computing the expected count of an output bag PDB tuple $\tup$  is linear (in the size of the polynomial) when the lineage polynomial of $\tup$ is in SOP form, %has received much less attention, perhaps due to the property of linearity of expectation noted above.
-%%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation. 
+%%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
 %is this computation also linear (in the size of an equivalent compressed representation) when the lineage polynomial of $\tup$ is in compressed form?
-%there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart. 
+%there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart.
 Such compressed forms naturally occur in typical database optimizations, e.g., projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums instead of a SOP).
 \begin{figure}[t]
 	\begin{subfigure}[b]{0.51\linewidth}
@ -341,18 +354,18 @@ Such compressed forms naturally occur in typical database optimizations, e.g., p
 	\label{fig:ex-proj-push}
 \end{figure}
 \begin{Example}
-Consider again the tables in \cref{subfig:ex-shipping-loc} and \cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables as shown in \cref{subfig:ex-proj-push-q4}.  
+Consider again the tables in \cref{subfig:ex-shipping-loc} and \cref{subfig:ex-shipping-route} and let us assume that the tuples in $Route$ are annotated with random variables as shown in \cref{subfig:ex-proj-push-q4}.
 Consider the equivalent queries $Q_3 := \pi_{\text{City}_1}(Loc \bowtie_{\text{City}_\ell = \text{City}_1}Route)$ and $Q_4 := Loc \bowtie_{\text{City}_\ell = \text{City}_1}\pi_{\text{City}_1}(Route)$.
 The latter's ``pushed down'' projection produces a compressed annotation, both in the polynomial, as well as its circuit encoding (\cref{subfig:ex-proj-push-circ-q3,subfig:ex-proj-push-circ-q4}).
 In general, compressed representations of the lineage polynomial can be exponentially smaller than the polynomial.
-\end{Example}  
+\end{Example}
 This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
 In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.

 In view of this hardness result (i.e., step 2 of the workflow is the bottleneck in the bag setting as well), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
 By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
 	Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
-} 
+}
 This is an important result, because it implies that computing approximate expectations for bag output PDBs of SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.

 \subsection{Overview of our results and techniques}
@ -365,7 +378,7 @@ Concretely, in this paper:
 Our hardness results follow by considering a suitable generalization of the lineage polynomial in \cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le  2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed  it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.


-The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. 
+The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
 %For the ease of exposition, we start off with expression trees (see~\Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).

 We formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
--- a/macros.tex
+++ b/macros.tex
@ -24,7 +24,9 @@
 \newcommand{\pd}{\vct{P}}%pd for probability distribution
 \newcommand{\eval}[1]{\llbracket #1 \rrbracket}%evaluation double brackets
 \newcommand{\evald}[2]{\eval{{#1}}_{#2}}
+\newcommand{\linsett}[3]{\Phi_{#1,#2}^{#3}}
 \newcommand{\query}{Q}
+
 \newcommand{\join}{\Join}
 \newcommand{\select}{\sigma}
 \newcommand{\project}{\pi}
@ -120,7 +122,7 @@
 \newcommand{\treesize}{\func{size}}
 \newcommand{\sign}{\func{sgn}}

-%Random Variable 
+%Random Variable
 \newcommand{\randomvar}{W}
 \newcommand{\domain}{\func{Dom}}
 %PDBs