diff --git a/conclusions.tex b/conclusions.tex new file mode 100644 index 0000000..0ec5697 --- /dev/null +++ b/conclusions.tex @@ -0,0 +1,14 @@ +\section{Conclusions and Future Work}\label{sec:concl-future-work} + +We have studied the problem of calculating the expectation of polynomials over random integer variables. This problem has a practical application in probabilistic databases over multisets where it corresponds to calculating the expected multiplicity of a query result tuple using the tuple's provenance polynomial. This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in sum-of-products normal form, the problem is \sharpwonehard for factorized polynomials. We have proven this claim through a reduction from the problem of counting k-matchings. When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is possible to approximate the expectation of a polynomial in linear time. + +\BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out: +\textbullet{More queries: what happens with negation can circuits with monus be used?} +\textbullet{More databases: can we push beyond BIDBs? E.g., C-tables / aggregate semimodules or just TIDBs where each input tuple is a random variable over $\mathbb{N}$?} +\textbullet{Other results: can we extend the work to approximate $P(R(t) = n)$} +} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "main" +%%% End: diff --git a/intro.tex b/intro.tex index 139f459..1047ed6 100644 --- a/intro.tex +++ b/intro.tex @@ -2,7 +2,8 @@ \section{Introduction} -Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpphard$ problem). +Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known \sharpphard problem). + %the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general In PDBs, a boolean formula, ~\cite{DBLP:series/synthesis/2011Suciu} also called a lineage formula, encodes the conditions under which each output tuple appears in the result. %The marginal probability of this formula being true is the tuple's probability to appear in a possible world. @@ -99,7 +100,9 @@ Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T \end{align*} In the set/lineage setting, we find that the boolean query is satisfied, while in the bags evaluation we see how many combinations of the input satsify the query. \end{Example} -Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed $\sharpphard$, since it is a query that is non-hierarchical + +Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \sharpphard, since it is a query that is non-hierarchical + %, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, ~\cite{10.1145/1265530.1265571}. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics. To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$: diff --git a/main.tex b/main.tex index 73f4581..d696c52 100644 --- a/main.tex +++ b/main.tex @@ -169,8 +169,9 @@ sensitive=true \input{single_p} \input{lin_sys} \input{approx_alg} -%\input{bi_cancellation} - +% \input{bi_cancellation} +\input{related-work} +\input{conclusions} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/related-work.tex b/related-work.tex new file mode 100644 index 0000000..a843c0c --- /dev/null +++ b/related-work.tex @@ -0,0 +1,6 @@ +\section{Related Work}\label{sec:related-work} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "main" +%%% End: