Tweaking S5, and trimming back down to 12

2020-12-20 18:38:59 -05:00 · 2020-12-20 18:38:59 -05:00 · 13a7cd264d
parent 32c2511129
commit 13a7cd264d
3 changed files with 29 additions and 25 deletions
--- a/circuits-model-runtime.tex
+++ b/circuits-model-runtime.tex
@ -1,20 +1,22 @@
 %!TEX root=./main.tex
 \section{Generalizations}
 \label{sec:gen}
-In this section, we consider several generalizations/corollaries of our results.
+In this section, we consider generalizations/corollaries of our results.
 In particular, in~\Cref{sec:circuits} we first consider the case when the compressed  polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
 Then, we formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
-Finally, in~\Cref{sec:momemts}, we observe how our results can be used to estimate moments other than the expectation.
+Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Lineage circuits}
+\subsection{Lineage Circuits}
 \label{sec:circuits}

-In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
+In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have focused on thinking of inputs this way. 
+In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. 
+However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).

 In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
 We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to in this section.
-A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to output tuples.
+A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks to output tuples.
 Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
 Note that if we limit the outdegree to one, then we get back expression trees.

@ -36,11 +38,11 @@ For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a lineage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{The cost model}
 \label{sec:cost-model}
-Thus, so far our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial.
-We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ in runtime that is linear in the
-runtime that a class of deterministic algorithms take to evaluate $Q(D)$ for any world $\db$ of $\pxdb$ as long as
-there exists a constant $c$ that is independent of the number tuple in the largest world of $\pxdb$  such that $\abs{pxdb} \leq c \cdot \abs{\db}$. In practice, this is often the case because typically the blocks of a \bi represent entities where we are uncertain about their properties and in such a scenario often there are only a limited number of alternatives for each block. Note that all TIDBs trivially fulfill this condition for $c = 1$.
-That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
+So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
+We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.
+We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
+This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes --- in practice there is a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition for $c = 1$.
+%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
 % with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
 We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
 \newcommand{\qruntime}[1]{\textbf{cost}(#1)}
@ -115,15 +117,15 @@ This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit cou
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Higher moments}
+\subsection{Higher Moments}
 \label{sec:momemts}

 We make a simple observation to conclude the presentation of our results.
-So far we have presented algorithms that approximate the expectation of $\poly$.
+So far focused on the expectation of $\poly$.
 In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
 While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
-Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
-In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
+For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).
+In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
 However, we leave the question of coming up with a more accurate approximation algorithms for future work.

 %%% Local Variables:
--- a/conclusions.tex
+++ b/conclusions.tex
@ -3,11 +3,11 @@

 We have studied the problem of calculating the expectation of query polynomials over BIDBs. %random integer variables.
 This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
-This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
+It has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention.
 While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
 We have proven this claim through a reduction from the problem of counting k-matchings.
 When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are few cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
-Interesting directions for future work include development of a dichotomy for queries over bag PDBs and desgin approximation schemes for data models beyond what we consider in this paper.
+Interesting directions for future work include development of a dichotomy for queries over bag PDBs and approximations for data models beyond what we consider in this paper.
 % Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.

 \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
--- a/related-work.tex
+++ b/related-work.tex
@ -4,25 +4,25 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %\subsection{Probabilistic Databases}\label{sec:prob-datab}
 \textbf{Probabilistic Databases} (PDBs) have been studied predominantly for set semantics.
-A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
+Many data models have been proposed for encoding PDBs more compactly than as sets of possible worlds. 
+These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
 with probability distribution  over its variables.
-This is similar to our $\semNX$-PDBs, but we use polynomials instead of Boolean expressions and only allow constants as attribute values.
+This is similar to our $\semNX$-PDBs, with Boolean expressions instead of polynomials.
 % Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
 % While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
 % Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07}. %,BS06
 % \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
 % with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.

-Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
+Approaches for probabilistic query processing (i.e., computing marginal probabilities for tuples), fall into two broad categories.
 \emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple % (a Boolean formula encoding the provenance of the tuple)
 and then the probability of the lineage formula.
-In this paper we focus on intensional query evaluation using polynomials instead of Boolean formulas.
-It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{valiant-79-cenrp} %provan-83-ccccptg
-using the fact the tuple's marginal probability is the probability of a its lineage formula).
+In this paper we focus on intensional query evaluation with polynomials.
+It has been shown that computing the marginal probability of a tuple is \sharpphard~\cite{valiant-79-cenrp} (by reduction from weighted model counting).
 The second category, \emph{extensional} query evaluation, % avoids calculating the lineage.
 % This approach
 is in \ptime, but is limited to certain classes of queries.
-Dalvi et al.~\cite{DS12} proved that  a dichotomy for unions of conjunctive queries (UCQs):
+Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs):
 for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (permits intensional).
 Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation. % R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
 Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15}. %,AB15c
@ -35,9 +35,11 @@ Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of
 % \cite{FH12} identifies a tractable class of queries involving aggregation.
 In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while~\cite{FH12} computes probabilities for individual output annotations).

-\textbf{Compressed Encodings} are used extensively for Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{factorized-db}) some of which have been utilized for  probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
+\noindent \textbf{Compressed Encodings} are used for Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g., factorizations~\cite{factorized-db}) some of which have been utilized for  probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. 
+Compact representations for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. 
+\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.

-Additional discussion related work pertaining to fine-grained complexity appears in \Cref{sec:param-compl}.
+\noindent \Cref{sec:param-compl} covers more related work on fine-grained complexity.


 %%% Local Variables: