MayBMS_Mirror/Documents/experiments.tex



\chapter{Experiments}


This section reports on experiments performed with the first MayBMS release
(beta) and a benchmark consisting of two parts,
which are described in more detail in the remainder of this chapter:
%
\begin{enumerate}
\item
Computing the probability of triangles in random graphs.

\item
A modified subset of the TPC-H queries on uncertain TPC-H datasets.
\end{enumerate}


By this benchmark, we do not attempt to simulate a representative set of
use cases: the jury is still out on what such a set of use cases might be.
Instead, we focus on a benchmark that allows us to see how the performance
of MayBMS develops across releases on the two core technical problems solved
by MayBMS: polynomial-time query evaluation for the polynomial-time fragment
of our query language and the efficient approximation of query results for
queries that do not belong to the polynomial-time fragment. (Finding triangles
in random graphs is a near-canonical example of such queries.)

We will keep monitoring the development of the state of the art and will
continue to survey applications and collect use cases; we will extend or
replace this benchmark as consensus develops regarding the most important
applications of probabilistic databases.


\medskip

Experimental setup.
All the experiments reported on in this chapter were conducted on an Athlon-X2(4600+)64bit / 1.8GB / Linux2.6.20 / gcc4.1.2 machine.


\section{Random Graphs}
\subsection{Experiments with Varying Levels of Precision}

In this experiment, we create undirected random graphs in which
the presence of each edge is independent of that of the other edges. The probability that an edge is in the graph is 0.5 and
this applies to each edge. Then we compute the probability that there exists a triangle in the graphs using approximation.
The queries can be found in Appendix~\ref{app:randgraph}.

We report wall-clock execution times
of queries run in the PostgreSQL8.3.3 psql shell with a warm
cache obtained by running a query once and then reporting
the average execution time over three subsequent, identical executions.
Figure \ref{fig:randgraph} shows the execution time of approximation with different precision parameters for random graphs composed of 5 to 33 nodes. An ($\epsilon, \delta$) approximation has the following property: let $p$ be the exact probability and $\hat{p}$ be the approximate probability, then
$\Pr\big[ |p - \hat{p}| \ge \epsilon \cdot p \big] \le \delta$.

\begin{figure}[htp]

\begin{center}
  \begin{tabular}{ | c | c | c | c | c | c | }
    \hline
    \multirow{2}{*}{\#nodes} & \multirow{2}{*}{\#clauses} & \multicolumn{4}{|c|}{Execution Time(Seconds)} \\ \cline{3-6}
          &  & (.05,.05) & (.01,.01) & (.005,.005) &  (.001,.001)  \\ \hline
    5 & 10 & 0.01 & 0.03 & 0.11 & 2.08  \\ \hline
    6 & 20 & 0.01 & 0.08 & 0.26 & 5.27   \\ \hline
	7 & 35 & 0.02 & 0.14 & 0.46 & 9.15   \\ \hline
	8 & 56 & 0.03 & 0.22 & 0.7 & 12.49   \\ \hline
	9 & 84 & 0.04 & 0.28 & 0.85 & 14.95   \\ \hline
	10 & 120 & 0.08 & 0.44 & 1.13 & 16.19   \\ \hline
	11 & 165 & 0.15 & 0.60 & 1.60 & 17.98   \\ \hline
	12 & 220 & 0.29 & 1.24 & 2.48 & 24.31   \\ \hline
	13 & 286 & 0.55 & 2.38 & 4.74 & 35.29   \\ \hline
	14 & 364 & 0.98 & 4.26 & 8.38 & 51.51   \\ \hline
	15 & 455 & 1.56 & 6.74 & 13.29 & 73.00   \\ \hline
	16 & 560 & 2.37 & 10.26 & 19.21 & 102.97   \\ \hline
	17 & 680 & 3.46 & 14.6 & 28.76 & 144.02   \\ \hline
	18 & 816 & 4.92 & 20.49 & 41.1 & 206.18  \\ \hline
	19 & 969 & 7.03 & 28.52 & 56.43 & 291.21  \\ \hline
	20 & 1140 & 9.97 & 39.72 & 81.01 & 395.18  \\ \hline
	21 & 1330 & 14.74 & 57.13 & 123.79 &  597.86  \\ \hline
	22 & 1540 & 23.94 & 119.81 & 218.62 & 600+   \\ \hline
	23 & 1771 & 46.21 & 204.83 & 416.42 & 600+  \\ \hline
	24 & 2024 & 79.03 & 411.67 & 600+ & 600+  \\ \hline
	25 & 2300 & 115.64 & 515.65 & 600+ & 600+  \\ \hline
	26 & 2600 & 159.66 & 600+ & 600+ & 600+  \\ \hline
	27 & 2925 & 202.98 & 600+ & 600+ & 600+  \\ \hline
	28 & 3276 & 251.82 & 600+ & 600+ & 600+  \\ \hline
	29 & 3654 & 312.89 & 600+ & 600+ & 600+  \\ \hline
	30 & 4060 & 387.72 & 600+ & 600+ & 600+  \\ \hline
	31 & 4495 & 475.78 & 600+ & 600+ & 600+  \\ \hline
	32 & 4960 & 582.4 & 600+ & 600+ & 600+  \\ \hline
	33 & 5456 & 600+ & 600+ & 600+ & 600+  \\ \hline

  \end{tabular}
\end{center}

\caption{Comparison between execution time of approximation with different precision}

\label{fig:randgraph}
\end{figure}

\subsection{Experiments with Different Edge Probabilities}

In the previous experiments, each edge had probability 0.5. We use other values as the edge probability(all edges still have the same probability) and run the experiment again with (0.05,0.05) approximation. The SQL statements in Appendix~\ref{app:randgraph} should be modified accordingly. Let $p$ be the probability, change the following statements
\begin{verbatim}
insert into  inout values (1, 0.5);
insert into  inout values (0, 0.5);
\end{verbatim}
 to
\begin{verbatim}
insert into  inout values (1, p);
insert into  inout values (0, 1 - p);
\end{verbatim}
Figure \ref{fig:edge-prob} shows the execution time for queries of random graphs composed of 25 to 101 nodes with different fixed edge probabilities.

\begin{figure}[htp]

\begin{center}
  \begin{tabular}{ | c | c | c | c | c | }
    \hline
    \multirow{2}{*}{\#nodes} & \multirow{2}{*}{\#clauses} & \multicolumn{3}{|c|}{Execution Time(Seconds)} \\ \cline{3-5}
      	&  & p=0.5 & p=0.1 & p=0.05   \\ \hline
	25 & 2300 & 115.64 & 1.77 & 0.55   \\ \hline
	%26 & 2600 & 159.66 & 2.28 & 0.72   \\ \hline
	%27 & 2925 & 202.98 & 2.52 & 0.83   \\ \hline
	%28 & 3276 & 251.82 & 3.19 & 1.02   \\ \hline
	%29 & 3654 & 312.89 & 3.73 & 1.19   \\ \hline
	30 & 4060 & 387.72 & 4.13 & 1.35   \\ \hline
	31 & 4495 & 475.78 & 4.94 & 1.54   \\ \hline
	32 & 4960 & 582.40 & 5.72 & 1.82   \\ \hline
	33 & 5456 & 600+ & 6.87 & 2.12   \\ \hline
	%34 & 5984 & 600+ & 7.48 & 2.60  \\ \hline
	35 & 6545 & 600+ & 8.74 & 2.74  \\ \hline
	%36 & 7140 & 600+ & 9.59 & 3.12  \\ \hline
	%37 & 7770 & 600+ & 11.53 & 3.63  \\ \hline
	%38 & 8436 & 600+ & 13.38 & 3.92  \\ \hline
	%39 & 9139 & 600+ & 15.32 & 4.6  \\ \hline
	40 & 9880 & 600+ & 18.32 & 5.06  \\ \hline
	%41 & 10660 & 600+ & 20.65 & 5.76  \\ \hline
	%42 & 11480 & 600+ & 23.91 & 6.51  \\ \hline
	%43 & 12341 & 600+ & 28.44 & 7.7  \\ \hline
	%44 & 13244 & 600+ & 32.38 & 8.48  \\ \hline
	45 & 14190 & 600+ & 36.77 & 8.96  \\ \hline
	%46 & 15180 & 600+ & 41.09 & 9.99  \\ \hline
	%47 & 16215 & 600+ & 48.68 & 11.45  \\ \hline
	%48 & 17296 & 600+ & 54.66 & 12.62  \\ \hline
	%49 & 18424 & 600+ & 61.05 & 13.39  \\ \hline
	50 & 19600 & 600+ & 70.79 & 15.79  \\ \hline
	%51 & 20825 & 600+ & 80.19 & 16.09  \\ \hline
	%52 & 22100 & 600+ & 88.32 & 17.16  \\ \hline
	%53 & 23426 & 600+ & 97.99 & 19.49  \\ \hline
	%54 & 24804 & 600+ & 112.07 & 21.58  \\ \hline
	55 & 26235 & 600+ & 123.69 & 21.97  \\ \hline
	%56 & 27720 & 600+ & 138.92 & 25.73  \\ \hline
	%57 & 29260 & 600+ & 155.86 & 27.52  \\ \hline
	%58 & 30856 & 600+ & 172.39 & 29.37  \\ \hline
	%59 & 32509 & 600+ & 190.98 & 32.06  \\ \hline
	60 & 34220 & 600+ & 214.06 & 33.94  \\ \hline
	%61 & 35990 & 600+ &  & 36.97  \\ \hline
	%62 & 37820 & 600+ &  & 38.40  \\ \hline
	%63 & 39711 & 600+ &  & 42.80  \\ \hline
	%64 & 41664 & 600+ &  & 43.89  \\ \hline
	65 & 43680 & 600+ & 343.66 & 47.09  \\ \hline
	%66 & 45760 & 600+ &  & 51.56  \\ \hline
	%67 & 47905 & 600+ &  & 54.87  \\ \hline
	68 & 50116 & 600+ & 451.06 & 59.87  \\ \hline
	69 & 52934 & 600+ & 490.64 & 64.69  \\ \hline
	70 & 54740 & 600+ & 542.61 & 68.98  \\ \hline
	71 & 57155 & 600+ & 595.03 & 72.88  \\ \hline
	72 & 59640 & 600+ & 600+ & 82.30  \\ \hline
	75 & 67525 & 600+ & 600+ & 106.49  \\ \hline
	80 & 82160 & 600+ & 600+ & 154.92  \\ \hline
	85 & 98770 & 600+ & 600+ & 224.3  \\ \hline
	90 & 117480 & 600+ & 600+ & 316.28  \\ \hline
	95 & 138415 & 600+ & 600+ & 437.39  \\ \hline
	97 & 147440 & 600+ & 600+ & 510.39  \\ \hline
	98 & 152096 & 600+ & 600+ & 543.87  \\ \hline
	99 & 156849 & 600+ & 600+ & 558.44  \\ \hline
	100 & 161700 & 600+ & 600+ & 593.84  \\ \hline
	101 & 166650 & 600+ & 600+ & 600+  \\ \hline

  \end{tabular}
\end{center}

\caption{Comparison between execution time of queries of random graphs with different fixed edge probabilities}

\label{fig:edge-prob}
\end{figure}

\subsection{Experiments with General Random Graphs}

The previous experiments were conducted on undirected graphs in which every pair of nodes had a possibly present edge. However, this may not be the case in general. In many scenarios, each pair of nodes may have a certainly present, certainly absent or possibly present edge. In our following experiments, we construct such general probabilistic random graphs from data representing directed links between webpage within nd.edu domain\footnote{http://www.nd.edu/~networks/resources/www/www.dat.gz}. If a link between two pages is absent from the data, then it is also absent from our graphs. If a link is present in the data, then it is a certainly or possibly present edge in our graphs. We run again the queries computing the probabilities of existence of triangles in such graphs with (0.05,0.05) approximation. The probabilities that possibly present edges are in the graphs are randomly distributed in (0,0.1). The queries of the graph constructions and confidence computation can be found in Appendix~\ref{app:general-randgraph}. Figure \ref{fig:general-randgraph} shows the execution time for queries of such random graphs composed of 1000 to 30000 nodes.

\begin{figure}[htp]

\begin{center}
  \begin{tabular}{ | c | c | c | c | }
    \hline
    \#nodes & \#possible edges & \#clauses & Execution Time(Seconds) \\ \hline
	  	1000 & 3271 & 6367 &  4.04   \\ \hline
	  	2000 & 6446 & 12598 & 11.84    \\ \hline
	  	3000 & 9056 & 19836 &  21.88   \\ \hline
	  	4000 & 11366 & 22455 &  28.57   \\ \hline
	  	5000 & 13497 & 24574 &  31.38   \\ \hline
	  	6000 & 16095 & 25731 &  35.36   \\ \hline
	  	7000 & 17958 & 26070 &  35.82   \\ \hline
	  	8000 & 23113 & 39481 &  80.14   \\ \hline
	  	9000 & 26114 & 43369 &  115.45   \\ \hline
	  	10000 & 32975 & 51586 &  140.00   \\ \hline
	  	11000 & 35507 & 55562 &  157.34   \\ \hline
	  	12000 & 37623 & 57260 &  170.05   \\ \hline
	  	13000 & 40246 & 61060 &  197.67   \\ \hline
	  	14000 & 44045 & 66530 &  225.88   \\ \hline
	  	15000 & 45434 & 66966 &  230.51   \\ \hline
	  	16000 & 47814 & 69787 &  260.70   \\ \hline
	  	17000 & 50456 & 72710 &  278.48  \\ \hline
	  	18000 & 52145 & 73043 &  280.76  \\ \hline
	  	19000 & 53849 & 73437 &  288.01  \\ \hline
	  	20000 & 55584 & 73953 &  289.30  \\ \hline
	  	21000 & 57654 & 74688 &  290.37  \\ \hline
		22000 & 59274 & 74991 &  295.66   \\ \hline
		23000 & 61308 & 75954 &  296.13   \\ \hline
		24000 & 63000 & 76288 &  313.13  \\ \hline
		25000 & 65538 & 79404 &  354.95   \\ \hline
		26000 & 69741 & 89888 &  439.01   \\ \hline
		27000 & 72741 & 93016 &  479.78  \\ \hline
		28000 & 76148 & 98065 &  553.75   \\ \hline
		29000 & 79414 & 104328 &  573.24   \\ \hline
		30000 & 82714 & 107633 &  601.33   \\ \hline

  \end{tabular}
\end{center}

\caption{Execution time of confidence computation for existence of triangles in general random graphs}

\label{fig:general-randgraph}
\end{figure}

\section{Probabilistic TPC-H}

SPROUT\footnote{http://web.comlab.ox.ac.uk/projects/SPROUT/index.html} is a part
of the query engine of MayBMS and provides state-of-the-art techniques for efficient
exact confidence computation. In this section, we show how TPC-H queries can
benefit from these techniques. For each TPC-H query, we consider its largest subquery
without aggregations and inequality joins but with
conf() for specifying exact probability computation
for distinct tuples in query answers. We consider two
flavours of each of these queries: A version with original
selection attributes (again, without aggregations), and a version
where we drop keys from the selection attributes. Queries are included in the
experiments if SPROUT's techniques can be applied to them. Our data set consists
of tuple-independent probabilistic databases obtained from deterministic
databases produced by TPC-H 2.7.0 by associating each
tuple with a Boolean random variable and by choosing at random
a probability distribution over these variables. We perform experiments with TPC-H
scale factor 1 (1GB database size) and evaluate the
TPC-H-like queries mentioned above. The queries can
be found in Appendix~\ref{app:tpch}. In addition, we compare our results
with the reported time from \cite{OHK2008} in which SPROUT was only partially
integrated into PostgreSQL and storing temporary relations to the disk was
sometimes necessary. The average time shown below is obtained from ten subsequent,
identical executions with a warm cache by running the query once.

\begin{figure}[htp]

\begin{center}
  \begin{tabular}{ | c | c | c | }
    \hline
    Query &  \multicolumn{2}{|c|}{Average Time(Seconds)} \\ \cline{2-3}
           &  Current running time & Time reported in \cite{OHK2008} \\ \hline
	1 & 8.21 & 120.13 \\ \hline
	4 & 40.57 & 39.52 \\ \hline
	12 & 17.1 & 21.94 \\ \hline
	15 & 5.5 & 3.2 \\ \hline
	B1 & 5.37 & 14.92 \\ \hline
	B4 & 31.88 & 33.02 \\ \hline
	B6 & 3.82 & 6.37 \\ \hline
	B12 & 15.91 & 18.56 \\ \hline
	B14 & 4.17 & 4.86 \\ \hline
	B15 & 4.81 & 5.24 \\ \hline
	B16 & 0.87 & 3.16 \\ \hline
	B17 & 3.25 & 2.43 \\ \hline

  \end{tabular}
\end{center}
\caption{Current running times vs.\ running times reported in \cite{OHK2008}.
Boolean queries are prefixed by B. }

\end{figure}