293 lines
13 KiB
TeX
293 lines
13 KiB
TeX
|
|
|
|
\chapter{Experiments}
|
|
|
|
|
|
|
|
This section reports on experiments performed with the first MayBMS release
|
|
(beta) and a benchmark consisting of two parts,
|
|
which are described in more detail in the remainder of this chapter:
|
|
%
|
|
\begin{enumerate}
|
|
\item
|
|
Computing the probability of triangles in random graphs.
|
|
|
|
\item
|
|
A modified subset of the TPC-H queries on uncertain TPC-H datasets.
|
|
\end{enumerate}
|
|
|
|
|
|
By this benchmark, we do not attempt to simulate a representative set of
|
|
use cases: the jury is still out on what such a set of use cases might be.
|
|
Instead, we focus on a benchmark that allows us to see how the performance
|
|
of MayBMS develops across releases on the two core technical problems solved
|
|
by MayBMS: polynomial-time query evaluation for the polynomial-time fragment
|
|
of our query language and the efficient approximation of query results for
|
|
queries that do not belong to the polynomial-time fragment. (Finding triangles
|
|
in random graphs is a near-canonical example of such queries.)
|
|
|
|
We will keep monitoring the development of the state of the art and will
|
|
continue to survey applications and collect use cases; we will extend or
|
|
replace this benchmark as consensus develops regarding the most important
|
|
applications of probabilistic databases.
|
|
|
|
|
|
\medskip
|
|
|
|
Experimental setup.
|
|
All the experiments reported on in this chapter were conducted on an Athlon-X2(4600+)64bit / 1.8GB / Linux2.6.20 / gcc4.1.2 machine.
|
|
|
|
|
|
|
|
\section{Random Graphs}
|
|
\subsection{Experiments with Varying Levels of Precision}
|
|
|
|
In this experiment, we create undirected random graphs in which
|
|
the presence of each edge is independent of that of the other edges. The probability that an edge is in the graph is 0.5 and
|
|
this applies to each edge. Then we compute the probability that there exists a triangle in the graphs using approximation.
|
|
The queries can be found in Appendix~\ref{app:randgraph}.
|
|
|
|
We report wall-clock execution times
|
|
of queries run in the PostgreSQL8.3.3 psql shell with a warm
|
|
cache obtained by running a query once and then reporting
|
|
the average execution time over three subsequent, identical executions.
|
|
Figure \ref{fig:randgraph} shows the execution time of approximation with different precision parameters for random graphs composed of 5 to 33 nodes. An ($\epsilon, \delta$) approximation has the following property: let $p$ be the exact probability and $\hat{p}$ be the approximate probability, then
|
|
$\Pr\big[ |p - \hat{p}| \ge \epsilon \cdot p \big] \le \delta$.
|
|
|
|
\begin{figure}[htp]
|
|
|
|
\begin{center}
|
|
\begin{tabular}{ | c | c | c | c | c | c | }
|
|
\hline
|
|
\multirow{2}{*}{\#nodes} & \multirow{2}{*}{\#clauses} & \multicolumn{4}{|c|}{Execution Time(Seconds)} \\ \cline{3-6}
|
|
& & (.05,.05) & (.01,.01) & (.005,.005) & (.001,.001) \\ \hline
|
|
5 & 10 & 0.01 & 0.03 & 0.11 & 2.08 \\ \hline
|
|
6 & 20 & 0.01 & 0.08 & 0.26 & 5.27 \\ \hline
|
|
7 & 35 & 0.02 & 0.14 & 0.46 & 9.15 \\ \hline
|
|
8 & 56 & 0.03 & 0.22 & 0.7 & 12.49 \\ \hline
|
|
9 & 84 & 0.04 & 0.28 & 0.85 & 14.95 \\ \hline
|
|
10 & 120 & 0.08 & 0.44 & 1.13 & 16.19 \\ \hline
|
|
11 & 165 & 0.15 & 0.60 & 1.60 & 17.98 \\ \hline
|
|
12 & 220 & 0.29 & 1.24 & 2.48 & 24.31 \\ \hline
|
|
13 & 286 & 0.55 & 2.38 & 4.74 & 35.29 \\ \hline
|
|
14 & 364 & 0.98 & 4.26 & 8.38 & 51.51 \\ \hline
|
|
15 & 455 & 1.56 & 6.74 & 13.29 & 73.00 \\ \hline
|
|
16 & 560 & 2.37 & 10.26 & 19.21 & 102.97 \\ \hline
|
|
17 & 680 & 3.46 & 14.6 & 28.76 & 144.02 \\ \hline
|
|
18 & 816 & 4.92 & 20.49 & 41.1 & 206.18 \\ \hline
|
|
19 & 969 & 7.03 & 28.52 & 56.43 & 291.21 \\ \hline
|
|
20 & 1140 & 9.97 & 39.72 & 81.01 & 395.18 \\ \hline
|
|
21 & 1330 & 14.74 & 57.13 & 123.79 & 597.86 \\ \hline
|
|
22 & 1540 & 23.94 & 119.81 & 218.62 & 600+ \\ \hline
|
|
23 & 1771 & 46.21 & 204.83 & 416.42 & 600+ \\ \hline
|
|
24 & 2024 & 79.03 & 411.67 & 600+ & 600+ \\ \hline
|
|
25 & 2300 & 115.64 & 515.65 & 600+ & 600+ \\ \hline
|
|
26 & 2600 & 159.66 & 600+ & 600+ & 600+ \\ \hline
|
|
27 & 2925 & 202.98 & 600+ & 600+ & 600+ \\ \hline
|
|
28 & 3276 & 251.82 & 600+ & 600+ & 600+ \\ \hline
|
|
29 & 3654 & 312.89 & 600+ & 600+ & 600+ \\ \hline
|
|
30 & 4060 & 387.72 & 600+ & 600+ & 600+ \\ \hline
|
|
31 & 4495 & 475.78 & 600+ & 600+ & 600+ \\ \hline
|
|
32 & 4960 & 582.4 & 600+ & 600+ & 600+ \\ \hline
|
|
33 & 5456 & 600+ & 600+ & 600+ & 600+ \\ \hline
|
|
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
\caption{Comparison between execution time of approximation with different precision}
|
|
|
|
\label{fig:randgraph}
|
|
\end{figure}
|
|
|
|
\subsection{Experiments with Different Edge Probabilities}
|
|
|
|
In the previous experiments, each edge had probability 0.5. We use other values as the edge probability(all edges still have the same probability) and run the experiment again with (0.05,0.05) approximation. The SQL statements in Appendix~\ref{app:randgraph} should be modified accordingly. Let $p$ be the probability, change the following statements
|
|
\begin{verbatim}
|
|
insert into inout values (1, 0.5);
|
|
insert into inout values (0, 0.5);
|
|
\end{verbatim}
|
|
to
|
|
\begin{verbatim}
|
|
insert into inout values (1, p);
|
|
insert into inout values (0, 1 - p);
|
|
\end{verbatim}
|
|
Figure \ref{fig:edge-prob} shows the execution time for queries of random graphs composed of 25 to 101 nodes with different fixed edge probabilities.
|
|
|
|
\begin{figure}[htp]
|
|
|
|
\begin{center}
|
|
\begin{tabular}{ | c | c | c | c | c | }
|
|
\hline
|
|
\multirow{2}{*}{\#nodes} & \multirow{2}{*}{\#clauses} & \multicolumn{3}{|c|}{Execution Time(Seconds)} \\ \cline{3-5}
|
|
& & p=0.5 & p=0.1 & p=0.05 \\ \hline
|
|
25 & 2300 & 115.64 & 1.77 & 0.55 \\ \hline
|
|
%26 & 2600 & 159.66 & 2.28 & 0.72 \\ \hline
|
|
%27 & 2925 & 202.98 & 2.52 & 0.83 \\ \hline
|
|
%28 & 3276 & 251.82 & 3.19 & 1.02 \\ \hline
|
|
%29 & 3654 & 312.89 & 3.73 & 1.19 \\ \hline
|
|
30 & 4060 & 387.72 & 4.13 & 1.35 \\ \hline
|
|
31 & 4495 & 475.78 & 4.94 & 1.54 \\ \hline
|
|
32 & 4960 & 582.40 & 5.72 & 1.82 \\ \hline
|
|
33 & 5456 & 600+ & 6.87 & 2.12 \\ \hline
|
|
%34 & 5984 & 600+ & 7.48 & 2.60 \\ \hline
|
|
35 & 6545 & 600+ & 8.74 & 2.74 \\ \hline
|
|
%36 & 7140 & 600+ & 9.59 & 3.12 \\ \hline
|
|
%37 & 7770 & 600+ & 11.53 & 3.63 \\ \hline
|
|
%38 & 8436 & 600+ & 13.38 & 3.92 \\ \hline
|
|
%39 & 9139 & 600+ & 15.32 & 4.6 \\ \hline
|
|
40 & 9880 & 600+ & 18.32 & 5.06 \\ \hline
|
|
%41 & 10660 & 600+ & 20.65 & 5.76 \\ \hline
|
|
%42 & 11480 & 600+ & 23.91 & 6.51 \\ \hline
|
|
%43 & 12341 & 600+ & 28.44 & 7.7 \\ \hline
|
|
%44 & 13244 & 600+ & 32.38 & 8.48 \\ \hline
|
|
45 & 14190 & 600+ & 36.77 & 8.96 \\ \hline
|
|
%46 & 15180 & 600+ & 41.09 & 9.99 \\ \hline
|
|
%47 & 16215 & 600+ & 48.68 & 11.45 \\ \hline
|
|
%48 & 17296 & 600+ & 54.66 & 12.62 \\ \hline
|
|
%49 & 18424 & 600+ & 61.05 & 13.39 \\ \hline
|
|
50 & 19600 & 600+ & 70.79 & 15.79 \\ \hline
|
|
%51 & 20825 & 600+ & 80.19 & 16.09 \\ \hline
|
|
%52 & 22100 & 600+ & 88.32 & 17.16 \\ \hline
|
|
%53 & 23426 & 600+ & 97.99 & 19.49 \\ \hline
|
|
%54 & 24804 & 600+ & 112.07 & 21.58 \\ \hline
|
|
55 & 26235 & 600+ & 123.69 & 21.97 \\ \hline
|
|
%56 & 27720 & 600+ & 138.92 & 25.73 \\ \hline
|
|
%57 & 29260 & 600+ & 155.86 & 27.52 \\ \hline
|
|
%58 & 30856 & 600+ & 172.39 & 29.37 \\ \hline
|
|
%59 & 32509 & 600+ & 190.98 & 32.06 \\ \hline
|
|
60 & 34220 & 600+ & 214.06 & 33.94 \\ \hline
|
|
%61 & 35990 & 600+ & & 36.97 \\ \hline
|
|
%62 & 37820 & 600+ & & 38.40 \\ \hline
|
|
%63 & 39711 & 600+ & & 42.80 \\ \hline
|
|
%64 & 41664 & 600+ & & 43.89 \\ \hline
|
|
65 & 43680 & 600+ & 343.66 & 47.09 \\ \hline
|
|
%66 & 45760 & 600+ & & 51.56 \\ \hline
|
|
%67 & 47905 & 600+ & & 54.87 \\ \hline
|
|
68 & 50116 & 600+ & 451.06 & 59.87 \\ \hline
|
|
69 & 52934 & 600+ & 490.64 & 64.69 \\ \hline
|
|
70 & 54740 & 600+ & 542.61 & 68.98 \\ \hline
|
|
71 & 57155 & 600+ & 595.03 & 72.88 \\ \hline
|
|
72 & 59640 & 600+ & 600+ & 82.30 \\ \hline
|
|
75 & 67525 & 600+ & 600+ & 106.49 \\ \hline
|
|
80 & 82160 & 600+ & 600+ & 154.92 \\ \hline
|
|
85 & 98770 & 600+ & 600+ & 224.3 \\ \hline
|
|
90 & 117480 & 600+ & 600+ & 316.28 \\ \hline
|
|
95 & 138415 & 600+ & 600+ & 437.39 \\ \hline
|
|
97 & 147440 & 600+ & 600+ & 510.39 \\ \hline
|
|
98 & 152096 & 600+ & 600+ & 543.87 \\ \hline
|
|
99 & 156849 & 600+ & 600+ & 558.44 \\ \hline
|
|
100 & 161700 & 600+ & 600+ & 593.84 \\ \hline
|
|
101 & 166650 & 600+ & 600+ & 600+ \\ \hline
|
|
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
\caption{Comparison between execution time of queries of random graphs with different fixed edge probabilities}
|
|
|
|
\label{fig:edge-prob}
|
|
\end{figure}
|
|
|
|
\subsection{Experiments with General Random Graphs}
|
|
|
|
The previous experiments were conducted on undirected graphs in which every pair of nodes had a possibly present edge. However, this may not be the case in general. In many scenarios, each pair of nodes may have a certainly present, certainly absent or possibly present edge. In our following experiments, we construct such general probabilistic random graphs from data representing directed links between webpage within nd.edu domain\footnote{http://www.nd.edu/~networks/resources/www/www.dat.gz}. If a link between two pages is absent from the data, then it is also absent from our graphs. If a link is present in the data, then it is a certainly or possibly present edge in our graphs. We run again the queries computing the probabilities of existence of triangles in such graphs with (0.05,0.05) approximation. The probabilities that possibly present edges are in the graphs are randomly distributed in (0,0.1). The queries of the graph constructions and confidence computation can be found in Appendix~\ref{app:general-randgraph}. Figure \ref{fig:general-randgraph} shows the execution time for queries of such random graphs composed of 1000 to 30000 nodes.
|
|
|
|
\begin{figure}[htp]
|
|
|
|
\begin{center}
|
|
\begin{tabular}{ | c | c | c | c | }
|
|
\hline
|
|
\#nodes & \#possible edges & \#clauses & Execution Time(Seconds) \\ \hline
|
|
1000 & 3271 & 6367 & 4.04 \\ \hline
|
|
2000 & 6446 & 12598 & 11.84 \\ \hline
|
|
3000 & 9056 & 19836 & 21.88 \\ \hline
|
|
4000 & 11366 & 22455 & 28.57 \\ \hline
|
|
5000 & 13497 & 24574 & 31.38 \\ \hline
|
|
6000 & 16095 & 25731 & 35.36 \\ \hline
|
|
7000 & 17958 & 26070 & 35.82 \\ \hline
|
|
8000 & 23113 & 39481 & 80.14 \\ \hline
|
|
9000 & 26114 & 43369 & 115.45 \\ \hline
|
|
10000 & 32975 & 51586 & 140.00 \\ \hline
|
|
11000 & 35507 & 55562 & 157.34 \\ \hline
|
|
12000 & 37623 & 57260 & 170.05 \\ \hline
|
|
13000 & 40246 & 61060 & 197.67 \\ \hline
|
|
14000 & 44045 & 66530 & 225.88 \\ \hline
|
|
15000 & 45434 & 66966 & 230.51 \\ \hline
|
|
16000 & 47814 & 69787 & 260.70 \\ \hline
|
|
17000 & 50456 & 72710 & 278.48 \\ \hline
|
|
18000 & 52145 & 73043 & 280.76 \\ \hline
|
|
19000 & 53849 & 73437 & 288.01 \\ \hline
|
|
20000 & 55584 & 73953 & 289.30 \\ \hline
|
|
21000 & 57654 & 74688 & 290.37 \\ \hline
|
|
22000 & 59274 & 74991 & 295.66 \\ \hline
|
|
23000 & 61308 & 75954 & 296.13 \\ \hline
|
|
24000 & 63000 & 76288 & 313.13 \\ \hline
|
|
25000 & 65538 & 79404 & 354.95 \\ \hline
|
|
26000 & 69741 & 89888 & 439.01 \\ \hline
|
|
27000 & 72741 & 93016 & 479.78 \\ \hline
|
|
28000 & 76148 & 98065 & 553.75 \\ \hline
|
|
29000 & 79414 & 104328 & 573.24 \\ \hline
|
|
30000 & 82714 & 107633 & 601.33 \\ \hline
|
|
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
\caption{Execution time of confidence computation for existence of triangles in general random graphs}
|
|
|
|
\label{fig:general-randgraph}
|
|
\end{figure}
|
|
|
|
\section{Probabilistic TPC-H}
|
|
|
|
SPROUT\footnote{http://web.comlab.ox.ac.uk/projects/SPROUT/index.html} is a part
|
|
of the query engine of MayBMS and provides state-of-the-art techniques for efficient
|
|
exact confidence computation. In this section, we show how TPC-H queries can
|
|
benefit from these techniques. For each TPC-H query, we consider its largest subquery
|
|
without aggregations and inequality joins but with
|
|
conf() for specifying exact probability computation
|
|
for distinct tuples in query answers. We consider two
|
|
flavours of each of these queries: A version with original
|
|
selection attributes (again, without aggregations), and a version
|
|
where we drop keys from the selection attributes. Queries are included in the
|
|
experiments if SPROUT's techniques can be applied to them. Our data set consists
|
|
of tuple-independent probabilistic databases obtained from deterministic
|
|
databases produced by TPC-H 2.7.0 by associating each
|
|
tuple with a Boolean random variable and by choosing at random
|
|
a probability distribution over these variables. We perform experiments with TPC-H
|
|
scale factor 1 (1GB database size) and evaluate the
|
|
TPC-H-like queries mentioned above. The queries can
|
|
be found in Appendix~\ref{app:tpch}. In addition, we compare our results
|
|
with the reported time from \cite{OHK2008} in which SPROUT was only partially
|
|
integrated into PostgreSQL and storing temporary relations to the disk was
|
|
sometimes necessary. The average time shown below is obtained from ten subsequent,
|
|
identical executions with a warm cache by running the query once.
|
|
|
|
\begin{figure}[htp]
|
|
|
|
\begin{center}
|
|
\begin{tabular}{ | c | c | c | }
|
|
\hline
|
|
Query & \multicolumn{2}{|c|}{Average Time(Seconds)} \\ \cline{2-3}
|
|
& Current running time & Time reported in \cite{OHK2008} \\ \hline
|
|
1 & 8.21 & 120.13 \\ \hline
|
|
4 & 40.57 & 39.52 \\ \hline
|
|
12 & 17.1 & 21.94 \\ \hline
|
|
15 & 5.5 & 3.2 \\ \hline
|
|
B1 & 5.37 & 14.92 \\ \hline
|
|
B4 & 31.88 & 33.02 \\ \hline
|
|
B6 & 3.82 & 6.37 \\ \hline
|
|
B12 & 15.91 & 18.56 \\ \hline
|
|
B14 & 4.17 & 4.86 \\ \hline
|
|
B15 & 4.81 & 5.24 \\ \hline
|
|
B16 & 0.87 & 3.16 \\ \hline
|
|
B17 & 3.25 & 2.43 \\ \hline
|
|
|
|
\end{tabular}
|
|
\end{center}
|
|
\caption{Current running times vs.\ running times reported in \cite{OHK2008}.
|
|
Boolean queries are prefixed by B. }
|
|
|
|
\end{figure}
|
|
|
|
|
|
|