diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/JoinGraph.svg b/slides/talks/2017-1-EDBT-Inference/graphics/JoinGraph.svg new file mode 100644 index 00000000..f526f1b5 --- /dev/null +++ b/slides/talks/2017-1-EDBT-Inference/graphics/JoinGraph.svg @@ -0,0 +1,3 @@ + + + Produced by OmniGraffle 6.6.2 2017-03-19 09:42:51 +0000Canvas 1Layer 1Ļ†D(D)(ĪØ1)š›•1Ļ†G(G,D,I)ā‹ˆĻ†I(I)ā‹ˆ(ĪØ2)š›•2 diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/barley.png b/slides/talks/2017-1-EDBT-Inference/graphics/barley.png new file mode 100644 index 00000000..16080a9d Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/barley.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.pdf new file mode 100644 index 00000000..c997e466 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.png b/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.png new file mode 100644 index 00000000..7f18f370 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/barley_avg.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/child.png b/slides/talks/2017-1-EDBT-Inference/graphics/child.png new file mode 100644 index 00000000..b602290e Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/child.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.pdf new file mode 100644 index 00000000..6158a4c3 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.png b/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.png new file mode 100644 index 00000000..8d751174 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/child_avg.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/diabetes.png b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes.png new file mode 100644 index 00000000..cce4b5fa Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.pdf new file mode 100644 index 00000000..f6c1dc6d Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.png b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.png new file mode 100644 index 00000000..2f718da3 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/diabetes_avg.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/extended_student.png b/slides/talks/2017-1-EDBT-Inference/graphics/extended_student.png new file mode 100644 index 00000000..45001ef8 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/extended_student.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.pdf new file mode 100644 index 00000000..8c3e7537 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.png b/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.png new file mode 100644 index 00000000..cb7c13a7 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/fixed_time.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/insurance.png b/slides/talks/2017-1-EDBT-Inference/graphics/insurance.png new file mode 100644 index 00000000..56c89936 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/insurance.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.pdf new file mode 100644 index 00000000..73258546 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.png b/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.png new file mode 100644 index 00000000..646033e0 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/insurance_avg.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.pdf new file mode 100644 index 00000000..c10090a4 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.png b/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.png new file mode 100644 index 00000000..b425f14e Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/student_avg.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.pdf b/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.pdf new file mode 100644 index 00000000..f792a665 Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.pdf differ diff --git a/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.png b/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.png new file mode 100644 index 00000000..877bfdba Binary files /dev/null and b/slides/talks/2017-1-EDBT-Inference/graphics/student_scaling.png differ diff --git a/slides/talks/2017-1-EDBT-Inference/index.html b/slides/talks/2017-1-EDBT-Inference/index.html index a64e3d88..e7fdeb71 100644 --- a/slides/talks/2017-1-EDBT-Inference/index.html +++ b/slides/talks/2017-1-EDBT-Inference/index.html @@ -79,42 +79,6 @@ -
-
-

Online Aggregation (OLA)

- -

$Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7) = 6$

-

$Avg(3,6,10,9,1) = 5.8$ $\approx 6$

- -

$Sum\left(\frac{k}{N} Samples\right) \cdot \frac{N}{k} \approx Sum(*)$

- -

Sampling lets you approximate aggregate values with orders of magnitude less data.

-
- -
-

Typical OLA Challenges

-
-
Birthday Paradox
-
$Sample(R) \bowtie Sample(S)$ is likely to be empty.
-
Stratified Sampling
-
It doesn't matter how important they are to the aggregate, rare samples are still rare.
-
Replacement
-
Does the sampling algorithm converge exactly or asymptotically?
-
-
- -
-

Replacement

-
-
Sampling Without Replacement
-
... eventually converges to a precise answer.
-
Sampling With Replacement
-
... doesn't need to track what's been sampled.
-
... produces a better behaved estimate distribution.
-
-
-
-
@@ -214,11 +178,259 @@
-

Key Idea: OLA

+

Idea: Use OLA

+
+
+

Online Aggregation (OLA)

+ +

$Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7) = 6$

+

$Avg(3,6,10,9,1) = 5.8$ $\approx 6$

+ +

$Sum\left(\frac{k}{N} Samples\right) \cdot \frac{N}{k} \approx Sum(*)$

+ +

Sampling lets you approximate aggregate values with orders of magnitude less data.

+
+ +
+

Typical OLA Challenges

+
+
Birthday Paradox
+
$Sample(R) \bowtie Sample(S)$ is likely to be empty.
+
Stratified Sampling
+
It doesn't matter how important they are to the aggregate, rare samples are still rare.
+
Replacement
+
Does the sampling algorithm converge exactly or asymptotically?
+
+
+ +
+

Replacement

+
+
Sampling Without Replacement
+
... eventually converges to a precise answer.
+
Sampling With Replacement
+
... doesn't need to track what's been sampled.
+
... produces a better behaved estimate distribution.
+
+
+ +
+

OLA over GMs

+
+
Tables are Small
+
Compute, not IO is the bottleneck.
+
Tables are Dense
+
Birthday Paradox and Stratified Sampling irrelevant.
+
Queries have High Tree-Width
+
Intermediate tables are large.
+
+

Classical OLA techniques aren't entirely appropriate.

+
+
+ +
+
+

(Naive) OLA: Cyclic Sampling

+
+ +
+

A Few Quick Insights

+
    +
  1. Small Tables make random access to data possible.
  2. +
  3. Dense Tables mean we can sample directly from join outputs.
  4. +
  5. Cyclic PRNGs like Linear Congruential Generators can be used to generate a randomly ordered, but non-repeating sequence of integers from $0$ to any $N$ in constant memory.
  6. +
+
+ +
+

Linear Congruential Generators

+

If you pick $a$, $b$, and $N$ correctly, then the sequence:

+

$K_i = (a\cdot K_{iāˆ’1}+b)\;mod\;N$

+

will produce $N$ distinct, pseudorandom integers $K_i \in [0, N)$

+
+ +
+

Cyclic Sampling

+

To marginalize $p(\{X_i\})$...

+
    +
  1. Init an LCG with a cycle of $N = \prod_i |dom(X_i)|$
  2. +
  3. Use the LCG to sample $\{x_i\} \in \{X_i\}$
  4. +
  5. Incorporate $p(x_i = X_i)$ into the OLA estimate
  6. +
  7. Repeat from 2 until done
  8. +
+
+ +
+

Cyclic Sampling

+
+
Advantages
+
Progressively better estimates over time.
+
Converges in bounded time.
+
Disadvantages
+
Exponential time in the number of variables.
+
+
+ +
+

Accuracy

+
+
Sampling with Replacement
+
Chernoff Bounds, Hoeffding Bounds give an $\epsilon-\delta$ guarantee on the sum/avg of a sample with replacement.
+
Without Replacement?
+
Serfling et. al. have a variant of Hoeffding Bounds for sampling without replacement.
+

+
+
+

Better OLA: Leaky Joins

+

Make Cyclic Sampling into a composable operator

+
+ +
+ + + + + + +
$G$#$\sum p_{\psi_2}$
130.348
240.288
340.350
+ + + + + + + + +
$D$$G$#$\sum p_{\psi_1}$
0110.126
1120.222
0220.238
1220.050
0320.322
1320.028
+
+ +
+ + + + + + +
$G$#$\sum p_{\psi_2}$
130.348
240.288
340.350
+ + + + + + + + +
$D$$G$#$\sum p_{\psi_1}$
0120.140
1120.222
0220.238
1220.050
0320.322
1320.028
+
+ +
+ + + + + + +
$G$#$\sum p_{\psi_2}$
140.362
240.288
340.350
+ + + + + + + + +
$D$$G$#$\sum p_{\psi_1}$
0120.140
1120.222
0220.238
1220.050
0320.322
1320.028
+
+ +
+

Leaky Joins

+
    +
  1. Build a normal join/aggregate graph as in variable elimination: One Cyclic Sampler for each Join+Aggregate.
  2. +
  3. Keep advancing Cyclic Samplers in parallel, resetting their output after every cycle so samples "leak" through.
  4. +
  5. When the sampler completes one full cycle with a complete input, mark it complete and stop sampling it.
  6. +
  7. Continue until a desired accuracy is reached or all tables marked complete.
  8. +
+
+ +
+

There's a bit of extra math to compute $\epsilon-\delta$ bounds by adapting Serfling's results. It's in the paper.

+
+
+ +
+
+

Experiments

+
+
Microbenchmarks
+
Fix time, vary domain size, measure accuracy
+
Fix domain size, vary time, measure accuracy
+
Vary domain size, measure time to completion
+
Macrobenchmarks
+
4 graphs from the bnlearn Repository
+
+
+ +
+

Microbenchmarks

+ +

Student: A common benchmark graph.

+
+ +
+

Accuracy vs Domain

+ +

VE is binary: It completes, or it doesn't.

+
+ +
+

Accuracy vs Time

+ +

CS gets early results faster, but is overtaken by LJ.

+
+ +
+

Domain vs Time to 100%

+ +

LJ is only 3-5x slower than VE.

+
+ +
+

"Child"

+
+ + +
+

LJ converges to an exact result before Gibbs gets an approx.

+
+ +
+

"Insurance"

+
+ + +
+

On some graphs Gibbs is better, but only marginally.

+
+ +
+

More graphs in the paper.

+
+
+ +
+

Leaky Joins

+ +
+