SELECT COUNT(DISTINCT A) FROM R
SELECT A, COUNT(*) FROM R GROUP BY A
SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10
These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?
+Sketching: Hash function tricks used to estimate useful statistical properties.
+Challenge: To avoid double counting, we need to track which values of $A$ we've seen. $O(|A|)$ memory required.
+A brief digression
+Flips | Score |
---|---|
+ (👽) + | 0 |
+ (🐕) + (👽) + | 1 |
+ (🐕) + (🐕) + (🐕) + (🐕) + (🐕) + (👽) + | 5 |
Flips | Score | Probability | +E[# Games] | +
---|---|---|---|
(👽) | 0 | 0.5 | +2 | +
(🐕)(👽) | 1 | 0.25 | +4 | +
(🐕)(🐕)(👽) | 2 | 0.125 | +8 | +
(🐕)$\times N$ (👽) | $N$ | $\frac{1}{2^{N+1}}$ | +$2^{N+1}$ | +
If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.
+To do that, I only need to track my top score!
+Idea: Simulate coin flips with a hash function
+ +... take the index of the lowest-order nonzero bit
+Object | Hash Bits | Score |
---|---|---|
$O_1$ | 01011011 | 0 |
$O_2$ | 00110111 | 0 |
$O_3$ | 00111000 | 3 |
$O_4$ | 10010010 | 1 |
$O_3$ | 00111000 | 3 |
3 |
Estimate: $2^{3+1} = 16$
+Duplicates can't raise the top score!
+Problem: Noisy estimate!
+Idea 1: Instead of your top score, track the lowest score you have not gotten yet ($R$).
+Object | Hash Bits | Score |
---|---|---|
$O_1$ | 01011011 | 0 |
$O_2$ | 00110111 | 0 |
$O_3$ | 00111000 | 3 |
$O_4$ | 10010010 | 1 |
$O_3$ | 00111000 | 3 |
{0, 1, 3} $R = 2$ |
Estimate: $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$
+Idea 2: Compute several estimates in parallel and average estimates.
+Problem: Need a counter for each individual A
+Idea: Keep only one counter!
+No... seriously
+Object | $\delta(O_i)$ | Running Count |
---|---|---|
$O_3$ | -1 | -1 |
$O_1$ | +1 | 0 |
$O_4$ | -1 | -1 |
$O_2$ | +1 | 0 |
$O_4$ | -1 | -1 |
$O_1$ | +1 | 0 |
$O_3$ | -1 | -1 |
$O_3$ | -1 | -2 |
$O_1$ | +1 | -1 |
$Total =$ |
$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$ |
$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$ |
$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$= |
$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$ |
$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$ |
+ $$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$ +
+Running total was $-1$
+Object | $\delta(O_i)$ | Estimate |
---|---|---|
$O_1$ | +1 | -1 |
$O_2$ | +1 | -1 |
$O_3$ | -1 | +1 |
$O_4$ | -1 | +1 |
Not... so... great
+Problem 1: All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).
+Problem 2: The estimate is really noisy
+Idea 1: Multiple Buckets ($h(x)$ picks a bucket)
+Idea 2: Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)
+Object | + <% (1..num_trials).each do |i| %> +$h_<%=i%>(O_i)$ | +$\delta_<%=i%>(O_i)$ | + <% end %>
---|---|---|
$O_<%=i+1%>$ | + <% o_fns.each do |d, h| %> +Bucket <%=h+1%> | +<%=d%> | + <% end %>
Objects Seen: $<%= log.map { |l| "O_#{l}" }.join(",") %>$
+ +Bucket 1 | Bucket 2 | |
---|---|---|
Trial <%=trial%> | + <% buckets.each do |cnt| %><%= cnt %> | <% end %> +
Object | + <% (0...num_trials).each do |i| %>Trial <%=i+1%> | <% end %> +Estimate | Real |
---|---|---|---|
$O_<%=o+1%>$ | + <% est = 0; (0...num_trials).each do |i| %> + <% delta, bucket = all_fns[o][i]; est += m[i][bucket] * delta %> +<%= m[i][bucket] * delta %> | + <% end %> +<%= est.to_f/num_trials %> | +<%= log.select { |x| x == o+1 }.count %> | +
In practice, use Median and not Mode to combine trials
+Problem: "Heavy Hitters" overwhelm smaller counts
+Idea: Give up. Drop $\delta$.
+Object | +Appearances | + <% (1..num_trials).each do |i| %> +$h_<%=i%>(O_i)$ | + <% end %>
---|---|---|
$O_<%=i+1%>$ | +<%= counts[i] %> | + <% o_fns.each do |d, h| %> +Bucket <%=h+1%> | + <% end %>
<% (1..num_buckets).each { |b| %> | Bucket <%=b%> | <% } %>
---|---|
Trial <%=trial%> | + <% buckets.each do |cnt| %><%= cnt %> | <% end %> +
<% (1..num_buckets).each { |b| %> | Bucket <%=b%> | <% } %>
---|---|
Trial <%=trial%> | + <% buckets.each do |cnt| %><%= cnt %> | <% end %> +
Object | +Appearances | + <% (1..num_trials).each do |i| %> +Estimate <%=i%> | + <% end %> +Min | +
---|---|---|---|
$O_<%=o+1%>$ | +<%= counts[o] %> | + <% (0...num_trials).each do |i| %> +<%=m[i][o_fns[i][1]]%> | + <% end %> +<%=(0...num_trials).map { |i| m[i][o_fns[i][1]]}.min%> | +