diff --git a/src/teaching/cse-350/2024sp/slide/15-sketches.erb b/src/teaching/cse-350/2024sp/slide/15-sketches.erb new file mode 100644 index 00000000..672ee2af --- /dev/null +++ b/src/teaching/cse-350/2024sp/slide/15-sketches.erb @@ -0,0 +1,431 @@ +--- +template: templates/cse4562_2021_slides.erb +title: Data Sketching +date: March 25, 2021 +textbook: (readings only) +class_name: CSE 350 +--- + +
+
+ +

These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?

+
+ +
+

Sketching: Hash function tricks used to estimate useful statistical properties.

+
+ +
+
+
Flajolet-Martin Sketches (HyperLogLog)
+
Estimating Count-Distinct
+ +
Count Sketches
+
Estimating Count-GroupBy
+ +
Count-Min Sketches
+
Estimating Count-GroupBy-TopK
+
+
+
+ +
+
+

Count-Distinct

+ +
+ $3$ + $5$ + $4$ + $4$ + $2$ + $4$ + $3$ + $\ldots$ +
+ +
+
+ $3$ + $5$ + $4$ + $2$ + $\ldots$ +
+
+ +

Challenge: To avoid double counting, we need to track which values of $A$ we've seen. $O(|A|)$ memory required.

+
+ +
+

A brief digression

+
+ +
+

The Coin Flip Game

+ + Start with 0 points and flip a coin +
+
+
Tails (🐕)
+
Get a point and flip again.
+
+
+
Heads (👽)
+
Game over.
+
+
+
+ + + +
+ + + + + + +
FlipsScore
+ (👽) + 0
+ (🐕) + (👽) + 1
+ (🐕) + (🐕) + (🐕) + (🐕) + (🐕) + (👽) + 5
+
+ +
+ + + + + + + + + + + + + + + + +
FlipsScoreProbabilityE[# Games]
(👽)00.52
(🐕)(👽)10.254
(🐕)(🐕)(👽)20.1258
(🐕)$\times N$   (👽)$N$$\frac{1}{2^{N+1}}$$2^{N+1}$
+

If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.

+

To do that, I only need to track my top score!

+
+ +
+

Idea: Simulate coin flips with a hash function

+ +

... take the index of the lowest-order nonzero bit

+
+ +
+ + + + + + + + +
ObjectHash BitsScore
$O_1$010110110
$O_2$001101110
$O_3$001110003
$O_4$100100101
$O_3$001110003
3
+ +

Estimate: $2^{3+1} = 16$

+

Duplicates can't raise the top score!

+
+ +
+

Problem: Noisy estimate!

+

Idea 1: Instead of your top score, track the lowest score you have not gotten yet ($R$).

+
+ +
+ + + + + + + + +
ObjectHash BitsScore
$O_1$010110110
$O_2$001101110
$O_3$001110003
$O_4$100100101
$O_3$001110003
{0, 1, 3}
$R = 2$
+ +

Estimate: $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$

+
+ +
+

Idea 2: Compute several estimates in parallel and average estimates.

+
+ +
+

Flajolet-Martin Sketches

+

($\approx$ HyperLogLog)

+ +
    +
  1. For each record... +
      +
    1. Hash each record
    2. +
    3. Find the index of the lowest-order non-zero bit
    4. +
    5. Add the index of the bit to a set
    6. +
  2. +
  3. Find $R$, the lowest index not in the set
  4. +
  5. Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)
  6. +
  7. Repeat (in parallel) as needed
  8. +
+
+
+ +
+
+

Group-By Count

+

Problem: Need a counter for each individual A

+
+ +
+

Idea: Keep only one counter!

+
+ +
+ +

No... seriously

+
+ +
+ $$\delta(O_i) = \begin{cases} \textbf{if } h(O_i) = 0 \mod 2 & \textbf{then } -1 \\ \textbf{if } h(O_i) = 1 \mod 2 & \textbf{then } +1\end{cases}$$ +
+ +
+ $$\sum_i \delta(O_i)$$ +
+ +
+ + + + + + + + + + + +
Object$\delta(O_i)$Running Count
$O_3$-1-1
$O_1$+10
$O_4$-1-1
$O_2$+10
$O_4$-1-1
$O_1$+10
$O_3$-1-1
$O_3$-1-2
$O_1$+1-1
+
+ +
+ + + + + +
$Total =$
$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$
$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$
+ + + + + +
$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$=
$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$
$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$
+ +

+ $$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$ +

+
+ +
+

Running total was $-1$

+ + + + + + +
Object$\delta(O_i)$Estimate
$O_1$+1-1
$O_2$+1-1
$O_3$-1+1
$O_4$-1+1
+ +

Not... so... great

+
+ +
+

Problem 1: All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).

+

Problem 2: The estimate is really noisy

+
+ +
+

Idea 1: Multiple Buckets ($h(x)$ picks a bucket)

+

Idea 2: Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)

+
+ +<% + prng = Random.new(2019) + num_trials = 2 + num_buckets = 2 + num_objects = 4 + all_fns = (0...num_objects).map { + (0...num_trials).map { + [ prng.rand(2)*2-1, + prng.rand(num_buckets) + ] } } +%> +
+ + + <% (1..num_trials).each do |i| %> + + + <% end %> + <% all_fns.each.with_index do |o_fns, i| %> + + <% o_fns.each do |d, h| %> + + + <% end %> + <% end %> +
Object$h_<%=i%>(O_i)$$\delta_<%=i%>(O_i)$
$O_<%=i+1%>$Bucket <%=h+1%><%=d%>
+
+ + <% m = (0...num_trials).map { [0]*num_buckets } %> + <% log = [] %> + <% [ 2, 1, 4, 1, 2, 1 ].each do |i| %> + <% o_fns = all_fns[i-1]; %> +
+

Objects Seen: $<%= log.map { |l| "O_#{l}" }.join(",") %>$

+ + + + <% m.each.with_index do |buckets, trial| %> + + <% buckets.each do |cnt| %><% end %> + + <% end %> +
Bucket 1Bucket 2
Trial <%=trial%><%= cnt %>
+ + + + <% (0...num_trials).each do |i| %><% end %> + + <% (0...num_objects).each do |o| %> + + <% est = 0; (0...num_trials).each do |i| %> + <% delta, bucket = all_fns[o][i]; est += m[i][bucket] * delta %> + + <% end %> + + + + <% end %> +
ObjectTrial <%=i+1%>EstimateReal
$O_<%=o+1%>$<%= m[i][bucket] * delta %><%= est.to_f/num_trials %><%= log.select { |x| x == o+1 }.count %>
+ <% + log.push(i) + o_fns.each.with_index do |d_h, trial| + d, h = d_h + m[trial][h] += d + end + %> +
+ <% end %> + +
+

In practice, use Median and not Mode to combine trials

+
+ +
+ +
+
+

Top-K Group-By Count

+

Problem: "Heavy Hitters" overwhelm smaller counts

+
+ +
+

Idea: Give up. Drop $\delta$.

+
+ +
+

Count-Min Sketch

+
+ + <% m = (0...num_trials).map { [0]*num_buckets } %> + <% counts = [ 10, 32, 1002, 500 ] %> + <% + counts.each.with_index do |cnt, o| + o_fns = all_fns[o] + (0...num_trials).each do |i| + d, h = o_fns[i] + m[i][h] += cnt + end + end + %> +
+ + + + <% (1..num_trials).each do |i| %> + + <% end %> + <% all_fns.each.with_index do |o_fns, i| %> + + + <% o_fns.each do |d, h| %> + + <% end %> + <% end %> +
ObjectAppearances$h_<%=i%>(O_i)$
$O_<%=i+1%>$<%= counts[i] %>Bucket <%=h+1%>
+ + + <% (1..num_buckets).each { |b| %><% } %> + <% m.each.with_index do |buckets, trial| %> + + <% buckets.each do |cnt| %><% end %> + + <% end %> +
Bucket <%=b%>
Trial <%=trial%><%= cnt %>
+
+ +
+ + + <% (1..num_buckets).each { |b| %><% } %> + <% m.each.with_index do |buckets, trial| %> + + <% buckets.each do |cnt| %><% end %> + + <% end %> +
Bucket <%=b%>
Trial <%=trial%><%= cnt %>
+ + + + + <% (1..num_trials).each do |i| %> + + <% end %> + + + <% all_fns.each.with_index do |o_fns, o| %> + + + <% (0...num_trials).each do |i| %> + + <% end %> + + + <% end %> +
ObjectAppearancesEstimate <%=i%>Min
$O_<%=o+1%>$<%= counts[o] %><%=m[i][o_fns[i][1]]%><%=(0...num_trials).map { |i| m[i][o_fns[i][1]]}.min%>
+
+
\ No newline at end of file diff --git a/src/teaching/cse-350/2024sp/slide/ubodin.css b/src/teaching/cse-350/2024sp/slide/ubodin.css new file mode 100644 index 00000000..2e4e7baa --- /dev/null +++ b/src/teaching/cse-350/2024sp/slide/ubodin.css @@ -0,0 +1,409 @@ +@font-face { + font-family: 'News Cycle'; + font-style: normal; + font-weight: 400; + src: local('News Cycle'), local('NewsCycle'), url(../../../slide/reveal.js-3.7.0/fonts/9Xe8dq6pQDsPyVH2D3tMQsDdSZkkecOE1hvV7ZHvhyU.ttf) format('truetype'); +} +@font-face { + font-family: 'News Cycle'; + font-style: normal; + font-weight: 700; + src: local('News Cycle Bold'), local('NewsCycle-Bold'), url(../../../slide/reveal.js-3.7.0/fonts/G28Ny31cr5orMqEQy6ljt8BaWKZ57bY3RXgXH6dOjZ0.ttf) format('truetype'); +} +@font-face { + font-family: 'Lato'; + font-style: normal; + font-weight: 400; + src: local('Lato Regular'), local('Lato-Regular'), url(../../../slide/reveal.js-3.7.0/fonts/1EqTbJWOZQBfhZ0e3RL9uvesZW2xOQ-xsNqO47m55DA.ttf) format('truetype'); +} +@font-face { + font-family: 'Lato'; + font-style: normal; + font-weight: 700; + src: local('Lato Bold'), local('Lato-Bold'), url(../../../slide/reveal.js-3.7.0/fonts/MZ1aViPqjfvZwVD_tzjjkwLUuEpTyoUstqEm5AMlJo4.ttf) format('truetype'); +} +@font-face { + font-family: 'Lato'; + font-style: italic; + font-weight: 400; + src: local('Lato Italic'), local('Lato-Italic'), url(../../../slide/reveal.js-3.7.0/fonts/61V2bQZoWB5DkWAUJStypevvDin1pK8aKteLpeZ5c0A.ttf) format('truetype'); +} +@font-face { + font-family: 'Lato'; + font-style: italic; + font-weight: 700; + src: local('Lato Bold Italic'), local('Lato-BoldItalic'), url(../../../slide/reveal.js-3.7.0/fonts/HkF_qI1x_noxlxhrhMQYECZ2oysoEQEeKwjgmXLRnTc.ttf) format('truetype'); +} + + + +/**@import url(https://fonts.googleapis.com/css?family=News+Cycle:400,700); +@import url(https://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic); +**/ +/** + * A simple theme for reveal.js presentations, similar + * to the default theme. The accent color is darkblue. + * + * This theme is Copyright (C) 2012 Owen Versteeg, https://github.com/StereotypicalApps. It is MIT licensed. + * reveal.js is Copyright (C) 2011-2012 Hakim El Hattab, http://hakim.se + * + * with edits (C) 2017-2021 Oliver Kennedy. + */ +/********************************************* + * GLOBAL STYLES + *********************************************/ +body { + background: #fff; + background-color: #fff; } + +.reveal { + font-family: 'Lato', sans-serif; + font-size: 36px; + font-weight: normal; + color: #000; } + +::selection { + color: #fff; + background: rgba(0, 0, 0, 0.99); + text-shadow: none; } + +.reveal .slides > section, .reveal .slides > section > section { + line-height: 1.3; + font-weight: inherit; } + +/********************************************* + * STATIC HEADER/FOOTER + *********************************************/ + +.reveal .header { + position: absolute; + top: 0px; + left: 0px; + right: 0px; + height: 25px; + text-align: center; + padding-left: 15px; + padding-right: 15px; + padding-bottom: 10px; + padding-top: 15px; + background-color: #041a9b; + color: white; + font-size: 0.5em; + z-index: 100; +} +.reveal .footer { + position: absolute; + bottom: 0px; + left: 0px; + right: 0px; + height: 40px; + text-align: center; + padding-left: 15px; + padding-right: 15px; + padding-bottom: 10px; + padding-top: 20px; + background-color: #041a9b; + color: white; + font-size: 0.5em; + z-index: 100; +} + + +/********************************************* + * HEADERS + *********************************************/ +.reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 { + margin: 0 0 20px 0; + color: #000; + font-family: 'News Cycle', Impact, sans-serif; + font-weight: normal; + line-height: 1.2; + letter-spacing: normal; + text-transform: none; + text-shadow: none; + word-wrap: break-word; } + +.reveal h1 { + font-size: 3.77em; } + +.reveal h2 { + font-size: 2.11em; } + +.reveal h3 { + font-size: 1.55em; } + +.reveal h4 { + font-size: 1em; } + +.reveal h1 { + text-shadow: none; } + +/********************************************* + * OTHER + *********************************************/ +.reveal p { + margin: 20px 0; + line-height: 1.3; } + +.reveal imagecredits { + font-size: 12pt; + position: absolute; + right: -10px; + bottom: -10px; + text-align: right; +} +.reveal citation { + font-size: 12pt; + position: absolute; + right: -10px; + bottom: -10px; + text-align: right; +} +.reveal tt { + font-family: courier; + font-weight: bold; +} + +/* Ensure certain elements are never larger than the slide itself */ +.reveal img, .reveal video, .reveal iframe { + max-width: 95%; + max-height: 95%; } + +.reveal strong, .reveal b { + font-weight: bold; } + +.reveal em { + font-style: italic; } + +.reveal ol, .reveal dl, .reveal ul { + display: inline-block; + text-align: left; + margin: 0 0 0 1em; } + +.reveal ol { + list-style-type: decimal; } + +.reveal ul { + list-style-type: disc; } + +.reveal ul > li { + margin-top: 20px; } + +.reveal ul.tight > li { + margin-top: 10px; } + +.reveal ol > li { + margin-top: 20px; } + +.reveal ol.tight > li { + margin-top: 0px; } + +.reveal ul ul { + list-style-type: square; } + +.reveal ul ul ul { + list-style-type: circle; } + +.reveal ul ul, .reveal ul ol, .reveal ol ol, .reveal ol ul { + display: block; + margin-left: 40px; } + +.reveal dt { + margin-top: 20px; + margin-bottom: 0px; + font-weight: bold; } + +.reveal dd { + margin-top: 0px; + margin-left: 40px; } + +.reveal q, .reveal blockquote { + quotes: none; } + +.reveal blockquote { + display: block; + position: relative; + width: 70%; + margin: 20px auto; + padding: 5px; + font-style: italic; + background: rgba(255, 255, 255, 0.05); + box-shadow: 0px 0px 2px rgba(0, 0, 0, 0.2); } + +.reveal blockquote p:first-child, .reveal blockquote p:last-child { + display: inline-block; } + +.reveal q { + font-style: italic; } + +.reveal pre { + display: block; + position: relative; + width: 90%; + margin: 20px auto; + text-align: left; + font-size: 0.55em; + font-family: monospace; + line-height: 1.2em; + word-wrap: break-word; + box-shadow: 0px 0px 6px rgba(0, 0, 0, 0.3); } + +.reveal code { + font-family: monospace; +} + +.reveal pre code { + display: block; + padding: 5px; + overflow: auto; + max-height: 400px; + word-wrap: normal; + background: #3F3F3F; + color: #DCDCDC; } + +.reveal table { + margin: auto; + border-collapse: collapse; + border-spacing: 0; } + +.reveal table th { + font-weight: bold; + border-bottom: 1px solid; } + +.reveal table th, .reveal table td { + text-align: center; + padding: 0.2em 0.5em 0.2em 0.5em;} + +.reveal table th[align="left"], .reveal table td[align="left"] { + text-align: left; } + +.reveal table th[align="right"], .reveal table td[align="right"] { + text-align: right; } + +.reveal table tr:last-child td { + border-bottom: none; } + +.reveal sup { + vertical-align: super; } + +.reveal sub { + vertical-align: sub; } + +.reveal small { + display: inline-block; + font-size: 0.6em; + line-height: 1.2em; + vertical-align: top; } + +.reveal small * { + vertical-align: top; } + +/********************************************* + * LINKS + *********************************************/ +.reveal a { + color: #00008B; + text-decoration: none; + -webkit-transition: color 0.15s ease; + -moz-transition: color 0.15s ease; + transition: color 0.15s ease; } + +.reveal a:hover { + color: #0000f1; + text-shadow: none; + border: none; } + +.reveal .roll span:after { + color: #fff; + background: #00003f; } + +/********************************************* + * IMAGES + *********************************************/ +.reveal section img { + margin: 15px 0px; + background: rgba(255, 255, 255, 0.12); +} + +.reveal section img.bordered +{ + border: 4px solid #000; + box-shadow: 0 0 10px rgba(0, 0, 0, 0.15); +} + +.reveal a img { + -webkit-transition: all 0.15s linear; + -moz-transition: all 0.15s linear; + transition: all 0.15s linear; } + +.reveal a:hover img { + background: rgba(255, 255, 255, 0.2); + border-color: #00008B; + box-shadow: 0 0 20px rgba(0, 0, 0, 0.55); } + +/********************************************* + * NAVIGATION CONTROLS + *********************************************/ +.reveal .controls div.navigate-left, .reveal .controls div.navigate-left.enabled { + border-right-color: #00008B; } + +.reveal .controls div.navigate-right, .reveal .controls div.navigate-right.enabled { + border-left-color: #00008B; } + +.reveal .controls div.navigate-up, .reveal .controls div.navigate-up.enabled { + border-bottom-color: #00008B; } + +.reveal .controls div.navigate-down, .reveal .controls div.navigate-down.enabled { + border-top-color: #00008B; } + +.reveal .controls div.navigate-left.enabled:hover { + border-right-color: #0000f1; } + +.reveal .controls div.navigate-right.enabled:hover { + border-left-color: #0000f1; } + +.reveal .controls div.navigate-up.enabled:hover { + border-bottom-color: #0000f1; } + +.reveal .controls div.navigate-down.enabled:hover { + border-top-color: #0000f1; } + +/********************************************* + * PROGRESS BAR + *********************************************/ +.reveal .progress { + background: rgba(0, 0, 0, 0.2); } + +.reveal .progress span { + background: #00008B; + -webkit-transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985); + -moz-transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985); + transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985); } + +/********************************************* + * SLIDE NUMBER + *********************************************/ +.reveal .slide-number { + color: #00008B; } + +/********************************************* + * CUSTOM HIGHLIGHTS + *********************************************/ +.reveal .slides section .fragment.highlight-grey, +.reveal .slides section .fragment.highlight-current-grey { + opacity: 1; + visibility: inherit; } +.reveal .slides section .fragment.highlight-grey.visible { + color: lightgrey; } +.reveal .slides section .fragment.highlight-current-grey.current-fragment { + color: lightgrey; } + +/********************************************* + * CUSTOM TAGS + *********************************************/ +attribution { + width: 100%; + text-align: right; + font-size: 40%; + display: block; +} \ No newline at end of file diff --git a/templates/cse4562_2021_slides.erb b/templates/cse4562_2021_slides.erb index 198d1f34..c573bbdc 100644 --- a/templates/cse4562_2021_slides.erb +++ b/templates/cse4562_2021_slides.erb @@ -1,7 +1,7 @@ <% -class_name = "CSE-4/562 Spring 2021" +class_name = "CSE-4/562 Spring 2021" unless defined? class_name %>