Merge branch 'master' of gram.cse.buffalo.edu:ODIn/Website

master
Oliver Kennedy 2024-05-01 23:53:20 -04:00
commit 9b1d0275ff
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
12 changed files with 1849 additions and 58 deletions

View File

@ -0,0 +1,46 @@
Open farmersmarket_2024-42231059.xlsx
Name it: usda_farmers_markets
Geotag: Lon - Y; Lat - X
Geoplot
- 1. too much data
- 2. oops, flipped
Alter geotag. See geoplot rerun
Still too much data. Have a .shp file, but vizier doesn't support an adaptor. Python:
-------------------
# Extract County Shapes
import shapefile
with shapefile.Reader("cb_2018_us_county_500k.zip") as sf:
#for field in sf.fields:
# print(field)
# Get object containing an empty dataset.
ds = vizierdb.new_dataset()
ds.insert_column("county")
ds.insert_column("zip")
ds.insert_column("geometry", "geometry")
for entry in sf.shapeRecords():
if entry.record[0] == '36': # 36 is NYS
row = [ entry.record[5], entry.record[4], entry.shape ]
#print(row)
ds.insert_row( row )
ds.save("nys_counties")
ds.show()
-------------------
Spatial join
-------------------
SELECT *
FROM nys_counties nys
JOIN wny_counties wny ON nys.county = wny.county
-------------------
Add below, and name: usda_farmers_markets
-------------------
JOIN usda_farmers_markets f ON ST_CONTAINS(nys.geometry, f.geometry)
-------------------
Watch updated chart

View File

@ -124,6 +124,21 @@ end
</p>
</section>
<section>
<h3>Notebooks: The Good</h3>
<ul>
<li>They're interactive</li>
<li>Less intimidating than the command line</li>
<li class="fragment highlight-blue">In principle, a record of what you did.</li>
<li>Everyone's using them</li>
</ul>
</section>
<section>
<h3>Notebook</h3>
<svg data-src="graphics/2024-04-12/NotebookExtensions.svg" height="400px"/>
</section>
<section>
<h3>High-Level Challenges</h3>
<svg data-src="graphics/2024-04-12/Dependencies.svg" height="400px"/>
@ -197,11 +212,10 @@ end
<section>
<%=
notebook() do
nbnote("$$\\{\\;x \\rightarrow \\textbf{@1}\\;\\}$$")
nbcell("y = x + 102", idx: 5)
nbnote("$$\\{\\;x \\rightarrow \\textbf{@1},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$$", show:1)
nbcell("y = x + 102", idx: 5)
nbnote("$\\{\\;\\;\\}$ vs $\\{\\;x \\rightarrow \\textbf{@1},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$", show:1)
nbcell("x = 4", idx: 3, highlight: 2)
nbnote("$$\\{\\;x \\rightarrow \\textbf{@3},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$$", show:3)
nbnote("$\\{\\;x \\rightarrow \\textbf{@2}\\;\\}$ vs $\\{\\;x \\rightarrow \\textbf{@3},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$", show:3)
nbcell("print(y)", idx: 4, highlight: 4)
end
%>
@ -211,10 +225,6 @@ end
</p>
</section>
<section>
<h3>Vizier Demo</h3>
</section>
<section>
<h3>Overview: Workflow-Style Notebooks</h3>
@ -293,7 +303,7 @@ end
nbcell("if z:\n y = x + 2", idx: 2)
end
%>
<p>If <tt>z == False</tt>:</p>
<p><b>Reads: </b> $\{\;\textbf{z}\;\}$</p>
<p><b>Writes: </b> $\{\;\;\}$</p>
@ -412,7 +422,6 @@ end
<img src="graphics/2024-04-12/LibraryStaticAnalysisDSL.png">
<attribution>"Bolt-on, Compact, and Rapid Program Slicing for Notebooks" (Shenkar et. al.; VLDB 2023)</attribution>
<attribution>(Similar ideas in Nodebook, etc...)</attribution>
</section>
@ -452,12 +461,6 @@ end
<p class="fragment takeaway" data-fragment-index="2">We need to be able to recover the kernel to <i>any</i> state.</p>
</section>
<section>
<h2>Why have only one kernel?</h2>
<p class="fragment">🤷</p>
</section>
<section>
<%=
notebook() do
@ -474,8 +477,17 @@ end
</section>
<section>
<h3>When is parallelism allowed?</h3>
<h3 class="fragment">When is a cell runnable?</h3>
<h2>Why have only one kernel?</h2>
<p class="fragment">🤷</p>
</section>
<section>
<h3>Parallelism</h3>
<ul>
<li class="fragment">When is parallelism allowed?</li>
<li class="fragment">When is a cell runnable?</li>
</ul>
</section>
<section>
@ -507,14 +519,14 @@ end
<dd>Active if: $\forall (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] = \textbf{@i}$</dd>
<dd>$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{@i}\;|\;\forall (x \rightarrow \textbf{@i}) \in \texttt{DynamicWrites}\;\}$</dd>
<dt>Stale</dt>
<dd>Active if: first run or $\exists (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] \neq \textbf{@i}$</dd>
<dd>$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$</dd>
<dt>Runnable</dt>
<dd>Active if: $\forall x \in \texttt{StaticReads} : \texttt{InState}[x] \neq \textbf{???}$</dd>
<dd>$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$</dd>
<dt>Stale</dt>
<dd>Active if: first run or $\exists (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] \neq \textbf{@i}$</dd>
<dd>$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$</dd>
<dt>Unknown</dt>
<dd>Active otherwise.</dd>
<dd>$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$</dd>
@ -522,19 +534,19 @@ end
</section>
<section>
<svg data-src="graphics/2024-04-12/MultiRunnerBlockDiagram.svg" height="300px"/>
<attribution>"The Right Tool for the Job: Data-Centric Workflows in Vizier" (Kennedy et. al.; IEEE DEB 2022)
</section>
<section>
<h3>Serial</h3>
<svg data-src="graphics/2024-04-12/gantt_serial.svg" height="200px"/>
<h3>Parallel</h3>
<svg data-src="graphics/2024-04-12/gantt_parallel.svg" height="200px"/>
<div style="display: inline-block;">
<h3>Serial</h3>
<svg data-src="graphics/2024-04-12/gantt_serial.svg" width="450px"/>
</div>
<div style="display: inline-block;">
<h3>Parallel</h3>
<svg data-src="graphics/2024-04-12/gantt_parallel.svg" width="450px"/>
</div>
<attribution>"Runtime Provenance Refinement for Notebooks" (Deo et. al.; TaPP 2022)</attribution>
</section>
<section>
<h3>Microkernel Notebooks</h3>
<img src="graphics/2022-06-20/MicrokernelCheckpoints.svg" height="400px">
<attribution>https://openclipart.com</attribution>
</section>
@ -565,6 +577,15 @@ end
<p class="fragment">🤷</p>
</section>
<section>
<h3>Vizier Demo</h3>
</section>
<section>
<svg data-src="graphics/2024-04-12/MultiRunnerBlockDiagram.svg" height="300px"/>
<attribution>"The Right Tool for the Job: Data-Centric Workflows in Vizier" (Kennedy et. al.; IEEE DEB 2022)
</section>
<section>
<h3>Repeatable Spreadsheet Dataframe Editing</h3>
<img src="graphics/2023-06-18/vizier-spreadsheet.png" height="300px">
@ -593,48 +614,156 @@ end
<section>
<img src="graphics/2024-04-12/14thWarrior-Cartoon-Elephant.svg" height="300px">
<p class="fragment takeaway">... but this requires migrating state.</p>
<p class="fragment takeaway">... but this requires migrating state.<span class="fragment">.. across languages.</span></p>
<attribution>https://openclipart.com</attribution>
</section>
<section>
<h3>State Management</h3>
<svg data-src="graphics/2024-04-12/Dependencies.svg" height="400px"/>
</section>
<section>
<h3>Approach 1: Pickle</h3>
<p style="font-size: 70%">Python's native serialization support.</p>
<dl style="font-size: 90%">
<div class="fragment" data-fragment-index="1">
<dt>The Good</dt>
<dd>Easy</dd>
</div>
<div class="fragment" data-fragment-index="2">
<dt>The Bad</dt>
<dd><span class="fragment highlight-grey" data-fragment-index="3">Not everything is serializable</span><span class="fragment" data-fragment-index="3" style="font-size: 50%; vertical-align: top;">†</span></dd>
<dd>Limited compatibility with ¬Python</dd>
<dd>Expensive for e.g., dataframes</dd>
</div>
</dl>
</section>
<section>
<h3>Approach 2: Json</h3>
<p style="font-size: 70%">Standard data interchange format.</p>
<dl style="font-size: 90%">
<div class="fragment" data-fragment-index="1">
<dt>The Good</dt>
<dd>Easy</dd>
<dd>Near universal platform compatibility</dd>
</div>
<div class="fragment" data-fragment-index="2">
<dt>The Bad</dt>
<dd>Even less state is supported</dd>
<dd>Even more expensive for e.g., dataframes</dd>
<dd>Limited support for nuanced types (e.g., dates)</dd>
</div>
</dl>
</section>
<section>
<h3>Approach 3: Arrow, Shapefile, Parquet, NPY</h3>
<p style="font-size: 70%">Specialized formats for specific datatypes.</p>
<dl style="font-size: 90%">
<div class="fragment" data-fragment-index="1">
<dt>The Good</dt>
<dd>High Performance</dd>
<dd>Precise, Well Typed</dd>
</div>
<div class="fragment" data-fragment-index="2">
<dt>The Bad</dt>
<dd>Only one type of state is supported</dd>
</div>
</dl>
</section>
<section>
<h3>Vizier (Now)</h3>
<p style="font-size: 70%">Vizier-level Typing.</p>
<ul>
<li>State needs to be <b>checkpointed</b> out of the process that created it.</li>
<li>State needs to be <b>restored</b> into the cell that is about to consume it.</li>
<li class="fragment" data-fragment-index="1"><b>Simple Data:</b> JSON</li>
<li class="fragment" data-fragment-index="2"><b>Typed Data:</b> Standard JSON Encoding</li>
<li class="fragment" data-fragment-index="3"><b>Special Data:</b> <span class="fragment highlight-blue" data-fragment-index="5">'Active' Data</span></li>
<li class="fragment" data-fragment-index="4"><b>Fallback:</b> Pickle</li>
</ul>
</section>
<section>
Naive approach: Pickle
<h3>Active Data</h3>
... but pickle doesn't allow interop
... but pickle doesn't always work (e.g., for 'File' objects)
<p style="font-size: 70%">Datasets, Functions/Classes, etc...</p>
<ul style="font-size: 80%">
<li class="fragment">One concept, Many physical representations (Arrow, Parquet, CSV).
<ul>
<li class="fragment">A cell interpreter may not support a representation.</li>
<li class="fragment">Generating a standard representation can be expensive.</li>
</ul>
</li>
<li class="fragment">State (e.g., Datasets) can get big.
<ul>
<li class="fragment">An interpreter may not want/need to load the entire state.</li>
<li class="fragment">Versioning all checkpoints becomes infeasible.</li>
</ul>
</li>
</ul>
</section>
<section>
Interop: Define standards
<h3>Desiderata</h3>
- Primitive Values (int, float, date, etc...)
- Collection Types (map, list, etc...)
- Libraries
- Function [Challenge: Chained Dependencies]
- Dataframe/Series [Challenge: These are BIG]
<div style="text-align: left; font-size: 80%;">
<p>An abstraction that...</p>
<ul>
<li>... represents the concept.</li>
<li>... allows on-demand conversion between representations.</li>
<li>... allows partial in-store interactions.</li>
<li>... allows incremental changes.</li>
</ul>
</div>
<p class="fragment takeaway">Vizier's artifact store provides a thin wrapper around standards compliant libraries (e.g., Apache Spark).</p>
</section>
<section>
<h3>"Active" Data</h3>
<svg data-src="graphics/2024-04-12/DataframeAbstraction.svg" height="400px"/>
</section>
<section>
<p>... but it's a lot of special case code.</p>
</section>
<section>
<h3>Generalizing Active Data</h3>
<p style="font-size: 60%">(future work)</p>
<ul>
<li class="fragment">What's the right abstraction?</li>
<li class="fragment">Efficient type coercion (without $N^2$)</li>
<li class="fragment">Microservice RPCs</li>
<li class="fragment">Caching Strategies</li>
</ul>
<p class="fragment takeaway">Questions?</p>
</section>
<!------------------------- Closing -------------------------->
<%#
<section>
<a href="https://vizierdb.info">
<img src="graphics/2022-06-20/vizier.svg" height="200px">
<p style="margin-top: -20px;">https://vizierdb.info</p>
</a>
<p style="font-size: 65%"><b>Mike Brachmann, Boris Glavic, Nachiket Deo</b>, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Soham Patel, and more...</p>
<p style="font-size: 65%">Mike Brachmann, Boris Glavic, Nachiket Deo, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Nicholas Brown, Soham Patel, Thomas Slowe, and more...</p>
<div style="width: 100%; text-align: right">
<span style="font-size: 40%; vertical-align: top;">Supported by:</span>
<img src="graphics/logos/nsf.png" height="50px">
<img src="graphics/logos/breadcrumb.png" height="50px">
</div>
</section>
%>

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 132 KiB

View File

@ -0,0 +1,229 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
width="174.83727mm"
height="60.998596mm"
viewBox="0 0 174.83727 60.998596"
version="1.1"
id="svg1"
sodipodi:docname="Dependencies-State.svg"
inkscape:version="1.3.2 (091e20ef0f, 2023-11-25)"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg">
<sodipodi:namedview
id="namedview1"
pagecolor="#ffffff"
bordercolor="#666666"
borderopacity="1.0"
inkscape:showpageshadow="2"
inkscape:pageopacity="0.0"
inkscape:pagecheckerboard="0"
inkscape:deskcolor="#d1d1d1"
inkscape:document-units="mm"
inkscape:zoom="0.76129652"
inkscape:cx="327.07361"
inkscape:cy="177.98584"
inkscape:window-width="1120"
inkscape:window-height="651"
inkscape:window-x="26"
inkscape:window-y="23"
inkscape:window-maximized="0"
inkscape:current-layer="layer1" />
<defs
id="defs1">
<marker
style="overflow:visible"
id="Triangle3"
refX="0"
refY="0"
orient="auto-start-reverse"
inkscape:stockid="Triangle arrow"
markerWidth="1"
markerHeight="1"
viewBox="0 0 1 1"
inkscape:isstock="true"
inkscape:collect="always"
preserveAspectRatio="xMidYMid">
<path
transform="scale(0.5)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 5.77,0 -2.88,5 V -5 Z"
id="path135" />
</marker>
<linearGradient
id="linearGradient4"
inkscape:collect="always">
<stop
style="stop-color:#ffffff;stop-opacity:1"
offset="0"
id="stop4" />
<stop
style="stop-color:#ffffff;stop-opacity:1"
offset="0.36767325"
id="stop6" />
<stop
style="stop-color:#b3b3b3;stop-opacity:0"
offset="1"
id="stop5" />
</linearGradient>
<linearGradient
inkscape:collect="always"
xlink:href="#linearGradient4"
id="linearGradient5"
x1="8.422595"
y1="74.255554"
x2="48.759907"
y2="74.394501"
gradientUnits="userSpaceOnUse"
gradientTransform="matrix(1,0,0,0.4234772,0,41.532591)" />
<linearGradient
inkscape:collect="always"
xlink:href="#linearGradient4"
id="linearGradient6"
gradientUnits="userSpaceOnUse"
gradientTransform="matrix(1,0,0,0.4234772,-186.98905,41.532591)"
x1="8.422595"
y1="74.255554"
x2="48.759907"
y2="74.394501" />
</defs>
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1"
transform="translate(-6.0758843,-41.482174)">
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect1"
width="42.189392"
height="26.216751"
x="74.883568"
y="60.04607" />
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect2"
width="42.189392"
height="26.216751"
x="21.9669"
y="60.04607" />
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect3"
width="42.189392"
height="26.216751"
x="127.8002"
y="60.04607" />
<rect
style="fill:url(#linearGradient5);fill-opacity:1;stroke:none;stroke-width:0.650751;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect4"
width="44.133179"
height="27.703424"
x="6.0758843"
y="59.434734" />
<rect
style="fill:url(#linearGradient6);fill-opacity:1;stroke:none;stroke-width:0.650751;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect6"
width="44.133179"
height="27.703424"
x="-180.91315"
y="59.434734"
transform="scale(-1,1)" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle3)"
d="m 64.696622,73.154446 h 7.104642"
id="path6"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle3)"
d="m 117.6133,73.154446 h 7.10465"
id="path7"
sodipodi:nodetypes="cc" />
<g
id="g11"
transform="translate(41.511012,64.989937)"
class="fragment">
<circle
style="opacity:1;fill:#ffffff;stroke:#000000;stroke-width:0.7;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="circle10"
cx="79.934235"
cy="33.449871"
r="3.690964" />
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;text-align:center;letter-spacing:0px;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146"
id="text11"><tspan
sodipodi:role="line"
id="tspan11"
style="text-align:center;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146">1</tspan></text>
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;letter-spacing:0px;stroke-width:0.264583"
x="74.513275"
y="35.563065"
id="text12"><tspan
sodipodi:role="line"
id="tspan12"
style="text-align:end;text-anchor:end;stroke-width:0.264583"
x="74.513275"
y="35.563065">State needs to be <tspan
style="font-weight:bold"
id="tspan2">checkpointed</tspan>.</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:0.264583px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Triangle3)"
d="M 79.977143,29.634024 V 9.5097307"
id="path13"
sodipodi:nodetypes="cc" />
</g>
<g
id="g13"
class="fragment">
<g
id="g9"
transform="translate(-11.405656,12.073267)">
<circle
style="opacity:1;fill:#ffffff;stroke:#000000;stroke-width:0.7;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="path9"
cx="79.934235"
cy="33.449871"
r="3.690964" />
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;text-align:center;letter-spacing:0px;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146"
id="text9"><tspan
sodipodi:role="line"
id="tspan9"
style="text-align:center;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146">2</tspan></text>
</g>
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;letter-spacing:0px;stroke-width:0.264583"
x="75.052376"
y="47.658699"
id="text10"><tspan
sodipodi:role="line"
id="tspan10"
style="stroke-width:0.264583"
x="75.052376"
y="47.658699">State needs to be <tspan
style="font-weight:bold"
id="tspan1">restored</tspan>.</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:0.264583px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Triangle3)"
d="M 27.060473,-15.336639 V 6.9043202"
id="path12"
transform="translate(41.511011,64.989938)" />
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 8.0 KiB

View File

@ -0,0 +1,261 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
width="183.93697mm"
height="118.02839mm"
viewBox="0 0 183.93697 118.02839"
version="1.1"
id="svg1"
inkscape:version="1.3.2 (091e20ef0f, 2023-11-25)"
sodipodi:docname="NotebookExtensions.svg"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg">
<sodipodi:namedview
id="namedview1"
pagecolor="#ffffff"
bordercolor="#000000"
borderopacity="0.25"
inkscape:showpageshadow="2"
inkscape:pageopacity="0.0"
inkscape:pagecheckerboard="0"
inkscape:deskcolor="#d1d1d1"
inkscape:document-units="mm"
inkscape:zoom="0.40273357"
inkscape:cx="307.89586"
inkscape:cy="331.48465"
inkscape:window-width="1120"
inkscape:window-height="651"
inkscape:window-x="26"
inkscape:window-y="23"
inkscape:window-maximized="0"
inkscape:current-layer="g9" />
<defs
id="defs1">
<marker
style="overflow:visible"
id="Triangle"
refX="0"
refY="0"
orient="auto-start-reverse"
inkscape:stockid="Triangle arrow"
markerWidth="1"
markerHeight="1"
viewBox="0 0 1 1"
inkscape:isstock="true"
inkscape:collect="always"
preserveAspectRatio="xMidYMid">
<path
transform="scale(0.5)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 5.77,0 -2.88,5 V -5 Z"
id="path135" />
</marker>
<marker
style="overflow:visible"
id="Triangle-1"
refX="0"
refY="0"
orient="auto-start-reverse"
inkscape:stockid="Triangle arrow"
markerWidth="1"
markerHeight="1"
viewBox="0 0 1 1"
inkscape:isstock="true"
inkscape:collect="always"
preserveAspectRatio="xMidYMid">
<path
transform="scale(0.5)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 5.77,0 -2.88,5 V -5 Z"
id="path135-2" />
</marker>
<marker
style="overflow:visible"
id="Triangle-6"
refX="0"
refY="0"
orient="auto-start-reverse"
inkscape:stockid="Triangle arrow"
markerWidth="1"
markerHeight="1"
viewBox="0 0 1 1"
inkscape:isstock="true"
inkscape:collect="always"
preserveAspectRatio="xMidYMid">
<path
transform="scale(0.5)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 5.77,0 -2.88,5 V -5 Z"
id="path135-26" />
</marker>
</defs>
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1"
transform="translate(-4.2232184,-32.921722)">
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect1"
width="42.189392"
height="26.216751"
x="46.134819"
y="41.674355" />
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect2"
width="42.189392"
height="26.216751"
x="46.134819"
y="78.716019" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle)"
d="m 67.229515,68.360195 v 7.104637"
id="path6"
sodipodi:nodetypes="cc" />
<g
id="g10"
class="fragment">
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect3"
width="42.189392"
height="26.216751"
x="46.134819"
y="115.75767" />
<g
id="g13"
transform="translate(-52.325262,113.08537)">
<g
id="g9-0"
transform="translate(4.469344,-35.551734)">
<circle
style="opacity:1;fill:#ffffff;stroke:#000000;stroke-width:0.7;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="path9-9"
cx="79.934235"
cy="33.449871"
r="3.690964" />
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;text-align:center;letter-spacing:0px;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146"
id="text9"><tspan
sodipodi:role="line"
id="tspan9"
style="text-align:center;text-anchor:middle;stroke-width:0.264583"
x="79.906754"
y="35.650146">1</tspan></text>
</g>
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;letter-spacing:0px;stroke-width:0.264583"
x="56.00238"
y="0.033697922"
id="text10"><tspan
sodipodi:role="line"
id="tspan10"
style="stroke-width:0.264583"
x="56.00238"
y="0.033697922">Explore</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:0.264583px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Triangle-1)"
d="M 87.571736,0.03230374 102.8876,11.893201"
id="path12"
sodipodi:nodetypes="cc" />
</g>
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle)"
d="m 67.229515,105.40187 v 7.10464"
id="path4"
sodipodi:nodetypes="cc" />
</g>
<g
id="g9"
class="fragment">
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect4"
width="42.189392"
height="26.216751"
x="109.6348"
y="41.674355" />
<rect
style="fill:#6c5353;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect5"
width="42.189392"
height="26.216751"
x="109.6348"
y="78.716019" />
<rect
style="fill:#b3b3b3;stroke:#333333;stroke-width:1;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="rect6"
width="42.189392"
height="26.216751"
x="109.6348"
y="115.75767" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle)"
d="m 130.72953,68.360195 v 7.104637"
id="path7"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle)"
d="m 130.72953,105.40187 v 7.10464"
id="path8"
sodipodi:nodetypes="cc" />
<rect
style="fill:none;stroke:#550000;stroke-width:3;stroke-linejoin:bevel"
id="rect8"
width="54.404568"
height="115.02839"
x="40.169895"
y="34.421722" />
<path
style="fill:none;stroke:#000000;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Triangle)"
d="M 88.363491,91.676226 H 106.78837"
id="path9"
sodipodi:nodetypes="cc" />
<g
id="g11"
transform="translate(136.66141,95.831963)">
<circle
style="opacity:1;fill:#ffffff;stroke:#000000;stroke-width:0.7;stroke-linecap:round;stroke-linejoin:bevel;stroke-dasharray:none"
id="circle10"
cx="27.017569"
cy="17.574873"
r="3.690964" />
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;text-align:center;letter-spacing:0px;text-anchor:middle;stroke-width:0.264583"
x="26.990088"
y="19.775148"
id="text11"><tspan
sodipodi:role="line"
id="tspan11"
style="text-align:center;text-anchor:middle;stroke-width:0.264583"
x="26.990088"
y="19.775148">2</tspan></text>
<text
xml:space="preserve"
style="font-size:6.35px;line-height:1.25;font-family:Ubuntu;-inkscape-font-specification:Ubuntu;letter-spacing:0px;stroke-width:0.264583"
x="33.12188"
y="19.688066"
id="text12"><tspan
sodipodi:role="line"
id="tspan12"
style="stroke-width:0.264583"
x="33.12188"
y="19.688066">Revise</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:0.264583px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Triangle-6)"
d="M 23.923653,15.181444 6.6687291,1.9874655"
id="path13"
sodipodi:nodetypes="cc" />
</g>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.6 KiB

View File

@ -63,18 +63,24 @@ schedule:
notes: slide/07-write-optimized.pdf
- topic: Dataframe Storage
detail: We design, from first principles, a storage format for persistent, mutable dataframes (aka relational tables)
- topic: TBD
detail: A fun topic for the day before spring break
- topic: Clustered and Unclustered Indexing
detail: We extend our dataframe storage model with support for indexing and explore techniques for supporting efficient access over multiple attributes.
docs:
notes: slide/09-dataframes.pdf
- topic: Buffer Management
detail: We add support for caching to our dataframe storage model
- topic: Transactions Overview
detail: ADT support for combining multiple operations together into a single, atomic operation.
docs:
notes: slide/11-concurrency-control.pdf
- topic: Locking
detail: Efficient strategies for enforcing transaction isolation on dataframes through locks
docs:
notes: slide/11-concurrency-control.pdf
- topic: Logging
detail: How to enforce transaction durability, how to efficiently support atomic durability, and an overview of the ARIES recovery protocol
docs:
slides: https://odin.cse.buffalo.edu/teaching/cse-562/2021sp/slide/2021-04-20-Logging.html
- topic: Clustered and Unclustered Indexing
detail: We extend our dataframe storage model with support for indexing and explore techniques for supporting efficient access over multiple attributes.
- topic: Versioned/Immutable Data
detail: What are immutable data types, and how they can be used to support efficient concurrent access to data
- topic: Sketches (time-permitting)
@ -110,13 +116,13 @@ deliverables:
submit: https://autolab.cse.buffalo.edu/courses/cse410-s24/assessments/P2-B-Trees
- item: "Written 2: B+ Tree Analysis"
due: Mar 17
- item: "Project 3: LSM Tree"
- item: "Project 3: Joins"
due: Apr 9
- item: "Written 3: LSM Tree Analysis"
- item: "Written 3: Joins Analysis"
due: Apr 9
- item: "Project 4: B+ Tree with Buffer Manager"
due: Apr 21
- item: "Project 5: Concurrent B+ Tree with Buffer Manager"
- item: "Project 4: MiniDB"
due: May 5
- item: "Written 4: MiniDB Analysis"
due: May 12
dates:
- event: Midterm

Binary file not shown.

View File

@ -0,0 +1,431 @@
---
template: templates/cse4562_2021_slides.erb
title: Data Sketching
date: March 25, 2021
textbook: (readings only)
class_name: CSE 350
---
<section>
<section>
<ul>
<li><code>SELECT COUNT(DISTINCT A) FROM R</code></li>
<li><code>SELECT A, COUNT(*) FROM R GROUP BY A</code></li>
<li><code>SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10</code></li>
</ul>
<p class="fragment" style="margin-top: 100px;">These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?</p>
</section>
<section>
<p><b>Sketching:</b> Hash function tricks used to <u>estimate</u> useful statistical properties.</p>
</section>
<section>
<dl>
<dt>Flajolet-Martin Sketches (HyperLogLog)</dt>
<dd>Estimating Count-Distinct</dd>
<dt>Count Sketches</dt>
<dd>Estimating Count-GroupBy</dd>
<dt>Count-Min Sketches</dt>
<dd>Estimating Count-GroupBy-TopK</dd>
</dl>
</section>
</section>
<section>
<section>
<h3>Count-Distinct</h3>
<div>
<span class="fragment" data-fragment-index="1">$3$</span>
<span class="fragment" data-fragment-index="2">$5$</span>
<span class="fragment" data-fragment-index="3">$4$</span>
<span class="fragment" data-fragment-index="4">$4$</span>
<span class="fragment" data-fragment-index="6">$2$</span>
<span class="fragment" data-fragment-index="6">$4$</span>
<span class="fragment" data-fragment-index="6">$3$</span>
<span class="fragment" data-fragment-index="6">$\ldots$</span>
</div>
<center>
<div style="border: 1px solid black; width: 200px; margin-top: 50px;" class="fragment" data-fragment-index="5">
<span>$3$</span>
<span>$5$</span>
<span>$4$</span>
<span class="fragment" data-fragment-index="6">$2$</span>
<span class="fragment" data-fragment-index="6">$\ldots$</span>
</div>
</center>
<p style="margin-top: 50px" class="fragment" data-fragment-index="5"><b>Challenge:</b> To avoid double counting, we need to track which values of $A$ we've seen. <span class="fragment" data-fragment-index="6">$O(|A|)$ memory required.</span></p>
</section>
<section>
<p>A brief digression</p>
</section>
<section>
<h3>The Coin Flip Game</h3>
Start with 0 points and flip a coin
<dl>
<div class="fragment">
<dt>Tails (🐕)</dt>
<dd>Get a point and flip again.</dd>
</div>
<div class="fragment">
<dt>Heads (👽)</dt>
<dd>Game over.</dd>
</div>
</dl>
</section>
<!--
[ 100d2 = 2 1 2 1 1 1 1 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2 2 1 2 1 2 1 2 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 2 1 2 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 ] = 151
-->
<section>
<table>
<tr><th>Flips</th><th>Score</th></tr>
<tr><td align="left">
<span class="fragment">(👽)</span>
</td><td class="fragment">0</td></tr>
<tr><td align="left">
<span class="fragment">(🐕)</span>
<span class="fragment">(👽)</span>
</td><td class="fragment">1</td></tr>
<tr><td align="left">
<span class="fragment">(🐕)</span>
<span class="fragment">(🐕)</span>
<span class="fragment">(🐕)</span>
<span class="fragment">(🐕)</span>
<span class="fragment">(🐕)</span>
<span class="fragment">(👽)</span>
</td><td class="fragment">5</td></tr>
</table>
</section>
<section>
<table>
<tr><th>Flips</th><th>Score</th><th>Probability</th>
<th class="fragment" data-fragment-index="1">E[# Games]</th>
</tr>
<tr><td>(👽)</td><td>0</td><td>0.5</td>
<td class="fragment" data-fragment-index="1">2</td>
</tr>
<tr><td>(🐕)(👽)</td><td>1</td><td>0.25</td>
<td class="fragment" data-fragment-index="1">4</td>
</tr>
<tr><td>(🐕)(🐕)(👽)</td><td>2</td><td>0.125</td>
<td class="fragment" data-fragment-index="1">8</td>
</tr>
<tr class="fragment"><td>(🐕)$\times N$ &nbsp;&nbsp;(👽)</td><td>$N$</td><td>$\frac{1}{2^{N+1}}$</td>
<td>$2^{N+1}$</td>
</tr>
</table>
<p class="fragment" style="margin-top: 50px;">If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.</p>
<p class="fragment" style="margin-top: 50px;">To do that, I only need to track my top score!</p>
</section>
<section>
<p><b>Idea:</b> Simulate coin flips with a hash function</p>
<p class="fragment" style="margin-top: 50px;">... take the index of the lowest-order nonzero bit</p>
</section>
<section>
<table>
<tr><th>Object</th><th>Hash Bits</th><th>Score</th></tr>
<tr class="fragment"><td>$O_1$</td><td>0101101<u>1</u></td><td>0</td></tr>
<tr class="fragment"><td>$O_2$</td><td>0011011<u>1</u></td><td>0</td></tr>
<tr class="fragment"><td>$O_3$</td><td>0011<u>1</u>000</td><td>3</td></tr>
<tr class="fragment"><td>$O_4$</td><td>100100<u>1</u>0</td><td>1</td></tr>
<tr class="fragment"><td>$O_3$</td><td>0011<u>1</u>000</td><td>3</td></tr>
<tr class="fragment"><td></td><td></td><td style="border-top: 1px solid black">3</td></tr>
</table>
<p class="fragment"><b>Estimate: </b> $2^{3+1} = 16$</p>
<p class="fragment">Duplicates can't raise the top score!</p>
</section>
<section>
<p><b>Problem: </b> Noisy estimate!</p>
<p class="fragment"><b>Idea 1:</b> Instead of your top score, track the lowest score you have not gotten yet ($R$).</p>
</section>
<section>
<table>
<tr><th>Object</th><th>Hash Bits</th><th>Score</th></tr>
<tr><td>$O_1$</td><td>0101101<u>1</u></td><td>0</td></tr>
<tr><td>$O_2$</td><td>0011011<u>1</u></td><td>0</td></tr>
<tr><td>$O_3$</td><td>0011<u>1</u>000</td><td>3</td></tr>
<tr><td>$O_4$</td><td>100100<u>1</u>0</td><td>1</td></tr>
<tr><td>$O_3$</td><td>0011<u>1</u>000</td><td>3</td></tr>
<tr class="fragment"><td></td><td></td><td style="border-top: 1px solid black">{0, 1, 3}<div class="fragment">$R = 2$</div></td></tr>
</table>
<p class="fragment"><b>Estimate: </b> $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$</p>
</section>
<section>
<p><b>Idea 2:</b> Compute several estimates in parallel and average estimates.</p>
</section>
<section>
<h3>Flajolet-Martin Sketches</h3>
<h4>($\approx$ HyperLogLog)</h4>
<ol>
<li>For each record...
<ol>
<li>Hash each record</li>
<li>Find the index of the lowest-order non-zero bit</li>
<li>Add the index of the bit to a set</li>
</ol></li>
<li>Find $R$, the lowest index <b>not</b> in the set</li>
<li>Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)</li>
<li>Repeat (in parallel) as needed</li>
</ol>
</section>
</section>
<section>
<section>
<h3>Group-By Count</h3>
<p style="margin-top: 100px;"><b>Problem: </b> Need a counter for each individual A</p>
</section>
<section>
<p><b>Idea:</b> Keep only one counter!</p>
</section>
<section>
<img src="../../cse-562/2021fa/graphics/Clipart/facepalm.jpg"/>
<p class="fragment">No... seriously</p>
</section>
<section>
$$\delta(O_i) = \begin{cases} \textbf{if } h(O_i) = 0 \mod 2 & \textbf{then } -1 \\ \textbf{if } h(O_i) = 1 \mod 2 & \textbf{then } +1\end{cases}$$
</section>
<section>
$$\sum_i \delta(O_i)$$
</section>
<section>
<table>
<tr><th>Object</th><th>$\delta(O_i)$</th><th>Running Count</th></tr>
<tr class="fragment"><td>$O_3$</td><td>-1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_1$</td><td>+1</td><td>0</td></tr>
<tr class="fragment"><td>$O_4$</td><td>-1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_2$</td><td>+1</td><td>0</td></tr>
<tr class="fragment"><td>$O_4$</td><td>-1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_1$</td><td>+1</td><td>0</td></tr>
<tr class="fragment"><td>$O_3$</td><td>-1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_3$</td><td>-1</td><td>-2</td></tr>
<tr class="fragment"><td>$O_1$</td><td>+1</td><td>-1</td></tr>
</table>
</section>
<section>
<table>
<tr><td align="left">$Total =$</td></tr>
<tr class="fragment"><td align="middle">$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$</td></tr>
<tr class="fragment"><td align="right">$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$</td></tr>
</table>
<table style="margin-top: 60px">
<tr class="fragment"><td align="left">$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$=</td></tr>
<tr class="fragment"><td align="middle">$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$</td></tr>
<tr class="fragment"><td align="right">$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$</td></tr>
</table>
<p style="margin-top: 60px;" class="fragment">
$$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$
</p>
</section>
<section>
<p>Running total was $-1$</p>
<table>
<tr><th>Object</th><th>$\delta(O_i)$</th><th>Estimate</th></tr>
<tr class="fragment"><td>$O_1$</td><td>+1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_2$</td><td>+1</td><td>-1</td></tr>
<tr class="fragment"><td>$O_3$</td><td>-1</td><td>+1</td></tr>
<tr class="fragment"><td>$O_4$</td><td>-1</td><td>+1</td></tr>
</table>
<p class="fragment">Not... so... great</p>
</section>
<section>
<p><b>Problem 1:</b> All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).</p>
<p><b>Problem 2:</b> The estimate is <b>really</b> noisy</p>
</section>
<section>
<p><b>Idea 1:</b> Multiple Buckets ($h(x)$ picks a bucket)</p>
<p><b>Idea 2:</b> Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)</p>
</section>
<%
prng = Random.new(2019)
num_trials = 2
num_buckets = 2
num_objects = 4
all_fns = (0...num_objects).map {
(0...num_trials).map {
[ prng.rand(2)*2-1,
prng.rand(num_buckets)
] } }
%>
<section>
<table>
<tr><th>Object</th>
<% (1..num_trials).each do |i| %>
<th>$h_<%=i%>(O_i)$</th>
<th>$\delta_<%=i%>(O_i)$</th>
<% end %></tr>
<% all_fns.each.with_index do |o_fns, i| %>
<tr><td>$O_<%=i+1%>$</td>
<% o_fns.each do |d, h| %>
<td>Bucket <%=h+1%></td>
<td><%=d%></td>
<% end %></tr>
<% end %>
</table>
</section>
<% m = (0...num_trials).map { [0]*num_buckets } %>
<% log = [] %>
<% [ 2, 1, 4, 1, 2, 1 ].each do |i| %>
<% o_fns = all_fns[i-1]; %>
<section>
<p>Objects Seen: $<%= log.map { |l| "O_#{l}" }.join(",") %>$</p>
<table>
<tr><th></th><th>Bucket 1</th><th>Bucket 2</th></tr>
<% m.each.with_index do |buckets, trial| %>
<tr><td style="font-weight: bold;">Trial <%=trial%></td>
<% buckets.each do |cnt| %><td><%= cnt %></td><% end %>
</tr>
<% end %>
</table>
<table>
<tr><th>Object</th>
<% (0...num_trials).each do |i| %><th>Trial <%=i+1%></th><% end %>
<th>Estimate</th><th>Real</th></tr>
<% (0...num_objects).each do |o| %>
<tr><td>$O_<%=o+1%>$</td>
<% est = 0; (0...num_trials).each do |i| %>
<% delta, bucket = all_fns[o][i]; est += m[i][bucket] * delta %>
<td><%= m[i][bucket] * delta %></td>
<% end %>
<td><%= est.to_f/num_trials %></td>
<td><%= log.select { |x| x == o+1 }.count %></td>
</tr>
<% end %>
</table>
<%
log.push(i)
o_fns.each.with_index do |d_h, trial|
d, h = d_h
m[trial][h] += d
end
%>
</section>
<% end %>
<section>
<p>In practice, use <i>Median</i> and not <i>Mode</i> to combine trials</p>
</section>
</section>
<section>
<section>
<h3>Top-K Group-By Count</h3>
<p style="margin-top: 100px;"><b>Problem:</b> "Heavy Hitters" overwhelm smaller counts</p>
</section>
<section>
<p><b>Idea: </b> Give up. Drop $\delta$.</p>
</section>
<section>
<h3>Count-Min Sketch</h3>
</section>
<% m = (0...num_trials).map { [0]*num_buckets } %>
<% counts = [ 10, 32, 1002, 500 ] %>
<%
counts.each.with_index do |cnt, o|
o_fns = all_fns[o]
(0...num_trials).each do |i|
d, h = o_fns[i]
m[i][h] += cnt
end
end
%>
<section>
<table>
<tr><th>Object</th>
<th>Appearances</th>
<% (1..num_trials).each do |i| %>
<th>$h_<%=i%>(O_i)$</th>
<% end %></tr>
<% all_fns.each.with_index do |o_fns, i| %>
<tr><td>$O_<%=i+1%>$</td>
<td><%= counts[i] %></td>
<% o_fns.each do |d, h| %>
<td>Bucket <%=h+1%></td>
<% end %></tr>
<% end %>
</table>
<table>
<tr><th></th><% (1..num_buckets).each { |b| %><th>Bucket <%=b%></th><% } %></tr>
<% m.each.with_index do |buckets, trial| %>
<tr><td style="font-weight: bold;">Trial <%=trial%></td>
<% buckets.each do |cnt| %><td><%= cnt %></td><% end %>
</tr>
<% end %>
</table>
</section>
<section>
<table>
<tr><th></th><% (1..num_buckets).each { |b| %><th>Bucket <%=b%></th><% } %></tr>
<% m.each.with_index do |buckets, trial| %>
<tr><td style="font-weight: bold;">Trial <%=trial%></td>
<% buckets.each do |cnt| %><td><%= cnt %></td><% end %>
</tr>
<% end %>
</table>
<table>
<tr><th>Object</th>
<th>Appearances</th>
<% (1..num_trials).each do |i| %>
<th>Estimate <%=i%></th>
<% end %>
<th>Min</th>
</tr>
<% all_fns.each.with_index do |o_fns, o| %>
<tr><td>$O_<%=o+1%>$</td>
<td><%= counts[o] %></td>
<% (0...num_trials).each do |i| %>
<td><%=m[i][o_fns[i][1]]%></td>
<% end %>
<td><%=(0...num_trials).map { |i| m[i][o_fns[i][1]]}.min%></td>
</tr>
<% end %>
</table>
</section>
</section>

View File

@ -0,0 +1,409 @@
@font-face {
font-family: 'News Cycle';
font-style: normal;
font-weight: 400;
src: local('News Cycle'), local('NewsCycle'), url(../../../slide/reveal.js-3.7.0/fonts/9Xe8dq6pQDsPyVH2D3tMQsDdSZkkecOE1hvV7ZHvhyU.ttf) format('truetype');
}
@font-face {
font-family: 'News Cycle';
font-style: normal;
font-weight: 700;
src: local('News Cycle Bold'), local('NewsCycle-Bold'), url(../../../slide/reveal.js-3.7.0/fonts/G28Ny31cr5orMqEQy6ljt8BaWKZ57bY3RXgXH6dOjZ0.ttf) format('truetype');
}
@font-face {
font-family: 'Lato';
font-style: normal;
font-weight: 400;
src: local('Lato Regular'), local('Lato-Regular'), url(../../../slide/reveal.js-3.7.0/fonts/1EqTbJWOZQBfhZ0e3RL9uvesZW2xOQ-xsNqO47m55DA.ttf) format('truetype');
}
@font-face {
font-family: 'Lato';
font-style: normal;
font-weight: 700;
src: local('Lato Bold'), local('Lato-Bold'), url(../../../slide/reveal.js-3.7.0/fonts/MZ1aViPqjfvZwVD_tzjjkwLUuEpTyoUstqEm5AMlJo4.ttf) format('truetype');
}
@font-face {
font-family: 'Lato';
font-style: italic;
font-weight: 400;
src: local('Lato Italic'), local('Lato-Italic'), url(../../../slide/reveal.js-3.7.0/fonts/61V2bQZoWB5DkWAUJStypevvDin1pK8aKteLpeZ5c0A.ttf) format('truetype');
}
@font-face {
font-family: 'Lato';
font-style: italic;
font-weight: 700;
src: local('Lato Bold Italic'), local('Lato-BoldItalic'), url(../../../slide/reveal.js-3.7.0/fonts/HkF_qI1x_noxlxhrhMQYECZ2oysoEQEeKwjgmXLRnTc.ttf) format('truetype');
}
/**@import url(https://fonts.googleapis.com/css?family=News+Cycle:400,700);
@import url(https://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic);
**/
/**
* A simple theme for reveal.js presentations, similar
* to the default theme. The accent color is darkblue.
*
* This theme is Copyright (C) 2012 Owen Versteeg, https://github.com/StereotypicalApps. It is MIT licensed.
* reveal.js is Copyright (C) 2011-2012 Hakim El Hattab, http://hakim.se
*
* with edits (C) 2017-2021 Oliver Kennedy.
*/
/*********************************************
* GLOBAL STYLES
*********************************************/
body {
background: #fff;
background-color: #fff; }
.reveal {
font-family: 'Lato', sans-serif;
font-size: 36px;
font-weight: normal;
color: #000; }
::selection {
color: #fff;
background: rgba(0, 0, 0, 0.99);
text-shadow: none; }
.reveal .slides > section, .reveal .slides > section > section {
line-height: 1.3;
font-weight: inherit; }
/*********************************************
* STATIC HEADER/FOOTER
*********************************************/
.reveal .header {
position: absolute;
top: 0px;
left: 0px;
right: 0px;
height: 25px;
text-align: center;
padding-left: 15px;
padding-right: 15px;
padding-bottom: 10px;
padding-top: 15px;
background-color: #041a9b;
color: white;
font-size: 0.5em;
z-index: 100;
}
.reveal .footer {
position: absolute;
bottom: 0px;
left: 0px;
right: 0px;
height: 40px;
text-align: center;
padding-left: 15px;
padding-right: 15px;
padding-bottom: 10px;
padding-top: 20px;
background-color: #041a9b;
color: white;
font-size: 0.5em;
z-index: 100;
}
/*********************************************
* HEADERS
*********************************************/
.reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 {
margin: 0 0 20px 0;
color: #000;
font-family: 'News Cycle', Impact, sans-serif;
font-weight: normal;
line-height: 1.2;
letter-spacing: normal;
text-transform: none;
text-shadow: none;
word-wrap: break-word; }
.reveal h1 {
font-size: 3.77em; }
.reveal h2 {
font-size: 2.11em; }
.reveal h3 {
font-size: 1.55em; }
.reveal h4 {
font-size: 1em; }
.reveal h1 {
text-shadow: none; }
/*********************************************
* OTHER
*********************************************/
.reveal p {
margin: 20px 0;
line-height: 1.3; }
.reveal imagecredits {
font-size: 12pt;
position: absolute;
right: -10px;
bottom: -10px;
text-align: right;
}
.reveal citation {
font-size: 12pt;
position: absolute;
right: -10px;
bottom: -10px;
text-align: right;
}
.reveal tt {
font-family: courier;
font-weight: bold;
}
/* Ensure certain elements are never larger than the slide itself */
.reveal img, .reveal video, .reveal iframe {
max-width: 95%;
max-height: 95%; }
.reveal strong, .reveal b {
font-weight: bold; }
.reveal em {
font-style: italic; }
.reveal ol, .reveal dl, .reveal ul {
display: inline-block;
text-align: left;
margin: 0 0 0 1em; }
.reveal ol {
list-style-type: decimal; }
.reveal ul {
list-style-type: disc; }
.reveal ul > li {
margin-top: 20px; }
.reveal ul.tight > li {
margin-top: 10px; }
.reveal ol > li {
margin-top: 20px; }
.reveal ol.tight > li {
margin-top: 0px; }
.reveal ul ul {
list-style-type: square; }
.reveal ul ul ul {
list-style-type: circle; }
.reveal ul ul, .reveal ul ol, .reveal ol ol, .reveal ol ul {
display: block;
margin-left: 40px; }
.reveal dt {
margin-top: 20px;
margin-bottom: 0px;
font-weight: bold; }
.reveal dd {
margin-top: 0px;
margin-left: 40px; }
.reveal q, .reveal blockquote {
quotes: none; }
.reveal blockquote {
display: block;
position: relative;
width: 70%;
margin: 20px auto;
padding: 5px;
font-style: italic;
background: rgba(255, 255, 255, 0.05);
box-shadow: 0px 0px 2px rgba(0, 0, 0, 0.2); }
.reveal blockquote p:first-child, .reveal blockquote p:last-child {
display: inline-block; }
.reveal q {
font-style: italic; }
.reveal pre {
display: block;
position: relative;
width: 90%;
margin: 20px auto;
text-align: left;
font-size: 0.55em;
font-family: monospace;
line-height: 1.2em;
word-wrap: break-word;
box-shadow: 0px 0px 6px rgba(0, 0, 0, 0.3); }
.reveal code {
font-family: monospace;
}
.reveal pre code {
display: block;
padding: 5px;
overflow: auto;
max-height: 400px;
word-wrap: normal;
background: #3F3F3F;
color: #DCDCDC; }
.reveal table {
margin: auto;
border-collapse: collapse;
border-spacing: 0; }
.reveal table th {
font-weight: bold;
border-bottom: 1px solid; }
.reveal table th, .reveal table td {
text-align: center;
padding: 0.2em 0.5em 0.2em 0.5em;}
.reveal table th[align="left"], .reveal table td[align="left"] {
text-align: left; }
.reveal table th[align="right"], .reveal table td[align="right"] {
text-align: right; }
.reveal table tr:last-child td {
border-bottom: none; }
.reveal sup {
vertical-align: super; }
.reveal sub {
vertical-align: sub; }
.reveal small {
display: inline-block;
font-size: 0.6em;
line-height: 1.2em;
vertical-align: top; }
.reveal small * {
vertical-align: top; }
/*********************************************
* LINKS
*********************************************/
.reveal a {
color: #00008B;
text-decoration: none;
-webkit-transition: color 0.15s ease;
-moz-transition: color 0.15s ease;
transition: color 0.15s ease; }
.reveal a:hover {
color: #0000f1;
text-shadow: none;
border: none; }
.reveal .roll span:after {
color: #fff;
background: #00003f; }
/*********************************************
* IMAGES
*********************************************/
.reveal section img {
margin: 15px 0px;
background: rgba(255, 255, 255, 0.12);
}
.reveal section img.bordered
{
border: 4px solid #000;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.15);
}
.reveal a img {
-webkit-transition: all 0.15s linear;
-moz-transition: all 0.15s linear;
transition: all 0.15s linear; }
.reveal a:hover img {
background: rgba(255, 255, 255, 0.2);
border-color: #00008B;
box-shadow: 0 0 20px rgba(0, 0, 0, 0.55); }
/*********************************************
* NAVIGATION CONTROLS
*********************************************/
.reveal .controls div.navigate-left, .reveal .controls div.navigate-left.enabled {
border-right-color: #00008B; }
.reveal .controls div.navigate-right, .reveal .controls div.navigate-right.enabled {
border-left-color: #00008B; }
.reveal .controls div.navigate-up, .reveal .controls div.navigate-up.enabled {
border-bottom-color: #00008B; }
.reveal .controls div.navigate-down, .reveal .controls div.navigate-down.enabled {
border-top-color: #00008B; }
.reveal .controls div.navigate-left.enabled:hover {
border-right-color: #0000f1; }
.reveal .controls div.navigate-right.enabled:hover {
border-left-color: #0000f1; }
.reveal .controls div.navigate-up.enabled:hover {
border-bottom-color: #0000f1; }
.reveal .controls div.navigate-down.enabled:hover {
border-top-color: #0000f1; }
/*********************************************
* PROGRESS BAR
*********************************************/
.reveal .progress {
background: rgba(0, 0, 0, 0.2); }
.reveal .progress span {
background: #00008B;
-webkit-transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985);
-moz-transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985);
transition: width 800ms cubic-bezier(0.26, 0.86, 0.44, 0.985); }
/*********************************************
* SLIDE NUMBER
*********************************************/
.reveal .slide-number {
color: #00008B; }
/*********************************************
* CUSTOM HIGHLIGHTS
*********************************************/
.reveal .slides section .fragment.highlight-grey,
.reveal .slides section .fragment.highlight-current-grey {
opacity: 1;
visibility: inherit; }
.reveal .slides section .fragment.highlight-grey.visible {
color: lightgrey; }
.reveal .slides section .fragment.highlight-current-grey.current-fragment {
color: lightgrey; }
/*********************************************
* CUSTOM TAGS
*********************************************/
attribution {
width: 100%;
text-align: right;
font-size: 40%;
display: block;
}

View File

@ -1,7 +1,7 @@
<!doctype html>
<%
class_name = "CSE-4/562 Spring 2021"
class_name = "CSE-4/562 Spring 2021" unless defined? class_name
%>
<html lang="en">