diff --git a/src/talks/2024-04-12-UIC-script.txt b/src/talks/2024-04-12-UIC-script.txt new file mode 100644 index 00000000..8244de11 --- /dev/null +++ b/src/talks/2024-04-12-UIC-script.txt @@ -0,0 +1,45 @@ +Open farmersmarket_2024-42231059.xlsx + +Geotag: Lon - Y; Lat - X + +Geoplot + - 1. too much data + - 2. oops, flipped + +Alter geotag. See geoplot rerun + +Still too much data. Have a .shp file, but vizier doesn't support an adaptor. Python: +------------------- +# Extract County Shapes +import shapefile + +with shapefile.Reader("cb_2018_us_county_500k.zip") as sf: + #for field in sf.fields: + # print(field) + # Get object containing an empty dataset. + ds = vizierdb.new_dataset() + ds.insert_column("county") + ds.insert_column("zip") + ds.insert_column("geometry", "geometry") + for entry in sf.shapeRecords(): + if entry.record[0] == '36': # 36 is NYS + row = [ entry.record[5], entry.record[4], entry.shape ] + #print(row) + ds.insert_row( row ) + ds.save("nys_counties") + ds.show() +------------------- + +Spatial join +------------------- +SELECT * +FROM nys_counties nys + JOIN wny_counties wny ON nys.county = wny.county +------------------- + +Add below, and name: usda_farmers_markets +------------------- + JOIN usda_farmers_markets f ON ST_CONTAINS(nys.geometry, f.geometry) +------------------- + +Watch updated chart \ No newline at end of file diff --git a/src/talks/2024-04-12-UIC.erb b/src/talks/2024-04-12-UIC.erb index 64897c03..e0476a91 100644 --- a/src/talks/2024-04-12-UIC.erb +++ b/src/talks/2024-04-12-UIC.erb @@ -304,7 +304,7 @@ end nbcell("if z:\n y = x + 2", idx: 2) end %> - +

If z == False:

Reads: $\{\;\textbf{z}\;\}$

Writes: $\{\;\;\}$

@@ -423,7 +423,6 @@ end "Bolt-on, Compact, and Rapid Program Slicing for Notebooks" (Shenkar et. al.; VLDB 2023) - (Similar ideas in Nodebook, etc...) @@ -446,10 +445,6 @@ end -
-

Vizier Demo

-
-
@@ -467,12 +462,6 @@ end

We need to be able to recover the kernel to any state.

-
-

Why have only one kernel?

- -

🤷

-
-
<%= notebook() do @@ -489,8 +478,17 @@ end
-

When is parallelism allowed?

-

When is a cell runnable?

+

Why have only one kernel?

+ +

🤷

+
+ +
+

Parallelism

+
@@ -522,14 +520,14 @@ end
Active if: $\forall (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] = \textbf{@i}$
$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{@i}\;|\;\forall (x \rightarrow \textbf{@i}) \in \texttt{DynamicWrites}\;\}$
-
Stale
-
Active if: first run or $\exists (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] \neq \textbf{@i}$
-
$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$
-
Runnable
Active if: $\forall x \in \texttt{StaticReads} : \texttt{InState}[x] \neq \textbf{???}$
$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$
+
Stale
+
Active if: first run or $\exists (x \rightarrow \textbf{@i}) \in \texttt{DynamicReads} : \texttt{InState}[x] \neq \textbf{@i}$
+
$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$
+
Unknown
Active otherwise.
$\texttt{OutState} = \texttt{InState} + \{\;x \rightarrow \textbf{???}\;|\;\forall x \in \texttt{StaticWrites}\;\}$
@@ -537,19 +535,19 @@ end
- - "The Right Tool for the Job: Data-Centric Workflows in Vizier" (Kennedy et. al.; IEEE DEB 2022) -
- -
-

Serial

- -

Parallel

- +
+

Serial

+ +
+
+

Parallel

+ +
"Runtime Provenance Refinement for Notebooks" (Deo et. al.; TaPP 2022)
+

Microkernel Notebooks

https://openclipart.com
@@ -580,6 +578,15 @@ end

🤷

+
+

Vizier Demo

+
+ +
+ + "The Right Tool for the Job: Data-Centric Workflows in Vizier" (Kennedy et. al.; IEEE DEB 2022) +
+

Repeatable Spreadsheet Dataframe Editing

@@ -608,48 +615,156 @@ end
-

... but this requires migrating state.

+

... but this requires migrating state... across languages.

https://openclipart.com
-

State Management

+ +
+ +
+

Approach 1: Pickle

+ +

Python's native serialization support.

+ +
+
+
The Good
+
Easy
+
+
+
The Bad
+
Not everything is serializable†
+
Limited compatibility with ¬Python
+
Expensive for e.g., dataframes
+
+
+
+ +
+

Approach 2: Json

+ +

Standard data interchange format.

+ +
+
+
The Good
+
Easy
+
Near universal platform compatibility
+
+
+
The Bad
+
Even less state is supported
+
Even more expensive for e.g., dataframes
+
Limited support for nuanced types (e.g., dates)
+
+
+
+ +
+

Approach 3: Arrow, Shapefile, Parquet, NPY

+ +

Specialized formats for specific datatypes.

+
+
+
The Good
+
High Performance
+
Precise, Well Typed
+
+
+
The Bad
+
Only one type of state is supported
+
+
+
+ +
+

Vizier (Now)

+ +

Vizier-level Typing.

- Naive approach: Pickle +

Active Data

- ... but pickle doesn't allow interop - ... but pickle doesn't always work (e.g., for 'File' objects) +

Datasets, Functions/Classes, etc...

+
- Interop: Define standards +

Desiderata

- - Primitive Values (int, float, date, etc...) - - Collection Types (map, list, etc...) - - Libraries - - Function [Challenge: Chained Dependencies] - - Dataframe/Series [Challenge: These are BIG] +
+

An abstraction that...

+
    +
  • ... represents the concept.
  • +
  • ... allows on-demand conversion between representations.
  • +
  • ... allows partial in-store interactions.
  • +
  • ... allows incremental changes.
  • +
+
+

Vizier's artifact store provides a thin wrapper around standards compliant libraries (e.g., Apache Spark).

- +

"Active" Data

+
+
+

... but it's a lot of special case code.

+
+ +
+

Generalizing Active Data

+

(future work)

+ + + +

Questions?

+
+ + -<%# +

https://vizierdb.info

-

Mike Brachmann, Boris Glavic, Nachiket Deo, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Soham Patel, and more...

+

Mike Brachmann, Boris Glavic, Nachiket Deo, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Nicholas Brown, Soham Patel, Thomas Slowe, and more...

+ + +
+ Supported by: + + +
- %> diff --git a/src/talks/graphics/2024-04-12/DataframeAbstraction.svg b/src/talks/graphics/2024-04-12/DataframeAbstraction.svg new file mode 100644 index 00000000..f1cda281 --- /dev/null +++ b/src/talks/graphics/2024-04-12/DataframeAbstraction.svg @@ -0,0 +1,280 @@ + + + +Vdbcreate artifactartifact @1 createdwrite dataSQLExportSummary diff --git a/src/talks/graphics/2024-04-12/Dependencies-State.svg b/src/talks/graphics/2024-04-12/Dependencies-State.svg new file mode 100644 index 00000000..07b36255 --- /dev/null +++ b/src/talks/graphics/2024-04-12/Dependencies-State.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 + State needs to be checkpointed. + + + + + + 2 + + State needs to be restored. + + + + diff --git a/src/talks/graphics/2024-04-12/NotebookExtensions.svg b/src/talks/graphics/2024-04-12/NotebookExtensions.svg new file mode 100644 index 00000000..63f00831 --- /dev/null +++ b/src/talks/graphics/2024-04-12/NotebookExtensions.svg @@ -0,0 +1,261 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + 1 + + Explore + + + + + + + + + + + + + + + 2 + Revise + + + + + diff --git a/src/talks/graphics/logos/breadcrumb.png b/src/talks/graphics/logos/breadcrumb.png new file mode 100644 index 00000000..86de55c3 Binary files /dev/null and b/src/talks/graphics/logos/breadcrumb.png differ