master
Oliver Kennedy 2022-03-27 17:33:54 -04:00
parent 4c4efd668e
commit b0a006bdaa
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
6 changed files with 105 additions and 1 deletions

1
.gitignore vendored
View File

@ -8,3 +8,4 @@ main.synctex.gz
pdfa.xmpi
main.bbl
main.blg
vizier.db

View File

@ -0,0 +1,12 @@
0,3680,1,127,585
1,3681,1,684,4097
2,3682,1,4173,10320
3,3683,1,4185,9881
4,3684,1,4200,10269
5,3685,1,4219,10284
6,3686,1,4236,10076
7,3687,1,4251,10373
8,3688,1,4269,10188
9,3689,1,4287,10131
10,3690,1,4311,9839
11,3691,1,4333,10006
1 0 3680 1 127 585
2 1 3681 1 684 4097
3 2 3682 1 4173 10320
4 3 3683 1 4185 9881
5 4 3684 1 4200 10269
6 5 3685 1 4219 10284
7 6 3686 1 4236 10076
8 7 3687 1 4251 10373
9 8 3688 1 4269 10188
10 9 3689 1 4287 10131
11 10 3690 1 4311 9839
12 11 3691 1 4333 10006

12
data/100000_10_serial.csv Normal file
View File

@ -0,0 +1,12 @@
0,3692,1,157,546
1,3693,1,625,4021
2,3694,1,4092,7507
3,3695,1,7555,8370
4,3696,1,8416,9186
5,3697,1,9231,9900
6,3698,1,9939,10633
7,3699,1,10666,11444
8,3700,1,11475,12166
9,3701,1,12205,12921
10,3702,1,12951,13717
11,3703,1,13746,14393
1 0 3692 1 157 546
2 1 3693 1 625 4021
3 2 3694 1 4092 7507
4 3 3695 1 7555 8370
5 4 3696 1 8416 9186
6 5 3697 1 9231 9900
7 6 3698 1 9939 10633
8 7 3699 1 10666 11444
9 8 3700 1 11475 12166
10 9 3701 1 12205 12921
11 10 3702 1 12951 13717
12 11 3703 1 13746 14393

View File

@ -1,5 +1,4 @@
@inproceedings{brachmann:2020:cidr:your,
author = {Brachmann, Michael and Spoth, William and Kennedy, Oliver and Glavic, Boris and Mueller, Heiko and Castelo, Sonia and Bautista, Carlos and Freire, Juliana},
title = {Your notebook is not crumby enough, REPLace it},
@ -363,3 +362,16 @@
Title = {{Principles of Program Analysis}},
Year = {1999},
}
@article{DBLP:journals/tse/Weiser84,
author = {Mark D. Weiser},
title = {Program Slicing},
journal = {{IEEE} Trans. Software Eng.},
volume = {10},
number = {4},
pages = {352--357},
year = {1984}
}

View File

@ -1,5 +1,13 @@
%!TEX root=../main.tex
We introduce a provenance-based approach for predicting and tracking dependencies across python cells in a computational notebook and an implementation of this approach in Vizier, a data-centric notebook system where cells are isolated from each other and communicate through data artifacts. By combining best effort static analysis with an adaptable runtime schedule for notebook cell execution, we achieve (i) parallel execution of python cells, (ii) automatic refresh of dependent cells when the notebook is modified, and (iii) translation of Jupyter notebooks into our model.
This paper represents an initial proof-of-concept, on which we note several opportunities for improvement.
Crucially, there are still considerable opportunities to reduce blocking due to state transfer between kernels.
For example, it may be possible to re-use a kernel for applications involving large state, trading the increased overhead of sequential execution for a reduction in overhead from state transfer.
We also plan to explore ways to make state export and import asynchronous.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"

View File

@ -1 +1,60 @@
%!TEX root=../main.tex
Jupyter notebooks are evaluated in a single long-running kernel that facilitates inter-cell communication through a shared global state.
In this section, we outline the process for converting a Jupyter notebook into a form compatible \systemname's isolated cell execution model.
For this preliminary work, we make the simplifying assumption that cells do not perform out-of-band communication (i.e., through files or external services), and that all state required by a cell is passes via the kernel's global scope.
Our first challenge is deriving each cell's read and write sets.
To accomplish this, we build a dataflow graph over the cells of the notebook using Python's \texttt{ast} module to obtain a structured representation of the code: an \emph{abstract syntax tree} (AST).
Attribute references are marked by instances of the \texttt{Attribute} object, conveniently annotated with the directionality of the reference: \texttt{Load}, \texttt{Store}, or \texttt{Delete}.
We simulate execution of the python code in a manner analogous to Program Slicing~\cite{DBLP:journals/tse/Weiser84}, using an in-order traversal of the AST's statements (lines or blocks of code) to build a \emph{fine-grained} data flow graph.
A data flow graph is a directed graph where each node corresponds to a single statement in the AST, and is i
We identify nodes by a 2-tuple of the cell index, and the line number.
Each edge corresponds to a read dependency, originating at a statement with an attribute \texttt{Load} to the statement with the \texttt{Store} operation that most recently wrote the attribute.
Due to control flow constructs (e.g., if-then-else blocks and for loops), a single attribute \texttt{Load} can potentially read from multiple \texttt{Store} operations; we represent this by encoding one edge for each potential dependency.
Python's scoping logic presents additional complications.
First, function and class definitions may reference symbols from a containing scope.
For example, python's \texttt{import} statement simply declares imported modules as symbols in the current scope.
Thus, references to the module's functions within a function or class definition create transitive dependencies.
When the traversal visits a function or class declaration statement, we record
An additional complication arises from python's scope capture semantics.
When a function (or class) is declared, it records a reference to all enclosing scopes. Consider the following example code:
\begin{lstlisting}
def foo():
print(a)
a = 1
foo() # Prints '1'
def bar():
a = 2
foo()
bar() # Prints '1'
\end{lstlisting}
\begin{lstlisting}
def foo():
a = 2
def bar():
print(a)
return bar
bar = foo()
bar() # Prints '2'
a = 1
bar() # Prints '2'
\end{lstlisting}
In the latter instance, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope.
In the former instance, conversely, \texttt{bar}'s assignment to \texttt{a} happens in its own scope, and so the invocation of \texttt{foo} reads the instance of \texttt{a} in the global scope.
\tinysection{Coarse-Grained Data Flow}
We next reduce the fine-grained data flow graph, as defined above, into a simplified \emph{coarse-grained} data flow graph by (i) merging the nodes for each cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides us with the cell's dependencies: The set of in-edges (resp., out-edges) is a guaranteed upper bound on the cell's write set (read set).
We note that the coarse-grained data flow graph is free of several complications that affect most data flow analyses.
Notably, Jupyter does not allow control-flow primitives to span cells.
For each cell, we record the approximate read and write sets. We then instrument the cell by injecting \systemname import operations for every element of the read set into the start of the cell and export operations for every element of the write set at the end of the cell.
We leave to future work a more fine-grained instrumentation that avoids importing or exporting symbols until they are explicitly required.