Import

2022-03-27 17:33:54 -04:00 · 2022-03-27 17:33:54 -04:00 · b0a006bdaa
parent 4c4efd668e
commit b0a006bdaa
6 changed files with 105 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,3 +8,4 @@ main.synctex.gz
 pdfa.xmpi
 main.bbl
 main.blg
+vizier.db
--- a/data/100000_10_parallel.csv
+++ b/data/100000_10_parallel.csv
@ -0,0 +1,12 @@
+0,3680,1,127,585
+1,3681,1,684,4097
+2,3682,1,4173,10320
+3,3683,1,4185,9881
+4,3684,1,4200,10269
+5,3685,1,4219,10284
+6,3686,1,4236,10076
+7,3687,1,4251,10373
+8,3688,1,4269,10188
+9,3689,1,4287,10131
+10,3690,1,4311,9839
+11,3691,1,4333,10006
--- a/data/100000_10_serial.csv
+++ b/data/100000_10_serial.csv
@ -0,0 +1,12 @@
+0,3692,1,157,546
+1,3693,1,625,4021
+2,3694,1,4092,7507
+3,3695,1,7555,8370
+4,3696,1,8416,9186
+5,3697,1,9231,9900
+6,3698,1,9939,10633
+7,3699,1,10666,11444
+8,3700,1,11475,12166
+9,3701,1,12205,12921
+10,3702,1,12951,13717
+11,3703,1,13746,14393
--- a/main.bib
+++ b/main.bib
@ -1,5 +1,4 @@

-
@inproceedings{brachmann:2020:cidr:your,
   author = {Brachmann, Michael and Spoth, William and Kennedy, Oliver and Glavic, Boris and Mueller, Heiko and Castelo, Sonia and Bautista, Carlos and Freire, Juliana},
   title = {Your notebook is not crumby enough, REPLace it},
@ -363,3 +362,16 @@
  Title = {{Principles of Program Analysis}},
  Year = {1999},
 }
+
+
+
+@article{DBLP:journals/tse/Weiser84,
+  author    = {Mark D. Weiser},
+  title     = {Program Slicing},
+  journal   = {{IEEE} Trans. Software Eng.},
+  volume    = {10},
+  number    = {4},
+  pages     = {352--357},
+  year      = {1984}
+}
+
--- a/sections/conclusions.tex
+++ b/sections/conclusions.tex
@ -1,5 +1,13 @@
+%!TEX root=../main.tex
 We introduce a provenance-based approach for predicting and tracking dependencies across python cells in a computational notebook and an implementation of this approach in Vizier, a data-centric notebook system where cells are isolated from each other and communicate through data artifacts. By combining best effort static analysis with an adaptable runtime schedule for notebook cell execution, we achieve (i) parallel execution of python cells, (ii) automatic refresh of dependent cells when the notebook is modified, and (iii) translation of Jupyter notebooks into our model.

+This paper represents an initial proof-of-concept, on which we note several opportunities for improvement.
+Crucially, there are still considerable opportunities to reduce blocking due to state transfer between kernels.
+For example, it may be possible to re-use a kernel for applications involving large state, trading the increased overhead of sequential execution for a reduction in overhead from state transfer.
+We also plan to explore ways to make state export and import asynchronous.
+
+
+
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "../main"
--- a/sections/import.tex
+++ b/sections/import.tex
@ -1 +1,60 @@
+%!TEX root=../main.tex
+
+Jupyter notebooks are evaluated in a single long-running kernel that facilitates inter-cell communication through a shared global state.
+In this section, we outline the process for converting a Jupyter notebook into a form compatible \systemname's isolated cell execution model.
+For this preliminary work, we make the simplifying assumption that cells do not perform out-of-band communication (i.e., through files or external services), and that all state required by a cell is passes via the kernel's global scope.
+
+Our first challenge is deriving each cell's read and write sets.  
+To accomplish this, we build a dataflow graph over the cells of the notebook using Python's \texttt{ast} module to obtain a structured representation of the code: an \emph{abstract syntax tree} (AST).
+Attribute references are marked by instances of the \texttt{Attribute} object, conveniently annotated with the directionality of the reference: \texttt{Load}, \texttt{Store}, or \texttt{Delete}.  
+
+We simulate execution of the python code in a manner analogous to Program Slicing~\cite{DBLP:journals/tse/Weiser84}, using an in-order traversal of the AST's statements (lines or blocks of code) to build a \emph{fine-grained} data flow graph. 
+A data flow graph is a directed graph where each node corresponds to a single statement in the AST, and is i
+We identify nodes by a 2-tuple of the cell index, and the line number.
+Each edge corresponds to a read dependency, originating at a statement with an attribute \texttt{Load} to the statement with the \texttt{Store} operation that most recently wrote the attribute.
+Due to control flow constructs (e.g., if-then-else blocks and for loops), a single attribute \texttt{Load} can potentially read from multiple \texttt{Store} operations; we represent this by encoding one edge for each potential dependency.
+
+Python's scoping logic presents additional complications.
+First, function and class definitions may reference symbols from a containing scope.
+For example, python's \texttt{import} statement simply declares imported modules as symbols in the current scope. 
+Thus, references to the module's functions within a function or class definition create transitive dependencies.
+When the traversal visits a function or class declaration statement, we record
+
+An additional complication arises from python's scope capture semantics.
+When a function (or class) is declared, it records a reference to all enclosing scopes.  Consider the following example code:
+\begin{lstlisting}
+def foo():
+  print(a)
+a = 1
+foo()  # Prints '1'
+def bar():
+  a = 2
+  foo()
+bar()  # Prints '1'
+\end{lstlisting}
+
+\begin{lstlisting}
+def foo():
+  a = 2
+  def bar():
+    print(a)
+  return bar
+bar = foo()
+bar()  # Prints '2'
+a = 1
+bar()  # Prints '2'
+\end{lstlisting}
+
+In the latter instance, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope.
+In the former instance, conversely, \texttt{bar}'s assignment to \texttt{a} happens in its own scope, and so the invocation of \texttt{foo} reads the instance of \texttt{a} in the global scope.
+
+\tinysection{Coarse-Grained Data Flow}
+
+We next reduce the fine-grained data flow graph, as defined above, into a simplified \emph{coarse-grained} data flow graph by (i) merging the nodes for each cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
+The coarse-grained data flow graph provides us with the cell's dependencies: The set of in-edges (resp., out-edges) is a guaranteed upper bound on the cell's write set (read set).
+We note that the coarse-grained data flow graph is free of several complications that affect most data flow analyses.
+Notably, Jupyter does not allow control-flow primitives to span cells.
+
+For each cell, we record the approximate read and write sets.  We then instrument the cell by injecting \systemname import operations for every element of the read set into the start of the cell and export operations for every element of the write set at the end of the cell.
+We leave to future work a more fine-grained instrumentation that avoids importing or exporting symbols until they are explicitly required.