This commit is contained in:
Boris Glavic 2022-03-31 18:50:32 -05:00
parent 2c8a5c36c2
commit fcb5cb31d4

View file

@ -25,13 +25,13 @@ For our use cases of provenance (parallel execution of cells and limiting automa
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Approximate Provenance through Static Dataflow Analysis}
%
In terms of static analysis, we built a dataflow graph for the code of Python cells, using Python's AST library and dataflow equations that are standard in static program analysis. We only analyze the user's code and do not extend the analysis to other modules (libraries). While it is possible that we will miss data dependencies between objects created by the user that are caused by such library code, this is acceptable, because such dependencies will be discovered and compensated for at runtime.
Like any static dataflow analysis technique, our approach may produce false positives.\footnote{This is due to the fact that static analysis has to reason about all possible controlflow paths through a program that could arise for some input, while for a concrete input only some of these paths may be taken.} For example, in the code snippet shown at the bottom in \Cref{fig:example-python-code} that value of the variable \texttt{b} may dependent on either \texttt{c} or \texttt{d} subject to whether \texttt{a < 10} evaluates to true. Since the value of \texttt{a} is only known at runtime, static analysis has to assume that \texttt{b} depends on both \texttt{c} and \texttt{d}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Refining Approximate Provenance at Runtime}
%
To deal with the approximate nature of provenance generated by our static analysis approach, we allow the educated guesses made by static analysis to be refined at runtime. This is were the benefits of isolated cell execution with explicit communication between cells through data artificers that are created, read, and written through an API provided by the system become clear. Because the system keeps book about which cells access what data through this API, new data dependencies not predicted by static analysis are automatically detected at runtime. Similarly, when a cell finishes execution, we will know which predicated data dependencies have not materialized. Next we will introduce a scheduling algorithm that dynamically adapts its execution plan when new dependencies are detected or predicted cell dependencies do not materialize.