Pimentel et al: "4.03% of notebooks on github are reproducible"
Joel Grus: "For beginners, with dozens of cells and more complex code [the ability to run code snippets out of order] is utterly confusing."
A modest proposal...
So now...
and...
and...
... or worse ...
and...
and...
Why are you getting my hopes up?
def social_link(base, provider = "facebook.com"):
if base is None:
return None
if base.startswith("http://"):
base = base.replace("http://", "https://")
if base.startswith("https://"):
return base
if base.startswith(provider) or base.startswith(f"www.{provider}"):
return "https://"+base
return f"https://{provider}/"+base
vizierdb.export_module(social_link)
vizierdb.export_module(social_link)
... but they're annoying
c = 19
b = 23
a = b + c
Writes: c
Writes: b
Reads: b, c; Writes: c
🧹
Python's scoping rules are a mess.
x = 1
def foo():
x = 2
def bar():
print(x)
return bar
x += 10
baz = foo()
baz() # What is printed?
... fortunately we only care about cross-cell dependencies (for the most part).
import urrlib.request as r
with r.urlopen("https://not.sus.com/code.py") as response:
eval( response.read() )
???
... fortunately eval isn't a major part of notebook use.
import pandas as pd
pd.load_csv("myfile.csv")
maybe safe???
... fortunately libraries are usually good at abstracting.
Idea: Optimistic Concurrency Control.
(Work in progress)
System | Dependencies | Execution | Parallelism |
---|---|---|---|
Notebook | Unknown | Manual | None |
Workflows | Fully Known | DAG | |
Vizier | Bounded+Trace | ??? |
How do we know when it is safe to reuse a result?
How do we know what is safe to parallelize?
df = pd.load_csv("foo.csv")
(variable → version)
(e.g., $\{ retail \rightarrow 937, markets \rightarrow 252 \}$)
is the cell...
https://vizierdb.info
Mike Brachmann, Boris Glavic, Nachiket Deo, Stefan Muller, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Soham Patel, and more...