master
Boris Glavic 2019-06-25 19:00:55 -05:00
parent 0db40c9831
commit 98fe6697e8
25 changed files with 35383 additions and 0 deletions

77
demo_script.md Normal file
View File

@ -0,0 +1,77 @@
Key things to emphasize during this script:
- Vizier is data-centric --- the medium of exchange between cells is data frames, rather than python kernels, allowing us to mix different scripting languages in a single notebook.
- Vizier is multi-modal --- Better still, we can mix entirely different interaction modalities: Script, Spreadsheet, or Data Widget (plot, lens, etc...)
- Vizier is history-aware and collaborative --- Versioning comes entirely for free (versions can be shared)
- Vizier greatly aids iterative construction of data analysis and curation pipelines (this is how people do it!) and reuse through versioning (+branching), automatic refresh of dependent results on update (works because we a data-centric), and the error list (you have an overview of what problems still need to be addressed)
- Vizier enables data testing and debugging (lenses, error list, uncertainty tracking)
## Screen: Vizier Home
- Intro
- Hi, I'm _____ I'm going to show you Vizier, our tool for data exploration and curation.
- Creating a project
- I'm going to start by creating a new project [New project, give it a name]
- If you're familiar with notebooks like Jupyter or Zeppelin, you'll feel right at home. Vizier projects are notebooks.
## Basic Demo
- Loading Data
- We're going to start by loading a dataset. [ Load "New_York_City_Leading_Causes_of_Death_12_11_2018.csv" and name it "causes" Also make sure "Detect Headers" and "Detect Types" are checked ]
- We got this collection of cause of death statistics from the NYS open data portal. It has data on 2012 through 2014.
- Vizier does the usual things: It detects the types of each column, and the column names in the header row.
- Script and Plot Cells
- Vizier lets you do the standard things you'd expect a notebook to do. For example, I can use python to find all of the distinct causes of list [Copy "Causes" script]
- But you're not tied to a single language. Vizier is data-centric, data flows between cells through data frames/tables/relations
- For example, you can create SQL cells. Here we'll use a query to compute the breakdown by gender [Copy "Group By Gender" script]. [Select table output]
- Tabular outputs are nice, but we can also plot this data. [ Create a plot cell ]
- In contrast to Jupyter and other, Vizier is data-centric: cells communicate through input/output passing. This enables automatic refresh of dependent results for immediate feedback on the effect of an update [Change SQL query by adding a condition]. As you can see the plot was automatically refreshed.
- The Spreadsheet
- Another benefit of being data-centric is that at any point, you can just open up a view of the data. [ open spreadsheet view ]
- Not everyone is super comfortable with scripting languages. Or, sometimes it just doesn't make sense to write an entire script just to fix one or two bugs in the data. That's why Vizier also lets you edit your data cells directly like a spreadsheet. [ edit a few values ]
- One of the great things about notebooks is that they document what operations have been applied to the data. Vizier is no different [ switch to notebook view ]. However, Vizier takes this idea to the next level. Also every edit that happens in the spreadsheet gets recorded in the notebook through a language we call Vizual. Every transformation you do in the spreadsheet maps to a Vizual operation.
- History
- Another cool feature of Vizier is built-in versioning. Every change you make gets recorded. [ open time travel view ] The time travel view lets you go back to any earlier version of the notebook to see what happened.
- You can view a snapshot of the notebook [ open the snapshot before the spreadsheet edits ], like for example right before we changed anything in the spreadsheet.
- This is a read-only view, but you can pick up editing from here by branching the notebook. [ hit the branch button ]. This is useful for restarting developing from a stable version in case the pipeline has evolved into a non-productive direction to iterative exploration.
- You can also share a read-only snapshot of the notebook with collaborators or to allow another researcher to reproduce your results [ click the share button ]
- Lenses and the Error List
- In addition to script and plot cells, Vizier provides a set of data validation and cleaning operations called Lenses. Lenses specify a constraint over the data, identify violations, and do their best to repair any violations.
- Let's see an example. In this dataset [point to plot], the data is broken down by sex into 'M' and 'F'. We can register this assumption with Vizier: [ Create a Missing Value lens with the constraint in "SEX Constraint" immediately after the load data step ].
- When the constraint is violated, Vizier records the fact in a convenient error list [ open the error list ].
- Fortunately, all of the data passes the constraint we defined. The SEX column shows no errors. However, you can see a few fixes it had to apply to the source data. For example, it reinterpreted a bunch of records that were literal period characters ('.') as NULLs. (the LOAD DATASET cell automatically creates lenses). Like other data cleaning techniques, repairs applied by lenses are often only best effort guesses based on the whatever information is available in the dataset. That is, they are not guaranteed to be correct. Vizier automatically marks such data as uncertain [switch to spreadsheet and show red cell].
- Data Testing and Reuse
- Let's say that our sources provide us with a data update. For example, here's a copy of the same dataset taken from the NYS open data portal downloaded a week later. This new dataset adds data from 2015 and 2016. [ Change the uploaded file to "New_York_City_Leading_Causes_of_Death_12_18_2018.csv" ]. Reusing the pipeline we have constructed for a new dataset is trivial in Vizier, because results that dependent on this dataset in the remainder of the notebook are automatically refreshed.
- Now let's see if we broke anything [ Switch to errors tab ].
- Whoops... looks like the SEX column now has a bunch of values that violate our assumption. Let's see what's up [ click the "go to error" button ]. The year here is 2016... that seems odd. Part of the new data.
- Let's see the original values. [ go to the notebook, open the "causes" table in the output of the load data cell. Jump to the last page. ]
- Anyone want to guess what's wrong? (it's 'Female'/'Male' in 2015/16 vs 'F'/'M' in 2012-14)
- Let's fix that. [ Add a python/sql cell after the load data step and copy in "Repair Code" or "Repair Query" ]
- Matplotlib
- We can also explore the data with Matplotlib [ "Plot by ethnicity" ]
## Other Demos
See: https://docs.google.com/spreadsheets/d/1ikRZtk0p8qOM4mE0-2K5bfdY296dr3ZV88Gs8muVuRw/edit#gid=0
Other demos and points to emphasize
* Buffalo Tech Salaries:
- Import from Google Sheets
- Red text highlights values that are uncertain (e.g., see "by_language" dataset)
* Home Sensing
- Missing Value lens repairs error values
* SEC Filings
- Multi-modal interface: Simulation in Python + GUI data visualization
* Buffalo Census Plots
- Bokeh integration (interactive plots)

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.