From 096b4bf405cf148e2812a78aa9aff7f5605b2bc3 Mon Sep 17 00:00:00 2001
From: Oliver Kennedy <okennedy.gitkr@xthemage.net>
Date: Tue, 29 Jan 2019 19:23:26 -0500
Subject: [PATCH] Drafting the 724 page

---
 src/teaching/cse-7xx/2019sp.erb | 81 +++++++++++++++++++++++++++++++++
 src/teaching/cse-7xx/2019sp.md  |  1 -
 2 files changed, 81 insertions(+), 1 deletion(-)
 create mode 100644 src/teaching/cse-7xx/2019sp.erb
 delete mode 100644 src/teaching/cse-7xx/2019sp.md

diff --git a/src/teaching/cse-7xx/2019sp.erb b/src/teaching/cse-7xx/2019sp.erb
new file mode 100644
index 00000000..7bb1967f
--- /dev/null
+++ b/src/teaching/cse-7xx/2019sp.erb
@@ -0,0 +1,81 @@
+---
+title: Database Systems for Data Quality and Curation
+papers:
+  - title: "Wrangler: Interactive visual specification of data transformation scripts"
+    url: https://dl.acm.org/citation.cfm?id=1979444
+    abstract: |
+      Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing.
+  - title: "Descriptive and Prescriptive Data Cleaning"
+    url: https://cs.uwaterloo.ca/~ilyas/papers/AnupSIGMOD2014.pdf
+    abstract: |
+      Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using  some repair algorithms.  Oftentimes, the rules, which are related to the business logic, can only be defined on some tar get report generated by transformations over multiple data sources.   This  creates  a  situation  where  the  violations  detected in the report are decoupled in space and time from the actual source of errors.  In addition, applying the repair on the report would need to be repeated whenever the data sources change.  Finally, even if repairing the report is  possible and affordable, this would be of little help towards  identifying and analyzing the actual sources of errors for future prevention of violations at the target.  In this paper, we  propose a system to address this decoupling.  The system takes quality  rules  defined  over  the  output  of  a  transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe  actions to solve them.  We present scalable techniques to detect, propagate, and explain errors.  We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.
+  - title: "Incremental knowledge base construction using deepdive"
+    url: https://dl.acm.org/citation.cfm?id=2809991
+    abstract: |
+      Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.
+schedule: 
+  - date: Feb 6
+  - date: Feb 13
+  - date: Feb 20
+  - date: Feb 27
+  - date: Mar 6
+  - date: Mar 13
+    event: Spring Break
+  - date: Mar 20
+  - date: Mar 27
+  - date: Apr 3
+  - date: Apr 10
+  - date: Apr 17
+  - date: Apr 24
+  - date: May 1
+  - date: May 8
+---
+
+
+
+<h2><%= title %></h2>
+
+This course will survey research systems, tools, and techniques for data quality management. Aspects of data quality covered will include interfaces and human-in-the-loop approaches to data refinement, techniques for schema and structure detection, and automated data ingest.
+
+<h2>Details</h2>
+<ul>
+  <li><b>Instructor: </b> Oliver Kennedy (Capen 211; Inside the Library)</li>
+  <li><b>Office Hours: </b> Weds: 10AM-Noon</li>
+  <li><b>Seminar Time: </b> Weds: 2:00-3:30</li>
+  <li><b>Seminar Location: </b><a href="http://www.buffalo.edu/home/visiting-ub/CampusMaps/maps.html#CLEMEN">Clemens</a> 117</li>
+</ul>
+
+The course is graded Sat/Unsat.  For a satisfactory grade:
+<ul>
+  <li>Present at least one paper listed below</li>
+  <li>Attend class (Max 3 absences regardless of reason)</li>
+  <li>Come prepared, having read the paper and ready to participate in discussion about the paper.  If necessary, I will call on students randomly to answer questions.</li>
+</ul>
+
+<h2>Schedule</h2>
+
+<table style="padding-left: 30px;">
+<% schedule.each do |event| %>
+  <tr>
+    <th style="text-align: right; padding-right: 30px;"><%= event["date"] %></th>
+    <td><% if event.has_key? "event" %>
+      <b><%= event["event"] %></b>
+    <% elsif event.has_key? "speakers" %>
+      <ul><% event["speakers"].each do |speaker| %>
+        <li><%= speaker["name"] %> presents <a href="<%= speaker["url"]%>"><%= speaker["title"] %></a><br/></li>
+      <% end %></ul>
+    <% else %>
+      No One Signed Up (yet)
+    <% end %></td>
+  </tr>
+<% end %>
+</table>
+
+<h2>Suggested Papers</h2>
+
+<dl>
+<% papers.each do |paper| %>
+  <dt style="margin-top: 30px;"><a href="<%= paper["url"]%>"><%= paper["title"] %></a></dt>
+  <dd style="max-width: 600px; padding-left: 30px;"><%= paper.fetch("abstract", "[Abstract Missing]") %></dd>
+<% end %>
+</dl>
\ No newline at end of file
diff --git a/src/teaching/cse-7xx/2019sp.md b/src/teaching/cse-7xx/2019sp.md
deleted file mode 100644
index 411e0b4b..00000000
--- a/src/teaching/cse-7xx/2019sp.md
+++ /dev/null
@@ -1 +0,0 @@
-Content to appear soon(tm)
\ No newline at end of file