Drafting the 724 page

This commit is contained in:
Oliver Kennedy 2019-01-29 19:23:26 -05:00
parent e49318c930
commit 096b4bf405
2 changed files with 81 additions and 1 deletions

View file

@ -0,0 +1,81 @@
---
title: Database Systems for Data Quality and Curation
papers:
- title: "Wrangler: Interactive visual specification of data transformation scripts"
url: https://dl.acm.org/citation.cfm?id=1979444
abstract: |
Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing.
- title: "Descriptive and Prescriptive Data Cleaning"
url: https://cs.uwaterloo.ca/~ilyas/papers/AnupSIGMOD2014.pdf
abstract: |
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar get report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.
- title: "Incremental knowledge base construction using deepdive"
url: https://dl.acm.org/citation.cfm?id=2809991
abstract: |
Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.
schedule:
- date: Feb 6
- date: Feb 13
- date: Feb 20
- date: Feb 27
- date: Mar 6
- date: Mar 13
event: Spring Break
- date: Mar 20
- date: Mar 27
- date: Apr 3
- date: Apr 10
- date: Apr 17
- date: Apr 24
- date: May 1
- date: May 8
---
<h2><%= title %></h2>
This course will survey research systems, tools, and techniques for data quality management. Aspects of data quality covered will include interfaces and human-in-the-loop approaches to data refinement, techniques for schema and structure detection, and automated data ingest.
<h2>Details</h2>
<ul>
<li><b>Instructor: </b> Oliver Kennedy (Capen 211; Inside the Library)</li>
<li><b>Office Hours: </b> Weds: 10AM-Noon</li>
<li><b>Seminar Time: </b> Weds: 2:00-3:30</li>
<li><b>Seminar Location: </b><a href="http://www.buffalo.edu/home/visiting-ub/CampusMaps/maps.html#CLEMEN">Clemens</a> 117</li>
</ul>
The course is graded Sat/Unsat. For a satisfactory grade:
<ul>
<li>Present at least one paper listed below</li>
<li>Attend class (Max 3 absences regardless of reason)</li>
<li>Come prepared, having read the paper and ready to participate in discussion about the paper. If necessary, I will call on students randomly to answer questions.</li>
</ul>
<h2>Schedule</h2>
<table style="padding-left: 30px;">
<% schedule.each do |event| %>
<tr>
<th style="text-align: right; padding-right: 30px;"><%= event["date"] %></th>
<td><% if event.has_key? "event" %>
<b><%= event["event"] %></b>
<% elsif event.has_key? "speakers" %>
<ul><% event["speakers"].each do |speaker| %>
<li><%= speaker["name"] %> presents <a href="<%= speaker["url"]%>"><%= speaker["title"] %></a><br/></li>
<% end %></ul>
<% else %>
No One Signed Up (yet)
<% end %></td>
</tr>
<% end %>
</table>
<h2>Suggested Papers</h2>
<dl>
<% papers.each do |paper| %>
<dt style="margin-top: 30px;"><a href="<%= paper["url"]%>"><%= paper["title"] %></a></dt>
<dd style="max-width: 600px; padding-left: 30px;"><%= paper.fetch("abstract", "[Abstract Missing]") %></dd>
<% end %>
</dl>

View file

@ -1 +0,0 @@
Content to appear soon(tm)