From 096b4bf405cf148e2812a78aa9aff7f5605b2bc3 Mon Sep 17 00:00:00 2001 From: Oliver Kennedy Date: Tue, 29 Jan 2019 19:23:26 -0500 Subject: [PATCH] Drafting the 724 page --- src/teaching/cse-7xx/2019sp.erb | 81 +++++++++++++++++++++++++++++++++ src/teaching/cse-7xx/2019sp.md | 1 - 2 files changed, 81 insertions(+), 1 deletion(-) create mode 100644 src/teaching/cse-7xx/2019sp.erb delete mode 100644 src/teaching/cse-7xx/2019sp.md diff --git a/src/teaching/cse-7xx/2019sp.erb b/src/teaching/cse-7xx/2019sp.erb new file mode 100644 index 00000000..7bb1967f --- /dev/null +++ b/src/teaching/cse-7xx/2019sp.erb @@ -0,0 +1,81 @@ +--- +title: Database Systems for Data Quality and Curation +papers: + - title: "Wrangler: Interactive visual specification of data transformation scripts" + url: https://dl.acm.org/citation.cfm?id=1979444 + abstract: | + Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing. + - title: "Descriptive and Prescriptive Data Cleaning" + url: https://cs.uwaterloo.ca/~ilyas/papers/AnupSIGMOD2014.pdf + abstract: | + Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar get report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules. + - title: "Incremental knowledge base construction using deepdive" + url: https://dl.acm.org/citation.cfm?id=2809991 + abstract: | + Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality. +schedule: + - date: Feb 6 + - date: Feb 13 + - date: Feb 20 + - date: Feb 27 + - date: Mar 6 + - date: Mar 13 + event: Spring Break + - date: Mar 20 + - date: Mar 27 + - date: Apr 3 + - date: Apr 10 + - date: Apr 17 + - date: Apr 24 + - date: May 1 + - date: May 8 +--- + + + +

<%= title %>

+ +This course will survey research systems, tools, and techniques for data quality management. Aspects of data quality covered will include interfaces and human-in-the-loop approaches to data refinement, techniques for schema and structure detection, and automated data ingest. + +

Details

+ + +The course is graded Sat/Unsat. For a satisfactory grade: + + +

Schedule

+ + +<% schedule.each do |event| %> + + + + +<% end %> +
<%= event["date"] %><% if event.has_key? "event" %> + <%= event["event"] %> + <% elsif event.has_key? "speakers" %> + + <% else %> + No One Signed Up (yet) + <% end %>
+ +

Suggested Papers

+ +
+<% papers.each do |paper| %> +
"><%= paper["title"] %>
+
<%= paper.fetch("abstract", "[Abstract Missing]") %>
+<% end %> +
\ No newline at end of file diff --git a/src/teaching/cse-7xx/2019sp.md b/src/teaching/cse-7xx/2019sp.md deleted file mode 100644 index 411e0b4b..00000000 --- a/src/teaching/cse-7xx/2019sp.md +++ /dev/null @@ -1 +0,0 @@ -Content to appear soon(tm) \ No newline at end of file