Website/src/seminar/2016sp.erb

---
title: The UBDB Seminar - Spring 2016
schedule:
  - when:  Feb. 1
    what:  Brainstorming Joint Project Ideas
    who:   The UB Database Group
    where: Davis 113A
  - when:  Feb. 8
    what:  Reenacting Transactional Histories to Compute Their Provenance
    who:   Boris Glavic (IIT)
    site:  http://www.cs.iit.edu/~glavic/
    where: Davis 113A
    details:
      abstract: |
        Provenance for database queries, information about how the outputs of a query where derived from its inputs, has recently gained traction in the database community resulting in the development of several models and their implementation in prototype systems. However,  currently there is no system or model that supports transactional updates limiting the applicability of provenance to databases which are never updated. In this talk, I introduce reenactment, a novel declarative replay technique for transactional histories, and demonstrate how reenactment can be used to retroactively compute the provenance of past updates, transactions, and histories. The foundation of this research are MV-semirings, our extension of the well-established semiring provenance model for queries  to updates and transactions running under multi-versioning concurrency control protocols. In this model, any transactional history (or part thereof) can be simulated through a query, i.e., any state of a relation R produced by a history can be reconstructed by a query. We call this process reenactment. More formally, the reenactment query for a transactional history H is equivalent (in the sense of query equivalence) to the history under MV-semiring semantics. These formal underpinnings are the basis of an efficient approach for computing provenance of past transactions using a standard relational DBMS. I will show how reenactment queries can be constructed from an audit log, a log of past SQL operations, and how queries with MV-semiring semantics can be encoded as standard relational queries. A naive implementation would either require replay of the complete history from the beginning or proactive materialization of provenance while transactions are run. However, as long as a transaction time history is available, reenactment can be started from any past history state. Since most modern DBMS support audit logs and time travel (querying transaction time histories) out of the box and these features incur only moderate overhead on transaction execution, this approach enables efficient provenance computation for transactions on-top of standard database systems. I present encouraging experimental results based on our implementation of these techniques in our GProM (Generic Provenance Middleware) provenance database middleware.
      bio: |
        Boris Glavic is an Assistant Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT database group (<a href="http://www.cs.iit.edu/%7Edbgroup/">http://www.cs.iit.edu/~dbgroup/</a>). Before coming to IIT, Boris spent to two years as a PostDoc in the <a href="http://www.cs.toronto.edu/">Department of Computer Science</a> at the <a href="http://www.utoronto.ca/">University of Toronto</a> working at the <a href="http://dblab.cs.toronto.edu/home/">Database Research Group</a> under <a href="http://www.cs.toronto.edu/%7Emiller">Renée J. Miller</a>. He received a Diploma (Master) in Computer Science from the <a href="http://www.informatik.rwth-aachen.de/">RWTH Aachen</a> in Germany, and a PhD in Computer Science from the University of Zurich in Switzerland being advised by <a href="http://www.ifi.uzh.ch/dbtg/Staff/Boehlen">Michael Böhlen</a> and <a href="http://people.inf.ethz.ch/alonso/">Gustavo Alonso</a>. Boris is a professed database guy enjoying systems research based on solid theoretical foundations. His main research interests are provenance and information integration. He has build several provenance-aware systems (see <a href="http://cs.iit.edu/%7Edbgroup/research/index.html">http://cs.iit.edu/~dbgroup/research/index.html</a>) including Perm (relational databases), Ariadne (stream processing), GProM (database provenance middleware), Vagabond, and LDV (database virtualization and repeatability).
  - when:  Feb. 15
    who:   Zack Ives (UPenn)
    site:  http://www.cis.upenn.edu/~zives/
    what:  Rethinking the Database for the Data Science Era
    where: Davis 113A
    details:
      abstract: |
        The relational DBMS was developed to address the core needs of the enterprise: it provides a rigid schema to ensure data consistency, provides query optimization for flexibility, and supports high-throughput transactions or bigger-than-memory analytics.  Over time the database community has pushed into many new frontiers, extending out to Web data and beyond, but most people think of a DBMS as enabling roughly this same set of core capabilities.<br/>
        In the era of data science, the Web, and the cloud -- where machine learning techniques, data in files, and robust MapReduce-style distributed compute platforms are all the rage -- the question is whether the "common core" data needs have fundamentally changed.  I will describe our (still evolving) answer to this question, which is that the "new DBMS" should be focused on iteratively improving and integrating unstructured data to make it amenable to analysis, on making the data more useful through annotation and provenance, and on enabling interaction among users, algorithms, and the broader community.  I will describe our experiences in trying to foster community-scale data science (for the neuroscience domain) with these techniques.
      bio: |
        Zachary Ives is a Professor and Markowitz Faculty Fellow at the University of Pennsylvania. He received his PhD from the University of Washington. His research interests include data integration and sharing, managing "big data," sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel.  He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He serves as the Director for Penn's Singh Program in Networked and Social Systems Engineering, and he is a Penn Engineering Fellow. He is a co-author of the textbook Principles of Data Integration, and received an ICDE 2013 ten-year Most Influential Paper award. He has been an Associate Editor for Proceedings of the VLDB Endowment (2014) and a Program Co-Chair for SIGMOD (2015).
  - when:  Feb. 22
    what:  Large-Scale Machine Learning With The SimSQL System
    who:   Chris Jermaine (Rice University)
    site:  http://www.cs.rice.edu/~cmj4/
    where: Davis 113A
    details:
      abstract: |
        In this talk, I’ll describe the SimSQL system, which is a platform for writing and executing statistical codes over large data sets, particularly for machine learning applications. Codes that run on SimSQL can be written in a very high-level, declarative language called Buds. A Buds program looks a lot like a mathematical specification of an algorithm, and statistical codes written in Buds are often just a few lines long.<br/>
        At its heart, SimSQL is really a relational database system, and like other relational systems, SimSQL is designed to support data independence. That is, a single declarative code for a particular statistical inference problem can be used regardless of data set size, compute hardware, and physical data storage and distribution across machines. One concern is that a platform supporting data independence will not perform well. But we’ve done extensive experimentation, and have found that SimSQL performs as well as other competitive platforms that support writing and executing machine learning codes for large data sets.
      bio: |
        Chris Jermaine is an associate professor of computer science at Rice University. He is the recipient of an Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and an ACM SIGMOD Best Paper Award. In his spare time, Chris enjoys outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, Chris and his wife floated a whitewater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.
  - when:  Feb. 29
    what:  Too many VLDB submissions to meet
  - when:  Mar. 7
    what:  Convergent Inference with Leaky Joins
    who:   Ying Yang
    url:   /papers/2016/VLDB-Inference-submitted.pdf
  - when:  Mar. 14
    what:  <b>Spring Break</b>
  - when:  Mar. 21
    who:   Wolfgang Gatterbauer (CMU)
    site:  http://www.andrew.cmu.edu/user/gatt/
    what:  Approximate lifted inference with probabilistic databases
    where: Davis 113A
    details:
      abstract: |
        Probabilistic inference over large data sets is becoming a central data management problem. Recent large knowledge bases, such as Yago, Nell or DeepDive have millions to billions of uncertain tuples. Yet probabilistic inference is known to be #P-hard in the size of the database, even for some very simple queries. This talk shows a new approach that allows ranking answers to hard probabilistic queries in guaranteed polynomial time, and by using only basic operators of existing database management systems (e.g., no sampling required).<br/>
        (1) The first part of this talk develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space.<br/>
        (2) The second part then draws the connection to lifted inference and shows how application of this theory allows a standard relational database management system to both upper and lower bound hard probabilistic queries in guaranteed polynomial time. We give experimental evidence on synthetic TPC-H data that our approach is by orders of magnitude faster and also more accurate than currently used sampling-based approaches.<br/>
        (Talk based on joint work with Dan Suciu from TODS 2014 and VLDB 2015: http://arxiv.org/abs/1409.6052, http://arxiv.org/pdf/1412.1069)
      bio: |
        Wolfgang Gatterbauer is an Assistant Professor in Business Technologies and Computer Science at CMU. His current research focus is on scalable approaches to perform inference over uncertain data. He received degrees in Mechanical Engineering, Electrical Engineering &amp; Computer Science, and Technology &amp; Policy, and then got his PhD in Computer Science from Vienna University of Technolgoy. Prior to joining CMU, he was a Post-Doc in the Database group at University of Washington. In earlier times, he won a Bronze medal at the International Physics Olympiad, worked in the steam turbine development department of ABB Alstom Power, and in the German office of McKinsey &amp; Company.
  - when:  Mar. 28
    what:  Data generation for testing and grading SQL queries
    who:   Gökhan Kul
    url:   http://dl.acm.org/citation.cfm?id=2846644
  - when:  Apr. 11
    what:  TBD
  - when:  Apr. 18
    who:   Ihab Ilyas (Waterloo)
    site:  https://cs.uwaterloo.ca/~ilyas/
    what:  Data Cleaning from Theory to Practice
    details:
      abstract: |
        With decades of research on the various aspects of data cleaning, multiple technical challenges have been tackled and interesting results have been published in many research papers. Example quality problems include missing values, functional dependency violations and duplicate records. Unfortunately, very little success can be claimed in adopting any of these results in practice. Businesses and enterprises are building silos of home-grown data curation solutions under various names, often referred to as ETL layers in the business intelligence stack. The impedance mismatch between the challenges faced in industry and the challenges tackled in research papers explain to a large extent the growing gap between the two worlds. In this talk I claim that being pragmatic in developing data cleaning solution does not necessarily mean being unprincipled or ad-hoc. I discuss a subset of these practical challenges including data ownership, human involvement, and holistic data quality concerns. These new set of challenges often hinder current research proposals from being adopted in the real world. I also go through a quick overview of the approach we use in tamr (a data curation startup) to tackle these challenges.
      bio: |
        Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette. His main research is in the area of database systems, with special interest in data quality, managing uncertain data, rank-aware query processing, and information extraction. Ihab is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He serves on the VLDB Board of Trustees, and he is an associate editor of the ACM Transactions of Database Systems (TODS).
    where: Davis 113A
    note:  Wants 3PM-4PM blocked off for a conference call.
  - when:  Apr. 25
    what:  Probabilistic Databases
    who:   Niccolò Meneghetti
    what:  TBD
  - when:  May 2
    what:  TBD
candidates:
  - who:  Amol Deshpande
  - who:  Boris Glavic
  - who:  Daisy Wang
  - who:  Dan Suciu
---

<p>The UBDB seminar meets on Mondays at 10:30 AM, typically in Davis 113A.  Subscribe to cse-database-list for more details.  </p>