Merge branch 'master' of gitlab.odin.cse.buffalo.edu:odin-lab/Website

This commit is contained in:
Oliver Kennedy 2018-03-31 19:58:38 -04:00
commit 3204fe4ac3
13 changed files with 152 additions and 20 deletions

View file

@ -139,16 +139,16 @@
}
],
"reviewer" : [
{ "venue" : "VLDBJ", "years" : [ 2013, 2017 ] },
{ "venue" : "VLDBJ", "years" : [ 2013, 2017, 2018 ] },
{ "venue" : "TKDE", "years" : [ 2013, 2014 ] },
{ "venue" : "TODS", "years" : [ 2015 ] },
{ "venue" : "CSE", "years" : [ 2015 ] },
{ "venue" : "pVLDB", "track" : "PhD", "years" : [ 2013 ], "pc": true },
{ "venue" : "pVLDB", "track" : "Demo", "years" : [ 2016, 2017, 2018 ], "pc": true },
{ "venue" : "pVLDB", "track" : "Demo", "years" : [ 2016, 2017, 2018, 2019 ], "pc": true },
{ "venue" : "pVLDB", "years" : [ 2017, 2018 ], "pc": true },
{ "venue" : "SIGMOD", "years" : [ 2015, 2016, 2017, 2019 ], "pc": true },
{ "venue" : "PWEEK", "years" : [ 2016 ], "pc": true },
{ "venue" : "HILDA", "years" : [ 2016, 2017 ], "pc": true },
{ "venue" : "HILDA", "years" : [ 2016, 2017, 2018 ], "pc": true },
{ "venue" : "TOIT", "years" : [ 2016 ] },
{ "venue" : "KAIS", "years" : [ 2017 ] },
{ "venue" : "SoCC", "years" : [ 2017 ], "pc": true },

View file

@ -9,7 +9,7 @@
"semester" : "(planed) Fall 2018" },
{ "code" : "CSE 4/562",
"title" : "Database Systems",
"enrollment" : 162,
"enrollment" : 93,
"semester" : "Spring 2018" },
{ "code" : "CSE 662",
"title" : "Languages and Runtimes for Big Data",

View file

@ -59,6 +59,7 @@
"type" : "grant",
"commitment" : { "summer" : "0,0,¾,1½,1½" },
"supports": ["Poonam Kumari"],
"agency_id" : "IIS-1750460",
"urls" : {
"proposal" : "https://odin.cse.buffalo.edu/grants/2018-NSF-CAREER.pdf"
}

View file

@ -1,11 +1,11 @@
[
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "(scheduled) Jan. 2018",
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "Jan. 2018",
"venue" : "Cornell" },
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "(scheduled) Jan. 2018",
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "Jan. 2018",
"venue" : "Penn State U." },
{ "talk" : "Just-In-Time Data Structures", "date" : "(scheduled) Dec. 2017",
{ "talk" : "Just-In-Time Data Structures", "date" : "Dec. 2017",
"venue" : "Harvard" },
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "(scheduled) Dec. 2017",
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "Dec. 2017",
"venue" : "University of Washington" },
{ "talk" : "Don't Wrangle, Guess Instead (with Mimir)", "date" : "Oct. 2017",
"venue" : "Columbia" },

View file

@ -173,6 +173,8 @@
"due": "Homework 2",
"topic": "Midterm Review",
"materials": {
"slides" : "https://odin.cse.buffalo.edu/slides/cse4562sp2018/2018-03-09-Review.html",
"homework 2 answers" : "https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/homeworks/homework2-answers.pdf"
}
},
{
@ -189,10 +191,11 @@
},
{
"date": "Mar 14",
"type": "Lecture",
"type": "Practicum",
"due": "",
"topic": "TBD",
"topic": "Translating SQL to RA and Optimization Review",
"materials": {
"slides" : "https://odin.cse.buffalo.edu/slides/cse4562sp2018/2018-03-02-Optimization.pdf"
}
},
{
@ -231,15 +234,16 @@
"date": "Mar 26",
"type": "Lecture",
"due": "",
"topic": "Incremental View Maintenance",
"topic": "Views",
"materials": {
"slides" : "https://odin.cse.buffalo.edu/slides/cse4562sp2018/2018-03-26-Views.pdf"
}
},
{
"date": "Mar 28",
"type": "Lecture",
"due": "",
"topic": "Buffer Management",
"topic": "Materialized Views + Buffer Management",
"materials": {
}
},

View file

@ -959,7 +959,7 @@
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
controls: true,
progress: true,
history: true,
center: true,

View file

@ -692,7 +692,7 @@
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
controls: true,
progress: true,
history: true,
center: true,

Binary file not shown.

View file

@ -0,0 +1,15 @@
---
title: Vizier Workflows (rant)
author: Oliver Kennedy
---
I'd like to talk a little about abstractions for communication. In particular, I want to talk about a favorite workhorse of the data science community these days: Jupyter notebook. For those unfamiliar with it, Jupyter users work with blocks of code called "cells." Each cell has an opportunity to produce a result, which is then displayed inline, immediately after the cell. This makes it a lot easier for users to break up complex tasks, showing intermediate results inline with the rest of the code.
Let's spend a little time digging in to how this works. For each language that supports Jupyter (python, scala, ruby, and more...), the developers have created a way to snapshot the language's state: global variables, runtime information, file handles and more. They call this a *kernel*. When you execute a cell, Jupyter loads the kernel, runs code against it, and saves the result.
This means that if you want to have code in one cell talk to code in another cell, the natural way to do it is to create a global variable. The fundamental communication abstraction in Jupyter is the kernel. On the one hand, this is a very powerful abstraction: anything that you can represent using a global variable in Python can be sent.
On the other hand, it also means that the main way to communicate is through language-specific binary blobs.
At UB, we're working with NYU and IIT on a data exploration tool called [Vizier](http://vizierdb.info). Expect to hear more about Vizier here in the coming weeks and months, but what I want to focus on right now is the fact that cells in Vizier talk through **tables** (or DataFrames or Relations, if you like). The fact that they're tables isn't even all that important; What we care about is the fact that they're in a standardized format that Vizier understands. This is why data debugging in Vizier is easier, and why we expect to be able to provide some powerful query optimization down the line. Again, more on each of those as they develop.
What I want to focus on today is interoperability. Because all communication in Vizier happens through tables, you can write a python script that transforms data in one cell and a SQL query over the same data in the next. Better still, it means that we can allow direct manipulation of data: For Vizier, we're developing a new language called Vizual. Every expression in Vizual corresponds to an action in a spreadsheet (rewriting a cell, adding a formula, etc...). So, you can write a python script, manually fine tune the output table as a spreadsheet, and then query the results. None of that would have been possible if the underlying communications abstraction was opaque to Vizier.

View file

@ -65,13 +65,15 @@ schedule:
received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and
2016 respectively, and Nilesh Dalvi was a runner up in 2008.
- when: May 17; Time TBD
what: Title TBD
what: "Diagnoses and Explanations: Creating a Higher-Quality Data World"
who: Alexandra Meliou (UMass Amherst)
where: Location TBD
details:
abstract: TBD
bio: |
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Science, at the University of Massachusetts, Amherst. She has held this position since September 2012. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington, working with Dan Suciu. Alexandra received her PhD and MS degrees from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley, in 2009 and 2005 respectively. She is the recipient of an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Siebel Scholarship. Her research interests are in the area of data and information management, with a current emphasis on provenance, causality, and reverse data management.
abstract: |
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.<br/>
Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also discuss our work on MIDAS, a recommendations system that improves the quality of datasets by identifying and filling information gaps. Finally, I will discuss a vision for explanation frameworks to assist the exploration of information in a varied, diverse, highly non-integrated data world.
bio: |
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington, working with Dan Suciu. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
---
<p>Subscribe to <a href="https://listserv.buffalo.edu/cgi-bin/wa?A0=cse-database-list">cse-database-list</a> for more details about the UBDB seminar.</p>

View file

@ -0,0 +1,110 @@
---
title: CSE-562; Project 3
---
<h1>Checkpoint 3</h1>
<ul>
<li><strong>Overview</strong>: New SQL features (Aggregation), Limited Memory, Faster Performance, Different Join Algorithms
<li><strong>Deadline</strong>: April 13</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>8% Correctness</li>
<li>7% Efficiency</li>
</ul>
</li>
</ul>
<p>This project follows the same outline as Checkpoint 1 and 2. Your code gets SQL queries and is expected to answer them. You are expected to implement all the features from Checkpoint 1 and 2. Additionally, there are two key differences:
<ul>
<li>Queries may now include a <tt>GROUP BY</tt> clause, and <tt>MIN(), MAX(), SUM(), COUNT(), AVG()</tt> functions.</li>
<li>You will be expected to process queries faster, and use less memory.</li>
</ul>
</p>
<h2>Grouping Data</h2>
<p>Just like Order-by, Group-by aggregates are also a blocking operator. If you run out of memory for the groups, you will need to implement a memory-aware grouping operator. One idea is to re-use the sort operator to group values together and use the sorted grouping technique. In the queries, there can only be one or two attributes in the group-by clause, so you do not need to handle unlimited number of group-by attributes.</p>
<h2>Optimization/Query Rewriting</h2>
<p>In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begins to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). You should have implemented selection pushdown for Checkpoint 2. The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.</p>
<p>Another optimization that is always good is projection pushdown operation. Essentially, you only read the attributes that you will need in the query from the each database file, and discard all the attributes that you will not use. In practice, it is expensive to copy the values of a tuple into a new tuple. This is especially helpful when your operator changes the schema of the input tuple, and outputs a tuple with a different schema (i.e., Cross Product and Join). Also, you will save considerable memory space with this improvement.</p>
<h4>Grading Workflow</h4>
<p>All .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). A main file that you can take as an example is given <a href="https://www.cse.buffalo.edu/~gokhanku/Main.java">here</a>. As before, the class <tt>edu.buffalo.www.cse4562.Main</tt> will be invoked, and a stream of <b>semicolon-delimited</b> queries will be printed to System.in (after you print out a prompt). Also, make sure that you use the path we provide you in <tt>--data</tt> argument. Hardcoding the location may cause problems.</p>
<p>For example (<span style="color: red">red</span> text is entered by the user/grader):</p>
<pre>bash&gt; <span style="color: red">ls data</span>
R.dat
S.dat
T.dat
bash&gt; <span style="color: red">cat data/R.dat</span>
1|1|5
1|2|6
2|3|7
bash&gt; <span style="color: red">cat data/S.dat</span>
1|2|6
3|3|2
3|5|2
bash&gt; <span style="color: red">find {code root directory} -name \*.java -print > compile.list</span>
bash&gt; <span style="color: red">javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list</span>
bash&gt; <span style="color: red">java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main --data data/</span>
$> <span style="color: red">CREATE TABLE R(A int, B int, C int);</span>
$> <span style="color: red">CREATE TABLE S(D int, E int, F int);</span>
$> <span style="color: red">SELECT B, C FROM R WHERE A = 1;</span>
1|5
2|6
$> <span style="color: red">SELECT A, E FROM R, S WHERE R.A = S.D;</span>
1|2
1|2
</pre>
<p>For this project, we will issue 5 queries to your program excluding <tt>CREATE TABLE</tt> queries. 2 of these queries will NOT be timed, and they will evaluated based on the correctness of the query results. Answering each query successfully will bring you 2 point each. An example file you will read the data from is given <a href="https://www.cse.buffalo.edu/~gokhanku/R.dat">here</a>. The remaining three queries will be timed, and they will run on files that are around 500 MB in total. You will performance points for each query for matching or beating the reference implementation timewise. Also keep in mind that for ALL queries, the grader will time out and exit after 5 minutes.
There is also a memory limit that will not allow you to load full tables and cross product them for joins.
</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-s6z2{text-align:center}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-hgcj{font-weight:bold;text-align:center}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-s6z2"></th>
<th class="tg-hgcj">Points for<br>Correctness</th>
<th class="tg-hgcj">Points for<br>Performance</th>
<th class="tg-amwm">Table<br>Size</th>
</tr>
<tr>
<td class="tg-hgcj">Query 1</td>
<td class="tg-s6z2">2</td>
<td class="tg-s6z2">0</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-hgcj">Query 2</td>
<td class="tg-s6z2">2</td>
<td class="tg-s6z2">0</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-hgcj">Query 3</td>
<td class="tg-s6z2">1</td>
<td class="tg-s6z2">2</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-amwm">Query 4</td>
<td class="tg-baqh">2</td>
<td class="tg-baqh">2</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-amwm">Query 5</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">3</td>
<td class="tg-baqh">~500 MB</td>
</tr>
</table>

View file

@ -53,8 +53,8 @@ In this course, you will learn...
<ul>
<li>5% <a title="Checkpoint 0" href="https://odin.cse.buffalo.edu/slides/cse4562sp2018/Checkpoint0.pdf">Checkpoint 0</a> due on Feb. 8.</li>
<li>10% <a title="Checkpoint 1" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint1.html">Checkpoint 1</a> due on Feb. 23</li>
<li>10% <a title="Checkpoint 2" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint2.html">Checkpoint 2</a> due on Mar. 15</li>
<li>15% <a title="Checkpoint 3" href="#">Checkpoint 3</a> due on Apr. 12</li>
<li>10% <a title="Checkpoint 2" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint2.html">Checkpoint 2</a> due on Mar. 16</li>
<li>15% <a title="Checkpoint 3" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint3.html">Checkpoint 3</a> due on Apr. 13</li>
<li>10% <a title="Checkpoint 4" href="#">Checkpoint 4</a> due on May 11</li>
</ul>
</li>