Checkpoint 3 specs added

This commit is contained in:
Gokhan Kul 2018-03-14 23:22:51 -04:00
parent c98a446fed
commit 3467fe364e
2 changed files with 111 additions and 1 deletions

View file

@ -0,0 +1,110 @@
---
title: CSE-562; Project 3
---
<h1>Checkpoint 3</h1>
<ul>
<li><strong>Overview</strong>: New SQL features (Aggregation), Limited Memory, Faster Performance, Different Join Algorithms
<li><strong>Deadline</strong>: April 13</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>8% Correctness</li>
<li>7% Efficiency</li>
</ul>
</li>
</ul>
<p>This project follows the same outline as Checkpoint 1 and 2. Your code gets SQL queries and is expected to answer them. You are expected to implement all the features from Checkpoint 1 and 2. Additionally, there are two key differences:
<ul>
<li>Queries may now include a <tt>GROUP BY</tt> clause, and <tt>MIN(), MAX(), SUM(), COUNT(), AVG()</tt> functions.</li>
<li>You will be expected to process queries faster, and use less memory.</li>
</ul>
</p>
<h2>Grouping Data</h2>
<p>Just like Order-by, Group-by aggregates are also a blocking operator. If you run out of memory for the groups, you will need to implement a memory-aware grouping operator. One idea is to re-use the sort operator to group values together and use the sorted grouping technique. In the queries, there can only be one or two attributes in the group-by clause, so you do not need to handle unlimited number of group-by attributes.</p>
<h2>Optimization/Query Rewriting</h2>
<p>In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begins to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). You should have implemented selection pushdown for Checkpoint 2. The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.</p>
<p>Another optimization that is always good is projection pushdown operation. Essentially, you only read the attributes that you will need in the query from the each database file, and discard all the attributes that you will not use. In practice, it is expensive to copy the values of a tuple into a new tuple. This is especially helpful when your operator changes the schema of the input tuple, and outputs a tuple with a different schema (i.e., Cross Product and Join). Also, you will save considerable memory space with this improvement.</p>
<h4>Grading Workflow</h4>
<p>All .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). A main file that you can take as an example is given <a href="https://www.cse.buffalo.edu/~gokhanku/Main.java">here</a>. As before, the class <tt>edu.buffalo.www.cse4562.Main</tt> will be invoked, and a stream of <b>semicolon-delimited</b> queries will be printed to System.in (after you print out a prompt). Also, make sure that you use the path we provide you in <tt>--data</tt> argument. Hardcoding the location may cause problems.</p>
<p>For example (<span style="color: red">red</span> text is entered by the user/grader):</p>
<pre>bash&gt; <span style="color: red">ls data</span>
R.dat
S.dat
T.dat
bash&gt; <span style="color: red">cat data/R.dat</span>
1|1|5
1|2|6
2|3|7
bash&gt; <span style="color: red">cat data/S.dat</span>
1|2|6
3|3|2
3|5|2
bash&gt; <span style="color: red">find {code root directory} -name \*.java -print > compile.list</span>
bash&gt; <span style="color: red">javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list</span>
bash&gt; <span style="color: red">java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main --data data/</span>
$> <span style="color: red">CREATE TABLE R(A int, B int, C int);</span>
$> <span style="color: red">CREATE TABLE S(D int, E int, F int);</span>
$> <span style="color: red">SELECT B, C FROM R WHERE A = 1;</span>
1|5
2|6
$> <span style="color: red">SELECT A, E FROM R, S WHERE R.A = S.D;</span>
1|2
1|2
</pre>
<p>For this project, we will issue 5 queries to your program excluding <tt>CREATE TABLE</tt> queries. 2 of these queries will NOT be timed, and they will evaluated based on the correctness of the query results. Answering each query successfully will bring you 1 point each. An example file you will read the data from is given <a href="https://www.cse.buffalo.edu/~gokhanku/R.dat">here</a>. The remaining three queries will be timed, and they will run on files that are around 500 MB in total. You will receive 1.5 points for each query if you can return correct results. You will receive additional 2 points for each query for matching or beating the reference implementation timewise. Also keep in mind that for ALL queries, the grader will time out and exit after 5 minutes.
There is also a memory limit that will not allow you to load full tables and cross product them for joins.
</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-s6z2{text-align:center}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-hgcj{font-weight:bold;text-align:center}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-s6z2"></th>
<th class="tg-hgcj">Points for<br>Correctness</th>
<th class="tg-hgcj">Points for<br>Performance</th>
<th class="tg-amwm">Table<br>Size</th>
</tr>
<tr>
<td class="tg-hgcj">Query 1</td>
<td class="tg-s6z2">2</td>
<td class="tg-s6z2">0</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-hgcj">Query 2</td>
<td class="tg-s6z2">2</td>
<td class="tg-s6z2">0</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-hgcj">Query 3</td>
<td class="tg-s6z2">1</td>
<td class="tg-s6z2">2</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-amwm">Query 4</td>
<td class="tg-baqh">2</td>
<td class="tg-baqh">2</td>
<td class="tg-baqh">~500 MB</td>
</tr>
<tr>
<td class="tg-amwm">Query 5</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">3</td>
<td class="tg-baqh">~500 MB</td>
</tr>
</table>

View file

@ -54,7 +54,7 @@ In this course, you will learn...
<li>5% <a title="Checkpoint 0" href="https://odin.cse.buffalo.edu/slides/cse4562sp2018/Checkpoint0.pdf">Checkpoint 0</a> due on Feb. 8.</li>
<li>10% <a title="Checkpoint 1" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint1.html">Checkpoint 1</a> due on Feb. 23</li>
<li>10% <a title="Checkpoint 2" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint2.html">Checkpoint 2</a> due on Mar. 16</li>
<li>15% <a title="Checkpoint 3" href="#">Checkpoint 3</a> due on Apr. 13</li>
<li>15% <a title="Checkpoint 3" href="https://odin.cse.buffalo.edu/teaching/cse-562/2018sp/checkpoint3.html">Checkpoint 3</a> due on Apr. 13</li>
<li>10% <a title="Checkpoint 4" href="#">Checkpoint 4</a> due on May 11</li>
</ul>
</li>