Checkpoint2

This commit is contained in:
Oliver Kennedy 2019-03-12 00:00:24 -04:00
parent 9c2667c352
commit 6dd81fca0b
2 changed files with 138 additions and 2 deletions

View file

@ -0,0 +1,137 @@
---
title: CSE-562; Project 2
---
<h1>Checkpoint 2</h1>
<ul>
<li><strong>Overview</strong>: New SQL features, Limited Memory, Faster Performance
<li><strong>Deadline</strong>: March 16</li>
<li><strong>Grade</strong>: 10% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
</ul>
</li>
</ul>
<p>This project follows the same outline as Checkpoint 1. Your code gets SQL queries and is expected to answer them. There are a few key differences:
<ul>
<li>Queries may now include a <tt>ORDER BY</tt> clause. </li>
<li>Queries may now include a <tt>LIMIT</tt> clause.</li>
<li>Queries may now include aggregate operators, a <tt>GROUP BY</tt> clause, and/or a <tt>HAVING</tt> clause.</li>
<li>For part of the workload, your program will be re-launched with heavy restrictions on available heap space (see Java's <tt>-XMx</tt> option). You will most likely have insufficient memory for any task that requires O(N)-memory. </li>
</ul>
</p>
<h2>Sorting and Grouping Data</h2>
<p>Sort is a blocking operator. Before it emits even one row, it needs to see the entire dataset. If you have enough memory to hold the entire input to be sorted, then you can just use Java's built-in <a href="http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#sort(java.util.List,%20java.util.Comparator)">Collections.sort</a> method. However, for the memory-restricted part of the workflow, you will likely not have enough memory to keep everything available. In that case, a good option is to use the 2-pass sort algorithm that we discussed in class.</p>
<h2>Join Ordering</h2>
<p>The order in which you join tables together is <strong>incredibly important</strong>, and can change the runtime of your query by <strong>multiple orders of magnitude</strong>.  Picking between different join orderings is incredibly important!  However, to do so, you will need statistics about the data, something that won't really be feasible until the next project.  Instead, here's a present for those of you paying attention.  The tables in each FROM clause are ordered so that you will get our recommended join order by building a <em>left-deep plan</em> going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.</p>
<h2>Query Rewriting</h2>
<p>In Project 1, you were encouraged to parse SQL into a relational algebra tree.  Project 2 is where that design choice begins to pay off.  We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them.  For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.</p>
<pre class="prettyprint">if(o instanceof Selection){
Selection s = (Selection)o;
if(s.getChild() instanceof CrossProduct){
CrossProduct prod =
(CrossProduct)s.getChild();
Expression join_cond =
// find a good join condition in
// the predicate of s.
Expression rest =
// the remaining conditions
return new Selection(
rest,
new HashJoin(
join_cond,
prod.getLHS(),
prod.getRHS()
)
);
}
}
return o;</pre>
<p>The reference implementation has a function similar to this snippet of code, and applies the function to every node in the relational algebra tree.</p>
<p>Because selection can be decomposed, you may find it useful to have a piece of code that can split AndExpressions into a list of conjunctive terms:</p>
<pre class="prettyprint">List&lt;Expression&gt; splitAndClauses(Expression e)
{
List&lt;Expression&gt; ret =
new ArrayList&lt;Expression();
if(e instanceof AndExpression){
AndExpression a = (AndExpression)e;
ret.addAll(
splitAndClauses(a.getLeftExpression())
);
ret.addAll(
splitAndClauses(a.getRightExpression())
);
} else {
ret.add(e);
}
}</pre>
<h2>Grading Workflow</h2>
<p>As before, the class <tt>dubstep.Main</tt> will be invoked and a stream of <b>semicolon-delimited</b> queries will be printed to System.in (one after after each time you print out a prompt)</p>
<p>All .java / .scala files in your repository will be compiled (and linked against JSQLParser). Your code will be subjected to a sequence of test cases and evaluated on speed and correctness.  Note that unlike Project 1, you will neither receive a warning about, nor partial credit for out-of-order query results if the outermost query includes an ORDER BY clause. For this checkpoint, we will use predominantly queries chosen from the <a href="http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf">TPC-H benchmark workload</a>.</p>
<p>Phase 1 (big queries) will be graded on a TPC-H SF 1 dataset (1 GB of raw text data).  Phase 2 (limited memory) will be graded on either a TPC-H SF 1 or SF 0.2 (200 MB of raw text data).  Grades are assigned based on per-query thresholds:</p>
<ul>
<li style="text-align: justify;"><strong>0/10 (F)</strong>: Your submission does not compile, does not produce correct output, or fails in some other way. Resubmission is highly encouraged.</li>
<li style="text-align: justify;"><strong>5/10 (C)</strong>: Your submission completes the test query workload within the timeout period, and produces the correct output.</li>
<li style="text-align: justify;"><strong>7.5/10 (B)</strong>: Your submission completes the test query workload notably slower than the reference implementation, and produces the correct output.</li>
<li><strong>10/10 (A)</strong>: Your submission runs the test query within a factor of 2 of the reference implementation, and produces the correct output.</li>
</ul>
<p>Unlike before, your code will be given arguments. During the initial phase of the workload, your code will be launched with <tt>--in-mem</tt> as one of its arguments. During the memory-restricted phase of the workload, your code will be launched with <tt>--on-disk</tt> as one of its arguments. You may use the <tt>data/</tt> directory to store temporary files.</p>
<p>For example (<span style="color: red">red</span> text is entered by the user/grader):</p>
<pre>bash&gt; <span style="color: red">ls data</span>
R.dat
S.dat
T.dat
bash&gt; <span style="color: red">cat data/R.dat</span>
1|1|5
1|2|6
2|3|7
bash&gt; <span style="color: red">cat data/S.dat</span>
1|2|6
3|3|2
3|5|2
bash&gt; <span style="color: red">find {code root directory} -name \*.java -print > compile.list</span>
bash&gt; <span style="color: red">javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list</span>
bash&gt; <span style="color: red">java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main - --in-mem</span>
$> <span style="color: red">CREATE TABLE R(A int, B int, C int);</span>
$> <span style="color: red">CREATE TABLE S(D int, E int, F int);</span>
$> <span style="color: red">SELECT B, C FROM R WHERE A = 1;</span>
1|5
2|6
$> <span style="color: red">SELECT A, E FROM R, S WHERE R.A = S.D;</span>
1|2
1|2
</pre>
<p>For this project, we will issue a sequence of queries to your program and time your performance. A randomly chosen subset of these queries will be checked for correctness. Producing an incorrect answer on any query will result in a 0.

View file

@ -212,7 +212,6 @@ In this course, you will learn...
<li><strong>Ninjas: </strong><ul>
<li>William Spoth: Davis TA Area, Fridays 9:00-11:00 AM</li>
<li>Darshana Balakrishnan: 212 Capen Hall, Thursdays 2:00-4:00 PM</li>
<li>Carl Nuessle (Availability TBD)</li>
</ul></li>
<li><strong>Course Discussions: </strong> <a href="https://piazza.com/buffalo/spring2019/cse4562/home">Piazza</a></li>
<li><strong>Textbook</strong>: <ul>
@ -243,7 +242,7 @@ In this course, you will learn...
<ul>
<li>5% <a href="checkpoint0.html">Checkpoint 0</a> due on Feb. 8.</li>
<li>10% <a href="checkpoint1.html">Checkpoint 1</a> due on Mar. 1</li>
<li>10% Checkpoint 2 due on Mar. 29</li>
<li>10% <a href="checkpoint2.html">Checkpoint 2</a> due on Mar. 29</li>
<li>15% Checkpoint 3 due on TBD</li>
<li>10% Checkpoint 4 due on TBD</li>
</ul>