Website/src/teaching/cse-462/2016sp/dubstep/checkpoint1.html

252 lines
15 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
title: DµBStep Checkpoint 1
---
<h1>DuBStep Checkpoint 1</h1>
<ul>
<li><strong>Overview</strong>: Submit a simple SPJUA query evaluator.</li>
<li><strong>Deadline</strong>: March 7</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
</ul>
</li>
</ul>
<p>In this project, you will implement a simple SQL query evaluator with support for Select, Project, Join, Bag Union, and Aggregate operations.  You will receive a set of data files, schema information, and be expected to evaluate multiple SELECT queries over those data files.</p>
<p>Your code is expected to evaluate the SELECT statements on provided data, and produce output in a standardized form. Your code will be evaluated for both correctness and performance (in comparison to a naive evaluator based on iterators and nested-loop joins).</p>
<h1>Parsing SQL</h1>
<p>A parser converts a human-readable string into a structured representation of the program (or query) that the string describes. A fork of the <a href="http://github.com/ubodin/jsqlparser">JSQLParser</a> open-source SQL parser (JSQLParser) will be provided for your use.  The JAR may be downloaded from:</p>
<center><a href="/software/jsqlparser/jsqlparser.jar">https://odin.cse.buffalo.edu/software/jsqlparser/jsqlparser.jar</a></center>
<p>And documentation for the fork is available at</p>
<center><a href="/software/jsqlparser">https://odin.cse.buffalo.edu/software/jsqlparser</a></center>
<p>You are not required to use this parser (i.e., you may write your own if you like). However, we will be testing your code on SQL that is guaranteed to parse with JSqlParser.</p>
<p>Basic use of the parser requires a <tt>java.io.Reader</tt> or <tt>java.io.InputStream</tt> from which the file data to be parsed (For example, a <tt>java.io.FileReader</tt>). Let's assume you've created one already (of either type) and called it <tt>inputFile</tt>.</p>
<pre>CCJSqlParser parser = new CCJSqlParser(inputFile);
Statement statement;
while((statement = parser.Statement()) != null){
// `statement` now has one of the several
// implementations of the Statement interface
}
// End-of-file. Exit!</pre>
<p>At this point, you'll need to figure out what kind of statement you're dealing with. For this project, we'll be working with <tt>Select</tt> and <tt>CreateTable</tt>. There are two ways to do this: Visitor classes, or the <tt>instanceof</tt> relation. We strongly recommend using <tt>instanceof</tt>:</p>
<pre>if(statement instanceof Select) {
Select selectStatement = (Select)statement;
// handle the select
} else if(statement instanceof CreateTable) {
// and so forth
}</pre>
<h2>Example</h2>
<iframe src="https://www.youtube.com/embed/U4TyaHTJ3Zg" width="540" height="405" frameborder="0" allowfullscreen="allowfullscreen"></iframe>
<h1>Expressions</h1>
<p>JSQLParser includes an interface called <tt>Expression</tt> that represents a primitive-valued expression parse tree.  UB's JSQLParser fork includes a class called <tt>Eval</tt> that can be used to evaluate <tt>Expression</tt> objects. To use the <tt>Eval</tt> class, you will need to define a method for dereferencing <tt>Column</tt> objects.  For example, if I have a <tt>Map</tt> called <tt>tupleSchema</tt> that contains my tuple schema, and an <tt>ArrayList</tt> called <tt>tuple</tt> that contains the tuple I am currently evaluating, I might write:</p>
<pre>public void PrimitiveValue eval(Column x){
int colID = tupleSchema.get(x.getName());
return tuple.get(colID);
}</pre>
<p>After doing this, you can use Eval.eval() to evaluate any expression in the context of tuple.</p>
<h1>Source Data</h1>
<p>Because you are implementing a query evaluator and not a full database engine, there will not be any tables -- at least not in the traditional sense of persistent objects that can be updated and modified. Instead, you will be given a <strong>Table Schema</strong> and a <strong>CSV File</strong> with the instance in it. To keep things simple, we will use the <tt>CREATE TABLE</tt> statement to define a relation's schema. To reiterate, <tt>CREATE TABLE</tt> statements <strong>only appear to give you a schema</strong>. You do not need to allocate any resources for the table in reaction to a <tt>CREATE TABLE</tt> statement -- Simply save the schema that you are given for later use. Sql types (and their corresponding java types) that will be used in this project are as follows:</p>
<table>
<tbody>
<tr>
<th>SQL Type</th>
<th>Java Equivalent</th>
</tr>
<tr>
<td>string</td>
<td>StringValue</td>
</tr>
<tr>
<td>varchar</td>
<td>StringValue</td>
</tr>
<tr>
<td>char</td>
<td>StringValue</td>
</tr>
<tr>
<td>int</td>
<td>LongValue</td>
</tr>
<tr>
<td>decimal</td>
<td>DoubleValue</td>
</tr>
<tr>
<td>date</td>
<td>DateValue</td>
</tr>
</tbody>
</table>
<p>In addition to the schema, you will be given a data directory containing multiple data files who's names correspond to the table names given in the <tt>CREATE TABLE</tt> statements. For example, let's say that you see the following statement in your query file:</p>
<pre>CREATE TABLE R(A int, B int, C int);</pre>
<p>That means that the data directory contains a data file called 'R.dat' that might look like this:</p>
<pre>1|1|5
1|2|6
2|3|7</pre>
<p>Each line of text (see <tt>java.io.BufferedReader.readLine()</tt>) corresponds to one row of data. Each record is delimited by a vertical pipe '|' character.  Integers and floats are stored in a form recognized by Javas Long.parseLong() and Double.parseDouble() methods. Dates are stored in YYYY-MM-DD form, where YYYY is the 4-digit year, MM is the 2-digit month number, and DD is the 2-digit date. Strings are stored unescaped and unquoted and are guaranteed to contain no vertical pipe characters.</p>
<h1>Queries</h1>
<p>Your code is expected to support both aggregate and non-aggregate queries with the following features.  Keep in mind that this is only a minimum requirement.</p>
<ul>
<li>Non-Aggregate Queries
<ul>
<li>SelectItems may include:
<ul>
<li><strong>SelectExpressionItem</strong>: Any expression that <tt>Eval</tt> can evaluate.  Note that Column expressions may or may not include an appropriate source.  Where relevant, column aliases will be given, unless the SelectExpressionItem's expression is a Column (in which case the Column's name attribute should be used as an alias)</li>
<li><strong>AllTableColumns</strong>: For any aliased term in the from clause</li>
<li><strong>AllColumns</strong>: If present, this will be the only SelectItem in a given PlainSelect.</li>
</ul>
</li>
</ul>
</li>
<li>Aggregate Queries
<ul>
<li><strong>SelectItems</strong> may include:
<ul>
<li><strong>SelectExpressionItem</strong>s where the Expression is one of:
<ul>
<li>A Function with the (case-insensitive) name: SUM, COUNT, AVG, MIN or MAX.  The Function's argument(s) may be any expression(s) that can be evaluated by <tt>Eval</tt>.</li>
<li>A Single Column that also occurs in the GroupBy list.</li>
</ul>
</li>
<li><strong>AllTableColumns</strong>: If all of the table's columns also occur in the GroupBy list</li>
<li><strong>AllColumns</strong>: If all of the source's columns also occur in the GroupBy list.</li>
</ul>
</li>
<li>GroupBy column references are all Columns.</li>
</ul>
</li>
<li>Both Types of Queries
<ul>
<li>From/Joins may include:
<ul>
<li><strong>Join</strong>: All joins will be simple joins</li>
<li><strong>Table</strong>: Tables may or may not be aliased.  Non-Aliased tables should be treated as being aliased to the table's name.</li>
<li><strong>SubSelect</strong>: SubSelects may be aggregate or non-aggregate queries, as here.</li>
</ul>
</li>
<li>The Where/Having clauses may include:
<ul>
<li>Any expression that <tt>Eval</tt> will evaluate to an instance of BooleanValue</li>
</ul>
</li>
<li>Allowable Select Options include
<ul>
<li>SELECT DISTINCT (but not SELECT DISTINCT ON)</li>
<li>UNION ALL (but not UNION)</li>
<li>Order By: The OrderByItem expressions may include any expression that can be evaluated by <tt>Eval</tt>.  Columns in the OrderByItem expressions will refer only to aliases defined in the SelectItems (i.e., the output schema of the query's projection.  See TPC-H Benchmark Query 5 for an example of this)</li>
<li>Limit: RowCount limits (e.g., LIMIT 5), but not Offset limits (e.g., LIMIT 5 OFFSET 10) or JDBC parameter limits.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1>Output</h1>
<p>Your code is expected output query results in the same format as the input data:</p>
<ul>
<li>One output row per ('\n'-delimited) line.  If there is no ORDER BY clause, you may emit the rows in any order.</li>
<li>One output value per ('|'-delimited) field.  Columns should appear in the same order that they appear in the query.  Table Wildcards should be resolved in the same order that the columns appear in the CREATE TABLE statement.  Global Wildcards should be resolved as Table Wildcards with the tables in the same order that they appear in the FROM clause.</li>
<li>A trailing newline as the last character of the file.</li>
<li>You should not output any header information or other formatting.</li>
</ul>
<h1>Example Queries and Data</h1>
<p>These are only examples.  Your code will be expected to handle these queries, as well as others.</p>
<ul>
<li><a href="Sanity_Check_Examples.tgz">Sanity Check Examples</a>: A thorough suite of test cases covering most simple query features.</li>
<li><a href="NBA_Query_Examples.tgz">Example NBA Benchmark Queries</a>: Some very simple queries to get you started.</li>
<li><a href="http://www.tpc.org/information/current_specifications.asp">The TPC-H Benchmark</a>: This benchmark consists of two parts: DBGen (generates the data) and a specification document (defines the queries).  A nice summary of the TPC-H queries can be found <a href="http://www.dbtoaster.org/index.php?page=samples">here</a>.</li>
</ul>
<p>The SQL implementation used by TPC-H differs in a few subtle ways from the implementation used by JSqlParser.  Minor structural rewrites to the queries in the specification document will be required:</p>
<ul>
<li>The date format used by TPC-H differs from the date format used by SqlParser.  You will need to replace all instances of date 'YYYY-MM-DD' with DATE('YYYY-MM-DD') or {d'YYYY-MM-DD'}</li>
<li>Many queries in TPC-H use INTERVALs, which the project does not require support for.  However, these are all added to hard-coded parameters.  You will need to manually add the interval to the parameter (e.g., DATE '1982-01-01' + INTERVAL '1 YEAR' becomes DATE('1983-01-01'))</li>
</ul>
<p>Queries that conform to the specifications for this project include: Q1, Q3, Q5, Q6, Q8*, Q9, Q10, Q12*, Q14*, Q15*, Q19* (Asterisks mean that the query doesn't meet the spec as written, but can easily be rewritten into one that does)</p>
<ul>
<li>Q2 requires SubSelect expressions.</li>
<li>Q4  requires EXISTS and SubSelect expressions.</li>
<li>Q7 requires an implementation of the EXTRACT function.</li>
<li>Q8 violates the restriction on simple select items in aggregate queries.  It can be rewritten into a compliant form with FROM-nested Selects.</li>
<li>Q11 violates the simple select item restriction, and requires  SubSelect expressions.</li>
<li>Q12 requires IN expressions, but may be rewritten into a compliant form.</li>
<li>Q13 requires Outer Joins.</li>
<li>Q14 violates the simple select item restriction, but may be rewritten into a compliant form.</li>
<li>Q15 uses views, but may be rewritten into a compliant form</li>
<li>Q16 uses IN and NOT IN expressions as well as SubSelects</li>
<li>Q17 uses SubSelect expressions and violates the simple select item restriction</li>
<li>Q18 uses IN and violates the simple select item restriction</li>
<li>Q19 uses IN but may be rewritten into a compliant form</li>
<li>Q20 uses IN and SubSelects</li>
<li>Q21 uses EXISTS, NOT EXISTS and SubSelects</li>
<li>Q22 requires an implementation of the SUBSTRING function, IN, NOT EXISTS and SubSelects</li>
</ul>
<h1>Code Submission</h1>
<p>As before, all .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). Also as before, the class <tt>dubstep.Main</tt> will be invoked with the following arguments:</p>
<ul>
<li>--data data directory: A path to a directory containing the .dat data files for this test.</li>
<li>sql file: one or more sql files for you to parse and evaluate. Treat multiple files as if they were one really really long file (queries will never span multiple files, but we may use one file to define a schema and one with the queries). </li>
</ul>
<p>For example:</p>
<pre>$&gt; ls data
R.dat
S.dat
T.dat
$&gt; cat data/R.dat
1|1|5
1|2|6
2|3|7
$&gt; cat query.sql
CREATE TABLE R(A int, B int, C int)
SELECT B, C FROM R WHERE A = 1
$&gt; java -cp build:jsqlparser.jar dubstep.Main --data data query.sql
1|5
2|6
</pre>
<p>Once again, the data directory contains files named table name.dat where table name is the name used in a CREATE TABLE statement. Notice the effect of CREATE TABLE statements is not to create a new file, but simply to link the given schema to an existing .dat file. These files use vertical-pipe (|) as a field delimiter, and newlines (\n) as record delimiters.</p>
<p>The testing environment is configured with the Sun JDK version 1.8.</p>
<h1>Grading</h1>
<p>Your code will be subjected to a sequence of test cases, most of which are provided in the project code (though different data will be used). The NBA queries (in the examples given above) and TPC-H queries (under the constraints listed above) are both fair game. For TPC-H, a SF 0.1 (100MB dataset will be used). Time constraints are based on the reference implementation for Checkpoint 1. </p>
<table>
<tr>
<th>Query</th>
<th>Max Credit</th>
<th>Fast Time (full credit)</th>
<th>Slow Time (75% credit)</th>
<th>Cutoff Time (50% credit)</th>
<th>Reference Time</th>
</tr>
<tr>
<td>NBA Q1,Q2,Q3,Q4</td>
<td>1 point</td><td>5 s</td><td>15 s</td><td>30 s</td><td>1.5-1.8 s</td>
</tr>
<tr>
<td>TPCH Q1, Q6</td>
<td>1 point</td><td>5 s</td><td>15 s</td><td>30 s</td><td>1.4-1.5 s</td>
</tr>
<tr>
<td>TPCH Q3</td>
<td>2 points</td><td>200 s</td><td>300 s</td><td>420 s</td><td>170 s</td>
</tr>
<tr>
<td>TPCH Q12</td>
<td>2 points</td><td>30 s</td><td>60 s</td><td>90 s</td><td>16 s</td>
</tr>
</table>
<p>Producing the correct result on the test cluster, beating the fast time for each query will earn you full credit. Beating the slow and cutoff times will earn you 75% or 50% credit, respectively. Your query will be terminated if it runs slower than the cutoff time. The runtime of the reference implementation time is also given. Your overall project grade will be your total score for each of the individual components.  </p>
<p>Additionally, there will be a per-query leader-board for all groups who manage to get full credit on the overall assignment. Good luck.</p>