Deleting old text from checkpoints

This commit is contained in:
Oliver Kennedy 2017-01-03 08:05:23 -05:00
parent 821f516006
commit 5f2209bdca
7 changed files with 41 additions and 643 deletions

View file

@ -1,6 +1,9 @@
---
title: CSE-562; Project 0
---
<ul>
<li><strong>Overview</strong>: Submit a hello world program and agree not to plagiarize code.</li>
<li><strong>Deadline</strong>: Feb 6</li>
<li><strong>Overview</strong>: Answer Select/Project Queries</li>
<li><strong>Deadline</strong>: Feb 10</li>
<li><strong>Grade</strong>: 5% of Overall Grade</li>
</ul>
<h1>The Submission System</h1>
@ -93,7 +96,6 @@ A snapshot of your repository will be taken, and your entire group will receive
<li>Validate the output.</li>
</ol>
If these steps fail for any reason, your submission will receive a 0 and you will need to resubmit. A log of the testing process will be made available on the submission page so that you may correct any errors that occur.
<h1>Project: Hello World!</h1>
Create a class edu.buffalo.cse562.Main with a main function that that prints out the following (with no newlines) and exits.
<pre>We, the members of our team, agree that we will not submit any code that we have not written ourselves, share our code with anyone outside of our group, or use code that we have not written ourselves as a reference.</pre>
Make sure your class compiles, push your (committed) repository, and hit Submit.
<h1>Project: A Database Hello World!</h1>
TBD

View file

@ -0,0 +1,11 @@
<ul>
<li><strong>Overview</strong>: Answer Select/Project/Aggregate Queries Efficiently
<li><strong>Deadline</strong>: TBD</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>

View file

@ -1,235 +0,0 @@
<ul>
<li><strong>Overview</strong>: Submit a simple SPJUA query evaluator.</li>
<li><strong>Deadline</strong>: Feb 23</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>
In this project, you will implement a simple SQL query evaluator with support for Select, Project, Join, Bag Union, and Aggregate operations.  You will receive a set of data files, schema information, and be expected to evaluate multiple SELECT queries over those data files.
Your code is expected to evaluate the SELECT statements on provided data, and produce output in a standardized form. Your code will be evaluated for both correctness and performance (in comparison to a naive evaluator based on iterators and nested-loop joins).
<h1>Parsing SQL</h1>
A parser converts a human-readable string into a s<span style="line-height: 1.5;">tructured representation of the program (or query) that the string describes. A fork of the <a href="http://jsqlparser.sourceforge.net">JSQLParser</a> open-source SQL parser (JSQLParser) will be provided for your use.  The JAR may be downloaded from</span>
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/jsqlparser/jsqlparser.jar">http://odin.cse.buffalo.edu/resources/jsqlparser/jsqlparser.jar</a></p>
And documentation for the fork is available at
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/jsqlparser"><span style="line-height: 1.5;">http://odin.cse.buffalo.edu/resources/jsqlparser</span></a></p>
You are not required to use this parser (i.e., you may write your own if you like). However, we will be testing your code on SQL that is guaranteed to parse with JSqlParser.
Basic use of the parser requires a <tt>java.io.Reader</tt> or <tt>java.io.InputStream</tt> from which the file data to be parsed (For example, a <tt>java.io.FileReader</tt>). Let's assume you've created one already (of either type) and called it <tt>inputFile</tt>.
<pre class="prettyprint">CCJSqlParser parser = new CCJSqlParser(inputFile);
Statement statement;
while((statement = parser.Statement()) != null){
// `statement` now has one of the several
// implementations of the Statement interface
}
// End-of-file. Exit!</pre>
At this point, you'll need to figure out what kind of statement you're dealing with. For this project, we'll be working with <tt>Select</tt> and <tt>CreateTable</tt>. There are two ways to do this. JSqlParser defines a Visitor style interface that you can use if you're familiar with the pattern. However, my preference is for the simpler and lighter-weight <tt>instanceof</tt> relation:
<pre class="prettyprint">if(statement instanceof Select) {
Select selectStatement = (Select)statement;
// handle the select
} else if(statement instanceof CreateTable) {
// and so forth
}</pre>
<h2>Example</h2>
<iframe src="https://www.youtube.com/embed/U4TyaHTJ3Zg" width="420" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe>
<h1>Expressions</h1>
JSQLParser includes an object called Expression that represents a primitive-valued expression parse tree.  In addition to the parser, we are providing a collection of classes for manipulating and evaluating Expressions.  The JAR may be downloaded from
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/expressionlib/expression.jar"><small>http://odin.cse.buffalo.edu/resources/expressionlib/expression.jar</small></a></p>
<p style="text-align: left;"> Documentation for the library is available at</p>
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/expressionlib">http://odin.cse.buffalo.edu/resources/expressionlib</a></p>
<p style="text-align: left;">To use the <tt>Eval</tt> class, you will need to define a method for dereferencing <tt>Column</tt> objects.  For example, if I have a <tt>Map</tt> called <tt>tupleSchema</tt> that contains my tuple schema, and an <tt>ArrayList</tt> called <tt>tuple</tt> that contains the tuple I am currently evaluating, I might write:</p>
<pre class="prettyprint">public void LeafValue eval(Column x){
int colID = tupleSchema.get(x.getName());
return tuple.get(colID);
}</pre>
<p style="text-align: left;">After doing this, you can use Eval.eval() to evaluate any expression in the context of tuple.</p>
<h1 style="text-align: left;">Source Data</h1>
Because you are implementing a query evaluator and not a full database engine, there will not be any tables -- at least not in the traditional sense of persistent objects that can be updated and modified. Instead, you will be given a <strong>Table Schema</strong> and a <strong>CSV File</strong> with the instance in it. To keep things simple, we will use the <tt>CREATE TABLE</tt> statement to define a relation's schema. You do not need to allocate any resources for the table in reaction to a <tt>CREATE TABLE</tt> statement -- Simply save the schema that you are given for later use. Sql types (and their corresponding java types) that will be used in this project are as follows:
<table>
<tbody>
<tr>
<th><span style="text-decoration: underline;">SQL Type</span></th>
<th><span style="text-decoration: underline;">Java Equivalent</span></th>
</tr>
<tr>
<td>string</td>
<td>StringValue</td>
</tr>
<tr>
<td>varchar</td>
<td>StringValue</td>
</tr>
<tr>
<td>char</td>
<td>StringValue</td>
</tr>
<tr>
<td>int</td>
<td>LongValue</td>
</tr>
<tr>
<td>decimal</td>
<td>DoubleValue</td>
</tr>
<tr>
<td>date</td>
<td>DateValue</td>
</tr>
</tbody>
</table>
In addition to the schema, you will be given a data directory containing multiple data files who's names correspond to the table names given in the <tt>CREATE TABLE</tt> statements. For example, let's say that you see the following statement in your query file:
<pre class="prettyprint">CREATE TABLE R(A int, B int, C int);</pre>
That means that the data directory contains a data file called 'R.dat' that might look like this:
<pre>1|1|5
1|2|6
2|3|7</pre>
Each line of text (see <tt>java.io.BufferedReader.readLine()</tt>) corresponds to one row of data. Each record is delimited by a vertical pipe '|' character.  Integers and floats are stored in a form recognized by Javas Long.parseLong() and Double.parseDouble() methods. Dates are stored in YYYY-MM-DD form, where YYYY is the 4-digit year, MM is the 2-digit month number, and DD is the 2-digit date. Strings are stored unescaped and unquoted and are guaranteed to contain no vertical pipe characters.
<h1>Queries</h1>
Your code is expected to support both aggregate and non-aggregate queries with the following features.  Keep in mind that this is only a minimum requirement.
<ul>
<li>Non-Aggregate Queries
<ul>
<li>SelectItems may include:
<ul>
<li><strong>SelectExpressionItem</strong>: Any expression that ExpressionLib can evaluate.  Note that Column expressions may or may not include an appropriate source.  Where relevant, column aliases will be given, unless the SelectExpressionItem's expression is a Column (in which case the Column's name attribute should be used as an alias)</li>
<li><strong>AllTableColumns</strong>: For any aliased term in the from clause</li>
<li><strong>AllColumns</strong>: If present, this will be the only SelectItem in a given PlainSelect.</li>
</ul>
</li>
</ul>
</li>
<li>Aggregate Queries
<ul>
<li><strong>SelectItems</strong> may include:
<ul>
<li><strong>SelectExpressionItem</strong>s where the Expression is one of:
<ul>
<li>A Function with the (case-insensitive) name: SUM, COUNT, AVG, MIN or MAX.  The Function's argument(s) may be any expression(s) that can be evaluated by ExpressionLib.</li>
<li>A Single Column that also occurs in the GroupBy list.</li>
</ul>
</li>
<li><strong>AllTableColumns</strong>: If all of the table's columns also occur in the GroupBy list</li>
<li><strong>AllColumns</strong>: If all of the source's columns also occur in the GroupBy list.</li>
</ul>
</li>
<li>GroupBy column references are all Columns.</li>
</ul>
</li>
<li>Both Types of Queries
<ul>
<li>From/Joins may include:
<ul>
<li><strong>Join</strong>: All joins will be simple joins</li>
<li><strong>Table</strong>: Tables may or may not be aliased.  Non-Aliased tables should be treated as being aliased to the table's name.</li>
<li><strong>SubSelect</strong>: SubSelects may be aggregate or non-aggregate queries, as here.</li>
</ul>
</li>
<li>The Where/Having clauses may include:
<ul>
<li>Any expression that ExpressionLib will evaluate to an instance of BooleanValue</li>
</ul>
</li>
<li>Allowable Select Options include
<ul>
<li>SELECT DISTINCT (but not SELECT DISTINCT ON)</li>
<li>UNION ALL (but not UNION)</li>
<li>Order By: The OrderByItem expressions may include any expression that can be evaluated by ExpressionLib.  Columns in the OrderByItem expressions will refer only to aliases defined in the SelectItems (i.e., the output schema of the query's projection.  See TPC-H Benchmark Query 5 for an example of this)</li>
<li>Limit: RowCount limits (e.g., LIMIT 5), but not Offset limits (e.g., LIMIT 5 OFFSET 10) or JDBC parameter limits.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1>Output</h1>
Your code is expected output query results in the same format as the input data:
<ul>
<li>One output row per ('\n'-delimited) line.  If there is no ORDER BY clause, you may emit the rows in any order.</li>
<li>One output value per ('|'-delimited) field.  Columns should appear in the same order that they appear in the query.  Table Wildcards should be resolved in the same order that the columns appear in the CREATE TABLE statement.  Global Wildcards should be resolved as Table Wildcards with the tables in the same order that they appear in the FROM clause.</li>
<li>A trailing newline as the last character of the file.</li>
<li>You should not output any header information or other formatting.</li>
</ul>
<h1>Example Queries and Data</h1>
These are only examples.  Your code will be expected to handle these queries, as well as others.
<a href="http://odin.cse.buffalo.edu/resources/cse562/Sanity_Check_Examples.tgz">Sanity Check Examples</a>: A thorough suite of test cases covering most simple query features.
<a href="http://odin.cse.buffalo.edu/resources/cse562/NBA_Query_Examples.tgz">Example NBA Benchmark Queries</a>: Some very simple queries to get you started.
<a href="http://www.tpc.org/information/current_specifications.asp">The TPC-H Benchmark</a>: This benchmark consists of two parts: DBGen (generates the data) and a specification document (defines the queries).  A nice summary of the TPC-H queries can be found <a href="http://www.dbtoaster.org/index.php?page=samples">here</a>.
The SQL implementation used by TPC-H differs in a few subtle ways from the implementation used by JSqlParser.  Minor structural rewrites to the queries in the specification document will be required:
<ul>
<li>The date format used by TPC-H differs from the date format used by SqlParser.  You will need to replace all instances of date 'YYYY-MM-DD' with DATE('YYYY-MM-DD') or {d'YYYY-MM-DD'}</li>
<li>Many queries in TPC-H use INTERVALs, which the project does not require support for.  However, these are all added to hard-coded parameters.  You will need to manually add the interval to the parameter (e.g., DATE '1982-01-01' + INTERVAL '1 YEAR' becomes DATE('1983-01-01'))</li>
</ul>
Queries that conform to the specifications for this project include: Q1, Q3, Q5, Q6, Q8*, Q9, Q10, Q12*, Q14*, Q15*, Q19* (Asterisks mean that the query doesn't meet the spec as written, but can easily be rewritten into one that does)
<ul>
<li>Q2 requires SubSelect expressions.</li>
<li>Q4  requires EXISTS and SubSelect expressions.</li>
<li>Q7 requires an implementation of the EXTRACT function.</li>
<li>Q8 violates the restriction on simple select items in aggregate queries.  It can be rewritten into a compliant form with FROM-nested Selects.</li>
<li>Q11 violates the simple select item restriction, and requires  SubSelect expressions.</li>
<li>Q12 requires IN expressions, but may be rewritten into a compliant form.</li>
<li>Q13 requires Outer Joins.</li>
<li>Q14 violates the simple select item restriction, but may be rewritten into a compliant form.</li>
<li>Q15 uses views, but may be rewritten into a compliant form</li>
<li>Q16 uses IN and NOT IN expressions as well as SubSelects</li>
<li>Q17 uses SubSelect expressions and violates the simple select item restriction</li>
<li>Q18 uses IN and violates the simple select item restriction</li>
<li>Q19 uses IN but may be rewritten into a compliant form</li>
<li>Q20 uses IN and SubSelects</li>
<li>Q21 uses EXISTS, NOT EXISTS and SubSelects</li>
<li>Q22 requires an implementation of the SUBSTRING function, IN, NOT EXISTS and SubSelects</li>
</ul>
<h1 style="text-align: left;">Code Submission</h1>
As before, all .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). Also as before, the class
<pre> edu.buffalo.cse562.Main
</pre>
will be invoked with the following arguments:
<ul>
<li>--data data directory: A path to a directory containing the .dat data files for this test.</li>
<li>sql file: one or more sql files for you to parse evaluate.</li>
</ul>
For example:
<pre>$&gt; ls data
R.dat
S.dat
T.dat
$&gt; cat R.dat
1|1|5
1|2|6
2|3|7
$&gt; cat query.sql
CREATE TABLE R(A int, B int, C int)
SELECT B, C FROM R WHERE A = 1
$&gt; java -cp build:jsqlparser.jar edu.buffalo.cse562.Main --data data query.sql
1|5
2|6
</pre>
Once again, the data directory contains files named table name.dat where table name is the name used in a CREATE TABLE statement. Notice the effect of CREATE TABLE statements is not to create a new file, but simply to link the given schema to an existing .dat file. These files use vertical-pipe (|) as a field delimiter, and newlines (\n) as record delimiters.
The testing environment is configured with the Sun JDK version 1.8.
<h1>Grading</h1>
Your code will be subjected to a sequence of test cases, most of which are provided in the project code (though different data will be used). Two evaluation phases will be performed. Phase 1 will be performed on small datasets (&lt; 100 rows per input table) and each run will be graded on a per-test-case basis as follows:
<ul>
<li><strong>0/10 (F)</strong>: Your submission does not compile, does not produce correct output, or fails in some other way. Resubmission is highly encouraged.</li>
<li><strong>5/10 (C)</strong>: Your submission runs the test query in under 30 seconds on the test machine, and produces properly formatted output.</li>
<li><strong>7.5/10 (B)</strong>: Your submission runs the test query in under 15 seconds on the test machine, and produces the correct output.</li>
<li><strong>10/10 (A)</strong>: Your submission runs the test query in under 5 seconds on the test machine, and prduces the correct output.</li>
</ul>
Phase 2 will evaluate your code on more complex queries that create large intermediate states (100+ MB). Queries for which your submission does not produce correct output, or which your submission takes over 1 minute to process will receive an F. Otherwise, your submission will be graded on the runtime of each test as follows
<ul>
<li><strong>5/10 (C)</strong>: Produce correct output in 1 minute or less.</li>
<li><strong>7.5/10 (B)</strong>: Produce correct output in 30 seconds or less.</li>
<li><strong>10/10 (A)</strong>: Produce correct output in 15 seconds or less</li>
<li><strong>12/10 (A+)</strong>: Produce correct output in 8 seconds or less</li>
</ul>
Your overall project grade will be a weighted average of the individual components.  It will be possible to earn extra credit by beating the reference implementation.
Additionally, there will be a per-query leader-board for all groups who manage to beat the reference implementation.

View file

@ -0,0 +1,11 @@
<ul>
<li><strong>Overview</strong>: Add Sort/Limit/Join
<li><strong>Deadline</strong>: TBD</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>

View file

@ -1,193 +0,0 @@
<ul>
<li><strong>Overview</strong>: Optimize your implementation to support specialized join algorithms and limited memory.</li>
<li><strong>Deadline</strong>: March 30</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>
<p style="text-align: justify;"><span id="LC20" class="line">This project is, in effect, a more rigorous form of Project 1. The requirements are identical: We give you a query and some data, you evaluate the query on the data and give us a response as quickly as possible.</span></p>
<p style="text-align: justify;"><span id="LC24" class="line">First, this means that we'll be expecting a more feature-complete submission. Your code will be evaluated on more queries from TPC-H benchmark, which exercises a broader range of SQL features than the Project 1 test cases did.</span></p>
<p style="text-align: justify;"><span id="LC28" class="line">Second, performance constraints will be tighter. The reference implementation for this project has been improved over that of Project 1, meaning that you'll be expected to perform more efficiently, and to handle data that does not fit into main memory.</span></p>
<hr />
<h1>Join Ordering</h1>
<p style="text-align: justify;">The order in which you join tables together is <strong>incredibly important</strong>, and can change the runtime of your query by <strong>multiple orders of magnitude</strong>.  Picking between different join orderings is incredibly important!  However, to do so, you will need statistics about the data, something that won't really be feasible until the next project.  Instead, here's a present for those of you paying attention.  The tables in each FROM clause are ordered so that you will get our recommended join order by building a <em>left-deep plan</em> going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.</p>
<h1><span id="LC33" class="line">Blocking Operators and Memory</span></h1>
<p style="text-align: justify;"><span id="LC35" class="line">Blocking operators (e.g., joins other than Merge Join, the Sort operator, etc...) are generally blocking because they need to materialize instances of a relation. For half of this project, you will not have enough memory available to materialize a full relation, to say nothing of join results. To successfully process these queries, you will need to implement out-of core equivalents of these operators: At least one External Join (e.g., Block-Nested-Loop, Hash, or Sort/Merge Join) and an out-of-core Sort Algorithm (e.g., External Sort).</span></p>
<p style="text-align: justify;"><span id="LC37" class="line">For your reference, the evaluation machines have 2GB of memory.  In phase 2,  Java will be configured for 1<strong>00 MB of heap space </strong>(see the command line argument -Xmx).  To work with such a small amount of heap space, <strong>you will need to manually invoke Java's garbage collector</strong> by calling <tt>System.gc()</tt>.  How frequently you do this is up to you.  The more you wait, the greater the chance that you'll run out of memory.  The reference implementation calls it in the Two-Phase sort operator, every time it finishes flushing a file out to disk. </span></p>
<h1>Query Rewriting</h1>
<p style="text-align: justify;">In Project 1, you were encouraged to parse SQL into a relational algebra tree.  Project 2 is where that design choice begins to pay off.  We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them.  For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.</p>
<pre class="prettyprint">if(o instanceof Selection){
Selection s = (Selection)o;
if(s.getChild() instanceof CrossProduct){
CrossProduct prod =
(CrossProduct)s.getChild();
Expression join_cond =
// find a good join condition in
// the predicate of s.
Expression rest =
// the remaining conditions
return new Selection(
rest,
new HashJoin(
join_cond,
prod.getLHS(),
prod.getRHS()
)
);
}
}
return o;</pre>
<p style="text-align: justify;">The reference implementation has a function similar to this snippet of code, and applies the function to every node in the relational algebra tree.</p>
<p style="text-align: justify;">Because selection can be decomposed, you may find it useful to have a piece of code that can split AndExpressions into a list of conjunctive terms:</p>
<pre class="prettyprint">List&lt;Expression&gt; splitAndClauses(Expression e)
{
List&lt;Expression&gt; ret =
new ArrayList&lt;Expression();
if(e instanceof AndExpression){
AndExpression a = (AndExpression)e;
ret.addAll(
splitAndClauses(a.getLeftExpression())
);
ret.addAll(
splitAndClauses(a.getRightExpression())
);
} else {
ret.add(e);
}
}</pre>
<h1>Interface</h1>
<p style="text-align: justify;">Your code will be evaluated in exactly the same way as Project 1.  Your code will be presented with a 1GB (SF 1) TPC-H dataset.  Grading will proceed in two phases.  In the first phase, you will have an unlimited amount of memory, but very tight time constraints.  In the second phase, you will have slightly looser time constraints, but will be limited to 100 MB of memory, and presented with either a 1GB or a 200 MB (SF 0.2) dataset.</p>
<p style="text-align: justify;">As before, your code will be invoked with the data directory and the relevant SQL files. An additional parameter will be used in Phase 2:</p>
<ul>
<li><tt>--swap directory</tt>: A swap directory that you're allowed to write to. No other directories are guaranteed to be available or writeable. If the --swap parameter is not present, you should not swap (i.e., the data size is small enough to be handled entirely in-memory)</li>
</ul>
<pre>java -cp build:jsqlparser.jar
-Xmx100m # Heap limit (Phase 2 only)
edu.buffalo.cse562.Main
--data [data]
--swap [swap]
[sqlfile1] [sqlfile2] ...</pre>
This example uses the following directories and files:
<ul>
<li><tt>[data]</tt>: Table data stored in '|' separated files. As before, table names match the names provided in the matching CREATE TABLE with the .dat suffix.</li>
<li><tt>[swap]</tt>: A temporary directory for an individual run. This directory will be emptied after every trial.</li>
<li><tt>[sqlfileX]</tt>: A file containing CREATE TABLE and SELECT statements, defining the schema of the dataset and the query to process</li>
</ul>
<h1>Grading</h1>
<p style="text-align: justify;">Your code will be subjected to a sequence of test cases and evaluated on speed and correctness.  Note that unlike Project 1, you will neither receive a warning about, nor partial credit for out-of-order query results if the outermost query includes an ORDER BY clause.</p>
<p style="text-align: justify;">Phase 1 (big queries) will be graded on a TPC-H SF 1 dataset (1 GB of raw text data).  Phase 2 (limited memory) will be graded on either a TPC-H SF 1 or SF 0.2 (200 MB of raw text data) dataset as listed in the chart below.  Grades are assigned based on per-query thresholds:</p>
<ul>
<li style="text-align: justify;"><strong>0/10 (F)</strong>: Your submission does not compile, does not produce correct output, or fails in some other way. Resubmission is highly encouraged.</li>
<li style="text-align: justify;"><strong>5/10 (C)</strong>: Your submission runs the test query faster than the C threshold (listed below for each query), and produces the correct output.</li>
<li style="text-align: justify;"><strong>7.5/10 (B)</strong>: Your submission runs the test query faster than the B threshold (listed below for each query), and produces the correct output.</li>
<li><strong>10/10 (A)</strong>: Your submission runs the test query faster than the A threshold (listed below for each query), and produces the correct output.</li>
</ul>
<table>
<tbody>
<tr>
<th>TPC-H Query</th>
<th>Phase 1 Runtimes</th>
<td></td>
<th>Phase 2 Runtimes</th>
<th>Phase 2 Scaling Factor</th>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">1</th>
<td style="text-align: center;">45 s</td>
<th style="text-align: center;">A</th>
<td style="text-align: center;">1 min</td>
<td style="text-align: center;" rowspan="3">SF = 0.2</td>
</tr>
<tr>
<td style="text-align: center;">67.5 s</td>
<th>B</th>
<td style="text-align: center;">2 min</td>
</tr>
<tr>
<td style="text-align: center;">90 s</td>
<th>C</th>
<td style="text-align: center;">3 min</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">3</th>
<td style="text-align: center;">45 s</td>
<th>A</th>
<td style="text-align: center;">40 s</td>
<td style="text-align: center;" rowspan="3">SF = 0.2</td>
</tr>
<tr>
<td style="text-align: center;">90 s</td>
<th>B</th>
<td style="text-align: center;">80 s</td>
</tr>
<tr>
<td style="text-align: center;">120 s</td>
<th>C</th>
<td style="text-align: center;">120 s</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">5</th>
<td style="text-align: center;">45 s</td>
<th>A</th>
<td style="text-align: center;">70 s</td>
<td style="text-align: center;" rowspan="3">SF = 0.2</td>
</tr>
<tr>
<td style="text-align: center;">90 s</td>
<th>B</th>
<td style="text-align: center;">140 s</td>
</tr>
<tr>
<td style="text-align: center;">120 s</td>
<th>C</th>
<td style="text-align: center;">210 s</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">10</th>
<td style="text-align: center;">45 s</td>
<th>A</th>
<td style="text-align: center;">2 min</td>
<td style="text-align: center;" rowspan="3">SF = 1</td>
</tr>
<tr>
<td style="text-align: center;">67.5 s</td>
<th>B</th>
<td style="text-align: center;">4 min</td>
</tr>
<tr>
<td style="text-align: center;">90 s</td>
<th>C</th>
<td style="text-align: center;">6 min</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">12</th>
<td style="text-align: center;">45 s</td>
<th>A</th>
<td style="text-align: center;">1.5 min</td>
<td style="text-align: center;" rowspan="3">SF = 1</td>
</tr>
<tr>
<td style="text-align: center;">67.5 s</td>
<th>B</th>
<td style="text-align: center;">3 min</td>
</tr>
<tr>
<td style="text-align: center;">90 s</td>
<th style="text-align: center;">C</th>
<td style="text-align: center;">4.5 min</td>
</tr>
</tbody>
</table>

View file

@ -0,0 +1,11 @@
<ul>
<li><strong>Overview</strong>: Answer Big Data Queries Efficiently
<li><strong>Deadline</strong>: TBD</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>

View file

@ -1,209 +0,0 @@
<ul>
<li><strong>Overview</strong>: Add a pre-processing phase to your system.</li>
<li><strong>Deadline</strong>: May 8</li>
<li><strong>Grade</strong>: 15% of Project Component
<ul>
<li>5% Correctness</li>
<li>5% Efficiency</li>
<li>5% Code Review</li>
</ul>
</li>
</ul>
<p style="text-align: justify;">Once again, we will be tightening performance constraints.  You will be expected to complete queries in seconds, rather than tens of seconds as before.  This time however, you will be given a few minutes alone with the data before we start timing you.</p>
<p style="text-align: justify;">Concretely, you will be given a period of up to 5 minutes that we'll call the Load Phase.  During the load phase, you will have access to the data, as well as a database directory that will not be erased in between runs of your application.  Example uses for this time include building indexes or  gathering statistics about the data for use in cost-based estimation.</p>
<p style="text-align: justify;">Additionally, CREATE TABLE statements are now annotated with PRIMARY KEY and FOREIGN KEY attributes.  You may hardcode index selections for the TPC-H benchmark based on your own experimentation.</p>
<hr />
<h1>BerkeleyDB</h1>
<p style="text-align: justify;">For this project, you will get access to a new library: BerkeleyDB (Java Edition).  Don't let the name mislead you, BDB is not actually a full database system.  Rather, BDB implements the indexing and persistence layers of a database system.  Download BDB at:</p>
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/berkeleydb.jar">http://odin.cse.buffalo.edu/resources/berkeleydb/berkeleydb.jar</a></p>
<p style="text-align: justify;">The BerkeleyDB documentation is mirrored at:</p>
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/dpl.html">http://odin.cse.buffalo.edu/resources/berkeleydb/</a></p>
<p style="text-align: justify;">You can find a getting started guide at:</p>
<p style="text-align: center; font-size: 10pt;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide">http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide</a></p>
And the javadoc at:
<p style="text-align: center;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/">http://odin.cse.buffalo.edu/resources/berkeleydb/java/</a></p>
<p style="text-align: justify;">BDB can be used in two ways: The Direct Persistence layer, and the Base API.  The <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/dpl.html">Direct Persistence Layer</a> is easier to use at first, as it handles index management and serialization through compiler annotations.  However, this ease comes at the cost of flexibility.  Especially if you plan to use secondary indexes, you may find it substantially easier to work with the <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/baseapi.html">Base API</a>.  For this reason, this summary will focus on the Base API.</p>
<h1 style="text-align: justify;">Environments and Databases</h1>
<p style="text-align: justify;">A relation or table is represented in BDB as a <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/databases.html#DBOpen">Database</a>, which is grouped into units of storage called an <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/env.html">Environment</a>.  The first thing that you should to do in the pre-computation phase is to <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/databases.html#DBOpen">create an Environment and one or more Databases</a>.  <strong>Be absolutely sure to <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/databases.html#dbclose">close both the environment and the database</a> before you exit</strong>, as not doing so could lead to file corruption.</p>
<p style="text-align: justify;">BDB Databases are in effect clustered indexes, which means that every record stored in one is identified (and sorted by) a key.  A database supports efficient access to records or ranges of records based on their keys.</p>
<h1 style="text-align: justify;">Representing, Storing, and Reading Tuples</h1>
<p style="text-align: justify;">Every tuple must be marked with a primary key, and may include one or more secondary keys.  In t<span style="line-height: 1.5;">he Base API, both the value and its key are represented as a string of bytes.  Both key and value must be </span><a style="line-height: 1.5;" href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/DBEntry.html#usingDbEntry">stored as a byte array encapsulated in a DatabaseEntry object</a><span style="line-height: 1.5;">.  Secondary Keys are defined when creating a secondary index.</span></p>
<p style="text-align: justify;">Note that you will need to manually extract the key from the rest of the record and write some code to serialize the record and the key into byte arrays.  You could use <span style="line-height: 1.5;">toString(), but you may find it substantially faster to use Java's native object serialization:</span></p>
<p style="text-align: center;"><a href="http://docs.oracle.com/javase/8/docs/api/java/io/ObjectOutputStream.html">ObjectOutputStream </a> |  <a href="http://docs.oracle.com/javase/8/docs/api/java/io/ObjectInputStream.html">ObjectInputStream</a></p>
<p style="text-align: justify;">... or a pair of classes that java provides for serializing primitive data:</p>
<p style="text-align: center;"><a href="http://docs.oracle.com/javase/8/docs/api/java/io/DataOutputStream.html">DataOutputStream</a>  |  <a href="http://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html">DataInputStream</a></p>
<p style="text-align: justify;">Like a Hash-Map, BDB supports a <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/usingDbt.html">simple get/put interface</a>.  Tuples can be stored or looked up by their key.  Like your code, BDB also provides an iterator interface called a <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/Cursors.html">Cursor</a>.  Of note, BDB's cursor interface supports <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/Positioning.html#cursorsearch">index lookups</a>.</p>
<h1 style="text-align: justify;">Secondary Indexes</h1>
<p style="text-align: justify;">The Database represents a clustered index.  In addition, BDB has support for unclustered indexes, which it calls <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/indexes.html">SecondaryDatabases</a>. As an unclustered index, a secondary database doesn't dictate how the tuples themselves are laid out, but still allows for (mostly) efficient lookups for secondary "keys".  The term "keys" is in quotation marks, because unlike the primary key used in the primary database, a secondary database allows for multiple records with the same secondary key.</p>
<p style="text-align: justify;">To automate the management process, a secondary index is defined using an implementation of <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/keyCreator.html">SecondaryKeyCreator</a>.  This class should map record DatabaseEntry objects to a (not necessarily unique) DatabaseEntry object that acts as a secondary key.</p>
<h1 style="text-align: justify;">BDB Joins</h1>
<p style="text-align: justify;">Another misnomer, BDB allows you to define so-called <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/GettingStartedGuide/joins.html">Join Cursors</a>. This is <strong>not</strong> a relational join in the traditional sense.   Rather, a Join Cursor allows you to define multiple <strong>equality</strong> predicates over the base relation and scan over all records that match all of the specified lookup conditions.</p>
<h1 style="text-align: justify;">Performance Tuning</h1>
<p style="text-align: justify;">BerkeleyDB can be quite tricky to get performance out of.  There are a number of options, and ways of interacting with it that can help you get the most out of this indexing software.  Since evaluation on the grading boxes takes time due to the end-to-end testing process, I encourage you to evaluate on your own machines.  For best results, be sure to store your database on an HDD (Results from SSDs will not be representative of the grading boxes).  Recall that the grader boxes have 4 GB of RAM.</p>
<h2 style="text-align: justify;">Heap Scans</h2>
<p style="text-align: justify;">Depending on how you've implemented deserialization of the raw data files, you may find it faster to read directly from the clustered index rather than from the data file.  In the reference implementation, reading from a clustered index is about twice as fast as from a data file, but this performance boost stems from several factors.  If you choose to do this, take a look at <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DiskOrderedCursor.html">DiskOrderedCursor</a>, which my experiments show is roughly about twice as fast as a regular in-order Cursor on an HDD on a fully compacted relation.</p>
<h2 style="text-align: justify;">Locking Policies</h2>
<p style="text-align: justify;">Locking is slow.  Consistency is slow.  As long as you're not implementing your code multithreaded or with updates or transactions, you'll find that cursor operations will be faster under <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/LockMode.html">LockMode</a>.<a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/LockMode.html#READ_UNCOMMITTED">READ_UNCOMMITTED</a>.  See below for ways to set this parameter globally.</p>
<h2 style="text-align: justify;">Config Options</h2>
<p style="text-align: justify;">BDB also has numerous options that will affect the performance of your system.  Several options you may wish to evaluate, both for the load and run phases:</p>
<ul>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentConfig.html">EnvironmentConfig</a>
<ul>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentMutableConfig.html#setCachePercent(int)">setCachePercent</a></li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentConfig.html#setLocking(boolean)">setLocking</a></li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentConfig.html#setConfigParam(java.lang.String,%20java.lang.String)">setConfigParam</a>
<ul>
<li style="text-align: justify;">EnvironmentConfig.<a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentConfig.html#ENV_RUN_CLEANER">ENV_RUN_CLEANER</a></li>
<li style="text-align: justify;">EnvironmentConfig.<a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/EnvironmentConfig.html#ENV_RUN_CHECKPOINTER">ENV_RUN_CHECKPOINTER</a>
<ul>
<li style="text-align: justify;">See the documentation for Environment.<a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/Environment.html#cleanLog()">cleanLog</a>() if you plan to turn either of these off.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DatabaseConfig.html">DatabaseConfig</a> and <a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/SecondaryConfig.html">SecondaryConfig</a>
<ul>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DatabaseConfig.html#setReadOnly(boolean)">setReadOnly</a></li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DatabaseConfig.html#setTransactional(boolean)">setTransactional</a></li>
</ul>
</li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DiskOrderedCursorConfig.html">DiskOrderedCursorConfig</a>
<ul>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DiskOrderedCursorConfig.html#setInternalMemoryLimit(long)">setInternalMemoryLimit</a></li>
<li style="text-align: justify;"><a href="http://odin.cse.buffalo.edu/resources/berkeleydb/java/com/sleepycat/je/DiskOrderedCursorConfig.html#setQueueSize(int)">setQueueSize</a></li>
</ul>
</li>
</ul>
<hr />
<h1>Interface</h1>
<p style="text-align: justify;">Your code will be evaluated in exactly the same way as Projects 1 and 2.  Your code will be presented with a 500MB (SF 0.5) TPC-H dataset.  Before grading begins, your code will be run once to preprocess the data.  You will have up to 5 minutes, after which your process will be killed (if it has not yet terminated).  Your code will then be run on the test suite.</p>
<p style="text-align: justify;">As before, your code will be invoked with the data directory and the relevant SQL files. Two additional parameters will be used in the preprocessing stage:</p>
<ul>
<li><tt>--db directory</tt>: A directory in which it is safe to persist data.  The contents of this directory will be persisted across the entire grading run.</li>
<li><tt>--load</tt>: This parameter will be passed during the preprocessing phase.  When it appears on the command line, you have up to 5 minutes to preprocess the data.</li>
</ul>
<pre>java -cp build:jsqlparser.jar:...
edu.buffalo.cse562.Main
--data [data]
--db [db]
--load
[sqlfile1] [sqlfile2] ...</pre>
This example uses the following directories and files:
<ul>
<li><tt>[data]</tt>: Table data stored in '|' separated files. As before, table names match the names provided in the matching CREATE TABLE with the .dat suffix.</li>
<li><tt>[db]</tt>: A directory for permanent data files.  This directory will be persisted across all runs of the</li>
<li><tt>[sqlfileX]</tt>: A file containing CREATE TABLE and SELECT statements, defining the schema of the dataset and the query to process.  If --load appears on the command line, these files will contain only CREATE TABLE statements.</li>
</ul>
<h1>Grading</h1>
<p style="text-align: justify;">Your code will be subjected to a sequence of test cases and evaluated on speed and correctness.</p>
<ul>
<li style="text-align: justify;"><strong>0/10 (F)</strong>: Your submission does not compile, does not produce correct output, or fails in some other way. Resubmission is highly encouraged.</li>
<li style="text-align: justify;"><strong>5/10 (C)</strong>: Your submission runs the test query faster than the C threshold (listed below for each query), and produces the correct output.</li>
<li style="text-align: justify;"><strong>7.5/10 (B)</strong>: Your submission runs the test query faster than the B threshold (listed below for each query), and produces the correct output.</li>
<li><strong>10/10 (A)</strong>: Your submission runs the test query faster than the A threshold (listed below for each query), and produces the correct output.</li>
</ul>
<table>
<tbody>
<tr>
<th style="text-align: center;">TPC-H Query</th>
<th style="text-align: center;">Grade</th>
<th style="text-align: center;">Maximum Run-Time (s)</th>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q1</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">30</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">60</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">90</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q3</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">5</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">30</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">120</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q5</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">10</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">60</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">120</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q6</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">20</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">45</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">70</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q10</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">10</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">30</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">90</td>
</tr>
<tr>
<th style="text-align: center;" rowspan="3">Q12</th>
<th style="text-align: center;">A</th>
<td style="text-align: center;">40</td>
</tr>
<tr>
<th style="text-align: center;">B</th>
<td style="text-align: center;">60</td>
</tr>
<tr>
<th style="text-align: center;">C</th>
<td style="text-align: center;">90</td>
</tr>
</tbody>
</table>