Merge branch 'master' of gitlab.odin.cse.buffalo.edu:odin-lab/Website

This commit is contained in:
Oliver Kennedy 2018-02-12 15:48:50 -05:00
commit 7f58bf50c0

View file

@ -0,0 +1,240 @@
---
title: CSE-562; Project 1
---
<h1>Checkpoint 1</h1>
<ul>
<li><strong>Overview</strong>: Answer Select/Project Queries
<li><strong>Deadline</strong>: Friday, Feb 23</li>
<li><strong>Grade</strong>: 10% of Project Component
<ul>
<li>7% Correctness</li>
<li>3% Efficiency</li>
</ul>
</li>
</ul>
<p>Checkpoint 1 of the project aims to make sure you are able to utilize relational algebra knowledge you gained in the class. It is essential that you prepare a good infrastructure to evaluate queries at this early stage to make sure you are well prepared for the following checkpoints. In this checkpoint, you will be asked to evaluate a few single-table SQL queries. There are a couple of ways to complete this project, and some of these are better than the others. We will go through a few design decisions, pointing out tradeoffs, and explaining why a strategy that might seem easier in the short term turns out to be significantly harder later. The success criteria of this checkpoint is to be able to evaluate 5 queries correctly.</p>
<p>Specifically, you'll be given a number of queries in one of the following patterns:
<ol>
<li><tt>CREATE TABLE R (A int, B date, C string, ... )</tt></li>
<li><tt>SELECT A, B, ... FROM R</tt></li>
<li><tt>SELECT A, B, ... FROM R WHERE ...</tt></li>
<li><tt>SELECT A+B AS C, ... FROM R</tt></li>
<li><tt>SELECT A+B AS C, ... FROM R WHERE ...</tt></li>
<li><tt>SELECT * FROM R</tt></li>
<li><tt>SELECT * FROM R WHERE ...</tt></li>
<li><tt>SELECT R.A, ... FROM R WHERE ...</tt></li>
<li><tt>SELECT Q.C, ... FROM (SELECT A, C, ... FROM R) Q WHERE ...</tt></li>
</ol>
Your task is to answer these queries as they arrive.
</p>
<h2>Volcano-Style Computation (Iterators)</h2>
Let's take a look at the script we've used as an example in class.
<div style="text-align:left;color:#000000; background-color:#ffffff; border:solid black 1px; padding:0.5em 1em 0.5em 1em; overflow:auto;font-size:small; font-family:monospace; ">with <span style="color:#5b2a96;">open</span>(<span style="color:#f4181b;">'data.csv'</span>, <span style="color:#f4181b;">'r'</span>) <span style="color:#a71790;"><strong>as</strong></span> f:<br />
&nbsp;&nbsp;<span style="color:#a71790;"><strong>for</strong></span> line <span style="color:#a71790;"><strong>in</strong></span> f:<br />
&nbsp;&nbsp;&nbsp;&nbsp;fields = split(<span style="color:#f4181b;">&quot;,&quot;</span>, line)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;"><strong>if</strong></span>(fields[<span style="color:#0000ff;">2</span>] != <span style="color:#f4181b;">&quot;Ensign&quot;</span> <span style="color:#a71790;"><strong>and</strong></span> <span style="color:#5b2a96;">int</span>(fields[<span style="color:#0000ff;">3</span>]) &gt; <span style="color:#0000ff;">25</span>):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;">print</span>(fields[<span style="color:#0000ff;">1</span>])<br />
</div>
<p>This script is basically a form of pattern 3 above
<pre><code class="sql">SELECT fields[1] FROM 'data.csv'
WHERE fields[2] != "Ensign" AND CAST(fields[3] AS int) > 25
</code></pre>
</p>
<p>Or in other words, any query that follows the pattern...
<pre><code class="sql">SELECT /*targets*/ FROM /*file*/ WHERE /*condition*/</code></pre>
...becomes a script of the form...</p>
<div style="text-align:left;color:#000000; background-color:#ffffff; border:solid black 1px; padding:0.5em 1em 0.5em 1em; overflow:auto;font-size:small; font-family:monospace; ">with <span style="color:#5b2a96;">open</span>(<span style="color:#4444ff;font-weight:bold;">file</span>, <span style="color:#f4181b;">'r'</span>) <span style="color:#a71790;"><strong>as</strong></span> f:<br />
&nbsp;&nbsp;<span style="color:#a71790;"><strong>for</strong></span> line <span style="color:#a71790;"><strong>in</strong></span> f:<br />
&nbsp;&nbsp;&nbsp;&nbsp;fields = split(<span style="color:#f4181b;">&quot;,&quot;</span>, line)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;"><strong>if</strong></span> <span style="color:#4444ff;font-weight:bold;">condition</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;">print</span>(<span style="color:#4444ff;font-weight:bold;">targets</span>)<br />
</div>
<p>This is nice and simple, but the code is very specific to pattern 3. That's something that will lead us into trouble. To see a simple example of the sort of problems we're going to run into, let's come up with an example of pattern 5:</p>
<pre><code class="sql">SELECT height + weight FROM 'data.csv' WHERE rank != 'Ensign'</code></pre>
That is, we're asking for the sum of height and weight of each non-ensign in our example table. An equivalent script would be...</p>
<div style="text-align:left;color:#000000; background-color:#ffffff; border:solid black 1px; padding:0.5em 1em 0.5em 1em; overflow:auto;font-size:small; font-family:monospace; ">total = <span style="color:#0000ff;">0</span><br />
<br />
with <span style="color:#5b2a96;">open</span>(<span style="color:#f4181b;">'data.csv'</span>, <span style="color:#f4181b;">'r'</span>) <span style="color:#a71790;"><strong>as</strong></span> f:<br />
&nbsp;&nbsp;<span style="color:#a71790;"><strong>for</strong></span> line <span style="color:#a71790;"><strong>in</strong></span> f:<br />
&nbsp;&nbsp;&nbsp;&nbsp;fields = split(<span style="color:#f4181b;">&quot;,&quot;</span>, line)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;"><strong>if</strong></span> fields[<span style="color:#0000ff;">2</span>] != <span style="color:#f4181b;">'Ensign'</span>:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;total = <span style="color:#5b2a96;">int</span>(fields[<span style="color:#0000ff;">4</span>]) + <span style="color:#5b2a96;">int</span>(fields[<span style="color:#0000ff;">5</span>])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#a71790;">print</span>(total) &nbsp;&nbsp;&nbsp;<br />
</div>
<p>There's a pretty significant difference in the flow of the code in this version of the script. For one, there's a new global variable with the <tt>total</tt> that equals to the sum of weight and height. Now let's say we wanted to support both patterns 3 and 5. Then we would need to check which query pattern the query follows, and write code that supports each pattern. As you can see, if you try to do this, you would need to support patterns 1, 2, 3, 4, 5, 6, 7, 8, and 9 or the even more complex queries that will show up in later checkpoints.</p>
<p>There are a number of workflow steps that appear in more than one pattern. For example:
<ol>
<li>Loading the CSV file in as data</li>
<li>Filtering rows out of data</li>
<li>Transforming (mapping) data into a new structure</li>
<li>Printing output data</li>
</ol>
Most of these steps do something with <b>data</b>, so let's be a little more precise with respect to what we mean there. (1) When a CSV file is loaded, it's a sequence of rows and attributes. (2) Filtering doesn't change the structure: it's still rows and attributes. (3) Transforming (picking out specific columns) does change the structure, but at the end of the day we're still working with rows and attributes (or in the case of this script, just one attribute). (4) Printing doesn't change the structure: however, you need to do it in the correct format.</p>
<p>In short, this idea of rows and attributes is pretty fundamental, so let's use it. We're going to work with data expressed in terms of <tt>table</tt>s: or collections of rows and attributes. This allows us to abstract out each of those workflow steps from before into a set of functions:
<ol style="font-family: Courier;font-size: 10pt;">
<li>read_table(filename) -> table</li>
<li>filter_table(table, condition) -> table</li>
<li>map_table(table, rules) -> table</li>
<li>print_table(table)</li>
</ol>
</p>
<p><b>But we still have a problem.</b> These <tt>table</tt> objects are going to be as big as the data they represent... they can get super large. That's a massive drawback compared to our initial script design, which has constant-space usage. So what else can we do?</p>
<p>Let's look at why the original script uses constant-space. We load one record in upfront (that's constant space). We decide whether the record is useful to us (still constant space). Whether or not we print it, by the time we get to the next record, we're done with the current row and can safely discard it. Can we recover the same sort of property?</p>
<p>For this checkpoint, it turns out that we can. If you've used java, you're probably familiar with the <a href="http://docs.oracle.com/javase/8/docs/api/java/util/Iterator.html">Iterator</a> interface. An iterator lets you access elements of a collection without needing to have all of those elements available at once. That is, you define two methods:
<dl>
<dt style="font-family: Courier">hasNext()</dt>
<dd>Returns true if there are any more rows to read</dd>
<dt style="font-family: Courier">next()</dt>
<dd>Returns exactly one row. (the next row in the list)</dd>
</dl>
Because the iterator eventually returns each row of the table, it behaves sort of like a <tt>table</tt> object, but because it only returns one row at a time it doesn't strictly need all of the data to be in memory at once. Moreover, you can define one iterator in terms of another. For example, you might define a filtering iterator that takes a source iterator as part of its constructor, and every time you call <tt>next()</tt>, keeps calling <tt>source.next()</tt> until it finds a row that satisfies the where clause.</p>
<p>In short, iterators give you <i>composability</i> and <i>low memory use</i>. The first property is important for your sanity, while the latter property is important for your performance.</p>
<h2>Data Representation</h2>
<p>When it comes to figuring out how to represent one row of data, you have two questions to answer: (1) How do I represent a single primitive value, and (2) How do I represent an entire row of primitive values. </p>
<p>For the first question, there are two practical choices: Either as raw strings (taken directly from the CSV file) or parsed into <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/PrimitiveValue.html">PrimitiveValue</a> objects. PrimitiveValue is an interface implemented by several classes that represent specific types of values, for example <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/LongValue.html">longs</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/DateValue.html">dates</a>, and others. Because EvalLib (a library that I'll describe shortly) uses PrimitiveValues internally, most students find that it is easier to write code that performs well if you use PrimitiveValue.</p>
<p>For the second question, I strongly encourage the use of Java arrays. There are a few options, including ArrayLists, Vectors, Maps, and other structures. Java arrays outperform them all pretty drastically.</p>
<h2>EvalLib</h2>
<p>The JSqlParser <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/Expression.html">Expression</a> type can represent a whole mess of different arithmetic, boolean, and other primitive-valued expressions. For this project, you'll have a library to help you in evaluating these expressions: <b>EvalLib</b>. Before we get into it, you should note a distinction between two types of expression:
<dl>
<dt><a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/Expression.html">Expression</a></dt>
<dd>A generic expression. Can be anything: a comparison, a string, a multiplication, a regular expression match.</dd>
<dt><a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/PrimitiveValue.html">PrimitiveValue</a></dt>
<dd>The basic unit of data. Can be a: <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/BooleanValue.html">Boolean</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/DateValue.html">Date</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/DoubleValue.html">Double</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/LongValue.html">Long</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/NullValue.html">Null</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/StringValue.html">String</a>, <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/TimestampValue.html">Timestamp</a> or a <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/expression/TimeValue.html">Time</a>. Note that PrimitiveValues are also perfectly legitimate (if somewhat boring) Expressions.</dd>
</dl>
EvalLib includes a single class called <a href="https://github.com/UBOdin/evallib/blob/master/src/main/java/net/sf/jsqlparser/eval/Eval.java">Eval</a> that helps you to resolve Expression objects into PrimitiveValues. Eval is an abstract class, which means you'll need to subclass it to make use of it, but we'll get back to that in a moment. First, let's see a quick example.</p>
<pre><code class="java">
Eval eval = new Eval(){ /* we'll get what goes here shortly */ }
// Evaluate "1 + 2.0"
PrimitiveValue result;
result =
eval.eval(
new Addition(
new LongPrimitive(1),
new DoublePrimitive(2.0)
)
);
System.out.println("Result: "+result); // "Result: 3.0"
// Evaluate "1 > (3.0 * 2)"
result =
eval.eval(
new GreaterThan(
new LongPrimitive(1),
new Multiplication(
new DoublePrimitive(3.0),
new LongPrimitive(2)
)
)
);
System.out.println("Result: "+result); // "Result: false"
</code></pre>
<p>In short, eval helps you evaluate the Expression objects that JSQLParser gives you. However, there's one thing it can't do: It has no idea how to convert attribute names to values. That is, there's one type of Expression object that Eval has no clue how to evaluate: <a href="http://doc.odin.cse.buffalo.edu/jsqlparser/net/sf/jsqlparser/schema/Column.html">Column</a>. That is, let's take the following example:</p>
<pre><code class="java">
// Evaluate "R.A >= 5"
result =
eval.eval(
new GreaterThanEquals(
new Column(new Table(null, "R"), "A"),
new LongPrimitive(5)
)
);
</code></pre>
<p>What value should Eval give to R.A? This depends on the data. Because EvalLib has no way of knowing how you represent your data, you need to tell it:</p>
<pre><code>
Eval eval = new Eval(){
public PrimitiveValue eval(Column c){
/* Figure out what value 'c' has */
}
}
</code></pre>
<h2>Deliverable</h2>
<p>For this checkpoint, you'll be running multiple queries in sequence. This means a few changes to your code. First, before calling <tt>parser.Statement()</tt>, you will need to print a prompt to <tt>System.out</tt>. Use the string
'<tt>$&gt; </tt>' (without quotes), and make sure that it's the very first thing on its own. This is so that the testing framework knows when your code is ready for the next query.</p>
<h4>Source Data</h4>
<p>Because you are implementing a query evaluator and not a full database engine, there will not be any tables -- at least not in the traditional sense of persistent objects that can be updated and modified. Instead, you will be given a <strong>Table Schema</strong> and a <strong>CSV File</strong> with the instance in it. To keep things simple, we will use the <tt>CREATE TABLE</tt> statement to define a relation's schema. To reiterate, <tt>CREATE TABLE</tt> statements <strong>only appear to give you a schema</strong>. You do not need to allocate any resources for the table in reaction to a <tt>CREATE TABLE</tt> statement -- Simply save the schema that you are given for later use. Sql types (and their corresponding java types) that will be used in this project are as follows:</p>
<table>
<tbody>
<tr>
<th>SQL Type</th>
<th>Java Equivalent</th>
</tr>
<tr>
<td>string</td>
<td>StringValue</td>
</tr>
<tr>
<td>varchar</td>
<td>StringValue</td>
</tr>
<tr>
<td>char</td>
<td>StringValue</td>
</tr>
<tr>
<td>int</td>
<td>LongValue</td>
</tr>
<tr>
<td>decimal</td>
<td>DoubleValue</td>
</tr>
<tr>
<td>date</td>
<td>DateValue</td>
</tr>
</tbody>
</table>
<p>In addition to the schema, you will find a corresponding <tt>[tablename].csv</tt> file in the <tt>data</tt> directory. The name of the table corresponds to the table names given in the <tt>CREATE TABLE</tt> statements your code receives. For example, let's say that you see the following statement in your query file:</p>
<pre>CREATE TABLE R(A int, B int, C int);</pre>
<p>That means that the data directory contains a data file called 'R.dat' that might look like this:</p>
<pre>1|1|5
1|2|6
2|3|7</pre>
<p>Each line of text (see <a href="http://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html">BufferedReader.readLine()</a>) corresponds to one row of data. Each record is delimited by a vertical pipe '|' character.  Integers and floats are stored in a form recognized by Javas Long.parseLong() and Double.parseDouble() methods. Dates are stored in YYYY-MM-DD form, where YYYY is the 4-digit year, MM is the 2-digit month number, and DD is the 2-digit date. Strings are stored unescaped and unquoted and are guaranteed to contain no vertical pipe characters.</p>
<h4>Grading Workflow</h4>
<p>As before, all .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). Also as before, the class <tt>edu.buffalo.www.cse4562.Main</tt> will be invoked with no arguments, and a stream of <b>semicolon-delimited</b> queries will be printed to System.in (after you print out a prompt)</p>
<p>For example (<span style="color: red">red</span> text is entered by the user/grader):</p>
<pre>bash&gt; <span style="color: red">ls data</span>
R.dat
S.dat
T.dat
bash&gt; <span style="color: red">cat data/R.dat</span>
1|1|5
1|2|6
2|3|7
bash&gt; <span style="color: red">java -cp build:jsqlparser.jar edu.buffalo.www.cse4562.Main -</span>
$> <span style="color: red">CREATE TABLE R(A int, B int, C int);</span>
$> <span style="color: red">SELECT B, C FROM R WHERE A = 1;</span>
1|5
2|6
$> <span style="color: red">SELECT A + B AS Q FROM R;</span>
2
3
5
</pre>
<p>For this project, your code will not be timed, but you will need to answer some queries with a cap on available memory. You will receive up to 7 points for answering queries successfully, up to 3 additional points for beating the reference implementation timewise.</p>dr