Checkpoint 3

This commit is contained in:
Oliver Kennedy 2021-04-13 21:57:12 -04:00
parent df1a4c23cc
commit f898857de4
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
2 changed files with 334 additions and 3 deletions

View file

@ -0,0 +1,330 @@
---
title: "CSE-4/562 Database Systems: Checkpoint 3"
---
<%
def console(str)
return "<tt style='background-color: #111; color: #3f3; padding: 3px;'>#{str}</tt>"
end
prompt = console("$&gt;\\n")
%>
<div style="width: 600px; margin-left: auto; margin-right: auto;">
<h2>Checkpoint 3</h2>
<p>
In this project, you'll extend your SQL runtime to support aggregate queries. It is worth 8 points.
</p>
<h3>Requirements</h3>
<ul>
<li>All <tt>.scala</tt> files in <tt>/src/main/scala</tt> and its subdirectories will be compiled and the main function of the object <tt>microbase.Microbase</tt> will be run.</li>
<li>The grader will wait until the code prints <%= prompt %>. If this takes more than 2 seconds, you will receive a 0.</li>
<li>The grader will write a series of <tt>CREATE TABLE</tt> and <tt>SELECT</tt> commands to your code's <tt>System.in</tt>, with one command per <tt>\n</tt>-delimited line. After processing each statement, your code <b>must</b> print <%= prompt %> on a new line to indicate that it is done. If your code exceeds a per-operation time-out, you will receive a 0 for that query and all subsequent parts of the assignment.</li>
<li>When your code is provided with a <tt>CREATE TABLE</tt> statement, this indicates that there is a file called <%= console("data/[tableName].data") %>, where <tt>[tableName]</tt> is the name of the table. This file contains UTF-8-encoded records, one per <tt>\n</tt>-delimited line, with fields in human-readable string representation (i.e., as in a CSV file) delimited by the pipe character (<tt>|</tt>). Note that there will <b>not</b> be any <tt>INSERT</tt> statements.</li>
<li>When your code is provided with a <tt>SELECT</tt> statement, it <b>must</b> evaluate the SELECT statement and print the results to <tt>System.out</tt>, one per <tt>\n</tt>-delimited line, with fields in human-readable string representations delimited by the pipe character (<tt>|</tt>). You will be expected to support the following features of SQL:<ul>
<li>Arbitrary expression targets</li>
<li>Attribute and relation aliasing (i.e., <tt>SELECT bar.foo AS baz FROM table AS bar</tt>)</li>
<li>Project, Filter, Table, and Equi-Joins</li>
<li>Order-By and LIMIT queries</li>
<li style="font-weight: bold">Aggregate (Single-Valued, and GROUP BY) queries</li>
<li><tt>FROM</tt>-Nested Subqueries</li>
</ul></li>
<li>Your response to <tt>SELECT</tt> queries will be checked against Sqlite3. </li>
<li>Once again, your code <b>must</b> print <%= prompt %> on a new line after each <tt>CREATE TABLE</tt> or <tt>SELECT</tt> to indicate that it is done processing the statement.</li>
<li>You <i>may</i> use <tt>System.err</tt> to print debugging information to yourself. This output (or a subset of it) will be included in your debug log in autolab and will be ignored by the grading script.</li>
</ul>
<h3>Grading Rubric</h3>
All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below. Grading will be based on total runtime for each batch of queries.
<dl>
<dt>8 randomly-generated queries based on the <a href="http://www.tpc.org/tpch/">TPC-H</a> benchmark, with a scale factor of 0.1 (100MB) and templates listed below</dt>
<dd><b>Under 80 seconds total</b>: 8 of 8 points + leaderboard ranking</dd>
<dd><b>Under 150 seconds total</b>: 8 of 8 points</dd>
<dd><b>Under 5 seconds total</b>: 4 of 8 points</dd>
<dd><b>Under 1 minute per query</b>: 2 of 8 points</dd>
</dl>
Note in particular that these queries make extensive use of aggregates, equi-joins, order-by, and limit clauses, which will all need to be supported.
<h3>UnresolvedFunction</h3>
<p>
When the Spark SQL Parser encounters something that looks like a function, it doesn't try to interpret it directly. Instead, it'll produce a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/UnresolvedFunction.html">UnresolvedFunction</a> expression node. You'll need to replace these.
</p>
<p>
Like most databases, Spark maintains a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html">"Function Registry"</a>, a catalog of all functions and their implementations. All of the "built-in" functions are provided in <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html#builtin:org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry">FunctionRegistry.builtin</a>. Here's a little snippet you can use to replace functions. It doesn't support everything, but will be sufficient for this project.
</p>
<pre>
case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
{
val builder =
FunctionRegistry.builtin
.lookupFunctionBuilder(name)
.getOrElse {
throw new RuntimeException(
s"Unable to resolve function `${name}`"
)
}
builder(arguments) // returns the replacement expression node.
}
</pre>
<p>
<a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/SimpleFunctionRegistry.html#lookupFunction(name:org.apache.spark.sql.catalyst.FunctionIdentifier):Option[org.apache.spark.sql.catalyst.expressions.ExpressionInfo]">FunctionRegistry.lookupFunctionBuilder</a> returns a 'builder' function. When called on the arguments of the UnresolvedFunction, the builder function returns an expression that implements the function. For example, looking up "regexp_extract" in the registry returns a function that, when called on two string-typed expressions and a literal integer, will return a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/RegExpExtract.html">RegExpExtract</a> object.
</p>
<h3>Aggregates disguised as Projects</h3>
<p>
Because the Spark SQL Parser doesn't try to resolve functions, it is incapable of distinguishing between normal functions:
<pre>
SELECT regexp_extract(A, "a(b+)a", 1) FROM R
</pre>
and aggregate functions:
<pre>
SELECT sum(A) FROM R
</pre>
Both of these will parse into a LogicalPlan topped with a Project node.
</p>
<p>
While not required, you might find it easier to work with the resulting plans if you replace them with actual <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> plan nodes. Look for Project nodes with any expression in its <tt>projectList</tt> that is a subclass of <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/AggregateFunction.html">AggregateFunction</a>.
</p>
<h3>Aggregate</h3>
An <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> is a logical plan node with three fields:
<dl>
<dt>groupingExpressions</dt>
<dd>The GROUP BY attributes. Normally these can be any expression, but for this checkpoint, it will be sufficient to assume that all of these expressions are <tt>Attribute</tt>s</dd>
<dt>aggregateExpressions</dt>
<dd>The SELECT expressions. Normally these can be arbitrary arithmetic over aggregates, but for this checkpoint, it will be sufficient to assume that all of these expressions are either an <tt>Alias</tt> of an <tt>AggregateFunction</tt>, or an <tt>Attribute</tt> that also appears in the groupingExpressions field.</dd>
<dt>child</dt>
<dd>The input operator.</dd>
</dl>
<h3>AggregateFunctions</h3>
<p>
<tt>AggregateFunction</tt>s are unevaluable, because they don't get evaluated on a single row. Instead, there are several methods on an <tt>AggregateFunction</tt> that describe how to initialize an accumulator (what Spark calls an AggregationBuffer), how to incorporate input rows into it, and how to extract a final result value from the buffer.
</p>
<p>
The <tt>AggregateFunction</tt> can be an instance of either:
<ul>
<li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html">DeclarativeAggregate</a>: Update operations are given as Spark <tt>Expression</tt>s</li>
<li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/ImperativeAggregate.html">ImperativeAggregate</a>: Update operations are given as Scala functions</li>
</ul>
For the purposes of this checkpoint, you will need to support SUM, COUNT, AVERAGE, MIN, and MAX, all of which are implemented in Spark as <tt>DeclarativeAggregate</tt>s.
</p>
<h3>DeclarativeAggregates</h3>
<p>The following methods are relevant:</p>
<dl>
<dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#aggBufferAttributes:Seq[org.apache.spark.sql.catalyst.expressions.AttributeReference]">aggBufferAttributes</a></dt>
<dd>The "schema" of the aggregate buffer. Note that these are attributes, and their ExprIds here line up with the Attributes used in the expressions below.</dd>
<dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#initialValues:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">initialValues</a></dt>
<dd>A sequence of expressions, one for every attribute in aggBufferAttributes. These are the initial values for the buffer.</dd>
<dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#updateExpressions:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">updateExpressions</a></dt>
<dd>A sequence of expressions, one for every attribute in aggBufferAttributes. Evaluate these expressions on an InternalRow that includes both the aggBufferAttributes and the <tt>.output</tt> of the Aggregate's child LogicalPlan operator.</dd>
<dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#evaluateExpression:org.apache.spark.sql.catalyst.expressions.Expression">evaluateExpression</a></dt>
<dd>An expression that, if evaluated on an InternalRow storing the aggregation buffer, will return the result of the aggregate function.</dd>
</dl>
<h3>Example Queries</h3>
<p><a href="http://www.tpc.org/tpch/">TPC-H</a> is a standard database benchmark. The benchmark consists of a dataset generator and 22 standard query templates. This checkpoint uses three queries based on TPC-H Queries 1, 3, 5, 6, 10, 11, 12, and 14. The dataset generator and template values can be found at the <a href="http://www.tpc.org/tpch/">TPC-H website</a>, and is run at scaling factor (SF) 0.1. Minor variations in the queries may be made. The queries have been rewritten slightly to make them easier to Analyze.</p>
<h5>Query 1</h5>
<pre>
SELECT
LINEITEM.RETURNFLAG,
LINEITEM.LINESTATUS,
SUM(LINEITEM.QUANTITY) AS SUM_QTY,
SUM(LINEITEM.EXTENDEDPRICE) AS SUM_BASE_PRICE,
SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS SUM_DISC_PRICE,
SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)*(CAST(1.0 as float)+LINEITEM.TAX)) AS SUM_CHARGE,
AVG(LINEITEM.QUANTITY) AS AVG_QTY,
AVG(LINEITEM.EXTENDEDPRICE) AS AVG_PRICE,
AVG(LINEITEM.DISCOUNT) AS AVG_DISC,
COUNT(*) AS COUNT_ORDER
FROM
LINEITEM
WHERE
LINEITEM.SHIPDATE <= DATE '1998-10-01'
GROUP BY
LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS
ORDER BY
LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS
</pre>
<h5>Query 3</h5>
<pre>
SELECT
LINEITEM.ORDERKEY,
SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS REVENUE,
ORDERS.ORDERDATE,
ORDERS.SHIPPRIORITY
FROM
CUSTOMER,
ORDERS,
LINEITEM
WHERE
CUSTOMER.MKTSEGMENT = 'BUILDING' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
AND ORDERS.ORDERDATE < DATE '1995-03-15'
AND LINEITEM.SHIPDATE > DATE '1995-03-15'
GROUP BY LINEITEM.ORDERKEY, ORDERS.ORDERDATE, ORDERS.SHIPPRIORITY
ORDER BY REVENUE DESC, ORDERDATE
LIMIT 10
</pre>
<h5>Query 5</h5>
<pre>
SELECT
NATION.NAME,
SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE
FROM
REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
WHERE
CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
AND CUSTOMER.NATIONKEY = NATION.NATIONKEY
AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
AND NATION.REGIONKEY = REGION.REGIONKEY
AND REGION.NAME = 'ASIA'
AND ORDERS.ORDERDATE >= DATE '1994-01-01'
AND ORDERS.ORDERDATE < DATE '1995-01-01'
GROUP BY NATION.NAME
ORDER BY REVENUE DESC
</pre>
<h5>Query 6</h5>
<pre>
SELECT
SUM(LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT) AS REVENUE
FROM LINEITEM
WHERE LINEITEM.SHIPDATE >= DATE '1994-01-01'
AND LINEITEM.SHIPDATE < DATE '1995-01-01'
AND LINEITEM.DISCOUNT > CAST(0.05 AS float) AND LINEITEM.DISCOUNT < CAST(0.07 as float)
AND LINEITEM.QUANTITY < CAST(24 AS float)
</pre>
<h5>Query 10</h5>
<pre>
SELECT
CUSTOMER.CUSTKEY,
SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE,
CUSTOMER.ACCTBAL,
NATION.NAME,
CUSTOMER.ADDRESS,
CUSTOMER.PHONE,
CUSTOMER.COMMENT
FROM
CUSTOMER, ORDERS, LINEITEM, NATION
WHERE
CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
AND ORDERS.ORDERDATE >= DATE '1993-10-01'
AND ORDERS.ORDERDATE < DATE '1994-01-01'
AND LINEITEM.RETURNFLAG = 'R'
AND CUSTOMER.NATIONKEY = NATION.NATIONKEY
GROUP BY
CUSTOMER.CUSTKEY, CUSTOMER.ACCTBAL, CUSTOMER.PHONE, NATION.NAME, CUSTOMER.ADDRESS, CUSTOMER.COMMENT
ORDER BY REVENUE ASC
LIMIT 20
</pre>
<h5>Query 11</h5>
<pre>
SELECT PK_V.PARTKEY,
PK_V.VALUE
FROM (
SELECT PS.PARTKEY,
SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
FROM PARTSUPP PS,
SUPPLIER S,
NATION N
WHERE PS.SUPPKEY = S.SUPPKEY
AND S.NATIONKEY = N.NATIONKEY
AND N.NAME = 'GERMANY'
GROUP BY PS.PARTKEY
) PK_V, (
SELECT SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
FROM PARTSUPP PS,
SUPPLIER S,
NATION N
WHERE PS.SUPPKEY = S.SUPPKEY
AND S.NATIONKEY = N.NATIONKEY
AND N.NAME = 'GERMANY'
) CUTOFF_V
WHERE PK_V.VALUE > (CUTOFF_V.VALUE * CAST(0.0001 AS double) / CAST(100.0 AS double))
ORDER BY PK_V.VALUE DESC
</pre>
<h5>Query 12</h5>
<pre>
SELECT LINEITEM.SHIPMODE,
SUM(CASE WHEN ORDERS.ORDERPRIORITY = '1-URGENT'
OR ORDERS.ORDERPRIORITY = '2-HIGH'
THEN 1
ELSE 0 END) AS HIGH_LINE_COUNT,
SUM(CASE WHEN ORDERS.ORDERPRIORITY <> '1-URGENT'
AND ORDERS.ORDERPRIORITY <> '2-HIGH'
THEN 1
ELSE 0 END) AS LOW_LINE_COUNT
FROM LINEITEM, ORDERS
WHERE ORDERS.ORDERKEY = LINEITEM.ORDERKEY
AND (LINEITEM.SHIPMODE='MAIL' OR LINEITEM.SHIPMODE='SHIP')
AND LINEITEM.COMMITDATE < LINEITEM.RECEIPTDATE
AND LINEITEM.SHIPDATE < LINEITEM.COMMITDATE
AND LINEITEM.RECEIPTDATE >= DATE '1994-01-01'
AND LINEITEM.RECEIPTDATE < DATE '1995-01-01'
GROUP BY LINEITEM.SHIPMODE
ORDER BY LINEITEM.SHIPMODE
</pre>
<h5>Query 14</h5>
<pre>
SELECT
CAST(100.00 AS double)
* PROMO_ONLY
/ ALL_REVENUE
AS PROMO_REVENUE
FROM (
SELECT
SUM(
CASE WHEN PART.TYPE LIKE 'PROMO%'
THEN LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
ELSE cast(0 as float)
END
) AS PROMO_ONLY,
SUM(
LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
) AS ALL_REVENUE
FROM
LINEITEM,
PART
WHERE
LINEITEM.PARTKEY = PART.PARTKEY
AND LINEITEM.SHIPDATE >= DATE '1995-09-01'
AND LINEITEM.SHIPDATE < DATE '1995-10-01'
) AGGREGATE
</pre>
</div>

View file

@ -117,8 +117,9 @@ schedule:
- date: "Apr. 20"
topic: "Indexing Review + Checkpoint 4"
- date: "Apr. 22"
due: "Checkpoint 3"
topic: "Logging + Recovery"
- date: "Apr. 26"
due: "Checkpoint 3"
- date: "Apr. 27"
topic: "Distributed Commit"
- date: "Apr. 29"
@ -187,8 +188,8 @@ In this course, you will learn...
<ul>
<li>5% <a href="checkpoint0.html">Checkpoint 0</a> due on Feb. 16</li>
<li>10% <a href="checkpoint1.html">Checkpoint 1</a> due on Mar. 15</li>
<li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 6</li>
<li>8% <a>Checkpoint 3</a> due on Apr. 20</li>
<li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 9</li>
<li>8% <a href="checkpoint3.html">Checkpoint 3</a> due on Apr. 26</li>
<li>15% <a>Checkpoint 4</a> due on May 14</li>
</ul>
</li>