Checkpoint 3

2021-04-13 21:57:12 -04:00 · 2021-04-13 21:57:12 -04:00 · f898857de4
parent df1a4c23cc
commit f898857de4
2 changed files with 334 additions and 3 deletions
--- a/src/teaching/cse-562/2021sp/checkpoint3.erb
+++ b/src/teaching/cse-562/2021sp/checkpoint3.erb
@ -0,0 +1,330 @@
+---
+title: "CSE-4/562 Database Systems: Checkpoint 3"
+---
+
+<% 
+def console(str)
+  return "<tt style='background-color: #111; color: #3f3; padding: 3px;'>#{str}</tt>"
+end
+
+prompt = console("$&gt;\\n")
+%>
+
+<div style="width: 600px; margin-left: auto; margin-right: auto;">
+<h2>Checkpoint 3</h2>
+
+<p>
+  In this project, you'll extend your SQL runtime to support aggregate queries.  It is worth 8 points.
+</p>
+
+<h3>Requirements</h3>
+
+<ul>
+  <li>All <tt>.scala</tt> files in <tt>/src/main/scala</tt> and its subdirectories will be compiled and the main function of the object <tt>microbase.Microbase</tt> will be run.</li>
+  <li>The grader will wait until the code prints <%= prompt %>.  If this takes more than 2 seconds, you will receive a 0.</li>
+  <li>The grader will write a series of <tt>CREATE TABLE</tt> and <tt>SELECT</tt> commands to your code's <tt>System.in</tt>, with one command per <tt>\n</tt>-delimited line.  After processing each statement, your code <b>must</b> print  <%= prompt %> on a new line to indicate that it is done.  If your code exceeds a per-operation time-out, you will receive a 0 for that query and all subsequent parts of the assignment.</li>
+  <li>When your code is provided with a <tt>CREATE TABLE</tt> statement, this indicates that there is a file called <%= console("data/[tableName].data") %>, where <tt>[tableName]</tt> is the name of the table.  This file contains UTF-8-encoded records, one per <tt>\n</tt>-delimited line, with fields in human-readable string representation (i.e., as in a CSV file) delimited by the pipe character (<tt>|</tt>).  Note that there will <b>not</b> be any <tt>INSERT</tt> statements.</li>
+  <li>When your code is provided with a <tt>SELECT</tt> statement, it <b>must</b> evaluate the SELECT statement and print the results to <tt>System.out</tt>, one per <tt>\n</tt>-delimited line, with fields in human-readable string representations delimited by the pipe character (<tt>|</tt>).  You will be expected to support the following features of SQL:<ul>
+    <li>Arbitrary expression targets</li>
+    <li>Attribute and relation aliasing (i.e., <tt>SELECT bar.foo AS baz FROM table AS bar</tt>)</li>
+    <li>Project, Filter, Table, and Equi-Joins</li>
+    <li>Order-By and LIMIT queries</li>
+    <li style="font-weight: bold">Aggregate (Single-Valued, and GROUP BY) queries</li>
+    <li><tt>FROM</tt>-Nested Subqueries</li>
+  </ul></li>
+  <li>Your response to <tt>SELECT</tt> queries will be checked against Sqlite3. </li>
+  <li>Once again, your code <b>must</b> print <%= prompt %> on a new line after each <tt>CREATE TABLE</tt> or <tt>SELECT</tt> to indicate that it is done processing the statement.</li>
+  <li>You <i>may</i> use <tt>System.err</tt> to print debugging information to yourself.  This output (or a subset of it) will be included in your debug log in autolab and will be ignored by the grading script.</li>
+</ul>
+
+<h3>Grading Rubric</h3>
+
+All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below.  Grading will be based on total runtime for each batch of queries.
+
+<dl>
+  <dt>8 randomly-generated queries based on the <a href="http://www.tpc.org/tpch/">TPC-H</a> benchmark, with a scale factor of 0.1 (100MB) and templates listed below</dt>
+  <dd><b>Under 80 seconds total</b>: 8 of 8 points + leaderboard ranking</dd>
+  <dd><b>Under 150 seconds total</b>: 8 of 8 points</dd>
+  <dd><b>Under 5 seconds total</b>: 4 of 8 points</dd>
+  <dd><b>Under 1 minute per query</b>: 2 of 8 points</dd>
+</dl>
+
+Note in particular that these queries make extensive use of aggregates, equi-joins, order-by, and limit clauses, which will all need to be supported.
+
+<h3>UnresolvedFunction</h3>
+
+<p>
+  When the Spark SQL Parser encounters something that looks like a function, it doesn't try to interpret it directly.  Instead, it'll produce a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/UnresolvedFunction.html">UnresolvedFunction</a> expression node.  You'll need to replace these.  
+</p>
+
+<p>
+  Like most databases, Spark maintains a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html">"Function Registry"</a>, a catalog of all functions and their implementations.  All of the "built-in" functions are provided in <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html#builtin:org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry">FunctionRegistry.builtin</a>.  Here's a little snippet you can use to replace functions.  It doesn't support everything, but will be sufficient for this project.  
+</p>
+
+<pre>
+case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
+  {
+    val builder = 
+      FunctionRegistry.builtin
+        .lookupFunctionBuilder(name)
+        .getOrElse {
+          throw new RuntimeException(
+            s"Unable to resolve function `${name}`"
+          )
+        }
+    builder(arguments) // returns the replacement expression node.
+  }
+</pre>
+
+<p>
+  <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/SimpleFunctionRegistry.html#lookupFunction(name:org.apache.spark.sql.catalyst.FunctionIdentifier):Option[org.apache.spark.sql.catalyst.expressions.ExpressionInfo]">FunctionRegistry.lookupFunctionBuilder</a> returns a 'builder' function.  When called on the arguments of the UnresolvedFunction, the builder function returns an expression that implements the function.  For example, looking up "regexp_extract" in the registry returns a function that, when called on two string-typed expressions and a literal integer, will return a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/RegExpExtract.html">RegExpExtract</a> object.
+</p>
+
+<h3>Aggregates disguised as Projects</h3>
+
+<p>
+  Because the Spark SQL Parser doesn't try to resolve functions, it is incapable of distinguishing between normal functions: 
+<pre>
+  SELECT regexp_extract(A, "a(b+)a", 1) FROM R
+</pre>
+and aggregate functions:
+<pre>
+  SELECT sum(A) FROM R
+</pre>
+  Both of these will parse into a LogicalPlan topped with a Project node.
+</p>
+
+<p>
+  While not required, you might find it easier to work with the resulting plans if you replace them with actual <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> plan nodes.  Look for Project nodes with any expression in its <tt>projectList</tt> that is a subclass of <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/AggregateFunction.html">AggregateFunction</a>.  
+</p>
+
+<h3>Aggregate</h3>
+
+An <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> is a logical plan node with three fields:
+<dl>
+  <dt>groupingExpressions</dt>
+  <dd>The GROUP BY attributes.  Normally these can be any expression, but for this checkpoint, it will be sufficient to assume that all of these expressions are <tt>Attribute</tt>s</dd>
+
+  <dt>aggregateExpressions</dt>
+  <dd>The SELECT expressions.  Normally these can be arbitrary arithmetic over aggregates, but for this checkpoint, it will be sufficient to assume that all of these expressions are either an <tt>Alias</tt> of an <tt>AggregateFunction</tt>, or an <tt>Attribute</tt> that also appears in the groupingExpressions field.</dd>
+
+  <dt>child</dt>
+  <dd>The input operator.</dd>
+</dl>
+
+
+<h3>AggregateFunctions</h3>
+
+<p>
+  <tt>AggregateFunction</tt>s are unevaluable, because they don't get evaluated on a single row.  Instead, there are several methods on an <tt>AggregateFunction</tt> that describe how to initialize an accumulator (what Spark calls an AggregationBuffer), how to incorporate input rows into it, and how to extract a final result value from the buffer.  
+</p>
+
+<p>
+  The <tt>AggregateFunction</tt> can be an instance of either:
+  <ul>
+    <li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html">DeclarativeAggregate</a>: Update operations are given as Spark <tt>Expression</tt>s</li>
+    <li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/ImperativeAggregate.html">ImperativeAggregate</a>: Update operations are given as Scala functions</li>
+  </ul>
+  For the purposes of this checkpoint, you will need to support SUM, COUNT, AVERAGE, MIN, and MAX, all of which are implemented in Spark as <tt>DeclarativeAggregate</tt>s. 
+</p>
+
+<h3>DeclarativeAggregates</h3>
+<p>The following methods are relevant:</p>
+<dl>
+  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#aggBufferAttributes:Seq[org.apache.spark.sql.catalyst.expressions.AttributeReference]">aggBufferAttributes</a></dt>
+  <dd>The "schema" of the aggregate buffer.  Note that these are attributes, and their ExprIds here line up with the Attributes used in the expressions below.</dd>
+
+  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#initialValues:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">initialValues</a></dt>
+  <dd>A sequence of expressions, one for every attribute in aggBufferAttributes.  These are the initial values for the buffer.</dd>
+
+  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#updateExpressions:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">updateExpressions</a></dt>
+  <dd>A sequence of expressions, one for every attribute in aggBufferAttributes.  Evaluate these expressions on an InternalRow that includes both the aggBufferAttributes and the <tt>.output</tt> of the Aggregate's child LogicalPlan operator.</dd>
+
+  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#evaluateExpression:org.apache.spark.sql.catalyst.expressions.Expression">evaluateExpression</a></dt>
+  <dd>An expression that, if evaluated on an InternalRow storing the aggregation buffer, will return the result of the aggregate function.</dd>
+</dl>
+
+
+<h3>Example Queries</h3>
+
+<p><a href="http://www.tpc.org/tpch/">TPC-H</a> is a standard database benchmark.  The benchmark consists of a dataset generator and 22 standard query templates.  This checkpoint uses three queries based on TPC-H Queries 1, 3, 5, 6, 10, 11, 12, and 14.  The dataset generator and template values can be found at the <a href="http://www.tpc.org/tpch/">TPC-H website</a>, and is run at scaling factor (SF) 0.1.  Minor variations in the queries may be made.  The queries have been rewritten slightly to make them easier to Analyze.</p>
+
+<h5>Query 1</h5>
+<pre>
+SELECT
+  LINEITEM.RETURNFLAG,
+  LINEITEM.LINESTATUS,
+  SUM(LINEITEM.QUANTITY) AS SUM_QTY,
+  SUM(LINEITEM.EXTENDEDPRICE) AS SUM_BASE_PRICE, 
+  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS SUM_DISC_PRICE, 
+  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)*(CAST(1.0 as float)+LINEITEM.TAX)) AS SUM_CHARGE, 
+  AVG(LINEITEM.QUANTITY) AS AVG_QTY,
+  AVG(LINEITEM.EXTENDEDPRICE) AS AVG_PRICE,
+  AVG(LINEITEM.DISCOUNT) AS AVG_DISC,
+  COUNT(*) AS COUNT_ORDER
+FROM
+  LINEITEM
+WHERE
+  LINEITEM.SHIPDATE <= DATE '1998-10-01'
+GROUP BY 
+  LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS 
+ORDER BY
+  LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS
+</pre>
+
+<h5>Query 3</h5>
+<pre>
+SELECT
+  LINEITEM.ORDERKEY,
+  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS REVENUE, 
+  ORDERS.ORDERDATE,
+  ORDERS.SHIPPRIORITY
+FROM
+  CUSTOMER,
+  ORDERS,
+  LINEITEM 
+WHERE
+  CUSTOMER.MKTSEGMENT = 'BUILDING' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
+  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY 
+  AND ORDERS.ORDERDATE < DATE '1995-03-15'
+  AND LINEITEM.SHIPDATE > DATE '1995-03-15'
+GROUP BY LINEITEM.ORDERKEY, ORDERS.ORDERDATE, ORDERS.SHIPPRIORITY 
+ORDER BY REVENUE DESC, ORDERDATE
+LIMIT 10
+</pre>
+
+<h5>Query 5</h5>
+<pre>
+SELECT
+  NATION.NAME,
+  SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE 
+FROM
+  REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
+WHERE
+  CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
+  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
+  AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
+  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY 
+  AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
+  AND NATION.REGIONKEY = REGION.REGIONKEY
+  AND REGION.NAME = 'ASIA'
+  AND ORDERS.ORDERDATE >= DATE '1994-01-01'
+  AND ORDERS.ORDERDATE < DATE '1995-01-01'
+GROUP BY NATION.NAME
+ORDER BY REVENUE DESC
+</pre>
+
+<h5>Query 6</h5>
+<pre>
+SELECT
+  SUM(LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT) AS REVENUE
+FROM LINEITEM
+WHERE LINEITEM.SHIPDATE >= DATE '1994-01-01'
+  AND LINEITEM.SHIPDATE < DATE '1995-01-01'
+  AND LINEITEM.DISCOUNT > CAST(0.05 AS float) AND LINEITEM.DISCOUNT < CAST(0.07 as float)
+  AND LINEITEM.QUANTITY < CAST(24 AS float)
+</pre>
+
+<h5>Query 10</h5>
+<pre>
+SELECT 
+  CUSTOMER.CUSTKEY, 
+  SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE, 
+  CUSTOMER.ACCTBAL, 
+  NATION.NAME, 
+  CUSTOMER.ADDRESS, 
+  CUSTOMER.PHONE, 
+  CUSTOMER.COMMENT
+FROM 
+  CUSTOMER, ORDERS, LINEITEM, NATION
+WHERE
+  CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
+  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
+  AND ORDERS.ORDERDATE >= DATE '1993-10-01'
+  AND ORDERS.ORDERDATE < DATE '1994-01-01'
+  AND LINEITEM.RETURNFLAG = 'R'
+  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY
+GROUP BY 
+  CUSTOMER.CUSTKEY, CUSTOMER.ACCTBAL, CUSTOMER.PHONE, NATION.NAME, CUSTOMER.ADDRESS, CUSTOMER.COMMENT
+ORDER BY REVENUE ASC
+LIMIT 20
+</pre>
+
+<h5>Query 11</h5>
+<pre>
+SELECT PK_V.PARTKEY, 
+       PK_V.VALUE
+FROM (
+  SELECT PS.PARTKEY,
+         SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
+  FROM PARTSUPP PS,
+       SUPPLIER S,
+       NATION N
+  WHERE PS.SUPPKEY = S.SUPPKEY
+    AND S.NATIONKEY = N.NATIONKEY
+    AND N.NAME = 'GERMANY'
+  GROUP BY PS.PARTKEY 
+) PK_V, (
+  SELECT SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
+  FROM PARTSUPP PS,
+       SUPPLIER S,
+       NATION N
+  WHERE PS.SUPPKEY = S.SUPPKEY
+    AND S.NATIONKEY = N.NATIONKEY
+    AND N.NAME = 'GERMANY'
+) CUTOFF_V
+WHERE PK_V.VALUE > (CUTOFF_V.VALUE * CAST(0.0001 AS double) / CAST(100.0 AS double))
+ORDER BY PK_V.VALUE DESC
+</pre>
+
+<h5>Query 12</h5>
+<pre>
+SELECT  LINEITEM.SHIPMODE, 
+        SUM(CASE WHEN ORDERS.ORDERPRIORITY = '1-URGENT'
+                     OR ORDERS.ORDERPRIORITY = '2-HIGH'
+                   THEN 1
+                   ELSE 0 END) AS HIGH_LINE_COUNT,
+        SUM(CASE WHEN ORDERS.ORDERPRIORITY <> '1-URGENT'
+                     AND ORDERS.ORDERPRIORITY <> '2-HIGH'
+                   THEN 1
+                   ELSE 0 END) AS LOW_LINE_COUNT
+FROM LINEITEM, ORDERS
+WHERE ORDERS.ORDERKEY = LINEITEM.ORDERKEY
+  AND (LINEITEM.SHIPMODE='MAIL' OR LINEITEM.SHIPMODE='SHIP')
+  AND LINEITEM.COMMITDATE < LINEITEM.RECEIPTDATE
+  AND LINEITEM.SHIPDATE < LINEITEM.COMMITDATE
+  AND LINEITEM.RECEIPTDATE >= DATE '1994-01-01'
+  AND LINEITEM.RECEIPTDATE < DATE '1995-01-01'
+GROUP BY LINEITEM.SHIPMODE
+ORDER BY LINEITEM.SHIPMODE
+</pre>
+
+<h5>Query 14</h5>
+<pre>
+SELECT
+  CAST(100.00 AS double) 
+    * PROMO_ONLY 
+    / ALL_REVENUE
+      AS PROMO_REVENUE
+FROM (
+  SELECT
+    SUM(
+      CASE  WHEN PART.TYPE LIKE 'PROMO%'
+            THEN LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
+            ELSE cast(0 as float)
+      END
+    ) AS PROMO_ONLY,
+    SUM(
+      LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
+    ) AS ALL_REVENUE
+  FROM 
+    LINEITEM,
+    PART
+  WHERE
+    LINEITEM.PARTKEY = PART.PARTKEY
+    AND LINEITEM.SHIPDATE >= DATE '1995-09-01'
+    AND LINEITEM.SHIPDATE < DATE '1995-10-01'
+) AGGREGATE
+</pre>
+
+</div>
--- a/src/teaching/cse-562/2021sp/index.erb
+++ b/src/teaching/cse-562/2021sp/index.erb
@ -117,8 +117,9 @@ schedule:
  - date: "Apr. 20"
    topic: "Indexing Review + Checkpoint 4"
  - date: "Apr. 22"
-    due: "Checkpoint 3"
    topic: "Logging + Recovery"
+  - date: "Apr. 26"
+    due: "Checkpoint 3"
  - date: "Apr. 27"
    topic: "Distributed Commit"
  - date: "Apr. 29"
@ -187,8 +188,8 @@ In this course, you will learn...
    <ul>
      <li>5%  <a href="checkpoint0.html">Checkpoint 0</a> due on Feb. 16</li>
      <li>10% <a href="checkpoint1.html">Checkpoint 1</a> due on Mar. 15</li>
-      <li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 6</li>
-      <li>8%  <a>Checkpoint 3</a> due on Apr. 20</li>
+      <li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 9</li>
+      <li>8%  <a href="checkpoint3.html">Checkpoint 3</a> due on Apr. 26</li>
      <li>15% <a>Checkpoint 4</a> due on May 14</li>
    </ul>
  </li>