Checkpoint 3

2021-04-13 21:57:12 -04:00 · 2021-04-13 21:57:12 -04:00 · f898857de4
parent df1a4c23cc
commit f898857de4
2 changed files with 334 additions and 3 deletions
--- a/src/teaching/cse-562/2021sp/checkpoint3.erb
+++ b/src/teaching/cse-562/2021sp/checkpoint3.erb
@ -0,0 +1,330 @@
 ---
 title: "CSE-4/562 Database Systems: Checkpoint 3"
 ---
 <% 
 def console(str)
  return "<tt style='background-color: #111; color: #3f3; padding: 3px;'>#{str}</tt>"
 end
 prompt = console("$&gt;\\n")
 %>
 <div style="width: 600px; margin-left: auto; margin-right: auto;">
 <h2>Checkpoint 3</h2>
 <p>
  In this project, you'll extend your SQL runtime to support aggregate queries.  It is worth 8 points.
 </p>
 <h3>Requirements</h3>
 <ul>
  <li>All <tt>.scala</tt> files in <tt>/src/main/scala</tt> and its subdirectories will be compiled and the main function of the object <tt>microbase.Microbase</tt> will be run.</li>
  <li>The grader will wait until the code prints <%= prompt %>.  If this takes more than 2 seconds, you will receive a 0.</li>
  <li>The grader will write a series of <tt>CREATE TABLE</tt> and <tt>SELECT</tt> commands to your code's <tt>System.in</tt>, with one command per <tt>\n</tt>-delimited line.  After processing each statement, your code <b>must</b> print  <%= prompt %> on a new line to indicate that it is done.  If your code exceeds a per-operation time-out, you will receive a 0 for that query and all subsequent parts of the assignment.</li>
  <li>When your code is provided with a <tt>CREATE TABLE</tt> statement, this indicates that there is a file called <%= console("data/[tableName].data") %>, where <tt>[tableName]</tt> is the name of the table.  This file contains UTF-8-encoded records, one per <tt>\n</tt>-delimited line, with fields in human-readable string representation (i.e., as in a CSV file) delimited by the pipe character (<tt>|</tt>).  Note that there will <b>not</b> be any <tt>INSERT</tt> statements.</li>
  <li>When your code is provided with a <tt>SELECT</tt> statement, it <b>must</b> evaluate the SELECT statement and print the results to <tt>System.out</tt>, one per <tt>\n</tt>-delimited line, with fields in human-readable string representations delimited by the pipe character (<tt>|</tt>).  You will be expected to support the following features of SQL:<ul>
    <li>Arbitrary expression targets</li>
    <li>Attribute and relation aliasing (i.e., <tt>SELECT bar.foo AS baz FROM table AS bar</tt>)</li>
    <li>Project, Filter, Table, and Equi-Joins</li>
    <li>Order-By and LIMIT queries</li>
    <li style="font-weight: bold">Aggregate (Single-Valued, and GROUP BY) queries</li>
    <li><tt>FROM</tt>-Nested Subqueries</li>
  </ul></li>
  <li>Your response to <tt>SELECT</tt> queries will be checked against Sqlite3. </li>
  <li>Once again, your code <b>must</b> print <%= prompt %> on a new line after each <tt>CREATE TABLE</tt> or <tt>SELECT</tt> to indicate that it is done processing the statement.</li>
  <li>You <i>may</i> use <tt>System.err</tt> to print debugging information to yourself.  This output (or a subset of it) will be included in your debug log in autolab and will be ignored by the grading script.</li>
 </ul>
 <h3>Grading Rubric</h3>
 All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below.  Grading will be based on total runtime for each batch of queries.
 <dl>
  <dt>8 randomly-generated queries based on the <a href="http://www.tpc.org/tpch/">TPC-H</a> benchmark, with a scale factor of 0.1 (100MB) and templates listed below</dt>
  <dd><b>Under 80 seconds total</b>: 8 of 8 points + leaderboard ranking</dd>
  <dd><b>Under 150 seconds total</b>: 8 of 8 points</dd>
  <dd><b>Under 5 seconds total</b>: 4 of 8 points</dd>
  <dd><b>Under 1 minute per query</b>: 2 of 8 points</dd>
 </dl>
 Note in particular that these queries make extensive use of aggregates, equi-joins, order-by, and limit clauses, which will all need to be supported.
 <h3>UnresolvedFunction</h3>
 <p>
  When the Spark SQL Parser encounters something that looks like a function, it doesn't try to interpret it directly.  Instead, it'll produce a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/UnresolvedFunction.html">UnresolvedFunction</a> expression node.  You'll need to replace these.  
 </p>
 <p>
  Like most databases, Spark maintains a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html">"Function Registry"</a>, a catalog of all functions and their implementations.  All of the "built-in" functions are provided in <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/FunctionRegistry$.html#builtin:org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry">FunctionRegistry.builtin</a>.  Here's a little snippet you can use to replace functions.  It doesn't support everything, but will be sufficient for this project.  
 </p>
 <pre>
 case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
  {
    val builder = 
      FunctionRegistry.builtin
        .lookupFunctionBuilder(name)
        .getOrElse {
          throw new RuntimeException(
            s"Unable to resolve function `${name}`"
          )
        }
    builder(arguments) // returns the replacement expression node.
  }
 </pre>
 <p>
  <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/analysis/SimpleFunctionRegistry.html#lookupFunction(name:org.apache.spark.sql.catalyst.FunctionIdentifier):Option[org.apache.spark.sql.catalyst.expressions.ExpressionInfo]">FunctionRegistry.lookupFunctionBuilder</a> returns a 'builder' function.  When called on the arguments of the UnresolvedFunction, the builder function returns an expression that implements the function.  For example, looking up "regexp_extract" in the registry returns a function that, when called on two string-typed expressions and a literal integer, will return a <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/RegExpExtract.html">RegExpExtract</a> object.
 </p>
 <h3>Aggregates disguised as Projects</h3>
 <p>
  Because the Spark SQL Parser doesn't try to resolve functions, it is incapable of distinguishing between normal functions: 
 <pre>
  SELECT regexp_extract(A, "a(b+)a", 1) FROM R
 </pre>
 and aggregate functions:
 <pre>
  SELECT sum(A) FROM R
 </pre>
  Both of these will parse into a LogicalPlan topped with a Project node.
 </p>
 <p>
  While not required, you might find it easier to work with the resulting plans if you replace them with actual <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> plan nodes.  Look for Project nodes with any expression in its <tt>projectList</tt> that is a subclass of <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/AggregateFunction.html">AggregateFunction</a>.  
 </p>
 <h3>Aggregate</h3>
 An <a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Aggregate.html">Aggregate</a> is a logical plan node with three fields:
 <dl>
  <dt>groupingExpressions</dt>
  <dd>The GROUP BY attributes.  Normally these can be any expression, but for this checkpoint, it will be sufficient to assume that all of these expressions are <tt>Attribute</tt>s</dd>
  <dt>aggregateExpressions</dt>
  <dd>The SELECT expressions.  Normally these can be arbitrary arithmetic over aggregates, but for this checkpoint, it will be sufficient to assume that all of these expressions are either an <tt>Alias</tt> of an <tt>AggregateFunction</tt>, or an <tt>Attribute</tt> that also appears in the groupingExpressions field.</dd>
  <dt>child</dt>
  <dd>The input operator.</dd>
 </dl>
 <h3>AggregateFunctions</h3>
 <p>
  <tt>AggregateFunction</tt>s are unevaluable, because they don't get evaluated on a single row.  Instead, there are several methods on an <tt>AggregateFunction</tt> that describe how to initialize an accumulator (what Spark calls an AggregationBuffer), how to incorporate input rows into it, and how to extract a final result value from the buffer.  
 </p>
 <p>
  The <tt>AggregateFunction</tt> can be an instance of either:
  <ul>
    <li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html">DeclarativeAggregate</a>: Update operations are given as Spark <tt>Expression</tt>s</li>
    <li><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/ImperativeAggregate.html">ImperativeAggregate</a>: Update operations are given as Scala functions</li>
  </ul>
  For the purposes of this checkpoint, you will need to support SUM, COUNT, AVERAGE, MIN, and MAX, all of which are implemented in Spark as <tt>DeclarativeAggregate</tt>s. 
 </p>
 <h3>DeclarativeAggregates</h3>
 <p>The following methods are relevant:</p>
 <dl>
  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#aggBufferAttributes:Seq[org.apache.spark.sql.catalyst.expressions.AttributeReference]">aggBufferAttributes</a></dt>
  <dd>The "schema" of the aggregate buffer.  Note that these are attributes, and their ExprIds here line up with the Attributes used in the expressions below.</dd>
  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#initialValues:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">initialValues</a></dt>
  <dd>A sequence of expressions, one for every attribute in aggBufferAttributes.  These are the initial values for the buffer.</dd>
  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#updateExpressions:Seq[org.apache.spark.sql.catalyst.expressions.Expression]">updateExpressions</a></dt>
  <dd>A sequence of expressions, one for every attribute in aggBufferAttributes.  Evaluate these expressions on an InternalRow that includes both the aggBufferAttributes and the <tt>.output</tt> of the Aggregate's child LogicalPlan operator.</dd>
  <dt><a href="https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/expressions/aggregate/DeclarativeAggregate.html#evaluateExpression:org.apache.spark.sql.catalyst.expressions.Expression">evaluateExpression</a></dt>
  <dd>An expression that, if evaluated on an InternalRow storing the aggregation buffer, will return the result of the aggregate function.</dd>
 </dl>
 <h3>Example Queries</h3>
 <p><a href="http://www.tpc.org/tpch/">TPC-H</a> is a standard database benchmark.  The benchmark consists of a dataset generator and 22 standard query templates.  This checkpoint uses three queries based on TPC-H Queries 1, 3, 5, 6, 10, 11, 12, and 14.  The dataset generator and template values can be found at the <a href="http://www.tpc.org/tpch/">TPC-H website</a>, and is run at scaling factor (SF) 0.1.  Minor variations in the queries may be made.  The queries have been rewritten slightly to make them easier to Analyze.</p>
 <h5>Query 1</h5>
 <pre>
 SELECT
  LINEITEM.RETURNFLAG,
  LINEITEM.LINESTATUS,
  SUM(LINEITEM.QUANTITY) AS SUM_QTY,
  SUM(LINEITEM.EXTENDEDPRICE) AS SUM_BASE_PRICE, 
  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS SUM_DISC_PRICE, 
  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)*(CAST(1.0 as float)+LINEITEM.TAX)) AS SUM_CHARGE, 
  AVG(LINEITEM.QUANTITY) AS AVG_QTY,
  AVG(LINEITEM.EXTENDEDPRICE) AS AVG_PRICE,
  AVG(LINEITEM.DISCOUNT) AS AVG_DISC,
  COUNT(*) AS COUNT_ORDER
 FROM
  LINEITEM
 WHERE
  LINEITEM.SHIPDATE <= DATE '1998-10-01'
 GROUP BY 
  LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS 
 ORDER BY
  LINEITEM.RETURNFLAG, LINEITEM.LINESTATUS
 </pre>
 <h5>Query 3</h5>
 <pre>
 SELECT
  LINEITEM.ORDERKEY,
  SUM(LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT)) AS REVENUE, 
  ORDERS.ORDERDATE,
  ORDERS.SHIPPRIORITY
 FROM
  CUSTOMER,
  ORDERS,
  LINEITEM 
 WHERE
  CUSTOMER.MKTSEGMENT = 'BUILDING' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY 
  AND ORDERS.ORDERDATE < DATE '1995-03-15'
  AND LINEITEM.SHIPDATE > DATE '1995-03-15'
 GROUP BY LINEITEM.ORDERKEY, ORDERS.ORDERDATE, ORDERS.SHIPPRIORITY 
 ORDER BY REVENUE DESC, ORDERDATE
 LIMIT 10
 </pre>
 <h5>Query 5</h5>
 <pre>
 SELECT
  NATION.NAME,
  SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE 
 FROM
  REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
 WHERE
  CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
  AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY 
  AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
  AND NATION.REGIONKEY = REGION.REGIONKEY
  AND REGION.NAME = 'ASIA'
  AND ORDERS.ORDERDATE >= DATE '1994-01-01'
  AND ORDERS.ORDERDATE < DATE '1995-01-01'
 GROUP BY NATION.NAME
 ORDER BY REVENUE DESC
 </pre>
 <h5>Query 6</h5>
 <pre>
 SELECT
  SUM(LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT) AS REVENUE
 FROM LINEITEM
 WHERE LINEITEM.SHIPDATE >= DATE '1994-01-01'
  AND LINEITEM.SHIPDATE < DATE '1995-01-01'
  AND LINEITEM.DISCOUNT > CAST(0.05 AS float) AND LINEITEM.DISCOUNT < CAST(0.07 as float)
  AND LINEITEM.QUANTITY < CAST(24 AS float)
 </pre>
 <h5>Query 10</h5>
 <pre>
 SELECT 
  CUSTOMER.CUSTKEY, 
  SUM(LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)) AS REVENUE, 
  CUSTOMER.ACCTBAL, 
  NATION.NAME, 
  CUSTOMER.ADDRESS, 
  CUSTOMER.PHONE, 
  CUSTOMER.COMMENT
 FROM 
  CUSTOMER, ORDERS, LINEITEM, NATION
 WHERE
  CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
  AND ORDERS.ORDERDATE >= DATE '1993-10-01'
  AND ORDERS.ORDERDATE < DATE '1994-01-01'
  AND LINEITEM.RETURNFLAG = 'R'
  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY
 GROUP BY 
  CUSTOMER.CUSTKEY, CUSTOMER.ACCTBAL, CUSTOMER.PHONE, NATION.NAME, CUSTOMER.ADDRESS, CUSTOMER.COMMENT
 ORDER BY REVENUE ASC
 LIMIT 20
 </pre>
 <h5>Query 11</h5>
 <pre>
 SELECT PK_V.PARTKEY, 
       PK_V.VALUE
 FROM (
  SELECT PS.PARTKEY,
         SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
  FROM PARTSUPP PS,
       SUPPLIER S,
       NATION N
  WHERE PS.SUPPKEY = S.SUPPKEY
    AND S.NATIONKEY = N.NATIONKEY
    AND N.NAME = 'GERMANY'
  GROUP BY PS.PARTKEY 
 ) PK_V, (
  SELECT SUM(PS.SUPPLYCOST * CAST(PS.AVAILQTY AS float)) AS VALUE
  FROM PARTSUPP PS,
       SUPPLIER S,
       NATION N
  WHERE PS.SUPPKEY = S.SUPPKEY
    AND S.NATIONKEY = N.NATIONKEY
    AND N.NAME = 'GERMANY'
 ) CUTOFF_V
 WHERE PK_V.VALUE > (CUTOFF_V.VALUE * CAST(0.0001 AS double) / CAST(100.0 AS double))
 ORDER BY PK_V.VALUE DESC
 </pre>
 <h5>Query 12</h5>
 <pre>
 SELECT  LINEITEM.SHIPMODE, 
        SUM(CASE WHEN ORDERS.ORDERPRIORITY = '1-URGENT'
                     OR ORDERS.ORDERPRIORITY = '2-HIGH'
                   THEN 1
                   ELSE 0 END) AS HIGH_LINE_COUNT,
        SUM(CASE WHEN ORDERS.ORDERPRIORITY <> '1-URGENT'
                     AND ORDERS.ORDERPRIORITY <> '2-HIGH'
                   THEN 1
                   ELSE 0 END) AS LOW_LINE_COUNT
 FROM LINEITEM, ORDERS
 WHERE ORDERS.ORDERKEY = LINEITEM.ORDERKEY
  AND (LINEITEM.SHIPMODE='MAIL' OR LINEITEM.SHIPMODE='SHIP')
  AND LINEITEM.COMMITDATE < LINEITEM.RECEIPTDATE
  AND LINEITEM.SHIPDATE < LINEITEM.COMMITDATE
  AND LINEITEM.RECEIPTDATE >= DATE '1994-01-01'
  AND LINEITEM.RECEIPTDATE < DATE '1995-01-01'
 GROUP BY LINEITEM.SHIPMODE
 ORDER BY LINEITEM.SHIPMODE
 </pre>
 <h5>Query 14</h5>
 <pre>
 SELECT
  CAST(100.00 AS double) 
    * PROMO_ONLY 
    / ALL_REVENUE
      AS PROMO_REVENUE
 FROM (
  SELECT
    SUM(
      CASE  WHEN PART.TYPE LIKE 'PROMO%'
            THEN LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
            ELSE cast(0 as float)
      END
    ) AS PROMO_ONLY,
    SUM(
      LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT)
    ) AS ALL_REVENUE
  FROM 
    LINEITEM,
    PART
  WHERE
    LINEITEM.PARTKEY = PART.PARTKEY
    AND LINEITEM.SHIPDATE >= DATE '1995-09-01'
    AND LINEITEM.SHIPDATE < DATE '1995-10-01'
 ) AGGREGATE
 </pre>
 </div>
--- a/src/teaching/cse-562/2021sp/index.erb
+++ b/src/teaching/cse-562/2021sp/index.erb
@ -117,8 +117,9 @@ schedule:
  - date: "Apr. 20"
    topic: "Indexing Review + Checkpoint 4"
  - date: "Apr. 22"
    due: "Checkpoint 3"
    topic: "Logging + Recovery"
  - date: "Apr. 26"
    due: "Checkpoint 3"
  - date: "Apr. 27"
    topic: "Distributed Commit"
  - date: "Apr. 29"
@ -187,8 +188,8 @@ In this course, you will learn...
    <ul>
      <li>5%  <a href="checkpoint0.html">Checkpoint 0</a> due on Feb. 16</li>
      <li>10% <a href="checkpoint1.html">Checkpoint 1</a> due on Mar. 15</li>
-      <li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 6</li>
+      <li>12% <a href="checkpoint2.html">Checkpoint 2</a> due on Apr. 9</li>
-      <li>8%  <a>Checkpoint 3</a> due on Apr. 20</li>
+      <li>8%  <a href="checkpoint3.html">Checkpoint 3</a> due on Apr. 26</li>
      <li>15% <a>Checkpoint 4</a> due on May 14</li>
    </ul>
  </li>