From 2d0fcc7ded41c780918b8459723966f78f159433 Mon Sep 17 00:00:00 2001 From: Oliver Date: Mon, 1 Feb 2021 12:55:00 -0500 Subject: [PATCH] Updated slides --- src/teaching/cse-562/2021sp/index.erb | 2 +- .../cse-562/2021sp/slide/2021-02-02-Intro.erb | 331 +++++++++++++++++- .../graphics/2021-02-02-parts_of_sql.svg | 277 +++++++++++++++ .../cse-562/2021sp/slide/graphics/slido.png | Bin 0 -> 402 bytes src/teaching/cse-562/2021sp/slide/ubodin.css | 12 +- 5 files changed, 600 insertions(+), 22 deletions(-) create mode 100644 src/teaching/cse-562/2021sp/slide/graphics/2021-02-02-parts_of_sql.svg create mode 100644 src/teaching/cse-562/2021sp/slide/graphics/slido.png diff --git a/src/teaching/cse-562/2021sp/index.erb b/src/teaching/cse-562/2021sp/index.erb index 44de50ae..399edc0a 100644 --- a/src/teaching/cse-562/2021sp/index.erb +++ b/src/teaching/cse-562/2021sp/index.erb @@ -87,7 +87,7 @@ In this course, you will learn...
  • TAs:
  • -
  • Course Discussions: Canvas
  • +
  • Course Discussions: Piazza
  • No Required Textbook
  • Optional References:
    Practicum (50% of Grade)
      -
    • Rebuild the guts of Apache SparkSQL
    • +
    • Rebuild the guts of Spark's Catalyst Engine
    • Solo Project
    • 4 Checkpoints (+ 5 free points for Checkpoint 0)
    @@ -294,19 +328,29 @@ textbook: "Ch. 1, 2.1-2.2"
    - + +
    -

    I've torn the guts out of Apache Spark.
    - Your mission: Replace them (sort of).

    -
      +
      +

      I've torn the guts out of Catalyst. What remains: +

        +
      • SQL Parser
      • +
      • Logical Plans: Relational Algebra Trees
      • +
      • Expression: Primitive-valued expression trees + evaluation logic
      • +
      +

      + +

      Your mission: Replace the missing bits (sort of).

      +
      • Analysis: Tidying up SQL's corner cases
      • Execution: Answering queries fast (single-node)
      • -
      • Optimization: Eliminating redundancy
      • +
      • Optimization: Eliminating redundancy
        and picking the best algorithms.

      We give you...

      +

      Gutted Catalyst

      Data (CSV Files)

      Schema Information (CREATE TABLE)

      Questions (SQL Queries)

      @@ -346,7 +390,7 @@ textbook: "Ch. 1, 2.1-2.2"

      Checkpoint 1: "Get it Working"

      10/50 pts

        -
      • Interpret Relational Algebra (Spark Operators)
      • +
      • Interpret Relational Algebra (Spark's LogicalPlan)
      • Load CSV Files
      • Run Basic Select, Project, Join Queries
      • Nested Queries
      • @@ -362,14 +406,14 @@ textbook: "Ch. 1, 2.1-2.2"
      -

      Checkpoint 3: "Precomputation"

      +

      Checkpoint 3: "Aggregates"

      8/50 pts

      • Aggregation
      -

      Checkpoint 4: "The Real World"

      +

      Checkpoint 4: "Precomputation"

      15/50 pts

      • You get a few minutes to pre-compute
      • @@ -378,6 +422,10 @@ textbook: "Ch. 1, 2.1-2.2"
      • Build indexes
      + +
      + +
    @@ -389,7 +437,7 @@ textbook: "Ch. 1, 2.1-2.2" @@ -474,6 +522,10 @@ textbook: "Ch. 1, 2.1-2.2" +
    + +
    +
    @@ -505,17 +557,21 @@ textbook: "Ch. 1, 2.1-2.2"

    Your data is currently an Unordered Set
    - of Tuples with 100 fields each. + of Tuples with 100 attributes each.

    Tomorrow, you’ll be repeatedly asked for 1 specific attribute
    - of 5 specific rows identified by the first attribute + from 5 specific tuples identified by the first attribute

    Can you do better?

    +
    + <%= sli_do_link %> +
    +

    Better Idea: Rewrite data into a 99-Tuple of Maps keyed on the 1st attribute

    This representation is equivalent and better for your needs.

    @@ -523,3 +579,248 @@ textbook: "Ch. 1, 2.1-2.2"

    Declarative specifications make it easier to find equivalences.

    + + +
    +
    +

    Declarative Languages

    + +
      +
    • Don't need to think about algorithms.
    • +
    • Independent of the data representation.
    • +
    +
    + +
    +

    SQL

    +
      +
    • Developed by IBM (for System R) in the 1970s.
    • +
    • Standard used by many vendors.
        +
      • SQL-86 (original standard)
      • +
      • SQL-89 (minor revisions; integrity constraints)
      • +
      • SQL-92 (major revision; basis for modern SQL)
      • +
      • SQL-99 (XML, window queries, generated default values)
      • +
      • SQL 2003 (major revisions to XML support)
      • +
      • SQL 2008 (minor extensions)
      • +
      • SQL 2011 (minor extensions; temporal databases)
      • +
      +
    +
    + +
    +

    A Basic SQL Query

    + +
    + +
    +
    
    +            SELECT  [DISTINCT] targetlist
    +            FROM    relationlist
    +            WHERE   condition
    +    
    +
      +
    1. Compute the $2^n$ combinations of tuples in all relations appearing in relationlist
    2. +
    3. Discard tuples that fail the condition
    4. +
    5. Delete attributes not in targetlist
    6. +
    7. If DISTINCT is specified, eliminate duplicate rows
    8. +
    +

    + This is the least efficient strategy to compute a query! + A good optimizer will find more efficient strategies to compute the same answer. +

    +
    + +
    +

    Example Data

    + +
    + +
    +
    SELECT * FROM Trees;
    + +

    Wildcards (*, tablename.*) are special targets that select all attributes.

    + +
    + + + + + + + + +
    CREATED_ATTREE_IDBLOCK_IDTHE_GEOMTREE_DBHSTUMP_DIAMCURB_LOCSTATUSHEALTHSPC_LATINSPC_COMMONSTEWARDGUARDSSIDEWALKUSER_TYPEPROBLEMSROOT_STONEROOT_GRATEROOT_OTHERTRNK_WIRETRNK_LIGHTTRNK_OTHERBRNCH_LIGHBRNCH_SHOEBRNCH_OTHEADDRESSZIPCODEZIP_CITYCB_NUMBOROCODEBORONAMECNCLDISTST_ASSEMST_SENATENTANTA_NAMEBORO_CTSTATELATITUDELONGITUDEX_SPY_SP
    '08/27/2015'180683348711'POINT (-73.84421521958048 40.723091773924274)'30'OnCurb''Alive''Fair''Acer rubrum''red maple''None''None''NoDamage''TreesCount Staff''None''No''No''No''No''No''No''No''No''No''108-005 70 AVENUE''11375''Forest Hills'4064'Queens'292816'QN17''Forest Hills'4073900'New York'40.72309177-73.844215221027431.14821202756.768749
    '09/03/2015'200540315986'POINT (-73.81867945834878 40.79411066708779)'210'OnCurb''Alive''Fair''Quercus palustris''pin oak''None''None''Damage''TreesCount Staff''Stones''Yes''No''No''No''No''No''No''No''No''147-074 7 AVENUE''11357''Whitestone'4074'Queens'192711'QN49''Whitestone'4097300'New York'40.79411067-73.818679461034455.70109228644.837379
    '09/05/2015'204026218365'POINT (-73.93660770459083 40.717580740099116)'30'OnCurb''Alive''Good''Gleditsia triacanthos var. inermis''honeylocust''1or2''None''Damage''Volunteer''None''No''No''No''No''No''No''No''No''No''390 MORGAN AVENUE''11211''Brooklyn'3013'Brooklyn'345018'BK90''East Williamsburg'3044900'New York'40.71758074-73.93660771001822.83131200716.891267
    '09/05/2015'204337217969'POINT (-73.93445615919741 40.713537494833226)'100'OnCurb''Alive''Good''Gleditsia triacanthos var. inermis''honeylocust''None''None''Damage''Volunteer''Stones''Yes''No''No''No''No''No''No''No''No''1027 GRAND STREET''11211''Brooklyn'3013'Brooklyn'345318'BK90''East Williamsburg'3044900'New York'40.71353749-73.934456161002420.35833199244.253136
    '08/30/2015'189565223043'POINT (-73.97597938483258 40.66677775537875)'210'OnCurb''Alive''Good''Tilia americana''American linden''None''None''Damage''Volunteer''Stones''Yes''No''No''No''No''No''No''No''No''603 6 STREET''11215''Brooklyn'3063'Brooklyn'394421'BK37''Park Slope-Gowanus'3016500'New York'40.66677776-73.97597938990913.775046182202.425999
    ... and 683783 more
    +
    +
    + +
    +
    
    +            SELECT tree_id, spc_common, boroname
    +            FROM Trees
    +            WHERE boroname = 'Brooklyn'
    +    
    + +

    In English, what does this query compute?

    + + <%= sli_do_link_small %> +
    + +
    +

    What is the ID, Commmon Name and Borough of Trees in Brooklyn?

    + + + + + + + + + +
    TREE_IDSPC_COMMONBORONAME
    204026'honeylocust''Brooklyn'
    204337'honeylocust''Brooklyn'
    189565'American linden''Brooklyn'
    192755'London planetree''Brooklyn'
    189465'London planetree''Brooklyn'
    ... and 177287 more
    +
    + +
    +
    
    +      SELECT latitude, longitude 
    +      FROM Trees, SpeciesInfo
    +      WHERE Trees.spc_common = SpeciesInfo.name
    +        AND SpeciesInfo.has_unpleasant_smell = 'Yes';
    +    
    + +

    In English, what does this query compute?

    + + <%= sli_do_link_small %> + +
    + +
    +

    What are the coordinates of Trees with bad smells?

    + + + + + + + + + +
    LATITUDELONGITUDE
    40.59378755-73.9915968
    40.69149917-73.97258754
    40.74829709-73.98065645
    40.68767857-73.96764605
    40.739991-73.86526993
    ... and more
    +
    + +
    +
    
    +      SELECT Trees.latitude, Trees.longitude 
    +      FROM Trees, SpeciesInfo
    +      WHERE Trees.spc_common = SpeciesInfo.name
    +        AND SpeciesInfo.has_unpleasant_smell = 'Yes';
    +    
    + +

    ... is the same as ...

    + +
    
    +      SELECT T.latitude, T.longitude 
    +      FROM Trees T, SpeciesInfo S
    +      WHERE T.spc_common = S.name
    +        AND S.has_unpleasant_smell = 'Yes';
    +    
    + +

    ... is (usually) the same as ...

    + +
    
    +      SELECT latitude, longitude 
    +      FROM Trees, SpeciesInfo
    +      WHERE spc_common = name
    +        AND has_unpleasant_smell = 'Yes';
    +    
    + +
    + +
    +

    Expressions

    + +
    
    +            SELECT tree_id, 
    +                   stump_diam / 2 AS stump_radius,
    +                   stump_area = 3.14 * stump_diam * stump_diam / 4
    +            FROM Trees;
    +    
    + +

    + Arithmetic expressions can appear in targets or conditions. + Use ‘=’ or ‘AS’ to assign names to these attributes. + (The behavior of unnamed attributes is unspecified) +

    +
    + +
    +

    Expressions

    + +
    
    +  SELECT tree_id, spc_common FROM Trees WHERE spc_common LIKE '%maple'
    +    
    + + + + + + + +
    TREE_IDSPC_COMMON
    180683'red maple'
    204325'sycamore maple'
    205044'Amur maple'
    184031'red maple'
    208974'red maple'
    +

    SQL uses single quotes for ‘string literals’

    +

    LIKE is used for String Matches

    +

    %’ matches 0 or more characters

    +
    + +
    +

    Union

    +
    
    +    SELECT tree_id FROM Trees WHERE spc_common = 'red maple'
    +    UNION [ALL]
    +    SELECT tree_id FROM Trees WHERE spc_common = 'sycamore maple'
    +    
    +

    Computes the set-union of any two union-compatible sets of tuples

    +

    Adding ALL preserves duplicates across the inputs (bag-union).

    +
    + +
    +

    Aggregate Queries

    +
    
    +    SELECT [DISTINCT] targetlist
    +    FROM relationlist
    +    WHERE condition
    +    GROUP BY groupinglist
    +    HAVING groupcondition
    +    
    +
    +

    The targetlist now contains (a) Grouped attributes, and (b) Aggregate expressions.

    +

    Targets of type (a) must be a subset of the grouping-list

    +

    (intuitively each answer tuple corresponds to a single group, and each group must have a single value for each attribute)

    +

    groupcondition is applied after aggregation and may contain aggregate expressions.

    +
    +
    + +
    +

    Aggregate Queries

    +
    
    +    SELECT spc_common, count(*) FROM Trees GROUP BY spc_common
    +    
    + + + + + + + + +
    SPC_COMMON COUNT
    ''Schubert' chokecherry' 4888
    'American beech' 273
    'American elm' 7975
    'American hophornbeam' 1081
    'American hornbeam' 1517
    ... and more
    + +
    + +
    + + +
    +

    Next time...

    + +

    Scala for Java programmers.

    +
    diff --git a/src/teaching/cse-562/2021sp/slide/graphics/2021-02-02-parts_of_sql.svg b/src/teaching/cse-562/2021sp/slide/graphics/2021-02-02-parts_of_sql.svg new file mode 100644 index 00000000..691a0702 --- /dev/null +++ b/src/teaching/cse-562/2021sp/slide/graphics/2021-02-02-parts_of_sql.svg @@ -0,0 +1,277 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + SELECT [DISTINCT] target-listFROM relation-listWHERE condition + + A list of relation names (possibly with a range-variable after each name) + + + + A list of attributes of relations in relation-list + + + + Comparisons (‘=’, ‘<>’, ‘<‘, ‘>’, ‘<=’, ‘>=’) and other boolean predicates, combined using AND, OR, and NOT (a boolean formula) + + + + (optional) keyword indicating that the answer should not contain duplicates + + + + diff --git a/src/teaching/cse-562/2021sp/slide/graphics/slido.png b/src/teaching/cse-562/2021sp/slide/graphics/slido.png new file mode 100644 index 0000000000000000000000000000000000000000..74e1b953cc805bf5d7f1de86ca7edeb5ecfe7d74 GIT binary patch literal 402 zcmV;D0d4+?P)* z)VpoOFboD@9$cuqEFgeusN+iZ0`g@6XC-x7Lx2{LIv0W%`Yte###D+cBDnBZ7~tcV zfqxzNA1nZCY#tf_s;t~RWe#@b0S3G5{1w}5hUjUk;V<0B$ZZW4pq)itQ2Je>mFH~# zf^zE48{s+F3l;-6#$wVMb#{lFn?6|@L6scSO