Just like Checkpoint 3, but now...
+ +
+ CREATE TABLE CUSTOMER(
+ CUSTKEY INT,
+ NAME STRING,
+ ADDRESS STRING,
+ NATIONKEY INT,
+ PHONE STRING,
+ ACCTBAL FLOAT,
+ MKTSEGMENT STRING,
+ COMMENT STRING
+ ) USING csv OPTIONS(
+ path 'data/CUSTOMER.data',
+ delimiter = '|',
+ primary_key = 'custkey'
+ )
+
+
+
+ primary_key = 'custkey',
+
+ or
+
+ primary_key = 'orderkey,lineitem',
+
+
+ comma-separated list describing the primary key of the table
+Time to scan SF 0.1 LINEITEM
+Source | Time |
---|---|
CSV on NVME SSD | +0.88745s | +
+ | + |
+
Time to scan SF 0.1 LINEITEM
+Source | Time |
---|---|
CSV on NVME SSD | +0.88745s | +
IndexedSeq[InternalRow] | +0.018s | +
~30x speedup
+Takeaway: Read data in at CREATE TABLE
+
+ Table(...)
+
+ Return an iterator over the preloaded table.
+Time to filter SF 0.1 LINEITEM for one orderkey
+Source | Time |
---|---|
CSV on NVME SSD | +0.9196s | +
IndexedSeq[InternalRow] | +0.0624s | +
+ | + |
+
Time to filter SF 0.1 LINEITEM for one orderkey
+Source | Time |
---|---|
CSV on NVME SSD | +0.9196s | +
IndexedSeq[InternalRow] | +0.0624s | +
Sorted IndexedSeq[InternalRow] + Bin Search | +0.0008s | +
~80x speedup
+Takeaway: Sort on primary key and binary search.
+
+ Filter(expression, Table(...))
+
+ If expression is a ...
+If expression is a ...
+But TPC-H doesn't have filters on keys...
+
+ USING csv OPTIONS(
+ path '../TPCH/LINEITEM.csv',
+ delimiter = '|',
+ primary_key = 'orderkey,linenumber',
+ tree_index = 'shipdate',
+ hash_index = 'linestatus|shipmode'
+ )
+
+ tree_index and hash_index are |-separated lists of ,-separated indexes.
+