Just-in-Time Data Structures

Oliver Kennedy

okennedy@buffalo.edu

Darshana Balakrishnan
Hank Lin Ankur Upadhyay Lukasz Ziarek
(PhD In Progress) (BS 2017) (MS 2014) (Prof @ UB)

With support from NSF Awards IIS-1617586 and CNS-1629791

What is best in life?

(for organizing your data)

© Universal Pictures

API

    Insert $\lt key, value\gt$

    Query for $key \in [low, high)$


Available Structures

Binary Tree, Linked List, Sorted Array

You guessed wrong!

(unless you didn't)

Other Tradeoffs

  • Support for Threads
  • Lookup vs Full Scan vs Range Scan
  • Optimal Update Size
[Best] [Best] [Best] [Best?] [Best?]

Interactive Analytics

  1. User Opens CSV File
  2. User Poses Query as File Loads
  3. Lots More Queries
  4. User Adds More Data

Even in a single session, there may be more than one "optimal" data structure.

State of the Art

Victorinox
  • Jack of All Trades Datastructures
    (e.g., B+ Tree, LSM Tree)
  • Keep re-building structures for different workloads
    (e.g., DROP INDEXLOAD TABLECREATE INDEX)
  • Bespoke data structures
    (e.g., KD+R*++#N-Tree; Author et.al. SIGMOD 2023)

No way to gracefully transition between different trade-offs.

  1. What does it mean for a data structure to be halfway between a Binary Tree and a Linked List?
  2. How would we access and manipulate such a data structure?
  3. How do we automatically generate such data-structures?

Incremental Structure Transitions

  1. A Composable Organizational Grammar
  2. Realizing Universal Data Structures
  3. Just-In-Time Data Structure Optimization
  4. Performance Testing

Logical Content

Physical Structure

A Bag of $\lt Key \rightarrow Value \gt$ Pairs

One Physical Realization of the Bag

Core Idea: A grammar of physical realizations.

Primitives

  • A Key ($\mathbb K$)Any ordered set
  • A Record ($\mathbb R$)A key/value pair
  • A Pointer ($\mathbb P$)Logically a bag of records

Grammar

\begin{align} \mathbb P :=\; &|\;Sng(\mathbb R) \\ &|\uplus(\mathbb P, \mathbb P) \\ &|\;BT_{\mathbb K}(\mathbb P, \mathbb P) \\ &|\;Array_N(\mathbb R \ldots \mathbb R) \\ &|\;Sorted_N(\mathbb R \ldots \mathbb R) \end{align}

Singleton

Visual:
COG:$Sng(x: \mathbb R)$
Logical:$\{ x \}$

Union Node

Visual:
COG:$\uplus(a: \mathbb P, b: \mathbb P)$
Logical:$a \uplus b$

Combining Primitives: Linked List

Visual:
COG:\begin{align}LL :=\;&|\;U(Sng(x: \mathbb R), a: LL)\\&|\;Sng(x)\end{align}
Logical:$\{ x \} \uplus a$ or $\{ x \}$

Many existing data structures can be expressed as syntactic restrictions on this grammar.

Extension 1: Semantic Constraints

Visual:
COG:$BT_{k: \mathbb K}(a: \mathbb P, b: \mathbb P)$
Logical:$a \uplus b$
Constraint:$\forall r \in a: r.key \lt K$
$\forall r \in b: r.key \geq K$

Nodes can define syntactic constraints over the logical contents of descendents.

Combining Primitives: Binary Tree

\begin{align} BinTree :=\;&|\;BT_{k: \mathbb K}(a: BinTree, b: BinTree)\\&|\;Sng(x: \mathbb R) \end{align}

Extension 2: Repetition

Visual:
COG:$Array_{N : \mathbb N}(x_1: \mathbb R, \ldots, x_N: \mathbb R)$
Logical:$\{ x_1, \ldots, x_N \}$

Can repeat structures for efficiency (e.g., B+Tree vs BinTree)

Combining Extensions

Visual:
COG:$Sorted_{N : \mathbb N}(x_1: \mathbb R, \ldots, x_N: \mathbb R)$
Logical:$\{ x_1, \ldots, x_N \}$
Constraint:$\forall i \lt j: x_i.key \leq x_j.key$

Example

$\uplus(Sng(1), $ $\uplus(Array_2(2,4,7), $ $BT_6($ $Sorted_2(3, 5)$ $, Sng(6))$ $)$ $)$

Incremental Structure Transitions

  1. A Composable Organizational Grammar
  2. Realizing Universal Data Structures
  3. Just-In-Time Data Structure Optimization
  4. Performance Testing

Universal Data Structures

  • Physiological Morphisms
    • Queries
    • Updates
  • Purely Physical Morphisms
    • Optimization

Example: Range Queries

$Q_{\ell,h} : \mathbb P \mapsto \mathbb P$

Return tuples in $[\ell,h)$

\begin{align} Q_{\ell,h}(\uplus(a, b)) \rightarrow &\;\uplus(Q_{\ell,h}(a), Q_{\ell,h}(b))\\[10px] Q_{\ell,h}(BT_k(a, b)) \rightarrow &\; \begin{cases} Q_{\ell,h}(a) & \text{if } h \lt k\\ Q_{\ell,h}(b) & \text{if } \ell \geq k\\ BT_k(Q_{\ell,h}(a), Q_{\ell,h}(b)) & \text{otherwise} \end{cases}\\[10px] Q_{\ell,h}(Array_N(x_1,\ldots,x_N)) \rightarrow &\; Array_{|Y|}(y_1, \ldots, y_{|Y|}) \\&\;\;\text{ s.t. } Y = \{\;x_i\;|\;\ell \leq x_i \lt h\;\}\\[10px] Q_{\ell,h}(Sorted_N(x_1,\ldots,x_N)) \rightarrow &\; Sorted_{j-i+1}(x_i, \ldots, x_j) \\&\;\;\text{ s.t. } i = argmin_i(x_i \geq \ell); \\&\;\;\;\;\;\;\;\;j = argmax_(x_j \lt h); \end{align}

Insert

$$Insert_{\mathbb P}: \mathbb P \rightarrow \mathbb P$$

Do the least work possible (optimize later)

$$Insert_{a}(old) \rightarrow \uplus(old, a)$$

Incremental Structure Transitions

  1. A Composable Organizational Grammar
  2. Realizing Universal Data Structures
  3. Just-In-Time Data Structure Optimization
  4. Performance Testing

Core Idea: Physical layout as a compiler optimization problem.

Example: Organize A Hybrid Data Structure

$$\uplus(Sng(x), BT_k(a, b)) \rightarrow \begin{cases} BT_k(\uplus(Sng(x), a), b) & \text{if } x.key \lt k\\ BT_k(a, \uplus(Sng(x), b)) & \text{if } x.key \geq k\end{cases}$$
$$\uplus(Sng(x), Sorted_N(y_1, \ldots, y_N) \rightarrow Sorted_N(y_1, \ldots, y_i, x, y_{i+1}, \ldots y_N)$$ $$\text{ where }y_i.key \leq x.key \leq y_{i+1}.key$$

Rewrites

A pattern/replacement pair.

  • Crack-Array
  • Sort-Array
  • Sort-Merge
  • Pushdown-Array
  • Pushdown-BT
  • Pushdown-Sorted
  • ...

Events

A trigger for applying a rewrite.

  • Before-Scan
  • After-Scan
  • Before-Visit
  • After-Visit
  • Before-Insert
  • After-Insert
  • Idle-Tick

Policies (Take 1)

A set of Rewrite/Event pairs.

  • Cracker (Implements [Idreos et.al.-CIDR 2007])
  • Adaptive Merge (Implements [Graefe/Kano-EDBT 2010])
  • Swap (Heuristic Hybrid: Switch after 2000 events)
  • Transition (Heuristic Hybrid: Gradient from 1-3k events)
[Kennedy/Ziarek-CIDR 2015]; https://github.com/UBOdin/jitd

The Entire Transition Policy


package jitd;

import java.util.*;

public class TransitionMode extends Mode {
  int stepsTotal;
  int stepsTaken = 0;
  Random rand = new Random();
  Mode source, target;
  
  public TransitionMode(Mode source, Mode target, int steps)
  {
    this.stepsTotal = steps;
    this.source = source;
    this.target = target;
  }
  
  public Mode pick()
  {
    stepsTaken++;
    if(rand.nextInt(stepsTotal) < stepsTaken){
      return target;
    } else {
      return source;
    }
  }

  public KeyValueIterator scan(Driver driver, long low, long high)
  {
    return pick().scan(driver, low, high);
  }
  public void insert(Driver driver, Cog values)
  {
    pick().insert(driver, values);
  }
  public void idle(Driver driver)
  {
    pick().idle(driver);
  }
}
					

(40 lines of java)

Cracker Policy

(incrementally improving performance)

Adaptive Merge Policy

(first read: 33s; bimodal: merge vs already merged)

Swap Policy

(can arbitrarilly switch to a different policy)

Transition Policy

(can have two policies running simultaneously in parallel)

Universal data structures allow us to
hybridize policies "for free".

Policies (Take 2)

Core Idea: Physical layout as a just-in-time compiler optimization problem.

Fluid Data Structures

Encode COG sentences as functional$^*$ data structures.

A background thread incrementally applies rewrites to optimize the data structure.

Functional data structure allows continuous availability (while performance improves).

Optimizer Work Loop

  1. Decide which rewrite to apply and where to apply it.
  2. Compute rewritten subtree.
  3. Deploy subtree into the data structure.

Example: A Load-Time Available Index

Input: An Unsorted Array

Crack-in-Two (a.k.a. Radix-Partition)
Fast ($O(N)$), but only small improvement
... but can be recursively improved
Sort
Slow ($O(N\cdot \log(N))$), but big improvement

Crack

\begin{align} Array_N(x_1, \ldots, x_N) \rightarrow BT_{x_j.key}(\;\;&Array_{|Y|}(y_1, \ldots, y_{|Y|}), \\&Array_{|Z|}(z_1, \ldots, z_{|Z|})\;\;) \end{align}

where $j \in [1, N]$, $Y = \{x_i | x_i.key \lt x_j\}$, $Z = \{x_i | x_i.key \geq x_j\}$

Sort

$$Array_N(x_1, \ldots, x_N) \rightarrow Sorted_N(x_{f(1)}, \ldots, x_{f(N)})$$

where $f : [N] \rightarrow [N]$ and $x_{f(i)} \leq x_{f(i+1)}$

Crack

Consume: 1x Array

Produce: 2x Array, 1x BinTree

Sort

Consume: 1x Array

Produce: 1x Sorted Array

Crack and sort both consume an array.

Option 1: Crack($Array_8(1 \ldots 8)$)

Option 2: Sort($Array_8(1 \ldots 8)$)

Option 1: Crack($Array_4(1 \ldots 4)$)

Option 2: Sort($Array_4(1 \ldots 4)$)

Option 3: Crack($Array_4(5 \ldots 8)$)

Option 4: Sort($Array_4(5 \ldots 8)$)

Option 1: Crack($Array_4(1 \ldots 4)$)

Option 2: Sort($Array_4(1 \ldots 4)$)

The set of rewrites available is a top-1 tree query.

Accelerate with materialized views, delta queries, and priority queues.

Applying rewrites is expensive in a functional data structure.

How are pointers used?

Data-Structure Accessor Code
Persistence Long-lived Transient
Equivalence Guarantee Logical Physical
Fluid DSFunctional DS
Handle Pointer
[Balakrishnan et. al.-DBPL 2019 (To Appear)]

Does it work?

Incremental Structure Transitions

  1. A Composable Organizational Grammar
  2. Realizing Universal Data Structures
  3. Just-In-Time Data Structure Optimization
  4. Performance Testing

Setup

  • Hand Generated C++ Code
  • $10^9$ records (8-byte key + 8-byte value = 16GB)
  • Compared vs STL's std::ordered_map, std::unordered_map, Google's cpp-btree

Example Policy

  1. Crack arrays until threshold $10^6 \rightarrow 10^9$
  2. Sort unsorted arrays
  3. Merge binary trees

Range Scans

Point Lookups Scans

Concurrency / Handles

Load + One Point Lookup

Ongoing Work

  • Efficiently selecting optimizer targets with view maintenance
  • DSL compiler to generate C++ code
  • Synthesizing new structures & rules
  • Modeling access and transformation costs
  • Learning optimal policies

Just-in-Time Data Structures

  • A Composable Organizational Grammar can describe the intermediate state of a data structure in transition.
  • COG + localized rewrite rules can emulate the behaviors of existing data structures and be hybridized.
  • Handles allow rewrites to be applied efficiently.

Questions?