8 ReadingList Approximate QPs
Oliver Kennedy edited this page 2017-10-23 19:41:06 -04:00
This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Motivation for AQP and Data Exploration


Example Frontends for Approximate Data Exploration


Systems

Fuzzy Prophet

Precursor to this system, based on parameter exploration. Precompute for different parameter values. Attempt to re-exploit computations already performed for previous operators.

SqlShare

UI-focused data management system. Intuitive approach to specifying joins, views, queries, etc... Append only, etc...

DBO

Run queries at each level of the plan in parallel. As you're joining, allow operators at the same level to communicate. F'rex for a NLJ, estimate a 4-way NLJ by periodically sampling the 4-way join results from state available in-memory at the time. Followup work re-spins it using a pipelined version that mirrors currents somewhat. Incremental state is cached and used for downstream computations.

"The problem with traditional database engines in this context lies with the fact that relational operations are treated as black boxes. This abstraction renders accurate statistical estimation impossible because it hides interme- diate results as well as internal state from the remainder of the system. If intermediate results are not externally visible, it is impossible for the system to guess the final answer to the query because no entity has access to informa- tion about every input relation." (Systems paper)

Aqua

System designed for approximate query answering based on join synopses; Challenge: Join sampling suffers from the birthday paradox; quadratic (not linear) reduction in number of join results at low sampling rates. Observation: Foreign key constraints imply a 1-to-1 relationship between source relation and outputs. Solution: sample from source relations, and store and use the full joins of those samples. Congressional joins extends this to use an index-striding-like approach to group-by vars. Each paper builds a statistical analysis of the results of aggregates over such joins.

  • Blink and it's done: interactive queries on very large data
  • BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Scale up data processing by sampling. Maintain (offline) not only purely random samples, but also samples biased (by grouping?) around combinations of columns. Goal is to identify rare/long-tail samples and keep them available for query processing that relies on those samples. Gimmicks: Stratified (grouped) sampling with hierarchies: Don't sample from the distribution: group on an attribute(s) and build one set of samples where each group is capped to X instances. Overflow samples go into a second set, which is capped at X' instances. Overflow from that goes into a third set, etc... When processing use the first N sets, where N is determined by accuracy and time constraint bounds.

Derby

Maintain a set of incrementally maintained "sample" views along the lines of DBO. Integrated directly into the underlying data management system. The main gimmick of the sampling strategy is that they try to capture variance in the attributes of interest. If a group (potentially the entire relation) has a high variance, we need more samples from it than if it has low variance. E.g., Compare two classes, one where all students get 4.0s and one where there's a uniform distribution from 2.5 to 4.0. The latter needs more samples to accurately capture the variance in grades.

MapReduce Online (HOP)

PF-OLA / GLADE

Sampling-based evaluation tied directly into the standard query evaluation process. Parallel sampling-based processor. Extensions to the definition of aggregate operators that allow estimates of arbitrary (linear) aggregates.

Eddies

Incremental, stepwise query evaluation. Each operator processes one "chunk" at a time.

Use eddies for stream processing. Tag tuples with lineage to allow parallel multi-query out-of-order evaluation (MCDB parallelism)

Octopus

Best-effort utilities for data extraction and query processing. Hybrid automation under human direction query processing system.

QWiK

Alternative interfaces for interactive data exploration.

dbTouch

Data migration / Data exploration

DeVIL (Declarative Visual Interaction Language)

A language for declaratively specifying visualizations over relational data. Emphasis on using transactional processing techniques (reordering, consistency, literal view serializability) to enable faster (interactive-latency) responses to user interactions.


Incremental/Partial Processing Strategies

  • Online Dynamic Reordering for Interactive Data Processing
    • A datastructure/operator roughly analogous to a priority queue, but buffered, out-of-core, supporting changing priorities, and designed to support prioritized sampling. "Here, you'll find this interesting"
  • Distributed Online Aggregation
    • Distributed DHT-based implementation of the Hellerstein paper. Varies number of processing unit, and persists state for use in future computations.
  • Sets
    • Blakeley,J.A.,Larson,P.A ̊.,Tompa,F.W.:Efficientlyupdating materialized views. In: SIGMOD, pp. 6171 (1986)
    • Buneman,P.,Clemons,E.K.:Efficientlymonitoringrelationalda-tabases. ACM TODS 4(3), 368382 (1979)
  • Bags
    • Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Op- timizing queries with materialized views. In: ICDE, pp. 190200 (1995)
    • Griffin, T., Libkin, L.: Incremental maintenance of views with du-plicates. In: SIGMOD, pp. 328339 (1995)
  • Aggregates
    • Palpanas, T., Sidle, R., Cochrane, R., Pirahesh, H.:Incremental maintenance for non-distributive aggregate functions. In: VLDB, pp. 802813 (2002)
  • Lazy Evaluation
    • Colby,L.S.,Griffin,T.,Libkin,L.,Mumick,I.S.,Trickey,H.:Al-gorithms for deferred view maintenance. In: SIGMOD, pp. 469480 (1996)
    • Zhou,J.,Larson,P.A ̊.,Elmongui,H.G.:Lazymaintenanceofma-terialized views. In: VLDB, pp. 231242 (2007)
  • Asynchronous
    • Salem,K.,Beyer,K.S.,Cochrane,R.,Lindsay,B.G.:Howtorollajoin: Asynchronous incremental view maintenance. In: SIGMOD,pp. 129140 (2000)
  • Map/Reduce
    • Large-scale incremental processing with MapReduce

Sampling Strategies


Estimators and Accuracy Analyses


Online Joins


Creating Data Synopses


Query evaluation Over Synopses


Precomputation

  • Wave Computing in the Cloud
    • Capture recurring queries by examining a queries executed repeatedly on later instantiations of the same dataset. Use shared scans, etc...

Distributed and Continuous Workflow Systems


Predictive, Anticipatory Query Suggestions


Not Immediately Related work

UI Work