Add new file

pull/1/head
Oliver Kennedy 2017-02-09 19:36:14 -05:00
parent 55a90da775
commit 56f7951965
1 changed files with 61 additions and 0 deletions

View File

@ -0,0 +1,61 @@
---
title: CIDR Recap
projects:
- mimir
author: William Spoth
---
How big is BIG and how fast is FAST? This seemed to be a re-occurring theme of
the CIDR 2017 conference. A general consensus and major point of many
presentations is that RDBMS used to be the king of scaling to large data twenty
years ago but for some inexplicable reason has become lost to the ever changing
scope of BIG and FAST. Multiple papers attempted to address this problem in
different ways and added to multiple different tools on the market for data
stream processing and large calculations such as SPARK but there seemed to be
no silver bullet. To add to the theme that big data is too big, there were
keynote talks given by Emily Galt and Sam Madden that drove this point home and
gave different real work scenarios and outlooks on this problem.
To break this theme apart Ill split the papers into groups and explain the
different outlooks the authors took and how they addressed this common problem.
The papers, Prioritizing Attention in Analytic Monitoring, The Myria Big Data
Management and Analytics System and Cloud Services, Weld: A Common Runtime for
High Performance Data Analysis, A Database System with Amnesia, and Releasing
Cloud Databases for the Chains of Performance Prediction Models, were focused
on the theme that databases are not keeping pace with the rate that data is
growing. Sam Madden brought up an interesting point that the hardware
components like the bus are not the bottle neck in this system. With advances
in big data computing like apache spark, it feels like RDBMS are the end of the
line where data goes to die. These papers looked at different ways of
addressing this, A Database System with Amnesia looked at throwing out unused
data since most data in RDBMS gets put in and never used again and with the
increasing use of data streams the problem of not being able to process and
store this data fast enough becomes exemplified.
The second common ground problem is even if you can efficiently store and
perform queries over your data lakes, humans often lack the ability to
efficiently create queries or have the necessary insight into how the data is
formatted. The papers, The Data Civilizer System, Establishing Common Ground
with Data Context, Adaptive Schema Databases, Combining Design, and Performance
in a Data Visualization Management System, all try to address this problem but
from slightly different angles. The data civilizer system and adaptive
databases look at aiding an analyst in schema and table exploration and to help
an analyst discover unknown or desired qualities about their data source. These
papers approach user insight in a way that would otherwise exist as internal
middleware in large companies, the problem is that big data and messy data
lakes are becoming more and more prevalent for other users. Medium sized
businesses can be buried in data following user surges or new product upgrades,
government agencies can have large amounts of uncleaned sensor and user
submitted data that they do not have the abilities or tools to manage.
To me a large take away from this conference was databases need a better way to
handle big data. Databases are the hero big data needs AND the one it deserves.
To achieve these goals databases are going to need to relax the constraints on
ridged schemas and perfect data, which open up a large amount of research
opportunities and the realization that there might not currently be a right
answer to this problem. Either way it should be interesting to see what
sacrifices RDBMS make to compete with the growing amount of data and if they
are able to apply decades worth of research to this hot field that is looking
for an answer.