Website/src/rants/2015-08-13-incorrect-dbs.md

4.6 KiB

title author
What if Databases Could Answer Incorrectly? Oliver Kennedy

(an open letter to the database community)

For as long as databases have existed, they have held themselves to an invariant.  This invariant has become so ingrained into the psyche of database theoreticians, researchers, and designers that even the few who have tried to break it have only done so with cumbersome data models, by involving huge warning signs, or by using similarly obnoxious user interfaces.  The invariant that I'm talking about is that a database must never give the user an incorrect answer

Admittedly, this invariant has been broken now and again: Approximate (née. Online) Query Processing uses sampling to satisfy user-provided bounds, Probabilistic and Uncertain Databases work with underspecified data, while Model Databases allow users to query graphical models.  Yet, even in these cases, we as a community feel compelled to force the user to suffer immeasurable pain and anguish for the sin of working with uncertain data.  Probabilistic databases are impenetrable to anyone without a degree in statistics.  Every single AQP system and model database adds arcane syntax to SQL that allows users to specify how much uncertainty they're willing to tolerate, or worse still, requires a magical frontend that screams at the top of its lungs about just how bad the results that it's producing are.

Enough is enough!

Who do we think we are that we can provide a user with 100% correct answers?  Ask anyone who's run a production database or done any sort of analytics: Data is uncertain.  Data is messy.  Let's stop trying to prop up the failing illusion that it's anything else, and work towards embracing that uncertainty.  Let's give up on "certain" answers, and just give the users our best guess!

But Oliver, I hear you all screaming, this means that the users will get the wrong answers!

Of course they will.  Their data is already so screwed up that they're  getting the wrong answers anyway.  The difference is that now we can actually do something about it.  If the database is making guesses, the database knows exactly what it's guessing about, and why it's making a guess.  Instead of trying to hide that uncertainty from the user, let's try to better communicate that uncertainty to the user by shutting up, speaking english, and listening when we need to.

The first part of communication is knowing when to shut up.  Let's not overwhelm the user with details about (potential) errors.  A small, simple indicator like an asterisk or red colored result is enough to let the user know that something is up.  For god's sakes, don't cover the result screen in epsilon-delta bounds, or ask the user to write queries in your own brand of SQL+uncertainty bounds.

The second part of communication is speaking the user's language.  If you're going to make guesses that affect a user's analysis... tell them... but tell them in English (or your localization of choice).  Prioritize.  Let the user know why their result is uncertain, what you did to fix it, and whether they should be concerned or not.  Above all, let the user dictate the pace at which they absorb information.

The third part of communication is listening.  If there's an error that affects the user's results, we can't just stop at telling the user.  We  need to make it as easy as possible for the user to fix it.

Don't shun uncertainty, embrace it.  Better still, make it easier for your users to embrace it!

And, if you're interested in how the ODIn Lab is trying to approach these problems, check out the Mimir project and our 2015 VLDB paper (Research Session 25, Thursday, Sept 3 at 1:30 PM), or drop us a line!