pull/1/head
Oliver Kennedy 2020-05-03 15:01:54 -04:00
parent ddd5d987e6
commit 361386e275
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 73 additions and 0 deletions

View File

@ -1,4 +1,19 @@
[
{
"title" : "Loki: Streamlining Integration and Enrichment",
"authors" : [
"William Spoth",
"Poonam Kumari",
"Oliver Kennedy",
"Fatemeh Nargessian"
],
"venue" : "HILDA",
"year" : 2020,
"length" : 4,
"urls" : {
"workshop" : "https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf"
}
},
{
"title" : "Your notebook is not crumby enough, REPLace it",
"authors" : [

View File

@ -0,0 +1,58 @@
---
title: Re-using work in data integration
author: Oliver Kennedy
---
Data integration is a huge problem. There's a ton of work out there on
automating the process of merging two datasets into a unified whole, but most of
it misses one important factor: Exceptions are the norm in data integration.
That means that data integration is a labor intensive task, involving everything
from encoding standard translations (e.g., ℉ - 32 * 9/2 = ℃), dictionaries
(e.g., NY = New York), and more complex relationships (e.g., geocoding street
addresses). Worse, once datasets A and B are integrated, integrating dataset
C is nearly as much work. Maybe you have some code left over from integrating
A and B (and hopefully your student/employee is still around to explain it to
you), but you really need to sit there to try to figure out which bits of that
code can be re-used... or you do everything from scratch.
This is why, I got very excited when I ran into some work by Fatemeh Nargesian
on [searching for unionable datasets](http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf).
The idea is simple: you index a data lake, hand it a dataset, and it figures out
which datasets in the lake have "similar" columns (based on a clever use of
word embeddings). Enough similar columns, and there's a good chance that you
can just union the datasets together.
My students Will Spoth and Poonam Kumari got to talking with Fatemeh and me
about how we could use this idea to make it easier to re-use data integration
code --- basically, how could we make it easier to re-use integration work.
Our first steps towards this goal just got
[accepted at HILDA](https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf).
Our approach, called Link Once and Keep It (Loki) is also simple: When you
integrate two datasets together, you record the translations that got you from
the dataset to a common schema. This is something that can easily be done in
our prototype, provenance-aware notebook [Vizier](https://vizierdb.info). For
example, we might record how body temperature in one dataset was translated from
℉ to the ℃ used in the other dataset, or the dictionary translation of state
abbreviations (NY) to full state names (New York) used in the other dataset.
Now that we have one mapping, anytime we need to translate ℉ to ℃ or to expand
abbreviated state names, we have the logic needed to pull it off. What remains
is to figure out when to propose the translation to the user. This is where
Fatemeh's work on unionability comes in: Whenever two columns are "similar",
there's a good chance they're of the same type and that the same mappings apply.
We took the opportunity to define a new similarity metric for numeric types,
based on the distribution of values in the data. Unlike the prior approach
based on word embeddings, this is far more likely to give false positives, but
in this setting that's ok, since our goal is only to find and suggest
translations to the user.
Loki combines this with a graph-based search for *chains* of translations that
can be used to translate a source attribute family into a target attribute
family. This will allow Loki to answer two classes of queries: (1) What
transformations will get me from a source dataset to a target schema, and (2)
Is there a schema that I can map two datasets into with minimal work. While
Loki is still in the exploratory prototype phase, we hope to be able to release
it one day as a slowly growing repository of translation rules.

Binary file not shown.