Loki
parent
ddd5d987e6
commit
361386e275
|
@ -1,4 +1,19 @@
|
|||
[
|
||||
{
|
||||
"title" : "Loki: Streamlining Integration and Enrichment",
|
||||
"authors" : [
|
||||
"William Spoth",
|
||||
"Poonam Kumari",
|
||||
"Oliver Kennedy",
|
||||
"Fatemeh Nargessian"
|
||||
],
|
||||
"venue" : "HILDA",
|
||||
"year" : 2020,
|
||||
"length" : 4,
|
||||
"urls" : {
|
||||
"workshop" : "https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf"
|
||||
}
|
||||
},
|
||||
{
|
||||
"title" : "Your notebook is not crumby enough, REPLace it",
|
||||
"authors" : [
|
||||
|
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
title: Re-using work in data integration
|
||||
author: Oliver Kennedy
|
||||
---
|
||||
|
||||
Data integration is a huge problem. There's a ton of work out there on
|
||||
automating the process of merging two datasets into a unified whole, but most of
|
||||
it misses one important factor: Exceptions are the norm in data integration.
|
||||
That means that data integration is a labor intensive task, involving everything
|
||||
from encoding standard translations (e.g., ℉ - 32 * 9/2 = ℃), dictionaries
|
||||
(e.g., NY = New York), and more complex relationships (e.g., geocoding street
|
||||
addresses). Worse, once datasets A and B are integrated, integrating dataset
|
||||
C is nearly as much work. Maybe you have some code left over from integrating
|
||||
A and B (and hopefully your student/employee is still around to explain it to
|
||||
you), but you really need to sit there to try to figure out which bits of that
|
||||
code can be re-used... or you do everything from scratch.
|
||||
|
||||
This is why, I got very excited when I ran into some work by Fatemeh Nargesian
|
||||
on [searching for unionable datasets](http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf).
|
||||
The idea is simple: you index a data lake, hand it a dataset, and it figures out
|
||||
which datasets in the lake have "similar" columns (based on a clever use of
|
||||
word embeddings). Enough similar columns, and there's a good chance that you
|
||||
can just union the datasets together.
|
||||
|
||||
My students Will Spoth and Poonam Kumari got to talking with Fatemeh and me
|
||||
about how we could use this idea to make it easier to re-use data integration
|
||||
code --- basically, how could we make it easier to re-use integration work.
|
||||
Our first steps towards this goal just got
|
||||
[accepted at HILDA](https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf).
|
||||
Our approach, called Link Once and Keep It (Loki) is also simple: When you
|
||||
integrate two datasets together, you record the translations that got you from
|
||||
the dataset to a common schema. This is something that can easily be done in
|
||||
our prototype, provenance-aware notebook [Vizier](https://vizierdb.info). For
|
||||
example, we might record how body temperature in one dataset was translated from
|
||||
℉ to the ℃ used in the other dataset, or the dictionary translation of state
|
||||
abbreviations (NY) to full state names (New York) used in the other dataset.
|
||||
|
||||
Now that we have one mapping, anytime we need to translate ℉ to ℃ or to expand
|
||||
abbreviated state names, we have the logic needed to pull it off. What remains
|
||||
is to figure out when to propose the translation to the user. This is where
|
||||
Fatemeh's work on unionability comes in: Whenever two columns are "similar",
|
||||
there's a good chance they're of the same type and that the same mappings apply.
|
||||
We took the opportunity to define a new similarity metric for numeric types,
|
||||
based on the distribution of values in the data. Unlike the prior approach
|
||||
based on word embeddings, this is far more likely to give false positives, but
|
||||
in this setting that's ok, since our goal is only to find and suggest
|
||||
translations to the user.
|
||||
|
||||
Loki combines this with a graph-based search for *chains* of translations that
|
||||
can be used to translate a source attribute family into a target attribute
|
||||
family. This will allow Loki to answer two classes of queries: (1) What
|
||||
transformations will get me from a source dataset to a target schema, and (2)
|
||||
Is there a schema that I can map two datasets into with minimal work. While
|
||||
Loki is still in the exploratory prototype phase, we hope to be able to release
|
||||
it one day as a slowly growing repository of translation rules.
|
||||
|
||||
|
||||
|
Binary file not shown.
Loading…
Reference in New Issue