Loki

2020-05-03 15:01:54 -04:00 · 2020-05-03 15:01:54 -04:00 · 361386e275
parent ddd5d987e6
commit 361386e275
3 changed files with 73 additions and 0 deletions
--- a/db/publications.json
+++ b/db/publications.json
@ -1,4 +1,19 @@
 [
+  {
+    "title" : "Loki: Streamlining Integration and Enrichment",
+    "authors" : [
+      "William Spoth",
+      "Poonam Kumari",
+      "Oliver Kennedy",
+      "Fatemeh Nargessian"
+    ],
+    "venue" : "HILDA",
+    "year" : 2020,
+    "length" : 4,
+    "urls" : {
+      "workshop" : "https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf"
+    }
+  },
  {
    "title" : "Your notebook is not crumby enough, REPLace it",
    "authors" : [
--- a/src/news/2020-05-03-Loki-at-HILDA.md
+++ b/src/news/2020-05-03-Loki-at-HILDA.md
@ -0,0 +1,58 @@
+---
+title: Re-using work in data integration
+author: Oliver Kennedy
+---
+
+Data integration is a huge problem.  There's a ton of work out there on 
+automating the process of merging two datasets into a unified whole, but most of
+it misses one important factor: Exceptions are the norm in data integration.
+That means that data integration is a labor intensive task, involving everything
+from encoding standard translations (e.g., ℉ - 32 * 9/2 = ℃), dictionaries 
+(e.g., NY = New York), and more complex relationships (e.g., geocoding street
+addresses).  Worse, once datasets A and B are integrated, integrating dataset 
+C is nearly as much work.  Maybe you have some code left over from integrating 
+A and B (and hopefully your student/employee is still around to explain it to 
+you), but you really need to sit there to try to figure out which bits of that 
+code can be re-used... or you do everything from scratch.
+
+This is why, I got very excited when I ran into some work by Fatemeh Nargesian
+on [searching for unionable datasets](http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf).  
+The idea is simple: you index a data lake, hand it a dataset, and it figures out
+which datasets in the lake have "similar" columns (based on a clever use of
+word embeddings).  Enough similar columns, and there's a good chance that you 
+can just union the datasets together.
+
+My students Will Spoth and Poonam Kumari got to talking with Fatemeh and me 
+about how we could use this idea to make it easier to re-use data integration 
+code --- basically, how could we make it easier to re-use integration work.
+Our first steps towards this goal just got 
+[accepted at HILDA](https://odin.cse.buffalo.edu/papers/2020/HILDA-Loki.pdf).
+Our approach, called Link Once and Keep It (Loki) is also simple: When you 
+integrate two datasets together, you record the translations that got you from 
+the dataset to a common schema.  This is something that can easily be done in 
+our prototype, provenance-aware notebook [Vizier](https://vizierdb.info).  For 
+example, we might record how body temperature in one dataset was translated from
+℉ to the ℃ used in the other dataset, or the dictionary translation of state 
+abbreviations (NY) to full state names (New York) used in the other dataset.
+
+Now that we have one mapping, anytime we need to translate ℉ to ℃ or to expand
+abbreviated state names, we have the logic needed to pull it off.  What remains
+is to figure out when to propose the translation to the user.  This is where
+Fatemeh's work on unionability comes in: Whenever two columns are "similar",
+there's a good chance they're of the same type and that the same mappings apply.
+We took the opportunity to define a new similarity metric for numeric types, 
+based on the distribution of values in the data.  Unlike the prior approach 
+based on word embeddings, this is far more likely to give false positives, but 
+in this setting that's ok, since our goal is only to find and suggest 
+translations to the user.  
+
+Loki combines this with a graph-based search for *chains* of translations that
+can be used to translate a source attribute family into a target attribute
+family. This will allow Loki to answer two classes of queries: (1) What 
+transformations will get me from a source dataset to a target schema, and (2)
+Is there a schema that I can map two datasets into with minimal work.  While
+Loki is still in the exploratory prototype phase, we hope to be able to release
+it one day as a slowly growing repository of translation rules.
+
+
+
--- a/src/papers/2020/HILDA-Loki.pdf
+++ b/src/papers/2020/HILDA-Loki.pdf