Reading Lists

2017-06-30 17:29:23 -04:00 · 2017-06-30 17:29:23 -04:00 · 0e886d11f4
parent ec41487678
commit 0e886d11f4
2 changed files with 142 additions and 40 deletions
--- a/ReadingList-Probabilistic-DBs.md
+++ b/ReadingList-Probabilistic-DBs.md
@ -4,18 +4,25 @@ Probabilistic databases revolve around so-called possible worlds semantics. An u

 Uncertain data typically appears in one of three forms: Row-level uncertainty, where a tuple may or may not be in a given table, Attribute-level uncertainty, where the exact value of an attribute is not known, and Open-world uncertainty, where the set of tuples in a given result table is not known ahead of time.  Note the distinction between row-level and open-world uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.

-## Benchmarks
+Benchmarks
+-----------------------------------------------------

 * [PDBench](http://pdbench.sourceforge.net)

-## Surveys
+
+
+Surveys
+-----------------------------------------------------

 * [Morgan & Claypool: Probabilistic Databases](http://www.morganclaypool.com/doi/abs/10.2200/S00362ED1V01Y201105DTM016)
 A solid, theory-focused survey of techniques for probabilistic databases.  A good starting point for anyone working in the space.

 * [Springer: Managing and Mining Uncertain Data](http://link.springer.com/book/10.1007/978-0-387-09690-2?no-access=true#page=131)

-## Formal Systems for Incomplete Information
+
+Theoretical Foundations of Uncertain Databases
+-----------------------------------------------------
+This section outlines foundational work on probabilistic and uncertain databases in theory.  

 #### C-Tables

@ -30,7 +37,24 @@ Somewhat related to uncertainty is a concept called three-valued logic, which ex

 * [SQL’s Three-Valued Logic and Certain Answers](http://homepages.inf.ed.ac.uk/libkin/papers/icdt15.pdf)

-## Probabilistic Database Systems
+#### Dichotomy Results
+
+In general, query processing in a probabilistic database is NP-Hard.  This means that for any meaningfully sized query, good luck getting precise results before the heat death of the universe.  However, there are a number of cases where you can take shortcuts.  For example, if you can prove that query results are independent (P(X | Y) = P(X), P(Y | X) = P(Y)) or disjoint (P(X|Y) = 0, P(Y|X) = 0), you can compute the probability of a disjunction significantly faster.  There's been a bunch of theoretical work on trying to distinguish between queries for which such tricks exist, and queries where they don't.  These are typically called 'dichotomy results'.
+
+* [The dichotomy of probabilistic inference for unions of conjunctive queries](http://dl.acm.org/citation.cfm?id=2395119)
+    * [The dichotomy of conjunctive queries on probabilistic structures](http://dl.acm.org/citation.cfm?id=1265571)
+* [The complexity of causality and responsibility for query answers and non-answers](http://dl.acm.org/citation.cfm?id=1880176)
+* [Dichotomies for Queries with Negation in Probabilistic Databases](http://www.cs.ox.ac.uk/dan.olteanu/papers/fo-tods16.pdf)
+
+#### Provenance Semirings
+A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty.  There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *).  The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well.  The key idea is that you can express provenance as a polynomial expression (e.g., `a + b*c`) and then depending on what you slot in for the +, *, and base type, you get a different provenance model.  This lets you build one system that supports all of those models.  Another key insight (and the reason that this paper is on _this_ list), is that the semiring model is equivalent to the condition columns of C-Tables (although the paper incorrectly claims that they subsume all of C-Tables).
+
+* [Provenance Semirings](http://dl.acm.org/citation.cfm?id=1265535)
+
+
+Probabilistic Database Systems
+-----------------------------------------------------
+This section outlines complete end-end systems for probabilistic query processing.  A number of these are purely academic systems, and not all have been released.

 #### MCDB (UFL/Rice/IBM)

@ -85,6 +109,8 @@ Orion 2.0 was the first attempt to support continuous probability distributions

 #### Orchestra (UPenn)

+Though not technically a probabilistic database, Orchestra shares many of the same characteristics. Orchestra was designed as a data-sharing system for biologists, and as a result most of the uncertainty that they deal with comes from data exchange.  Under the hood, they share a lot of ideas with C-Tables-based systems like Pip and MayBMS.  They also leverage a lot of this infrastructure for provenance computations.
+
 * [ORCHESTRA: facilitating collaborative data sharing](http://dl.acm.org/citation.cfm?id=1247631)
 * [ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data](http://cis.upenn.edu/~zives/research/orchestra-cidr.pdf)
 * [The ORCHESTRA Collaborative Data Sharing System](http://dl.acm.org/citation.cfm?id=1462577)
@ -92,6 +118,8 @@ Orion 2.0 was the first attempt to support continuous probability distributions

 #### Trio (Stanford)

+Trio was pretty much the first major system for provenance management, and was also one of the first who's authors recognized the connection with uncertainty (like provenance semi-rings above).  Unfortunately, Trio restricted itself to lineage as a provenance model (Why provenance), and as a result was only able to handle comparatively limited classes of uncertainty and queries.
+
 * [Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4497511)
 * [Working Models for Uncertain Data](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1617375)
 * [ULDBs: databases with uncertainty and lineage](http://dl.acm.org/citation.cfm?id=1164209)
@ -104,22 +132,32 @@ Orion 2.0 was the first attempt to support continuous probability distributions

 #### Pip (Cornell)

-Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter.  Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function.  Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers --- this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
+Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter.  Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function.  Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers – this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.

 * [PIP: A database system for great and small expectations](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5447879)

 #### Mimir (UB)

+Mimir is the first effort to virtualize uncertainty-based computations.  In comparison to other systems which define an on-disk representation of uncertainty, Mimir borrows the VG-Terms of Pip and defines uncertain relations by the query used to construct them.  Then, depending on the specific requirements of the user, Mimir selects one of several intermediate representations best suited for the workload.  In addition to exploring virtualized query processing, the Mimir group has also explored techniques for to presenting uncertain query results to users.
+
+
 * [On-Demand Query Result Cleaning](http://www.vldb.org/2014/phd_workshop.proceedings_files/Camera-Ready%20Papers/Paper%201283/p1283-Yang.pdf)
 * [Detecting the Temporal Context of Queries](http://link.springer.com/chapter/10.1007/978-3-662-46839-5_7)
 * [Lenses: an on-demand approach to ETL](http://dl.acm.org/citation.cfm?id=2824055)
+* [Communicating Data Quality in On-Demand Curation](http://odin.cse.buffalo.edu/papers/2016/QDB-Uncertainty-final.pdf)
+* [Mimir: Bringing CTables into Practice](https://arxiv.org/abs/1601.00073)

 #### Jigsaw (Cornell/Microsoft)
+
+Jigsaw is a variant of MCDB.  The underlying system implementation basically follows the same pattern.  The goal here is to address the fact that VG-Functions are black-boxes and identify scenarios where two different parameterizations are going to create similar results.  The trick is to generate a bunch of samples from the function and use that as a fingerprint.  When approximating query results in a first-pass Jigsaw avoids repeated invocation of a VG Function by calling only once per fingerprint.
+
 * [Jigsaw: Efficient optimization over uncertain enterprise data](http://dl.acm.org/citation.cfm?id=1989410)
 * DEMO: [Fuzzy prophet: parameter exploration in uncertain enterprise scenarios](http://dl.acm.org/citation.cfm?id=1989482)


-## Model Database Systems
+Querying Machine Learning Models
+-----------------------------------------------------
+Virtually all probabilistic database systems adopt a data model based on tuples.  A number of efforts have come up looking at how to use similar techniques to directly query data defined by a graphical model and/or how to represent graphical models in a database.

 #### BayesStore (Berkeley/UFL)

@ -174,7 +212,13 @@ Integration for Spark/SciKit that allows users to store, query, retrieve, and re

 * [ModelDB: a system for machine learning model management](http://dl.acm.org/citation.cfm?id=2939516)

-## Probabilistic Programming Languages
+#### MCMC using IVM
+
+* [Scalable probabilistic databases with factor graphs and MCMC](http://dl.acm.org/citation.cfm?id=1920942)
+
+Probabilistic Programming Languages
+-----------------------------------------------------
+This section outlines a few languages specially aimed at allowing programmers to write turing-complete programs (or at least imperative-looking programs) where data/variables are defined as probability distributions.  

 #### ENFrame
 * [ENFrame: A Framework for Processing Probabilistic Data](http://www.cs.ox.ac.uk/dan.olteanu/papers/os-tods16.pdf)
@ -185,43 +229,60 @@ Integration for Spark/SciKit that allows users to store, query, retrieve, and re

 Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages.  The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.

-## Markov Processes
+Result Completeness
+-----------------------------------------------------
+A number of efforts have taken a step back from **probabilistic** query processing and come up with some clever things to do when all you have is a weaker description of uncertainty in your data.  One particular class of efforts explored in this section is incomplete results --- cases where you can precisely characterize what fragment of your input data is missing (e.g., when one partition goes down in a distributed system, the other partition generally know exactly shards of data were on that partition)

-#### Lahar
+* [Partial Results in Database Systems](http://dl.acm.org/citation.cfm?id=2612176) develops a classification system for completeness.  As in most work on PQP, they adopt notions of value- and row-level uncertainty, which they model by indicating that a particular attribute/table is complete or incomplete (and/or providing statistical measures over the resulting attributes).  They then show how completeness propagates through a query.

-* [Access Methods for Markovian Streams](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4812407)
-* [Approximation trade-offs in a Markovian stream warehouse: An empirical study](http://www.sciencedirect.com/science/article/pii/S0306437912000555)
-* THESIS: [Lahar: Warehousing Markovian Streams](http://lahar.cs.washington.edu/content/papers/letchner_thesis.pdf)
-* DEMO: [Lahar demonstration: warehousing Markovian streams](http://dl.acm.org/citation.cfm?id=1687605)
+* [m-tables: Representing Missing Data](http://drops.dagstuhl.de/opus/volltexte/2017/7061/) is the theoretical answer to Partial Results in Database Systems.  The idea is to model tuples that could be missing along with a hypothetical distribution over their multiplicities.

-#### SimSQL
+* [Identifying the Extent of Completeness of Query Answers over Partially Complete Databases](http://dl.acm.org/citation.cfm?id=2750544) does something almost identical.  The underlying idea is to provide a set of query *patterns* and show how these patterns propagate through a query.  This allows them to relate annotated fragments of query inputs to query outputs, which in turn makes it possible to detect what results are incomplete.

-* [Simulation of database-valued markov chains using SimSQL](http://dl.acm.org/ft_gateway.cfm?id=2465283)

-#### MCMC using IVM
+Data Models 
+-----------------------------------------------------

-* [Scalable probabilistic databases with factor graphs and MCMC](http://dl.acm.org/citation.cfm?id=1920942)
+#### Spatial Data

-## Lineage & Why/Why Not Queries
+
+* [Indexing Metric Uncertain Data for Range Queries](http://dl.acm.org/citation.cfm?id=2723728),
+in which the authors propose an indexing scheme for spatial data with an uncertain position.  One of several papers that adopts a similar key trick: Index one point in the probability distribution. Then, when you query, ask the user to provide you with a confidence bound/threshold and use the threshold to define a sufficiently large radius around the area you're querying to ensure that you hit any point that could be possibly in the result with at least that threshold.
+
+#### Array Data
+
+* [Supporting Data Uncertainty in Array Databases](http://dl.acm.org/citation.cfm?id=2723738),
+in which the authors address positional uncertainty in an array database, a premise quite similar to spatial uncertainty but on a discrete domain.  The key challenge the paper addresses is the tradeoff between replication (casting attribute-level uncertainty as tuple-level uncertainty as in MayBMS) and the overhead of relaxing query constraints on indexes to detect non-replicated records.  Interestingly, more replication increases query cost due to redundant tuple copies getting returned.
+
+
+Lineage & Why/Why Not Queries
+-----------------------------------------------------

 * Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
 * Meiliu/Gatterbaur/Suciu: "Why or why no"?
 * Approximate Lineage for Probabilistic Databases

-## Sensitivity
+
+Sensitivity
+-----------------------------------------------------

 * [Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases](http://dl.acm.org/citation.cfm?id=1989411)
 * [Lenses: An On-Demand Approach to ETL](http://dl.acm.org/citation.cfm?id=2824055)

-## Explanations
+
+Explanations
+-----------------------------------------------------

 * [Scorpion: Explaining away outliers in aggregate queries.](http://dl.acm.org/citation.cfm?id=2536356)
 * [A formal approach to finding explanations for database queries.](http://dl.acm.org/citation.cfm?id=2588578)
 * [Explaining Query Answers with Explanation-Ready Databases](https://users.cs.duke.edu/~sudeepa/ExplReady-FullVersion.pdf)

-## Dichotomy Results

-* [The dichotomy of probabilistic inference for unions of conjunctive queries](http://dl.acm.org/citation.cfm?id=2395119)
-    * [The dichotomy of conjunctive queries on probabilistic structures](http://dl.acm.org/citation.cfm?id=1265571)
-* [The complexity of causality and responsibility for query answers and non-answers](http://dl.acm.org/citation.cfm?id=1880176)
-* [Dichotomies for Queries with Negation in Probabilistic Databases](http://www.cs.ox.ac.uk/dan.olteanu/papers/fo-tods16.pdf)
+Training Probabilistic Databases
+-----------------------------------------------------
+Classically, PDBs assume that you come to them with data already annotated with probabilities.  This is, however, not always the case.  Papers in this section explore how probabilistic databases can 'learn' the underlying distribution over time.
+
+#### BetaPDBs
+A popular model for probabilistic databases is called the Tuple-Independent model (creating TI-PDBs for short).  Tuple-independent probabilistic databases annotate each input tuple with a Bernoulli-distributed random variable.  That is, we assume that each row of the input data is effectively present according to a random coin-flip.  In a Beta-PDB, this is instead a Beta-Bernoulli distribution.  It's still a coin flip, but the bias of the coin comes from training data given by two parameters (typically called a, b).  Naively, these parameters represent samples: You flip a coin a+b times, and it comes up with a heads, that corresponds to a beta-distribution with parameters a, b.  Propagating this training data through queries turns out to be surprisingly harder, which is the subject of this paper.
+
+* http://odin.cse.buffalo.edu/papers/2017/SIGMOD-BetaPDBs-final.pdf
--- a/ReadingList-Quality.md
+++ b/ReadingList-Quality.md
@ -1,4 +1,5 @@
-## User Studies
+User Studies
+-----------------------------------------------------

 #### Berkeley + Columbia
 * [Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations](http://sirrice.github.io/files/papers/cleaning-hilda16.pdf)
@ -8,7 +9,8 @@ A few nice insights from this paper. One obvious one: Data cleaning is iterative
 #### When is My Bus
 * [When (ish) is My Bus? User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems ](http://idl.cs.washington.edu/files/2016-WhenIsMyBus-CHI.pdf)

-## Data Visualization
+Data Visualization
+-----------------------------------------------------
 * The [VisTrails](http://www.vistrails.org/) ecosystem
 * [Fuzzy Prophet](http://dl.acm.org/citation.cfm?id=1989482)
 * [VisGets](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4658131)
@ -17,8 +19,8 @@ A few nice insights from this paper. One obvious one: Data cleaning is iterative
 * [Semantics of Interactive Visualization](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=559213)
 * Software: [LightTable](http://lighttable.com), [Stenci.la](https://stenci.la)

-## Data Integration
-
+Data Integration
+-----------------------------------------------------
 ##### Orchestra
 * [ORCHESTRA: Facilitating collaborative data sharing](http://dl.acm.org/citation.cfm?id=1247631)
 * [ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data.](http://cis.upenn.edu/~zives/research/orchestra-cidr.pdf)
@ -43,7 +45,37 @@ Techniques for estimating the impact of missing records on aggregate functions w

 Magellan is an end-to-end system for doing entity matching.  System may not be the right word, as its more of a toolkit embedded into the Python Data Science stack (SciPy, NumPy, Pandas, etc...) and an associated "How-To" guide.  A general theme throughout the toolkit is that there are multiple stages in a typical evaluation pipeline (sampling, blocking, matching, etc...), and for each stage there are a variety of different algorithms available.  Magellan helps users identify the right algorithm/procedure for this stage through a few resources: (1) The How-To guide outlines the space, (2) Debugging tools help users rapidly validate and iterate over possibilities in the space, and (3) For several stages, they have developed automated training procedures that interactively gather labels from users to select the algorithm/tool best suited for the user's needs.  The final challenge is metadata: Incorporating Magellan into the Python Data Science stack requires using Data Frames.  Data Frames lack support for schema-level metadata (e.g., key attributes or foreign keys), so they developed an external metadata manager to track the association externally.  They also rewrote many of the existing tools in the stack to propagate this information if available.  Unfortunately, propagation isn't guaranteed, so they adopt a validate and warn approach if metadata-derived constraints are broken.

-## Information Fusion
+Data Extraction
+-----------------------------------------------------
+### Extracting Web-Tables
+
+##### Tegra
+* [TEGRA: Table Extraction by Global Record Alignment](http://dl.acm.org/citation.cfm?id=2723725)
+
+The challenge is parsing sequences of strings (Logs, Lists, etc...) into tabular data.  The basic approach is (1) use common separators (e.g., ',' or ' ') to tokenize each string, and then (2) Align the tokens into columns.  
+Step 2 is made more complicated by the fact that you can't tell upfront whether two tokens (and their conjoining separator) are actually part of the same column.  The paper proposes a coherence metric based on term co-occurrence elsewhere in the corpus, bounds checking, etc... that evaluates whether the elements of a column belong together, and solves a bin-packing-style optimization problem to maximize coherence in the extracted table.
+
+Data Cleaning
+-----------------------------------------------------
+
+#### Katara
+Use knowledge-bases and crowdsourcing to clean relational data.  Given a table of input data, look the column headers up in the knowledge base to figure out relationships between them.  These relationships and corresponding entries already in the KB provide ground truth and a sanity check.  Crowdsourcing fills in the blanks.
+
+* [KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing](http://dl.acm.org/citation.cfm?id=2749431)
+
+#### BigDansing
+Sanity checking messy data happens using constraint queries.  These queries are often *super* expensive, as they involve things like NOT EXISTS, negated implication, or UDFs.  BigDansing is a distributed system for processing these types of queries efficiently on large input data.
+
+* [BigDansing: A System for Big Data Cleansing](http://dl.acm.org/citation.cfm?id=2747646)
+
+#### QOCO
+QOCO assumes that the input database is messy (contains incorrect tuples and is missing correct tuples).  As users query it, they can provide feedback of the form: This result should or should not be in the result set, and QOCO comes up with a resolution plan.  The idea, loosely put, is to use crowdsourcing to clean the data. The number of crowd queries is minimized by computing the minimal number of edits required to insert/remove the desired result tuple.
+
+* [Query-Oriented Data Cleaning with Oracles](http://dl.acm.org/citation.cfm?id=2737786)
+
+
+Information Fusion
+-----------------------------------------------------

 A broad term roughly equivalent to Information Integration.  Information Fusion work typically approaches the process of merging data sources from a domain-specific perspective.  As a result, such approaches tend to have more accuracy, but less generality.

@ -51,7 +83,8 @@ A broad term roughly equivalent to Information Integration.  Information Fusion
 * [Query Time Data Integration](http://www.qucosa.de/recherche/frontdoor/?tx_slubopus4frontend%5bid%5d=urn:nbn:de:bsz:14-qucosa-191560) (Thesis)
 * [Human Performance and Data Fusion Based Decision Aids](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.2451&rep=rep1&type=pdf)

-## Provenance 
+Provenance 
+-----------------------------------------------------

 #### VisTrails

@ -89,26 +122,30 @@ Generic model of provenance built over the ring structure of relational algebra.
 A multi-granularity model of provenance, and visualization tool for that model.  Groups operations into a hierharchy of modules and allows provenance to be viewed at different levels of aggregation.  Concrete implementation on top of Pig Latin.


-## Sensitivity/Influence
+Sensitivity/Influence
+-----------------------------------------------------
 * Meilieu/Gatterbauer/Suciu: Sensitivity
 * Kanagal/Deshpande: Influence

-## Data Discovery
+Data Discovery
+-----------------------------------------------------

 * [Goods: Organizing Google's Datasets](http://dl.acm.org/citation.cfm?id=2903730)

 An overview of a crawler developed and run internally at Google to help users track and find datasets.  Interesting IMO, because of the breadth of metadata it collects about the data.  In addition to trivialities like timestamps and formats, the system collects: (1) Provenance, relying on logs from M/R and other bulk data processing systems to establish links between data files, (2) Schema information --- a particularly nifty trick here is using protocol buffer code checked into google's repo to identify candidates, (3) Frequent tokens, keywords, etc..., (4) Metadata from the filename (e.g., date, version, etc...), (5) Semantic information, where it can be extracted from code comments, dataset content, or other details.

-## Versioned Data Management
+Versioned Data Management
+-----------------------------------------------------

-##### Ives: GIT for Sci Data
+##### GIT for Sci Data
 * [LOOKING AT EVERYTHING IN CONTEXT](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper10.pdf)

 ##### Parameswaran/Deshpande/Madden: DataHub
 * [DATAHUB: COLLABORATIVE DATA SCIENCE & DATASET VERSION MANAGEMENT AT SCALE](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper18.pdf)


-## Cleaning/Modeling/Extraction/Integration
+Cleaning/Modeling/Extraction/Integration (to be categorized)
+-----------------------------------------------------
 * Ilyas: Cleanr
 * Re: DeepDive
 * [Davidson/Milo: MoDAS/CrowdSourcing](http://www.cs.tau.ac.il/~milo/projects/modas/Publications.html)
@ -116,18 +153,21 @@ An overview of a crawler developed and run internally at Google to help users tr
 * [LearnPADS++](https://www.cs.princeton.edu/~dpw/papers/learnpads-padl2012.pdf)
 * [Dexter](http://dl.acm.org/citation.cfm?id=2505635)

-## Model-Building
+Model-Building
+-----------------------------------------------------
 * Jermaine: MCDB/SimSQL
 * Wang/Hellerstein: BayesDB
 * Deshpande/Madden: MauveDB
 * [Crankshaw: Velox](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf)
 * [Papakonstantinou: Plato](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper26.pdf)

-## Data/Idea Discovery
+Data/Idea Discovery
+-----------------------------------------------------
 * [Duggan: Hephaestus](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper29.pdf)
 * Meiliu/Gatterbauer

-## Semistructured Data Management
+Semistructured Data Management
+-----------------------------------------------------
 * Idreos: Cracking, ... / Adaptive data management
 * Ailamaki/Idreos: NoDB/VirtualDB
    * [Here are my Data Files. Here are my Queries. Where are my Results](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf)
@ -137,7 +177,8 @@ An overview of a crawler developed and run internally at Google to help users tr
    * [MANAGEMENT OF FLEXIBLE SCHEMA DATA IN RDBMSs – OPPORTUNITIES UND LIMITATIONS FOR NOSQL](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper5.pdf)
    * [KIDS](http://www.cidrdb.org/cidr2015/Papers/15_Abstract43GD.pdf)

-## Data Playgrounds
+Data Playgrounds
+-----------------------------------------------------
 * Bakke
    * [The Schema-Independent Database UI: A Proposed Holy Grail and Some Suggestions](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper31.pdf)
 * Wolfram Language