pdbench/wikispaces/Datasets

Note: This page is under construction! Explanations and further entries will be added soon!

==Data Generator: Uncertain TPC-H for MayBMS (and Trio)==
|| Availability: || [[http://pdbench.sourceforge.net/MayBMS-tpch.tgz|here]] (580kb .tgz file) ||
|| Contributors: || MayBMS team ||
|| Tags: || Discrete distributions; complex conditions aka "external lineage" ||
|| Comments: || This is a modification of the standard TPC-H data generator. Includes a data generator (C code, an extension of the standard TPC-H data generator), queries, a translator from attribute-level to tuple-level U-relations (PLSQL script), and a translator from tuple-level U-relations to Trio ULDBs (PLSQL script). ||
|| References: || Lyublena Antova, Thomas Jansen, Christoph Koch, Dan Olteanu. "Fast and Simple Relational Processing of Uncertain Data". //Proc. ICDE 2008//. ||

==Data Generator: Uncertain TPC-H for Trio==
|| Availability: || [[http://pdbench.sourceforge.net/TPCH-Trio.zip|here]] (3kB .zip file) ||
|| Contributors: || Trio team ||
|| Tags: ||   ||
|| Comments: || Two Python data generators for generating vertical and horizontal partitionings of the various TPCH tables into Trio tables with x-tuples ("run_horizontal_partitioner.sh" shows how to run the horizontal partitioner, for example). The archive also contains some DDL commands for some example schema definitions in Trio. (1) The horizontal partitioner simply groups the relational tuples from a given TPCH table into Trio alternatives with a uniform distribution of confidences. That is, all alternatives of the same group become mutually exclusive (which basically ignores the original TPCH keys for the resulting possible worlds). The number of alternatives per x-tuple, the number of partitions, etc., can be adjusted via various parameters. (2) The vertical partitioner can furthermore split a TPCH table into its individual attributes. Here, the original TPCH keys may be maintained and can be used to reconstruct the original PW's of the TPCH database. For the TPCH settings the Trio team can also provide readily extracted data dumps (> 1 GB) and the queries that were used for two previous papers. The queries are in also TriQL syntax and probably quite specific to our settings, i.e., they would already be actual benchmarks for measuring confidence computations and update performances in Trio. ||
|| References: ||   ||

==Dataset: IMDB Movies==
|| Availability: || [[http://pdbench.sourceforge.net/movie_data.zip|IMDB Movies and Netflix ratings dataset]] (100kB zipped SQL file); [[http://pdbench.sourceforge.net/movie_query.triql|TriQL queries]] ||
|| Contributors: || Trio team ||
|| Tags: ||   ||
|| Comments: || This is the IMDB data and query set which is also used for the Trio online demo. It models an "uncertain" data integration scenario between IMDB movies and Netflix ratings, along with a few queries that have been used for the demo (both the data and query files are in TriQL syntax). IMDB movies have been grouped together into x-tuples by similar titles, and each original Netflix rating has been artificially extended by additional rating alternatives according to some normal confidence distribution around the original (i.e., actual Netflix) rating. ||
|| References: ||   ||

==Dataset: RFID Sensor Data==
|| Availability: || soon ||
|| Contributors: || Mystiq team ||
|| Tags: ||   ||
|| Comments: ||   ||
|| References: ||   ||

==Dataset: Data integration==
|| Availability: || soon ||
|| Contributors: || Twente team ||
|| Tags: ||   ||
|| Comments: || Tasks: Turn the Twente probabilistic information integrator into a data generator; produce a dataset in the movie rating domain. **Source 1**: XML-file constructed by extracting data from [[http://www.tvguide.com/|http://www.tvguide.com]] (unfortunately, TVguide.com discontinued its movies recommendation service; changing to top 100 most popular): http://library.cs.utwente.nl/xquery/docs/tvguidemostpopular.xml (100 movies; 700kB) **Source 2**: XML-file constructed from moviedb-3.24 (see for example uiarchive.cso.uiuc.edu in /pub/info/imdb/tools) http://library.cs.utwente.nl/xquery/docs/ImdbRestricted.xml (243856 movies; 241MB; all movies with at least one genre which is not "Documentary" or "Adult). **[ToDo]** Data generator based on probabilistic XML information integrator for these two sources. ||
|| References: || de Keijzer, A. and van Keulen, M. (2008) [[http://eprints.eemcs.utwente.nl/11232/|//IMPrECISE: Good-is-good-enough data integration.//]] ICDE 2008 (demo). van Keulen, M. and de Keijzer, A. and Alink, W. (2005) [[http://eprints.eemcs.utwente.nl/7273/|//A probabilistic XML approach to data integration.//]] ICDE 2005. ||

==Dataset, Generator: IPUMS US census data==
|| Availability: || Anonymized subset of the US census: http://usa.ipums.org/usa/; uncertainty generator: soon; small example involving vertical decompositioning and data cleaning: [[http://maybms.cvs.sourceforge.net/viewvc/maybms/maybms/examples/census.sql?view=markup|here]] ||
|| Contributors: || MayBMS team ||
|| Tags: || Discrete distributions; or-sets; data cleaning; conditional probability tables ||
|| Comments: || Tasks: Contribute MayBMS data generator which reintroduces uncertainty into the census data, and queries. ||
|| References: ||   ||

==Nascent Use Case: Skills Management==
|| Availability: || [[http://maybms.cvs.sourceforge.net/viewvc/maybms/maybms/examples/companies.sql?view=markup|here]] ||
|| Contributors: || MayBMS team ||
|| Tags: || Discrete distributions; conditional probability tables ||
|| Comments: ||   ||
|| References: ||   ||

==Nascent Use Case: Random Graphs and Social Networks==
|| Availability: || [[http://maybms.cvs.sourceforge.net/viewvc/maybms/maybms/examples/randgraph.sql?view=markup|here]] ||
|| Contributors: || MayBMS team ||
|| Tags: || Discrete distributions ||
|| Comments: || Confidence computation is challenging because the DNFs/the lineage get very large and does not decompose using any of the known techniques. ||
|| References: ||   ||

==Nascent Use Case: Analyzing Web Graphs==
|| Availability: || WWW data set: [[http://www.nd.edu/~networks/resources/www/www.dat.gz|here]], SQL scripts: [[http://pdbench.sourceforge.net/webgraph.zip|here]] (3kB .zip file) ||
|| Contributors: || MayBMS team ||
|| Tags: || Discrete distributions ||
|| Comments: || A variation of the random graph example using web graph data. The edges are assigned probability relative to the degree of the end nodes, thus the graph has few edges with high probability, and the majority of edges have low probability. The dataset contains example queries finding the probability for occurrence of a pattern in the random graph, such as for example a triangle etc. ||
|| References: || Dataset due to Réka Albert, Hawoong Jeong and Albert-László Barabási: Diameter of the World Wide Web Nature 401, 130 (1999) See [[http://www.nd.edu/~networks/resources.htm]] for more details on the data. ||

==Use Case: Data cleaning==
|| Availability: || Generator for uncertain data: [[http://pdbench.sourceforge.net/census.zip|here]] (424kB .zip file) ||
|| Contributors: || MayBMS team ||
|| Tags: ||  ||
|| Comments: || This is a noise generator for the census data set. Contains two tools: for inserting noise in the form of or-sets in the census data set (DataNoise), and for data cleaning using dependencies on the data set (Chase). More information available in the README file supplied with the archive. ||
|| References: || Original dataset available for download at [[http://www.ipums.org]].  See "10^10^6 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information", Lyublena Antova, Christoph Koch, Dan Olteanu, Proc. ICDE 2007 for more details on the representation and query rewriting. ||