pdbench/census/README

100 lines
5.3 KiB
Plaintext

This is a noise generator for the census data set.
The generator is implemented in Java & JDBC and uses as storage engine PostgreSQL.
----------------
Data
----------------
The generator works with the anonymized US census data set, which can be obtained from [2].
----------------
Data Format
----------------
The generator generates data in UWSDT format as used by MayBMS v1 and described in reference [1]. In short, the schema of the database contains the following relations:
rt(<column1>, ..., tid int)
ft(Relation text, tid int, col text, cid text, hid text)
ct(cid text, hid text, wid text, value int)
wt(cid text, wid text)
----------------
Files:
----------------
* DataNoise/
introduces noise in the data set by inserting additional values in randomly selected fields, that is, by replacing certain values by or-sets.
* Chase/
implements the chase procedure to ensures a given set of equality-generating dependencies hold on the data.
* Queries/
example queries on the census data.
* generate_data.sh
example script to generate uncertain data.
* run_queries.sh
example script to run queries on the census data.
Note: DataNoise and Chase are not constrained on the census data set; the configuration files allow for use with other data sets and schemata. Currently the programs are constrained to work with a single relation only but the data model allows for multiple relations and there is no conceptual difficulty in extending the code to support that.
Below is a detailed description of the subprojects.
----------------
DataNoise
----------------
Inserts additional values to randomly selected fields of the relation.
To compile:
javac -sourcepath DataNoise/src -d DataNoise/bin/ DataNoise/src/*.java
To run follow the steps:
1. Create index on the template relation <rt> if one does not exist:
create index rtididx on rt(tid);
2. java -classpath DataNoise/bin:DataNoise/jdbc3.jar Main <rt> <ft> <ct> <size> <arity> <noise_ratio> <settings.xml> | psql -d <database>
The following parameters have to be specified:
- rt, ft, ct: names of the template, mapping and component relation, respectively
- size: size of the template relation rt where noise should be introduced. Tuples in rt are expected to have tuple id tid ranging from 1 to size.
- arity: arity of the relation (not counting the tid column)
- noise_ratio: percentage of the fields in the relation that should contain uncertainty.
- settings.xml: file with additional settings in xml format, including the following:
* dbname, username, password: database name, user name and password, respectively
* maxholespertuple: max number of uncertain fields per tuple
* maxholesize: max number of values to generate for an uncertain field
* vars: description of attribute names and possible values
See DataNoise/settings.xml for more information.
In short, the algorithm works as follows. Given the desired <noise_ratio>, it selects an appropriate number of tuples to add noise to, and then chooses for each of them uniformly at random between 1 and <maxholespertuple> fields. Those fields are then replaced by or-sets by inserting additional values from the attribute domain, where the max number of values to insert is controlled by the <maxholesize> argument and is selected uniformly at random between 1 and the min of <maxholsize> and domain size. The original value of the field is preserved. The output of the program is the sequence SQL functions needed to execute the updates, which can be streamed to psql or saved in a file for later execution.
----------------
Chase
----------------
Chases a given set of dependencies on a UWSDT, see [1] for more details. Assumes that the UWSDT relations (template rt, mappping ft, component ct) exist. The UWSDT need not be an or-set database, it can be a general UWSDT. Currently, only one (template) relation is supported.
To compile:
javac -sourcepath Chase/src -d Chase/bin/ Chase/src/*.java
To run follow the steps:
1. Load language pl/pgSQL in PostgreSQL (in case not already done):
psql -d <database> -c 'create language plpgsql'
2. Load the prsel pl/pgSQL function needed for the chase procedure:
psql -d <database> -f Chase/prsel.sql
2. Create indices on relations:
psql -d <database> -c 'create index ftididx on ft(tid);'
psql -d <database> -c 'create index fcididx on ft(cid);'
psql -d <database> -c 'create index fhididx on ft(hid);'
psql -d <database> -c 'create index ccididx on ct(cid);'
psql -d <database> -c 'create index chididx on ct(hid);'
java -classpath Chase/bin:Chase/jdbc3.jar Main <rt> <ft> <ct> deps.xml
The following parameters have to be specified:
- rt, ft, ct: names of the template, mapping and component relation, respectively
- dependencies.xml: xml file specifying the following
* dbname, username, password: database name, user name and password, respectively
* rules: description of the dependencies to be chased. The currently supported type of dependencies is 'eq', equality-generating dependencies of the form
IF Attr1 \theta c1 [and ...] THEN Attr0 \theta c0
See Chase/dependencies.xml for more information.
----------------
References:
----------------
[1] "10^10^6 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information", Lyublena Antova, Christoph Koch, Dan Olteanu, Proc. ICDE 2007
[2] "Integrated Public Use Microdata Series: V3.0", Steven Ruggles et al., 2004, http://www.ipums.org