1 Details on Quick Start Guide
1.1 PDGF zip contents
The PDGFEnvironment folder contains a file called “pdgf.jar”. This is the main program. There is a folder present named “config”, which contains the different configuration files delivered with PDGF. In this tutorial, we will use the two files “demo-schema.xml” (which contains the schema information) and “default-csv-generation.xml” (which contains the output specification). A schema as well as an output configuration file is required to be able to generate data. PDGF always needs both configuration filetypes.
The following table shows an overview of the directory structure of the PDGF environment:
config/ | Folder containing the schema and generation xml config files |
dict/ | Folder containing dictionaries |
extlib/ | Folder containing external libs used by PDGF |
output/ | Default output folder for generated data |
plugins/ | Folder containing plugins loaded during PDGF execution |
LICENSE.txt | PDGF’s license |
pdgf.jar | The main program |
THIRD_PARTY_LICENSE.txt | Licenses of third party libraries |
1.2 PDGF options
The following command was used in the quick start to generate the demo example schema:
cd PDGFEnvironment java -jar pdgf.jar -l demo-schema.xml -l default-csv-generation.xml -c -ns -sf 1 -s
In this command, we used some options for PDGF:
-l | This loads a configuration file. The schema configuration file must be loaded first, followed by the generation configuration file |
-c | Automatically exits PDGF when data generation is complete |
-ns | Does not start the interactive shell |
-sf | Sets the scale factor of the dataset. This overrides the scale factor setting in the schema configuration file. |
-s | Starts the data generation |
2 Demo modification examples
2.1 Demo schema modification
2.1.1 Schema xml explanation
The schema xml file describes the dataset. It is a pure description how the data are structured and what one specific field value looks like. In which format/way/system they are written out is configured in the generation xml file.
To modify the provided demo schema, take a look at the demo-schema.xml file by opening it in a text editor. Every field value is generated by a specific field value generator configured in the demo-schema.xml. The general structure is:
<schema> ...Internal PDGF settings... <property name="SF" type="double">200</property> ... <table name="table1"> <size>100 * ${SF}</size> <field name="field1" size="20" type="..."> <gen_GENERATOR1> ...generator specific elements... </gen_GENERATOR1> </field> ... </table> ... </schema>
In that example, a table with name “table1” is generated. The table will have 100 * 200 = 20000 lines (size element within table). The table has one field “field1” which has a field width of 20 characters (size attribute of field). The value in this field is being generated by a generator named “GENERATOR1” (which has to be one of the delivered core generators like “Id” or a custom implemented generator).
2.1.2 List PDGF’s core generators
To be able to change the dataset, get familiar with PDGF’s core generators first. To get a list of all data generators present in PDGF, run it in interactive mode:
java -jar pdgf.jar
You will be greeted with the PDGF command line interpreter:
PDGF:>
Now you are in the interactive PDGF shell. Running
PDGF:> help
in this shell always lists you all commands the shell understands. You can list all available generators by running the elementList command
PDGF:> ell generator
in the shell. If you need help for a specific generator, use its associated number as argument for the command:
PDGF:> ell generator ... 26: Generator | pdgf.generator.Id ... PDGF:> ell generator 26 Details for: pdgf.generator.Id Tag usage: ...
You can exit the PDGF shell with:
PDGF:> exit
2.1.3 schema.xml modification example
You can modify the schema and add, for example, an email address. Open the “demo-schema.xml” in an editor and add a new field after firstname. Copy these lines and paste them right after the firstname field element:
<field name="email_address" size="50" type="CHAR" primary="false"> <gen_Email> <file>dicts/mail_provider.dict</file> <reference id="lastname" field="c_last_name"/> <reference id="firstname" field="c_first_name"/> </gen_Email> </field>
Save the modified file as “demo-schema_modified.xml” and re-run the data generation:
java -jar pdgf.jar -l demo-schema_modified.xml -l default-csv-generation.xml -c -ns -sf 1 -s
Your customer output file (in “output/Customer.csv”) now has an additional column with an email address referencing the customer’s name. Note that there is no need to alter the generation configuration file, as only an additional column was added which will be written out the same way as the other fields.
2.2 Demo generation modification
2.2.1 Generation xml explanation
The generation xml configuration is specifying how data are written out. It is possible to specify the format of the data as well as where the data should be written to (write to flat files on disk, stream the generated data into databases/kafka/hadoop/…). The general structure is:
<generation> ...Internal PDGF settings... <output name="DEFAULTOUTPUT"> <fileTemplate>...</fileTemplate> <outputDir>...</outputDir> <fileEnding>...</fileEnding> <delimiter>...</delimiter> ... </output> <schema name="default"> <tables> <table name="table1"> ...specific output configuration for table1 only... </table> </tables> </schema> </generation>
In this xml fragment, one default output is specified. The output plugin here is “DEFAULTOUTPUT”, which has to be either a core output plugin (like CSVRowOutput) or a custom output plugin class. This default output will be used for each table without an entry within the tables-element. It allows setting the filename pattern, where the data should be written to, the directory where the result files should be located, as well as the filename extension and the field delimiter used to separate the fields.
For the table “table1”, a specific output configuration is performed in the above example. In that case, “table1” will use its specific output configuration while all other tables keep using the default output configuration. If all tables should use the default output, leave the tables-elements empty (but the tables-elements themselves must be present).
2.2.2 List PDGF’s outputs
To get a list of all outputs present in PDGF, run it in interactive mode again:
java -jar pdgf.jar
Now, you can list all available outputs by running the elementList command
PDGF:> ell output
in the shell. If you need help for a specific output, use its associated number as argument for the command:
PDGF:> ell output ... 1: FileOutputSkeleton | pdgf.output.CSVRowOutput ... PDGF:> ell output 1 Details for: pdgf.output.CSVRowOutput Tag usage: ...
You can exit the PDGF shell with:
PDGF:> exit
2.2.3 generation.xml modification example
To change the output format, open the “default-csv-generation.xml” in a text editor (like vi or notepad). You can easily change the field delimiter of the default output class from “,” to, e.g., “|” by changing the “delimiter” element from “,” to “|”. Save the file as “default-csv-generation_modified.xml” and re-run the data generation:
java -jar pdgf.jar -l demo-schema_modified.xml -l default-csv-generation_modified.xml -c -ns -sf 1 -s
Your output files now have a pipe as field delimiter. We ship another file with PDGF, which already has this modification: “default-psv-generation.xml”. You might use this file instead of the original “default-csv-generation.xml” in the above command to achieve the same effect.
java -jar pdgf.jar -l demo-schema_modified.xml -l default-psv-generation.xml -c -ns -sf 1 -s