IFL User Guide

This walks you through the usage of the PostgreSQL IFL, mapping modifications, and possible input errors and how to avoid them.

1 Prerequisites
2 Single File Loading Guide
3 Batch Loading Guide
4 Data Integrity Check
5 StdErr
6 Debugging and Testing

Prerequisites

Intermediate file (or directory of such files) that contains the data to be loaded in tab-delimited format.
- The filename should be of this format: freetext.tablename (Sample filenames: project1.marker, project1.marker_prop, project1.germplasm, project1.dnasample) The IFL will determine the table to load the file to based on its extension.
- These files should have the same columns as the table they will be loaded to. For example:

Note that the columns in the intermediate file doesn't need to be complete nor in order. The only required columns in the intermediate files are those that have "not null" constraints in their corresponding tables.
- Foreign columns, ie. IDs derived from another table (ex. strand_id, reference_id), by convention, we name them
  - _name
  (ex. strand_name, reference_name). These columns will be replaced by database IDs based on the mapping files (discussed here) when you run the IFL.
  - This is, however, configurable so columns can actually be named however you want, you just have to update the mapping files accordingly.
Output Directory where the IFL will write temporary files to. There will be two temporary files, ppd_*.tablename which will have all the name columns replaced with ids, and the nodups_*.tablename which will have no duplicate rows and is the actual file to bulk load. Postgres user should have write access to this directory.
Database Connection String which specifies the database server, port, user, and database name (RFC 3986 URI). It will be passed as a parameter to the IFL command call.

It is up to you to determine the hierarchy of files to load. For example, when loading germplasm and dnasample data, you should always load the germplasm first, otherwise the dnasample.germplasm_id won't be fetched as it doesn't exist yet.

Single File Loading Guide

The main entry point for the IFL is gobii_ifl.py (located in $IFL_ROOT/gobii_ifl/gobii_ifl.py). Run "gobii_ifl -h" to display the usage options
gobii_ifl.py -c <connectionString> -i <inputFile> -d <inputDir> -o <outputDirectory> -v -h = Usage help -c or --connectionString = Database connection string (RFC 3986 URI). Format: postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...] -i or --inputFile = The intermediate file. Expected format: freetext.tablename -d or --inputDir = The input directory. The IFL will load each file found in this directory. In case both inputFile and inputDir is specified, inputDir will take precedence. -o or --outputDir = The output directory where preprocessed file and file for bulk loading (no duplicates) will be placed. Ensure that this path is writeable. -v or --verbose = Print the status of the IFL in more detail -l or --fileLengthCheck = This will check if the preprocessed file is of the same length as the input file. A mismatch indicates duplicate entries in the table where the NMAP file maps to. Input file should not be loaded in that case.
To load a single intermediate file, run: $> python gobii_ifl.py -c <connectionString> -i <inputFile> -o <outputDir>

For example:

python gobii_ifl.py  -c postgresql://appuser:appuser@localhost:5432/gobii_rice_test -i /shared_data/gobii/crops/maize/loader/digest/jdl232-01.marker -o /tmp/

If loading is successful, you'll see the following output messages:

On the other hand, if there's an error, the error message will be printed to stderr.

There may be cases when the output directory you specify can be written to but files cannot be deleted, like in the example above. User postgres can write to /tmp/ directory but cannot remove those temp files so an error message will be thrown, saying something like "Failed to remove temporary files. Check file permissions. Error: [Errno 1] Operation not permitted: '/tmp/ppd_jdl232-01.marker'" which is okay as this directory is wiped clean on certain intervals (or on reboot) anyway.

Batch Loading Guide

The main entry point for the IFL is gobii_ifl.py (located in $IFL_ROOT/gobii_ifl/gobii_ifl.py). Run "gobii_ifl -h" to display the usage options
gobii_ifl.py -c <connectionString> -i <inputFile> -d <inputDir> -o <outputDirectory> -v -h = Usage help -c or --connectionString = Database connection string (RFC 3986 URI). Format: postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...] -i or --inputFile = The intermediate file. Expected format: freetext.tablename -d or --inputDir = The input directory. The IFL will load each file found in this directory. In case both inputFile and inputDir is specified, inputDir will take precedence. -o or --outputDir = The output directory where preprocessed file and file for bulk loading (no duplicates) will be placed. Ensure that this path is writeable. -v or --verbose = Print the status of the IFL in more detail -l or --fileLengthCheck = This will check if the preprocessed file is of the same length as the input file. A mismatch indicates duplicate entries in the table where the NMAP file maps to. Input file should not be loaded in that case.
To do batch loading, run: $> python gobii_ifl.py -c <connectionString> -d <inputDir> -o <outputDir>

For example:

Similar to single file loading, you get a "File loaded successfully" per file loaded, or an error message "Failed to load <name_of_file>. Error=<error_message>" printed to stderr if it wasn't successful.

Note that the batch loader will continue loading succeeding files even if a certain file fails. What you can do afterwards, in case any of your files failed, is to look at the stderr that lists out all the failed files, fix those files and move them to a new directory then re-run the IFL batch loader.

Although the IFL allows the input directory to have any kind of intermediate file (ex. *.germplasm, *.marker, *.dnasample), it is best not to mix them. Otherwise, you need to make sure that the hierarchy is respected. The IFL will fetch a directory's content based on the OS's sorting of them, which will likely put your intermediate file in the incorrect order. On the other hand, if you only put one type, say *.marker, in the input directory, regardless of the number of files, the IFL will be able to handle them correctly.

Data Integrity Check

Usage:

This will get the total number of lines in the input file (ex. project1.marker) and the preprocessed file (ppd_project1.marker) and compare them. If the number of lines match, the IFL will proceed as usual. If they don't match and your running a single file loading, the IFL will throw an error and exit. If they don't match and you were running a batch loading, the IFL will print an error message to stderr and proceed with the next file. A mismatch in the line count on those files indicates duplicate entries in the table where the NMAP file maps to (which should really be fixed).

This feature is disabled by default because having it enabled for large files defeats the purpose of the bulk loading (FDW table). I did research on the fastest and most efficient way to get the number of lines in a file in Python, ran tests, and came up with the one that performs best. However, this will still be a bottleneck when we're loading large files. This will work for up to a few million lines, but in the magnitude of hundreds of millions/billions, it probably will slow down the IFL significantly (not tested). The bulk loading IFL is doing pretty good loading rates without it. So I leave the decision to use it on a case to case basis on your hand.

StdErr

All error messages and stack traces are printed to stderr. So you can do something like:

Debugging and Testing

The IFL is composed of two components, preprocess_ifl and load_ifl. You can run them directly without using gobii_ifl.py if ever you need to. Preprocess_ifl will replace names with their corresponding IDs in the database and output result to ppd_*.tablename. Load_ifl will remove duplicates and output the result to nodups_*.tablename and bulk load it to the corresponding table.