IFL Mapping Files

 

Mapping files are part of the IFL distribution. They are located at $IFL_ROOT/gobii_ifl/res/map. This is a data package so the scripts reference them by:

nameMappingFile = resource_stream('res.map', tableName+'.nmap')

There are two types of mapping files: the name mapping (nmap) and the duplicate mapping (dupmap). These are discussed in detail below.

Name Mapping Files (NMAP)

The name mapping file contains the information on what columns the preprocessing script will change to IDs and how they are mapped. 

Format and example (germplasm.nmap):

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

species_name

species_id

cv

term

cv_id

cv1

type_name

type_id

cv

term

cv_id

cv2

 

This tells the preprocess script that for column species_name, find its ID in the database table 'cv' using the criteria: species_name = cv.term column. Then in the result file, change the column name to the col_alias 'species_id' for it to map directly to the species_id column of the germplasm table. The same applies for the type_name column in the file, it will be changed to type_id in the output file. The table_alias value is necessary in cases when the table name repeats - ie. another column 'type_id' maps to the same table 'cv'. For other cases, you don't need to specify this, but will need to leave a placeholder (trailing tab).

Another example (marker.nmap):

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

reference_name

reference_id

reference

name

reference_id

 

strand_name

strand_id

cv

term

cv_id

 

The second row basically says, for the 'strand_name' column in the file, find its ID in the database table 'cv' using the criteria: strand_name = cv.term column. Then in the result file, change the column name to the col_alias which is 'strand_id' for it to map directly to the strand_id column of the marker table. Note that since no tables repeat in the mapping, we don't need to specify a table alias.

For rows in the name mapping file that is not in the current input file, the IFL will just silently ignore them.

Multi-condition Name Mapping

This is a new feature added to the mapping capabilities of the IFL. NMAP files can now have the form:

File: marker_linkage_group.nmap

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

File Column To Match

Derived Column Alias

Table Name

Table Column To Match

Table's ID column

Table Alias

marker_name,platform_id

marker_id

marker

name,platform_id

marker_id 

 

linkage_group_name,map_id

linkage_group_id

linkage_group

name,map_id

linkage_group_id

 

 

Previously, the marker_linkage_group intermediate file's marker_name column will be converted to marker_id (fetched from the database) based on the mapping "file.marker_name = marker.name" only. This new feature allows us to have multiple conditions, like "file.marker_name = marker.name and file.platform_id = marker.platform_id", to derive the correct marker IDs. You can have as many conditions here and the IFL's NMAP module is now smart enough to know the column type and perform automatic type casts.

 

The following table's NMAPs are now set to multi-condition mapping by default:

  • marker_linkage_group as shown above, both marker_id and linkage_group_id are derived using multiple conditions

  • dataset_marker: marker_id column derived using marker_name and platform_id

  • dnarun: dnasample_id column derived using dnasample_name, platename, and project_id

Duplicate Mapping Files (DUPMAP)

Duplicate maps are simpler. They tell the IFL what criteria to use in checking for duplicates.

Format and example (marker.dupmap):

File Column Name

Table Column Name

Table Column Type

File Column Name

Table Column Name

Table Column Type

platform_id

platform_id

integer

name

name

text

This tells the script to use the following criteria for duplicates:
If name column in file is equal to the value of marker.name AND platform_id column in file is equal to the value of marker.platform_id column, then that row is a duplicate. The third column is the data type of the column to be compared, this basically just cast the column via ::<column_type>. The script will then not include that row in the file for bulk loading. This mapping file can have an arbitrary number of criteria, just note that the comparison will always be an exact match.

The duplicate check is limited to only the columns of the table you are loading to. For example, a duplicate for file.marker can only have comparison criteria on the marker table. So comparing to marker_linkage_group (or any other table) and marker using a join is not possible. You also have to ensure that the columns being compared to is present in the file, otherwise an error will be thrown.

 

Â