IFL Mapping Files

1 Name Mapping Files (NMAP)
- 1.1 Multi-condition Name Mapping
2 Duplicate Mapping Files (DUPMAP)

Mapping files are part of the IFL distribution. They are located at $IFL_ROOT/gobii_ifl/res/map. This is a data package so the scripts reference them by:

nameMappingFile = resource_stream('res.map', tableName+'.nmap')

There are two types of mapping files: the name mapping (nmap) and the duplicate mapping (dupmap). These are discussed in detail below.

Name Mapping Files (NMAP)

The name mapping file contains the information on what columns the preprocessing script will change to IDs and how they are mapped.

Format and example (germplasm.nmap):

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias
species_name	species_id	cv	term	cv_id	cv1
type_name	type_id	cv	term	cv_id	cv2

This tells the preprocess script that for column species_name, find its ID in the database table 'cv' using the criteria: species_name = cv.term column. Then in the result file, change the column name to the col_alias 'species_id' for it to map directly to the species_id column of the germplasm table. The same applies for the type_name column in the file, it will be changed to type_id in the output file. The table_alias value is necessary in cases when the table name repeats - ie. another column 'type_id' maps to the same table 'cv'. For other cases, you don't need to specify this, but will need to leave a placeholder (trailing tab).

Another example (marker.nmap):

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias
reference_name	reference_id	reference	name	reference_id
strand_name	strand_id	cv	term	cv_id

The second row basically says, for the 'strand_name' column in the file, find its ID in the database table 'cv' using the criteria: strand_name = cv.term column. Then in the result file, change the column name to the col_alias which is 'strand_id' for it to map directly to the strand_id column of the marker table. Note that since no tables repeat in the mapping, we don't need to specify a table alias.

For rows in the name mapping file that is not in the current input file, the IFL will just silently ignore them.

Multi-condition Name Mapping

This is a new feature added to the mapping capabilities of the IFL. NMAP files can now have the form:

File: marker_linkage_group.nmap

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias

File Column To Match	Derived Column Alias	Table Name	Table Column To Match	Table's ID column	Table Alias
marker_name,platform_id	marker_id	marker	name,platform_id	marker_id
linkage_group_name,map_id	linkage_group_id	linkage_group	name,map_id	linkage_group_id

Previously, the marker_linkage_group intermediate file's marker_name column will be converted to marker_id (fetched from the database) based on the mapping "file.marker_name = marker.name" only. This new feature allows us to have multiple conditions, like "file.marker_name = marker.name and file.platform_id = marker.platform_id", to derive the correct marker IDs. You can have as many conditions here and the IFL's NMAP module is now smart enough to know the column type and perform automatic type casts.

The following table's NMAPs are now set to multi-condition mapping by default:

marker_linkage_group as shown above, both marker_id and linkage_group_id are derived using multiple conditions
dataset_marker: marker_id column derived using marker_name and platform_id
dnarun: dnasample_id column derived using dnasample_name, platename, and project_id

Duplicate Mapping Files (DUPMAP)

Duplicate maps are simpler. They tell the IFL what criteria to use in checking for duplicates.

Format and example (marker.dupmap):

File Column Name	Table Column Name	Table Column Type

File Column Name	Table Column Name	Table Column Type
platform_id	platform_id	integer
name	name	text

This tells the script to use the following criteria for duplicates:
If name column in file is equal to the value of marker.name AND platform_id column in file is equal to the value of marker.platform_id column, then that row is a duplicate. The third column is the data type of the column to be compared, this basically just cast the column via ::<column_type>. The script will then not include that row in the file for bulk loading. This mapping file can have an arbitrary number of criteria, just note that the comparison will always be an exact match.

The duplicate check is limited to only the columns of the table you are loading to. For example, a duplicate for file.marker can only have comparison criteria on the marker table. So comparing to marker_linkage_group (or any other table) and marker using a join is not possible. You also have to ensure that the columns being compared to is present in the file, otherwise an error will be thrown.