Objective
Prevent wrong or bad data from being loaded into GOBii, which will most likely fail at the database level or cause data to be corrupted in the DB
Success metrics
Assumptions
Milestones
Requirements
# | Requirement | User Story | Importance | Jira Issue | Notes |
---|
1 | digest.germplasm | Required fields: Optional fields: - species_name => germplasm_species in DB[CV table]
- type_name => germplasm_type in DB[CV table]
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty (nothing but whitespace) records
- external_code column must not contain any duplicates
- if species_name column exist then check if unique set of species (excluding nulls) values exist in CV table germplasm_species group
- if type_name column exist then check if unique set of germplasm type (excluding nulls) values exist in CV table germplasm type group
| HIGH |
|
|
2 | digest.germplasm_prop | Required fields: Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- digest.germplasm must exist
- external_code column must be the same as external_code column in digest.gemplasm
|
|
|
|
3 | digest.dnasample | Required fields: - project_id
- name
- external_code
- num
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- name/external_code/num combination must be unique (in file)
- if digest.germplasm does not exist, check that unique set of external_codes exist in germplasm table [DB]
If digest.germplasm exists, external_code column must be the same as external_code column in digest.gemplasm Daw 12/10/18- if digest.germplasm exists, then unique list of external_code should be the same as unique list of external_code in digest.germplasm
|
|
|
|
4 | digest.dnasample_prop | Required fields: - project_id
- dnasample_name
- external_code
- num
Optional fields: - dnasample table must be mapped in this file if this table is mapped
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- dnasample_name/external_code/num combination must be unique (in file)
- digest.dnasample must exist
- project_id column must be the same as project_id column in digest.dnasample
- dnasample_name column must be the same as name column in digest.dnasample
- external_code column must be the same as external_code column in digest.dnasample
- num column must be the same as num column in digest.dnasample
|
|
|
|
5 | digest.dnarun | Required fields: - project_id
- experiment_id
- name
Optional fields: - dnasample_name IS NON NULL if exists
- num IS NON NULL if exists
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- If dnasample_name column exists then it cannot contain any null or empty records
- if num column exists then it cannot contain any null or empty records
if digest.dnasample exists: (daw: removed 10/17/18)project_id column contents must exist in digest.dnasample project_id columndnasample_name column must existdnasample_name column contents must exist in digest.dnasample name columnnum column must existnum column contents must exist in digest.dnasample num column
- if digest.dnasample exists: (daw: added 10/17/18)
- dnasample_name column must exist
- unique list of project_id must equal unique list of digest.dnasample.project_id
- unique list of dnasample_name must equal unique list of digest.dnasample_name
- if num column exists:
- unique list of dnasample_name and dnasample_numcombination must equal unique list of digest.dnasample.name, and digest.dnasample.num combination
- else if digest.dnasample does not exist:
dnasample_name column must exist DAW: 121018- check unique list of dnasample_names exist in dnasample name column[DB] for project_id DAW: 0711/19
- check unique list of name (dnarun_names) exist in dnarun name column [DB] for exp_id DAW: 0711/19
- if num column exists:
- dnasample_name column must exist
- check unique list of dnasampe_names and num combinations exist dnasample table [DB] within name and num columns
|
|
|
|
6 | digest.dnarun_prop | Required fields: Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- digest.dnarun must exist
- experiment_id column must be the same as experiment_id column in digest.dnarun
- dnarun_name column must be the same as name column in digest.dnarun
|
|
|
|
7 | digest.marker | Required fields: Optional fields: - reference_name => reference_name in DB[References table]
- strand_name => marker_strand in DB[CV table]
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- if reference_name column exists then check if unique set of reference names (excluding nulls) exist in reference table
- if strand_name column exists then check if unique set of strands (excluding nulls) exist in CV table marker_strand group
|
|
|
|
8 | digest.marker_prop | Required fields: Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- digest.marker must exist
|
|
|
|
9 | digest.linkage_group | Required fields: Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
|
|
|
|
10 | digest.marker_linkage_group | Required fields: - platform_id
- map_id
- linkage_group_name
- maker_name
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- If digest.linkage_group does not exist, then unique set of linkage_group_names must exist in linkage_group table [DB]
- if digest.marker does not exist, then unique set of marker_names must exist in marker_table [DB]
|
|
|
|
11 | digest.dataset_dnarun | Required fields: - experiment_id
- dataset_id
- dnarun_name
- dnarun_idx
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- If digest.dnarun does not exist, then unique set of dnarun_names must exist in dnarun table [DB] for associated experiment_id**
|
|
|
|
12 | digest.dataset_marker | Required fields: - platform_id
- dataset_id
- marker_name
- marker_idx
Conditions: - All required columns must exist in the file
- all required fields cannot contain null or empty records
- if digest.marker foes not exist, then unique set of marker_names must exist in marker table[DB] for associated platform_id
|
|
|
|
13 | digest.matrix | This check can be performed after data transformations but will need Josh's advice. If we have time - Users will greatly appreciate having this on input types instead of extract types. Conditions: - digest.dataset_dnarun and digest.dataset_marker must exist
- Matrix must not contain any null values
- Check for undesirable characters in genotype matrix based on dataset type (check for character length as well)
- Matrix must be rectangular (no jagged edges or double-delimiters)
- (Added 01/02/2019 by Yaw) matrix column and row lengths must correspond to record length in digest.dataset_dnarun and digest.dataset_marker files
Allowed characters for each data type are as follows: - dominant - 0,1,N
- co-dominant – 0,1,2,N
- SNP – A,T,C,G,-,+,N
- Allele sizes – numeric values, N
- All data types - any set on the 'invalid characters' set
|
|
|
|
14 | Webservices needed | - check if species_names exist in CV table
- input: List of names
- output: list of key:value pairs, where key is species_name and value is boolean if key exist in table
- search: CV table, germplasm_species group
- check if type_name exist in CV table
- input: List of names
- output: list of key:value pairs, where key is type_name and value is boolean if key exist table
- search: CV table, germplasm_type group
- check if reference_names exist in reference table
- input: List of names
- output: list of key:value pairs, where key is reference_name and value is boolean if key exist table
- search: reference table
- check if strand_name exist in CV table
- input: List of names
- output: list of key:value pairs, where key is strand_name and value is boolean if key exist table
- search: CV table, marker_strand group
|
|
|
|
User interaction and design
Open Questions
Question | Answer | Date Answered |
---|
| |
|
Out of Scope