Objective Prevent wrong or bad data from being loaded into GOBii, which will most likely fail at the database level or cause data to be corrupted in the DB
Success metrics
Assumptions
Milestones
Jul 2018 Aug Sep Oct Nov Dec Jan 2019 Feb Mar Apr May Jun Milestone 1 Go/No go Milestone 2
Requirements # Requirement User Story Importance Jira Issue Notes 1 digest.germplasm Required fields:
Optional fields:
species_name => germplasm_species in DB[CV table] type_name => germplasm_type in DB[CV table] Conditions:
All required columns must exist in the file all required fields cannot contain null or empty (nothing but whitespace) records external_code column must not contain any duplicates if species_name column exist then check if unique set of species (excluding nulls) values exist in CV table germplasm_species group if type_name column exist then check if unique set of germplasm type (excluding nulls) values exist in CV table germplasm type group HIGH
2 digest.germplasm_prop Required fields:
Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records digest.germplasm must exist external_code column must be the same as external_code column in digest.gemplasm 3 digest.dnasample Required fields:
project_id name external_code num Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records name/external_code/num combination must be unique (in file) if digest.germplasm does not exist, check that unique set of external_codes exist in germplasm table [DB] If digest.germplasm exists, external_code column must be the same as external_code column in digest.gemplasm Daw 12/10/18if digest.germplasm exists, then unique list of external_code should be the same as unique list of external_code in digest.germplasm 4 digest.dnasample_prop Required fields:
project_id dnasample_name external_code num Optional fields:
dnasample table must be mapped in this file if this table is mapped Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records dnasample_name/external_code/num combination must be unique (in file) digest.dnasample must exist project_id column must be the same as project_id column in digest.dnasample dnasample_name column must be the same as name column in digest.dnasample external_code column must be the same as external_code column in digest.dnasample num column must be the same as num column in digest.dnasample 5 digest.dnarun Required fields:
project_id experiment_id name Optional fields:
dnasample_name IS NON NULL if exists num IS NON NULL if exists Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records If dnasample_name column exists then it cannot contain any null or empty records if num column exists then it cannot contain any null or empty records if digest.dnasample exists: (daw: removed 10/17/18)project_id column contents must exist in digest.dnasample project_id columndnasample_name column must existdnasample_name column contents must exist in digest.dnasample name columnnum column must existnum column contents must exist in digest.dnasample num columnif digest.dnasample exists: (daw: added 10/17/18)dnasample_name column must exist unique list of project_id must equal unique list of digest.dnasample.project_id unique list of dnasample_name must equal unique list of digest.dnasample_name if num column exists:unique list of dnasample_name and dnasample_num combination must equal unique list of digest.dnasample.name , and digest.dnasample.num combination else if digest.dnasample does not exist:dnasample_name column must exist DAW: 121018 check unique list of dnasample_names exist in dnasample name column[DB] for project_id DAW: 0711/19 check unique list of name (dnarun_names) exist in dnarun name column [DB] for exp_id DAW: 0711/19 if num column exists:dnasample_name column must exist check unique list of dnasampe_names and num combinations exist dnasample table [DB] within name and num columns 6 digest.dnarun_prop Required fields:
Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records digest.dnarun must exist experiment_id column must be the same as experiment_id column in digest.dnarun dnarun_name column must be the same as name column in digest.dnarun 7 digest.marker Required fields:
Optional fields:
reference_name => reference_name in DB[References table] strand_name => marker_strand in DB[CV table] Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records if reference_name column exists then check if unique set of reference names (excluding nulls) exist in reference table if strand_name column exists then check if unique set of strands (excluding nulls) exist in CV table marker_strand group 8 digest.marker_prop Required fields:
Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records digest.marker must exist 9 digest.linkage_group Required fields:
Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records 10 digest.marker_linkage_group Required fields:
platform_id map_id linkage_group_name maker_name Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records If digest.linkage_group does not exist, then unique set of linkage_group_names must exist in linkage_group table [DB] if digest.marker does not exist, then unique set of marker_names must exist in marker_table [DB] 11 digest.dataset_dnarun Required fields:
experiment_id dataset_id dnarun_name dnarun_idx Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records If digest.dnarun does not exist, then unique set of dnarun_names must exist in dnarun table [DB] for associated experiment_id** 12 digest.dataset_marker Required fields:
platform_id dataset_id marker_name marker_idx Conditions:
All required columns must exist in the file all required fields cannot contain null or empty records if digest.marker foes not exist, then unique set of marker_names must exist in marker table[DB] for associated platform_id 13 digest.matrix This check can be performed after data transformations but will need Josh's advice.
If we have time - Users will greatly appreciate having this on input types instead of extract types.
Conditions:
digest.dataset_dnarun and digest.dataset_marker must exist Matrix must not contain any null values Check for undesirable characters in genotype matrix based on dataset type (check for character length as well) Matrix must be rectangular (no jagged edges or double-delimiters) (Added 01/02/2019 by Yaw) matrix column and row lengths must correspond to record length in digest.dataset_dnarun and digest.dataset_marker files Allowed characters for each data type are as follows:
dominant - 0,1,N co-dominant – 0,1,2,N SNP – A,T,C,G,-,+,N Allele sizes – numeric values, N All data types - any set on the 'invalid characters' set 14 Webservices needed check if species_names exist in CV tableinput: List of names output: list of key:value pairs, where key is species_name and value is boolean if key exist in table search: CV table, germplasm_species group check if type_name exist in CV tableinput: List of names output: list of key:value pairs, where key is type_name and value is boolean if key exist table search: CV table, germplasm_type group check if reference_names exist in reference tableinput: List of names output: list of key:value pairs, where key is reference_name and value is boolean if key exist table search: reference table check if strand_name exist in CV tableinput: List of names output: list of key:value pairs, where key is strand_name and value is boolean if key exist table search: CV table, marker_strand group
User interaction and design
Open Questions Question Answer Date Answered
Out of Scope