Digest Data Validation

Target release	2.1
Epic	GDM-3 - Getting issue details... STATUS
Document status	IN PROGRESS
Document owner	Yaw Nti-Addae
Designer	Yaw Nti-Addae
Tech lead	Joshua Lamos-Sweeney
Technical writers
QA	Deb Weigand

Objective

Prevent wrong or bad data from being loaded into GOBii, which will most likely fail at the database level or cause data to be corrupted in the DB

Success metrics

Goal	Metric

Assumptions

Milestones

Dashboard

Notification

Feature 1

Feature 2

Feature 3

Feature 4

iOS App

Android

Requirements

#	Requirement	User Story	Importance
1	digest.germplasm	Required fields: name external_code Optional fields: species_name => germplasm_species in DB[CV table] type_name => germplasm_type in DB[CV table] Conditions: All required columns must exist in the file all required fields cannot contain null or empty (nothing but whitespace) records external_code column must not contain any duplicates if species_name column exist then check if unique set of species (excluding nulls) values exist in CV table germplasm_species group if type_name column exist then check if unique set of germplasm type (excluding nulls) values exist in CV table germplasm type group	HIGH
2	digest.germplasm_prop	Required fields: external_code Conditions: All required columns must exist in the file all required fields cannot contain null or empty records digest.germplasm must exist external_code column must be the same as external_code column in digest.gemplasm
3	digest.dnasample	Required fields: project_id name external_code num Conditions: All required columns must exist in the file all required fields cannot contain null or empty records name/external_code/num combination must be unique (in file) if digest.germplasm does not exist, check that unique set of external_codes exist in germplasm table [DB] ~~If digest.germplasm exists, external_code column must be the same as external_code column in digest.gemplasm~~ Daw 12/10/18 if digest.germplasm exists, then unique list of external_code should be the same as unique list of external_code in digest.germplasm
4	digest.dnasample_prop	Required fields: project_id dnasample_name external_code num Optional fields: dnasample table must be mapped in this file if this table is mapped Conditions: All required columns must exist in the file all required fields cannot contain null or empty records dnasample_name/external_code/num combination must be unique (in file) digest.dnasample must exist project_id column must be the same as project_id column in digest.dnasample dnasample_name column must be the same as name column in digest.dnasample external_code column must be the same as external_code column in digest.dnasample num column must be the same as num column in digest.dnasample
5	digest.dnarun	Required fields: project_id experiment_id name Optional fields: dnasample_name IS NON NULL if exists num IS NON NULL if exists Conditions: All required columns must exist in the file all required fields cannot contain null or empty records If dnasample_name column exists then it cannot contain any null or empty records if num column exists then it cannot contain any null or empty records ~~if digest.dnasample exists:~~ (daw: removed 10/17/18) ~~project_id column contents must exist in digest.dnasample project_id column~~ ~~dnasample_name column must exist~~ ~~dnasample_name column contents must exist in digest.dnasample name column~~ ~~num column must exist~~ ~~num column contents must exist in digest.dnasample num column~~ if digest.dnasample exists: (daw: added 10/17/18) dnasample_name column must exist unique list of project_id must equal unique list of digest.dnasample.project_id unique list of dnasample_name must equal unique list of digest.dnasample_name if num column exists: unique list of dnasample_name and dnasample_numcombination must equal unique list of digest.dnasample.name, and digest.dnasample.num combination else if digest.dnasample does not exist: ~~dnasample_name column must exist~~ DAW: 121018 check unique list of dnasample_names exist in dnasample name column[DB] for project_id DAW: 0711/19 check unique list of name (dnarun_names) exist in dnarun name column [DB] for exp_id DAW: 0711/19 if num column exists: dnasample_name column must exist check unique list of dnasampe_names and num combinations exist dnasample table [DB] within name and num columns
6	digest.dnarun_prop	Required fields: experiment_id dnarun_name Conditions: All required columns must exist in the file all required fields cannot contain null or empty records digest.dnarun must exist experiment_id column must be the same as experiment_id column in digest.dnarun dnarun_name column must be the same as name column in digest.dnarun
7	digest.marker	Required fields: platform_id name Optional fields: reference_name => reference_name in DB[References table] strand_name => marker_strand in DB[CV table] Conditions: All required columns must exist in the file all required fields cannot contain null or empty records if reference_name column exists then check if unique set of reference names (excluding nulls) exist in reference table if strand_name column exists then check if unique set of strands (excluding nulls) exist in CV table marker_strand group
8	digest.marker_prop	Required fields: platform_id marker_name Conditions: All required columns must exist in the file all required fields cannot contain null or empty records digest.marker must exist
9	digest.linkage_group	Required fields: map_id name Conditions: All required columns must exist in the file all required fields cannot contain null or empty records
10	digest.marker_linkage_group	Required fields: platform_id map_id linkage_group_name maker_name Conditions: All required columns must exist in the file all required fields cannot contain null or empty records If digest.linkage_group does not exist, then unique set of linkage_group_names must exist in linkage_group table [DB] if digest.marker does not exist, then unique set of marker_names must exist in marker_table [DB]
11	digest.dataset_dnarun	Required fields: experiment_id dataset_id dnarun_name dnarun_idx Conditions: All required columns must exist in the file all required fields cannot contain null or empty records If digest.dnarun does not exist, then unique set of dnarun_names must exist in dnarun table [DB] for associated experiment_id**
12	digest.dataset_marker	Required fields: platform_id dataset_id marker_name marker_idx Conditions: All required columns must exist in the file all required fields cannot contain null or empty records if digest.marker foes not exist, then unique set of marker_names must exist in marker table[DB] for associated platform_id
13	digest.matrix	This check can be performed after data transformations but will need Josh's advice. If we have time - Users will greatly appreciate having this on input types instead of extract types. Conditions: digest.dataset_dnarun and digest.dataset_marker must exist Matrix must not contain any null values Check for undesirable characters in genotype matrix based on dataset type (check for character length as well) Matrix must be rectangular (no jagged edges or double-delimiters) (Added 01/02/2019 by Yaw) matrix column and row lengths must correspond to record length in digest.dataset_dnarun and digest.dataset_marker files Allowed characters for each data type are as follows: dominant - 0,1,N co-dominant – 0,1,2,N SNP – A,T,C,G,-,+,N Allele sizes – numeric values, N All data types - any set on the 'invalid characters' set
14	Webservices needed	check if species_names exist in CV table input: List of names output: list of key:value pairs, where key is species_name and value is boolean if key exist in table search: CV table, germplasm_species group check if type_name exist in CV table input: List of names output: list of key:value pairs, where key is type_name and value is boolean if key exist table search: CV table, germplasm_type group check if reference_names exist in reference table input: List of names output: list of key:value pairs, where key is reference_name and value is boolean if key exist table search: reference table check if strand_name exist in CV table input: List of names output: list of key:value pairs, where key is strand_name and value is boolean if key exist table search: CV table, marker_strand group

User interaction and design

Open Questions

Question	Answer	Date Answered

Genomic Data Manager