Digest Data Validation

Target release2.1
Epic GDM-3 - Getting issue details... STATUS
Document status
IN PROGRESS
Document owner
DesignerYaw Nti-Addae
Tech leadJoshua Lamos-Sweeney
Technical writers
QADeb Weigand

Objective

Prevent wrong or bad data from being loaded into GOBii, which will most likely fail at the database level or cause data to be corrupted in the DB

Success metrics

GoalMetric


Assumptions

Milestones

Jul2018AugSepOctNovDecJan2019FebMarAprMayJunMilestone 1Go/No goMilestone 2
Dashboard
Notification

Feature 1

Feature 2

Feature 3

Feature 4

iOS App

Android

Requirements

#RequirementUser StoryImportanceJira IssueNotes
1digest.germplasm

Required fields:

  • name
  • external_code

Optional fields:

  • species_name => germplasm_species in DB[CV table]
  • type_name => germplasm_type in DB[CV table]

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty (nothing but whitespace) records
  3. external_code column must not contain any duplicates
  4. if species_name column exist then check if unique set of species (excluding nulls) values exist in CV table germplasm_species group
  5. if type_name column exist then check if unique set of germplasm type (excluding nulls) values exist in CV table germplasm type group
HIGH


2digest.germplasm_prop

Required fields:

  • external_code

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. digest.germplasm must exist
  4. external_code column must be the same as external_code column in digest.gemplasm



3digest.dnasample

Required fields:

  • project_id
  • name
  • external_code
  • num

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. name/external_code/num combination must be unique (in file)
  4. if digest.germplasm does not exist, check that unique set of external_codes exist in germplasm table [DB]
  5. If digest.germplasm exists, external_code column must be the same as external_code column in digest.gemplasm  Daw 12/10/18
  6. if digest.germplasm exists, then unique list of external_code should be the same as unique list of external_code in digest.germplasm



4digest.dnasample_prop

Required fields:

  • project_id
  • dnasample_name
  • external_code
  • num

Optional fields:

  • dnasample table must be mapped in this file if this table is mapped

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. dnasample_name/external_code/num combination must be unique (in file)
  4. digest.dnasample must exist
  5. project_id column must be the same as project_id column in digest.dnasample
  6. dnasample_name column must be the same as name column in digest.dnasample
  7. external_code column must be the same as external_code column in digest.dnasample
  8. num column must be the same as num column in digest.dnasample



5digest.dnarun

Required fields:

  • project_id
  • experiment_id
  • name

Optional fields:

  • dnasample_name IS NON NULL if exists
  • num IS NON NULL if exists

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. If dnasample_name column exists then it cannot contain any null or empty records
  4. if num column exists then it cannot contain any null or empty records
  5. if digest.dnasample exists: (daw: removed 10/17/18)
    • project_id column contents must exist in digest.dnasample project_id column
    • dnasample_name column must exist
    • dnasample_name column contents must exist in digest.dnasample name column
    • num column must exist
    • num column contents must exist in digest.dnasample num column
  6. if digest.dnasample exists: (daw: added 10/17/18)
    • dnasample_name column must exist
    • unique list of project_id must equal unique list of digest.dnasample.project_id
    • unique list of dnasample_name must equal unique list of digest.dnasample_name
    • if num column exists:
      • unique list of dnasample_name and dnasample_numcombination must equal unique list of digest.dnasample.name, and digest.dnasample.num combination
  7. else if digest.dnasample does not exist:
    • dnasample_name column must exist  DAW: 121018
    • check unique list of dnasample_names exist in dnasample name column[DB] for project_id  DAW: 0711/19
    • check unique list of name (dnarun_names) exist in dnarun name column [DB] for exp_id  DAW: 0711/19
    • if num column exists:
      • dnasample_name column must exist
      • check unique list of dnasampe_names and num combinations exist dnasample table [DB] within name and num columns



6digest.dnarun_prop

Required fields:

  • experiment_id
  • dnarun_name

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. digest.dnarun must exist
  4. experiment_id column must be the same as experiment_id column in digest.dnarun
  5. dnarun_name column must be the same as name column in digest.dnarun



7digest.marker

Required fields:

  • platform_id
  • name

Optional fields:

  • reference_name => reference_name in DB[References table]
  • strand_name => marker_strand in DB[CV table]

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. if reference_name column exists then check if unique set of reference names (excluding nulls) exist in reference table
  4. if strand_name column exists then check if unique set of strands (excluding nulls) exist in CV table marker_strand group



8digest.marker_prop

Required fields:

  • platform_id
  • marker_name

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. digest.marker must exist



9digest.linkage_group

Required fields:

  • map_id
  • name

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records



10digest.marker_linkage_group

Required fields:

  • platform_id
  • map_id
  • linkage_group_name
  • maker_name

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. If digest.linkage_group does not exist, then unique set of linkage_group_names must exist in linkage_group table [DB]
  4. if digest.marker does not exist, then unique set of marker_names must exist in marker_table [DB]



11digest.dataset_dnarun

Required fields:

  • experiment_id
  • dataset_id
  • dnarun_name
  • dnarun_idx

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. If digest.dnarun does not exist, then unique set of dnarun_names must exist in dnarun table [DB] for associated experiment_id**



12digest.dataset_marker

Required fields:

  • platform_id
  • dataset_id
  • marker_name
  • marker_idx

Conditions:

  1. All required columns must exist in the file
  2. all required fields cannot contain null or empty records
  3. if digest.marker foes not exist, then unique set of marker_names must exist in marker table[DB] for  associated platform_id



13digest.matrix

This check can be performed after data transformations but will need Josh's advice. 

If we have time - Users will greatly appreciate having this on input types instead of extract types.

Conditions:

  1. digest.dataset_dnarun and digest.dataset_marker must exist
  2. Matrix must not contain any null values
  3. Check for undesirable characters in genotype matrix based on dataset type (check for character length as well)
  4. Matrix must be rectangular (no jagged edges or double-delimiters)
  5. (Added 01/02/2019 by Yaw) matrix column and row lengths must correspond to record length in digest.dataset_dnarun and digest.dataset_marker files

Allowed characters for each data type are as follows:

  • dominant - 0,1,N
  • co-dominant – 0,1,2,N
  • SNP – A,T,C,G,-,+,N 
  • Allele sizes – numeric values, N
  • All data types - any set on the 'invalid characters' set



14Webservices needed
  1. check if species_names exist in CV table
    • input: List of names
    • output: list of key:value pairs, where key is species_name and value is boolean if key exist in table
    • search: CV table, germplasm_species group
  2. check if type_name exist in CV table
    • input: List of names
    • output: list of key:value pairs, where key is type_name and value is boolean if key exist table
    • search: CV table, germplasm_type group
  3. check if reference_names exist in reference table
    • input: List of names
    • output: list of key:value pairs, where key is reference_name and value is boolean if key exist table
    • search: reference table
  4. check if strand_name exist in CV table
    • input: List of names
    • output: list of key:value pairs, where key is strand_name and value is boolean if key exist table
    • search: CV table, marker_strand group



User interaction and design

Open Questions

QuestionAnswerDate Answered

Out of Scope