This meeting provided a ton of useful feedback on GOBii and the GDM platform. This document focuses on the suggestions for improvement provided at that meeting.
CIMMYT
Overall
Benefits
centralized, searchable institutional record genotype data
not dependent on external services
Challenges
Can’t complete genotyping analysis workflows smoothly
“run and maintain” mode at CIMMYT pending GOBii / EBS integration
Loader
Benefits
Flexible
Templates
Challenges
Loader validations stricter than db req’s
Diagnosing errors - overreliance on help desk
Lacking useful transformations during loading
Update marker names that are based on sequence positions
Convoluted - requires substantial training
No tool for SNP recalling
(for KASP markers – but unclear where/how
this could happen)
Extractor
Benefits
Fast
Email notifications
Can aggregate some data across datasets
Challenges:
CAST
Benefits
Consensus call functionality
Splitting on multiple criteria
Some combining of data across datasets
Challenges
Cannot combine data from same samples in different datasets into one row
Does not facilitate the selection of marker groups to use in analyses
Timescope
Benefits:
Provides important functionalities to enable the “safe” deleting of datasets, markers, and samples
Challenges:
Requires deployment of another tool instead of combining all CRUD functionalities in one tool
Separate authentication system, not linked to institutional authentication system
Data
Benefits
Large data storage for common data types
Flexible properties facilitate metadata storage
Marker groups allow the storage of different haplotypes associated with the same group of markers
Challenges
Some data dependencies have led to unplanned processes, e.g. sample linkage to a project and UUID implementation have caused CIMMYT to use one project for all datasets
Variants have not been implemented to facilitate analyses for the “same” marker used in different platforms, potentially with different names, over time
Marker groups are based on markers instead of variants and are not linked to traits, phenotypes, etc.
Current data structure seem to prevent the storage of data linked to each genotypic call or data point
VCF metadata is not preserved
QC values can’t be associated with data points
Allele frequency data cannot be stored or retrieved easily
Across ST and GOBii no clear model for how to store “consensus” calls or “reference” genotype or fingerprint constructed from different samples over time
In an integrated system, many fields of information may be duplicated and sometimes have different “IDs” e.g. ID for germplasm in CB and new ID for germplasm in GOBii
IRRI
Loader
Challenges
Manual mapping is tedious
different fields from different service providers
Errors with certain characters in sample files generated by B4R
Indels not supported unless encoded as +/-
Data often requires cleaning prior to upload
Requirement to associate data with PI