Genotyping data management systems
System | Group/Institution | Contact |
---|---|---|
Germinate | JHI | Sebastian |
MontyDB | Cornell | Francisco |
GDM | Cornell | Joel |
Breedbase | BTI | Ask Tetima |
Gigwa | CIRAD | Guilhem |
PHG | Cornell | Ask Ed |
BCF | Broad Institute | |
GDR-BIMS (https://github.com/laceysanderson ) | Un. of Washington | Dori? |
Patrick |
VM allocations
VM Hostname | Status | Server Pool | Assignment | |
---|---|---|---|---|
OFF | cbsugobii09 | Germinate | ||
ON | cbsugobii09 | GDM | ||
ON | cbsugobii10 | Gigwa | ||
OFF | cbsugobii10 | PHG | ||
ON | cbsugobii11 | MontyDB | ||
OFF | cbsugobii11 | Breedbase |
Each VM has the following resources:
8 CPUs
64 GB RAM
2 TB SSD
/storage mounted volumn
/shared_data mounted volumn
Users
Username | User |
---|---|
gadm | system |
yaw | |
dave | |
francisco |
Datasets
Dataset | Format | Location |
---|---|---|
Maize NAM | CSV | /shared_data/test_data/NAM_HM32/csv |
Simulated datasets | ||
polyploid data in VCF | Moira share a dataset - invite to next meeting | |
indel data | ||
rice high density array | vcf | The Rice High Density Array is : 700K SNPs x ~1500 samples SNPs only vcf too Francisco loaded to Gigwa (own instance) already no problem http://rs-bt-mccouch4.biotech.cornell.edu/staged_data/CSHL_EVA_Release_HDRA.tar |
African rice | https://gigwa.ird.fr/gigwa/?module=AfricanRice available as vcf metadata availability? | |
3,000 rice genomes | too large? 29M SNPs | |
lettuce Wageningen Public dataset | vcf | 12M markers x 500 accessions 3 vcfs - one SNPs, one indels, one structural variants 40 GBs https://www.nature.com/articles/s41588-021-00831-0 /pub/CNSA/data2/CNP0000335/Other/variation |
Actions:
- presentations on polyploid data
- user accounts for participants
- identify benchmarking criteria
Benchmarking suggestions
Start with a SNP dataset - vcf ? - check with Sebastian and Breedbase (Titima)
Gigwa - 10s Ms markers x 1000s samples
Loading times?
Extract times - increasing marker and sample numbers?
Start with overview of features so we can understand better benchmarking
Action items April 21st
All - check can access site and load database - Gigwa still to be loaded to VM. Guilhem can access site but needs a user name
Yaw - update confluence so all participants can edit
Yaw - set up slack channel
Yaw - Have user accounts been set up? Set up and distribute. Need user names for people setting up databases.
Yaw - request VMs are not open for security
Dave to set up training with Liz to learn how to use GDM
Yaw make sure to invite Moira to next meeting to discuss polyploid data
Invite Breedbase to next meeting
Liz put together a table overview of features that we can all align against
Schedule a demo of each system features for a future meeting
Features of Gigwa
Basic filtering functionalities
By chromosome / sequence
By position
By variant type
Advanced filtering functionalities
By functional annotations
By genotype patterns
Using multiple groups of samples
By metadata
Visualization
Allows consulting genotypes
Graphical representation
File formats
Multiple import formats
Multiple export formats
Data peculiarities
Support for INDELs
Support for polyploïds
Support for phasing information
Support metadata
Interoperability
Standard API support (as a data-provider)
Standard API support (as a data-consumer)
Supports feeding genotypes through API
Link to external software (e.g. Jbrowse)
Software availability
Distributable
Open-source
Available as Docker container
Embeddable
Data compactness (percentage of disk space occupied by the data once in the system, compared to that occupied by the original VFC file)