Genotyping data management systems

System	Group/Institution	Contact
Germinate	JHI	Sebastian
MontyDB	Cornell	Francisco
GDM	Cornell	Joel
Breedbase	BTI	Ask Tetima
Gigwa	CIRAD	Guilhem
PHG	Cornell	Ask Ed
BCF	Broad Institute
GDR-BIMS (https://github.com/laceysanderson )	Un. of Washington	Dori?
IPK (https://zarr.readthedocs.io/en/stable/)		Patrick

VM allocations

VM Hostname	Status	Server Pool	Assignment
cbsugobiizvm20.biohpc.cornell.edu	OFF	cbsugobii09	Germinate
cbsugobiizvm23.biohpc.cornell.edu	ON	cbsugobii09	GDM

cbsugobiizvm19.biohpc.cornell.edu	ON	cbsugobii10	Gigwa
cbsugobiizvm22.biohpc.cornell.edu	OFF	cbsugobii10	PHG

cbsugobiizvm21.biohpc.cornell.edu	ON	cbsugobii11	MontyDB
	OFF	cbsugobii11	Breedbase

Each VM has the following resources:

8 CPUs
64 GB RAM
2 TB SSD
/storage mounted volumn
/shared_data mounted volumn

Users

Username	User
gadm	system
yaw
dave
francisco

Datasets

Dataset	Format	Location
Maize NAM	CSV	/shared_data/test_data/NAM_HM32/csv
Simulated datasets
polyploid data in VCF		Moira share a dataset - invite to next meeting
indel data
rice high density array	vcf	The Rice High Density Array is : 700K SNPs x ~1500 samples SNPs only vcf too Francisco loaded to Gigwa (own instance) already no problem http://rs-bt-mccouch4.biotech.cornell.edu/staged_data/CSHL_EVA_Release_HDRA.tar
African rice		https://gigwa.ird.fr/gigwa/?module=AfricanRice available as vcf metadata availability?
3,000 rice genomes		too large? 29M SNPs
lettuce Wageningen Public dataset	vcf	12M markers x 500 accessions 3 vcfs - one SNPs, one indels, one structural variants 40 GBs https://www.nature.com/articles/s41588-021-00831-0 /pub/CNSA/data2/CNP0000335/Other/variation ftp.cngb.org/pub/CNSA/data2/CNP0000335/Other/variation

Actions:

presentations on polyploid data
user accounts for participants
identify benchmarking criteria

Benchmarking suggestions

Start with a SNP dataset - vcf ? - check with Sebastian and Breedbase (Titima)

Gigwa - 10s Ms markers x 1000s samples

Loading times?

Extract times - increasing marker and sample numbers?

Start with overview of features so we can understand better benchmarking

Action items April 21st

All - check can access site and load database - Gigwa still to be loaded to VM. Guilhem can access site but needs a user name

Yaw - update confluence so all participants can edit

Yaw - set up slack channel

Yaw - Have user accounts been set up? Set up and distribute. Need user names for people setting up databases.

Yaw - request VMs are not open for security

Dave to set up training with Liz to learn how to use GDM

Yaw make sure to invite Moira to next meeting to discuss polyploid data

Invite Breedbase to next meeting

Liz put together a table overview of features that we can all align against

Schedule a demo of each system features for a future meeting

Features of Gigwa

Basic filtering functionalities

By chromosome / sequence
By position
By variant type

Advanced filtering functionalities

By functional annotations
By genotype patterns
Using multiple groups of samples
By metadata

Visualization

Allows consulting genotypes
Graphical representation

File formats

Multiple import formats
Multiple export formats

Data peculiarities

Support for INDELs
Support for polyploïds
Support for phasing information
Support metadata

Interoperability

Standard API support (as a data-provider)
Standard API support (as a data-consumer)
Supports feeding genotypes through API
Link to external software (e.g. Jbrowse)

Software availability

Distributable
Open-source
Available as Docker container
Embeddable

Data compactness (percentage of disk space occupied by the data once in the system, compared to that occupied by the original VFC file)

General Information

Genotyping data management systems

VM allocations

Users

Datasets