Genotyping data management systems

System	Group

/Institution

	Contact	VM Hostname	phase
Germinate	JHI	Sebastian

MontyDB

Cornell

Francisco

Raubach

cbsugobiizvm21.biohpc.cornell.edu

Status

colour	Purple
title	one

GDM

Cornell

Joel

Breedbase

BTI

Ask Tetima

Gigwa

CIRAD

Guilhem

PHG

Cornell

Ask Ed

BCF

Broad Institute

GDR-BIMS (https://github.com/laceysanderson )

Un. of Washington

Dori?

IPK (https://zarr.readthedocs.io/en/stable/)

Patrick

GOBii

Dave Matthews

Evan Rees

cbsugobiizvm23.biohpc.cornell.edu

Status

colour	Purple
title	one

Gigwa

CIRAD

guilhem.sempere

cbsugobiizvm19.biohpc.cornell.edu

Status

colour	Purple
title	one

Breedbase

BTI

Titima Tantikanjana

cbsugobiizvm20.biohpc.cornell.edu

Status

colour	Yellow
title	TWO

MontyDB

Cornell

McCouch Lab

Francisco Agosto

Status

colour	Yellow
title	TWO

BCFTools

Broad Institute

Status

colour	Yellow
title	TWO

PHG

Cornell

Buckler Lab

Ask Ed

Status

title	HOLD

Breeding Insight

Moira Sheehan

Status

title	HOLD

GDR-BIMS

University of Washington

Dori?

Status

title	HOLD

IPK

Patrick

Status

title	HOLD

VM allocations

VM Hostname

Status

Server Pool

Assignment

username

cbsugobiizvm20.biohpc.cornell.edu

Status

colour	Red
title	off

cbsugobii09

Breedbase

Germinate

breedbase

cbsugobiizvm23.biohpc.cornell.edu

Status

colour	Green
title	on

cbsugobii09

GDM

gadm

cbsugobiizvm19.biohpc.cornell.edu

Status

colour	Green
title	on

cbsugobii10

Gigwa

gigwa

cbsugobiizvm22.biohpc.cornell.edu

Status

colour	Red
title	off

cbsugobii10

PHG

phg

cbsugobiizvm21.biohpc.cornell.edu

Status

colour	Green
title	on

cbsugobii11

Germinate

MontyDB

jhi

Status

colour	Red
title	off

cbsugobii11

MontyDB

Breedbase

montydb

Each VM has the following resources:

8 CPUs
64 GB RAM
2 TB SSD
/storage mounted volumnvolume
/shared_data mounted volumn

Users

...

Username

...

User

...

gadm

...

system

...

yaw

...

dave

...

francisco

volume

Datasets

Dataset	Format	Location
Maize NAM	CSV	/shared_data/test_data/NAM_HM32/csv
Simulated datasets
polyploid data in VCF		Moira share a dataset - invite to next meeting
indel data
rice high density array	vcf	The Rice High Density Array is : 700K SNPs x ~1500 samples SNPs only vcf too Francisco loaded to Gigwa (own instance) already no problem http://rs-bt-mccouch4.biotech.cornell.edu/staged_data/CSHL_EVA_Release_HDRA.tar Hapmap: cbsugobiizvm19:/shared_data/test_data/genomics-systems-comparison/rice/Dataset.hmp.txt Flapjack: cbsugobiizvm19:/shared_data/test_data/genomics-systems-comparison/rice/flapjack/Dataset.*
African rice		https://gigwa.ird.fr/gigwa/?module=AfricanRice available as vcf metadata availability?
3,000 rice genomes		too large? 29M SNPs
lettuce Wageningen Public dataset	vcf	12M markers x 500 accessions 3 vcfs - one SNPs, one indels, one structural variants 40 GBs https://www.nature.com/articles/s41588-021-00831-0

/pub/CNSA/data2/CNP0000335/Other/variation
ftp.cngb.org/pub/CNSA/data2/CNP0000335/Other/variation

Actions:

presentations on polyploid data
user accounts for participants
identify benchmarking criteria

Benchmarking suggestions

Start with a SNP dataset - vcf ? - check with Sebastian and Breedbase (Titima)

Gigwa - 10s Ms markers x 1000s samples

Loading times?

Extract times - increasing marker and sample numbers?

Start with overview of features so we can understand better benchmarking

Action items April 29th

All - check can access site and load database - Gigwa still to be loaded to VM. Guilhem can access site but needs a user name

Yaw - update confluence so all participants can edit

Yaw - set up slack channel

Yaw - Have user accounts been set up? Set up and distribute. Need user names for people setting up databases.

Yaw - request VMs are not open for security

Dave to set up training with Liz to learn how to use GDM

Yaw make sure to invite Moira to next meeting to discuss polyploid data

Invite Breedbase to next meeting

Liz put together a table overview of features that we can all align against

Schedule a demo of each system features for a future meeting

Features of Gigwa

Basic filtering functionalities

By chromosome / sequence
By position
By variant type

Advanced filtering functionalities

By functional annotations
By genotype patterns
Using multiple groups of samples
By metadata

Visualization

Allows consulting genotypes
Graphical representation

File formats

Multiple import formats
Multiple export formats

Data peculiarities

Support for INDELs
Support for polyploïds
Support for phasing information
Support metadata

Interoperability

Standard API support (as a data-provider)
Standard API support (as a data-consumer)
Supports feeding genotypes through API
Link to external software (e.g. Jbrowse)

Software availability

Distributable
Open-source
Available as Docker container
Embeddable

...

Lettuce

hapmap

flapjack

Code Block

language	bash

/shared_data/test_data/genomics-systems-comparison/lettuce/
  chr1/
    Lactuca__project1__2021-06-24__1152198variants__FLAPJACK.fjzip
    Lactuca__project1__2021-06-24__1152198variants__HAPMAP.zip
    markerlists.zip
  full/
    Lactuca__project1__2021-06-28__12983735variants__FLAPJACK.fjzip
    Lactuca__project1__2021-06-28__12983735variants__HAPMAP.zip

potato (polyploid)

VCF

Code Block

language	bash

/shared_data/test_data/genomics-systems-comparison/potato/
  PRJNA414303.CHR5.filterNullGT.vcf.gz

source

Versions Compared

Old Version 9

New Version Current

Key

Contents

Genotyping data management systems

VM allocations

Users

Datasets

Page Comparison

Versions Compared

Old Version 9

New Version Current

Key

Contents

Genotyping data management systems

VM allocations

Users

Datasets