2021-08-10 Meeting notes
Date
Aug 10, 2021
Participants
@Evan Rees
@Dave Matthews
@Former user (Deleted)
@Pierre Larmande
@Sebastian Raubach
Goals
Progress updates
Blockers
Discussion topics
Time | Item | Presenter | Notes |
---|---|---|---|
| Access to CUVPN | @Sebastian Raubach |
|
polypoloid dataset | @Evan Rees @Yaw Nti-Addae | Awaiting decision on polyploid dataset | |
| lettuce full dataset | @Former user (Deleted) | Is lettuce full genome too large to work with? ~200M markers / ~6B datapoints (markers * accessions). Full dataset is 10’s of GB, probably not feasible to manipulate Loading hapmap format will impact performance - genotype data are highly condensed Difference between hapmap and VCF is large - not a fair comparison Try ‘plain’ VCF with only GT fields Need to account for conversion time from VCF to hapmap Gigwa had limitation for number of markers to query? |
| data formats | @Former user (Deleted) | timings bcftools for slicing / processing / conversion |
| data export | @Pierre Larmande | need to specify when exporting VCF if annotations are present - will impact performance |
| parity / equivalence | @Pierre Larmande | when comparing datasets, need to account for annotations / info content in VCF |
| lettuce scaling | @Pierre Larmande | see if import times are linear across chromosomes |
| Measuring resource usage | @Pierre Larmande | Do we want to compare memory footprint? Compare efficiency across platforms CPU usage / multithreading |
| measuring time | @Evan Rees | Are we timing things consistently? Measuring resource usage |
Action items
procpath
?)Decisions
- Record timings for import from VCF, plain VCF, flapjack, and hapmap as each platform is able