Benchmarking comparisons
Contents
Logistics
Source files
All files are stored in a shared folder mounted on each VM at HOST:/shared_data/test_data/genomics-systems-comparison
.
Files referenced in Table 1 and throughout this document are relative to the shared folder.
Output files
Each platform has a subdirectory within the outputs
folder, which can be organized as seen fit. At least one output file for every benchmark should be retained and stored in this folder. It is not necessary to store replicates for the same benchmark.
Output files should be renamed to match the template {platform}-{name}-{format}.{extension}
.
Table 1: Datasets
Each maize chromosome file should be loaded as a separate dataset. These will be used to benchmark querying samples from multiple datasets as in Table 4b. Only maize_chr1 needs to be benchmarked in Table 3.
name | polyploid | indels | format | file_size (MB) | num_markers | num_samples | num_missing | missingness | type | file |
---|---|---|---|---|---|---|---|---|---|---|
lettuce_chr1 | NO | NO | hapmap | 1,582 | 1,152,198 | 440 | 20,220,080 | 0.04 | 2letter |
|
lettuce_full | NO | NO | hapmap | 17,828 | 12,983,735 | 440 | 231,704,035 | 0.04 | 2letter |
|
rice | NO | NO | hapmap | 3,312 | 700,000 | 1,568 | 85,835,610 | 0.08 | 2letter |
|
potato | yes | yes | hapmap | 37 | 142,479 | 38 | 0 | 0 | 4letter |
|
maize_chr1 | NO | NO | hapmap | 1,449 | 148,752 | 4,845 | 359,630,401 | 0.50 | gbs |
|
maize_chr2 | NO | NO | hapmap | 1,122 | 115,173 | 4,845 | 280,942,230 | 0.50 | gbs |
|
maize_chr3 | NO | NO | hapmap | 1,055 | 108,224 | 4,845 | 262,919,216 | 0.50 | gbs |
|
maize_chr4 | NO | NO | hapmap | 923 | 94,726 | 4,845 | 235,124,013 | 0.51 | gbs |
|
maize_chr5 | NO | NO | hapmap | 1075 | 110,328 | 4,845 | 265,417,151 | 0.50 | gbs |
|
maize_chr6 | NO | NO | hapmap | 745 | 76,475 | 4,845 | 187,639,259 | 0.51 | gbs |
|
maize_chr7 | NO | NO | hapmap | 785 | 80,517 | 4,845 | 195,381,118 | 0.50 | gbs |
|
maize_chr8 | NO | NO | hapmap | 794 | 81,431 | 4,845 | 198,298,688 | 0.50 | gbs |
|
maize_chr9 | NO | NO | hapmap | 705 | 72,368 | 4,845 | 176,850,405 | 0.50 | gbs |
|
maize_chr10 | NO | NO | hapmap | 654 | 67,126 | 4,845 | 166,253,346 | 0.51 | gbs |
|
Table 2a: Marker lists
Lists of markers used in the analysis. Contiguous vs non-contiguous indicates whether markers are consecutive along the genome. The data file includes IDs for all markers in the given range.
name | contiguous | count | range_start | range_end | file |
---|---|---|---|---|---|
markers.lettuce_chr1_10pct | yes | 114,877 | 0 | 16,000,000 |
|
markers.lettuce_chr1_20pct | yes | 228,493 | 0 | 32,000,000 |
|
markers.lettuce_chr1_30pct | yes | 346,711 | 0 | 47,000,000 |
|
markers.lettuce_chr1_40pct | yes | 458,369 | 0 | 63,000,000 |
|
markers.lettuce_chr1_50pct | yes | 578,777 | 0 | 79,000,000 |
|
markers.lettuce_chr1_1K | no | 1,000 | NA | NA |
|
markers.lettuce_chr1_10K | no | 10,000 | NA | NA |
|
markers.lettuce_chr1_100K | no | 100,000 | NA | NA |
|
markers.lettuce_chr1_1M | no | 1,000,000 | NA | NA |
|
Table 2b: Sample lists
List of sample used for extraction benchmarks. All samples are in adjacent columns in the data input file. Each sample file contains the names of the first count samples in the data file.
name | count | file |
---|---|---|
samples.lettuce_chr1_5pct | 22 |
|
samples.lettuce_chr1_10pct | 44 |
|
samples.lettuce_chr1_25pct | 110 |
|
samples.lettuce_chr1_50pct | 220 |
|
samples.maize_5pct | 242 |
|
samples.maize_10pct | 484 |
|
samples.maize_25pct | 1211 |
|
samples.maize_50pct | 2422 |
|
Table 3: Dataset import / export
Time (in seconds) spent importing and exporting data to/from Flapjack, Hapmap, and VCF format, and disk space (in megabytes) occupied after upload.
See comments for discussion on VCF format: vcf.gz
dataset | platform | import_time (s) | space_occupied (MB) | export_time (s) | |||||
---|---|---|---|---|---|---|---|---|---|
marker-oriented | sample-oriented | marker-oriented | sample-oriented | ||||||
hapmap | vcf | flapjack | plink | vcf | hapmap | flapjack | |||
lettuce_chr1 | gigwa | 326; 323; 327 | 294; 294; 294 | 387; 367; 361 | 374; 374; 390 | 438 | 335; 332; 334 | 268; 268; 267 | 487; 498; 491 |
gobii | 320; 250; 273 | 84,770; 85,962; 85,041 | 1,793; 1,984; 1,503 | U | 2,028 | U | 935; 985; 979 | 896; 933; 926 | |
germinate | 118; 120; 118 | U | 160; 160; 159 |
| 1,020 | U | 106; 104; 107 | 201; 202; 204 | |
montydb |
|
|
|
|
|
|
|
| |
breedbase |
|
|
|
|
|
|
|
| |
lettuce_full | gigwa | 3,892; 3,841; 3,888 | 3,401; 3,379; 3,391 | 4,787; 4,903; 4,889 | 6,431; 7,552; 6,237 | 4,750 | 3,783; 3,726; 3,712 | 3,085; 3,052; 2,928 | 5710; 5721; 5719 |
gobii | 5,819; 6,173; 6,482 |
| U | U | 22,851 | U | 11,442; 11,322; 11,228 | 10,811; 10,715; 10,839 | |
germinate | 1,716; 1,696; 1,732 | U | 4,214; 4,146; 4,185 |
| 17,233 | U | 1,022; 1,033; 1,078 | 2,366; 2,352; 2,344 | |
montydb |
|
|
|
|
|
|
|
| |
breedbase |
|
|
|
|
|
|
|
| |
rice | gigwa | 663; 669; 677 | 624; 627; 616 | 728; 722; 741 | 760; 749; 762 | 1,827 | 755; 778; 765 | 532; 550; 551 | 889; 890; 913 |
gobii | 400; 423; 425 | 32,473; 32,496; 32,377 | 1576; | U | 4,390 | U | 706; 721; 712 | 593; 599; 595 | |
germinate | 226; 222; 223 | U | 337; 336; 339 |
| 2,202 | U | 192; 189; 190 | 512; 515; 514 | |
montydb |
|
|
|
|
|
|
|
| |
breedbase |
|
|
|
|
|
|
|
| |
potato | gigwa | 6; 6; 7 | 8; 6; 6 | 13; 13; 13 | U (ploidy) | 12 | 9; 9; 6 | 5; 5; 4 | 10; 10; 10 |
gobii |
|
|
|
|
| U |
| U | |
germinate | 9; 9; 9 |
|
|
| 12 | U | 5; 5; 5 | 3; 3; 3 | |
montydb |
|
|
|
|
|
|
| U | |
breedbase |
|
|
|
|
|
|
| U | |
maize_chr1 | gigwa | 219; 221; 222 | 274; 282; 270 | 301; 288; 295 | 327; 321; 334 (!truncated since multi-allelic) | 870.8 | 442; 439; 438 | 241; 238; 236 | 568; 574; 569 |
gobii | 250; 454; 341; 301 |
| 1035; 912; 895 |
| 2,883 | U | 159; 156; 157 | 148; 144; 144 | |
germinate | 130; 130; 129 |
| 107; 106; 110 |
| 722 | U | 168; 175; 170 | 205; 216; 208 | |
montydb |
|
|
|
|
|
|
|
| |
breedbase |
|
|
|
|
|
|
|
|
Table 4a: Marker extraction
Time in seconds spent extracting contiguous and non-contiguous subsets of markers. Gigwa supports subsetting by genomic interval rather than an uploaded list of marker names, so timings are from the equivalent interval.
markers | gigwa | gobii | germinate | montydb | breedbase | |||||
---|---|---|---|---|---|---|---|---|---|---|
hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | |
markers.lettuce_chr1_10pct | 28; 27; 27 | 68; 58; 57 | 102; 102; 101 | 97; 97; 96 | 12; 12; 12 | 23; 22; 23 |
|
|
|
|
markers.lettuce_chr1_20pct | 54; 52; 52 | 110; 111; 112 | 205; 202; 203 | 192; 193; 191 | 22; 22; 21 | 43; 43; 43 |
|
|
|
|
markers.lettuce_chr1_30pct | 80; 78; 80 | 165; 165; 167 | 298; 313; 309 | 291; 291; 293 | 30; 31; 31 | 63; 63; 64 |
|
|
|
|
markers.lettuce_chr1_40pct | 107; 103; 103 | 227; 220; 217 | 411; 406; 407 | 385; 383; 385 | 39; 39; 39 | 81; 83; 82 |
|
|
|
|
markers.lettuce_chr1_50pct | 132; 140; 128 | 270; 271; 273 | 511; 514; 521 | 486; 487; 484 | 48; 47; 48 | 97; 96; 95 |
|
|
|
|
markers.lettuce_chr1_1K | 1; 1; 1 | 1; 1; 1 | 2; 2; 2 | 5; 2; 2 | 3; 3; 3 | 2; 2; 2 |
|
|
|
|
markers.lettuce_chr1_10K | 5; 5; 5 | 7; 7; 7 | 11; 11; 10 | 10; 10; 10 | 4; 4; 4 | 3; 3; 4 |
|
|
|
|
markers.lettuce_chr1_100K | 48; 48; 46 | 73; 71; 74 | 93; 93; 95 | 88; 87; 87 | 11; 11; 11 | 19; 19; 19 |
|
|
|
|
markers.lettuce_chr1_1M | 469; 478; 482 | 718; 717; 728 | 891; 893; 900 | 839; 840; 843 | 81; 83; 81 | 165; 163; 165 |
|
|
|
|
Table 4b. Sample Extraction
Maize sample lists should be extracted across all 10 chromosome datasets.
samples | gigwa | gobii | germinate | montydb | breedbase | |||||
---|---|---|---|---|---|---|---|---|---|---|
hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | hapmap | flapjack | |
samples.lettuce_chr1_5pct | 42; 43; 41 | 68; 68; 72 | 945; 955; 943 | 921; 923; 924 | 30; 31; 29 | 12; 12; 12 |
|
|
|
|
samples.lettuce_chr1_10pct | 71; 72; 70 | 112; 109; 112 | 959; 967; 969 | 922; 917; 926 | 33; 33; 33 | 21; 22; 21 |
|
|
|
|
samples.lettuce_chr1_25pct | 172; 172; 173 | 240; 238; 244 | 962; 978; 955 | 932; 934; 932 | 42; 41; 41 | 48; 49, 49 |
|
|
|
|
samples.lettuce_chr1_50pct | 297; 300; 304 | 455; 451; 454 | 991; 994; 989 | 934; 928; 934 | 58; 58; 58 | 94; 93; 94 |
|
|
|
|
samples.maize_5pct | 76; 76; 79 | 113; 113; 113 |
|
| 32; 34; 33 | 50; 49; 51 |
|
|
|
|
samples.maize_10pct | 126; 125; 126 | 192; 191; 190 |
|
| 54; 55; 55 | 74; 74; 73 |
|
|
|
|
samples.maize_25pct | 270; 268; 259 | 434; 407; 407 |
|
| 110; 111; 113 | 162; 160; 159 |
|
|
|
|
samples.maize_50pct | 482; 490; 488 | 782; 785; 817 |
|
| 201; 199; 202 | 298; 298; 297 |
|
|
|
|