- Created by Yaw Nti-Addae, last modified on Feb 06, 2019
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 10 Next »
Quality Control using GOBii-KDCompute Module
TODO
How to initiate QC on dataset loading
TODO
Data Types | Description |
---|---|
Dominant | Presence, Absence |
Codominant | hom_class = 0, het_class = 1, hom_class_2 = 2 0 usually represents homozygous absence, 1 the heterozygous class, 2 = the homozygous presence class |
SSR_allele | Each discovered combination of alleles whether homozygote or heterozygote |
2_letter_nucleotide (SNP) | hom_class_1 is the first nucleotide encountered in the list of samples, hom_class_2 is the alternate allele encountered, het class: a combination of the above alleles |
IUPAC | IUPAC is converted to 2_letter_nucleotide upon loading and so is as above |
Output from QC
Follow the file path in the QC email notification to access output files as below:
This file contains genotype data extracted from GDM during QC process
Field | Description |
---|---|
project_pi_contact | |
project_name | |
project_genotyping_purpose | |
project_date_sampled | |
project_division | |
project_study_name | |
experiment_name | |
platform_name | |
vendor_protocol_name | |
vendor_name | |
protocol_name | |
dataset_name | |
dataset_type | |
analysis_name |
Pair-wise comparison between all samples with exact matches (case sensitive) for the metadata field names.
For example, samples A,B,and C having the same germplasm_external_code=10001 will have 3 (AB, AC, BC) reproducibility comparisons.
Field | Description |
---|---|
dnarun_name | |
germplasm_name | |
germplasm_external_code | |
dnasample_name | |
dnasample_platename | |
dnasample_num | |
dnasample_ref_sample |
Reproducibility calculations depend on the dataset_type. For all dataset_types, if there is any missing data (NN or N) in either sample the marker will be ignored in the calculation.
2_letter_nucleotide
- 1_allele_mismatch: number of markers where one sample has a single allele that is different to the other sample eg sample 1 = AA and sample 2 = AT, or sample 1 = AT and sample 2 = CA (ie phasing is not taken into consideration). Calculated as a ratio of markers that have no missing data for either samples.
- 2_allele_mismatch: number of markers where one sample has 2 alleles are different to the other sample eg a 2 allele mismatch could be sample 1 = AA and sample 2 = TT, or sample 1 = AA and sample 2 = CG.. Calculated as a ratio of markers that have no missing data for either samples. Note phasing is not considered and so sample 1= AT is considered a match to sample 2 = TA
- total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch, as a ratio of all markers that have no missing data for both samples
Codominant
- 1_allele_mismatch: number of markers where one sample is a 0 or 2 and the other sample is a 1 (heterozygote), inferring one allele is mismatched, as a ratio of all markers with valid allele data for both samples.
- 2_allele_mismatch: number of markers where one sample is a 0 and the other sample is a 2, inferring both alleles are mismatched, as a ratio of all markers with valid allele data for both samples.
Dominant
Mismatch: number of markers where one sample has a ‘0’ allele call and the other sample has a ‘1’ allele call, as a ratio of all markers without missing data for both samples.
SSR_allele
- 1_allele_mismatch: Either the first allele in both samples mismatch OR the second allele in both samples mismatch, as a ratio of all markers with valid allele data for both samples. Note, only the simplest case of SSR alleles is considered where there are 2 alleles, ie a sample 123,124,125 will have a 1 allele mismatch with 123, 124, 127[1] ?
- 2_allele_mismatch: First allele in both samples mismatch AND the second allele in both samples mismatch, as a ratio of all markers with valid allele data for both samples.
- total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch inferring an overall mismatch score, as a ratio of all markers with valid allele data for both samples.
Pair-wise calculation of genotypic similarity amongst all samples, with sample metadata provided above and left of the matrix. The calculation is displayed as an symmetric matrix (diagonals are a comparison of the same sample and should always = 1) with column names identical to row names.
For example [Table 1]:
Sample_1 | Sample_2 | Sample_3 | |
Sample_1 | 1.0 | 0.9 | 0.4 |
Sample_2 | 0.9 | 1.0 | 0.3 |
Sample_3 | 0.4 | 0.3 | 1.0 |
In the above table, samples 1 and 2 are very similar, whereas sample 3 is less so; the genetic similarity between samples 1 and 3 is 0.4.
Genetic similarity ranges from 0 (no similarity) to 1 (identical) and is calculated as the average of the comparison scores across all markers
using the following scoring methodology for markers with valid allele calls:
- Dominant: For markers with both samples having a ‘0’ or both samples having a ‘1’, the value is 1 (a match), otherwise the value is 0 (no match)
- Codominant & 2_letter_nucleotide: For markers with both samples having the same homozygote the value is 1, if one sample is a heterozygote the value is 0.5, or if the samples have different homozygote the value is 0.0. HOW about both samples being heterozygote? Both samples are the same heterozygote have a value of 1 (note phasing os not considered so AT = the same as TA[2] ).
- SSR: For markers with both samples having the same homozygous allele pair the value is 1, if either sample is a heterozygote the value is 0.5, for different homozygotes the value is 0.
Missing nucleotides in either sample will omit that marker from calculation of similarity
Field | Description |
---|---|
germplasm_name | |
germplasm_external_code | |
dnasample_name | |
dnasample_num | |
dnasample_platename | |
dnasample_ref_sample | |
dnarun_name |
Alternative representation of the Similarity Matrix. Each pair-wise comparison result is outputted in its own row with metadata of compared samples written with the following structure:
< Sample 1 meta fields > , Similarity score , < Sample 2 meta fields >
Field | Description |
---|---|
germplasm_name_sample | |
germplasm_external_code_sample | |
dnasample_name_sample | |
dnasample_num_sample | |
dnasample_platename_sample | |
dnasample_ref_sample_sample | |
dnarun_name_sample |
Field | Description |
---|---|
marker_name | |
platform_name | |
Sample_count | Number of samples genotyped |
Missing_count | Number of samples with missing data allele calls (eg N, NN) |
Unexpected_count | Number of samples with unexpected allele calls for the dataset_type |
No_data | Number of samples with blank fields (no allele call provided) |
Data_count | Number of samples with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data)) |
Call_rate | frequency of data_count in relation to sample_count, i.e. data_count/sample_count |
freq | Allele counts and frequencies, calculated as a ratio of the data_count, vary according to the dataset_type |
MAF | minor allele frequency: frequency of the allele with the lowest frequency (including when present in the het class) |
Field | Description |
---|---|
sample_name | |
dnarun_name | |
germplasm_name | |
germplasm_external_code | |
dnasample_name | |
dnasample_platename | |
dnasample_num | |
germplasm_type | |
germplasm_species | |
germplasm_subsp | |
germplasm_heterotic_group | |
dnasample_sample_group | |
dnasample_sample_group_cycle | |
dnasample_ref_sample | |
Marker_count | Number of markers genotyped for the sample |
Missing_count | Number of markers with missing data allele calls (eg N, NN) |
Unexpected_count | Number of markers with unexpected allele calls for the dataset_type |
No_data | Number of markers with blank fields (no allele call provided) |
Data_count | Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data)) |
Call_rate | frequency of data_count in relation to marker_count ie data_count/marker_count |
freq | Allele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type |
Provides averaged statistics of samples by following sample metadata fields
Field | Description |
---|---|
germplasm_type | |
germplasm_species | |
germplasm_subsp | |
germplasm_heterotic_group | |
dnasample_sample_group | |
dnasample_sample_group_cycle | |
Marker_count | Number of markers genotyped for the sample |
Missing_count | Number of markers with missing data allele calls (eg N, NN) |
Unexpected_count | Number of markers with unexpected allele calls for the dataset_type |
No_data | Number of markers with blank fields (no allele call provided) |
Data_count | Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data)) |
Call_rate | frequency of data_count in relation to marker_count ie data_count/marker_count |
freq | Allele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type |
A chisq test for samples identified as having the germplasm_type listed below. Deviations from the expected allele ratios below are calculated. If dnasample_group or dnasample_group_cycle[3] fields are provided, the chisq tests are carried out by these fields.
H0 (Null Hypothesis): Samples across marker support expected segregation ratio of specified germplasm population.
H1 (Alternative Hypothesis): Reject H0
Calculation:
Zygotes of samples grouped by meta field criteria, with missing values excluded, are counted as nz.The total number of zygotes are ntotal.
The following formulation is performed on all zygotes to calculate Chisq using lookup tables in the Germplasm Population Distributions section below:
Field | Description |
---|---|
germplasm_type | |
dnasample_sample_group | |
marker_name | |
Chisq | Chisq statistic |
P< | Probability value calculated from chisq statistic |
Germplasm Population Distributions
The following table is used for Chisq and TwoLetter Nucleotide and SSR. For SSR, the most frequent Allele pair is labelled as Homozygote 1, the second most frequent Allele pair is labelled as Homozygote 2 and the Heterozygote between the two Homozygotes is labelled Heterozygote.
Germplasm_type | Homozygote 1 ratio | Heterozygote ratio | Homozygote 2 ratio |
RH | 0.5 | 0 | 0.5 |
RIL | 0.5 | 0 | 0.5 |
F2 | 0.25 | 0.5 | 0.25 |
F3 | 0.375 | 0.25 | 0.375 |
F4 | 0.4375 | 0.125 | 0.4375 |
F5 | 0.46875 | 0.0625 | 0.46875 |
F6 | 0.484375 | 0.03125 | 0.484375 |
F7 | 0.4921875 | 0.015625 | 0.4921875 |
F8 | 0.49609375 | 0.0078125 | 0.49609375 |
F9 | 0.498046875 | 0.00390625 | 0.498046875 |
BC1F1 | 0.5 | 0.5 | 0 |
BC2F1 | 0.75 | 0.25 | 0 |
BC3F1 | 0.875 | 0.125 | 0 |
BC4F1 | 0.9375 | 0.0625 | 0 |
BC5F1 | 0.96875 | 0.03125 | 0 |
BC6F1 | 0.984375 | 0.015625 | 0 |
BC7F1 | 0.9921875 | 0.0078125 | 0 |
BC8F1 | 0.99609375 | 0.00390625 | 0 |
BC1F2 | 0.625 | 0.25 | 0.125 |
BC2F2 | 0.8125 | 0.125 | 0.0625 |
BC3F2 | 0.90625 | 0.0625 | 0.03125 |
BC4F2 | 0.953125 | 0.03125 | 0.015625 |
BC5F2 | 0.9765625 | 0.015625 | 0.0078125 |
BC6F2 | 0.98828125 | 0.0078125 | 0.00390625 |
BC7F2 | 0.994140625 | 0.00390625 | 0.001953125 |
BC8F2 | 0.997070313 | 0.001953125 | 0.000976563 |
BC1F3 | 0.6875 | 0.125 | 0.1875 |
BC2F3 | 0.84375 | 0.0625 | 0.09375 |
BC3F3 | 0.921875 | 0.03125 | 0.046875 |
BC4F3 | 0.9609375 | 0.015625 | 0.0234375 |
BC5F3 | 0.98046875 | 0.0078125 | 0.01171875 |
BC6F3 | 0.990234375 | 0.00390625 | 0.005859375 |
BC7F3 | 0.995117188 | 0.001953125 | 0.002929688 |
BC8F3 | 0.997558594 | 0.000976563 | 0.001464844 |
BC1F4 | 0.71875 | 0.0625 | 0.21875 |
BC2F4 | 0.859375 | 0.03125 | 0.109375 |
BC3F4 | 0.9296875 | 0.015625 | 0.0546875 |
BC4F4 | 0.96484375 | 0.0078125 | 0.02734375 |
BC5F4 | 0.982421875 | 0.00390625 | 0.013671875 |
BC6F4 | 0.991210938 | 0.001953125 | 0.006835938 |
BC7F4 | 0.995605469 | 0.000976563 | 0.003417969 |
BC8F4 | 0.997802734 | 0.000488281 | 0.001708984 |
BC1F5 | 0.734375 | 0.03125 | 0.234375 |
BC2F5 | 0.8671875 | 0.015625 | 0.1171875 |
BC3F5 | 0.93359375 | 0.0078125 | 0.05859375 |
BC4F5 | 0.966796875 | 0.00390625 | 0.029296875 |
BC5F5 | 0.983398438 | 0.001953125 | 0.014648438 |
BC6F5 | 0.991699219 | 0.000976563 | 0.007324219 |
BC7F5 | 0.995849609 | 0.000488281 | 0.003662109 |
BC8F5 | 0.997924805 | 0.000244141 | 0.001831055 |
BC1F6 | 0.7421875 | 0.015625 | 0.2421875 |
BC2F6 | 0.87109375 | 0.0078125 | 0.12109375 |
BC3F6 | 0.935546875 | 0.00390625 | 0.060546875 |
BC4F6 | 0.967773438 | 0.001953125 | 0.030273438 |
BC5F6 | 0.983886719 | 0.000976563 | 0.015136719 |
BC6F6 | 0.991943359 | 0.000488281 | 0.007568359 |
BC7F6 | 0.99597168 | 0.000244141 | 0.00378418 |
BC8F6 | 0.99798584 | 0.00012207 | 0.00189209 |
BC1F7 | 0.74609375 | 0.0078125 | 0.24609375 |
BC2F7 | 0.873046875 | 0.00390625 | 0.123046875 |
BC3F7 | 0.936523438 | 0.001953125 | 0.061523438 |
BC4F7 | 0.968261719 | 0.000976563 | 0.030761719 |
BC5F7 | 0.984130859 | 0.000488281 | 0.015380859 |
BC6F7 | 0.99206543 | 0.000244141 | 0.00769043 |
BC7F7 | 0.996032715 | 0.00012207 | 0.003845215 |
BC8F7 | 0.998016357 | 6.1E-05 | 0.001922607 |
BC1F8 | 0.748046875 | 0.00390625 | 0.248046875 |
BC2F8 | 0.874023438 | 0.001953125 | 0.124023438 |
BC3F8 | 0.937011719 | 0.000976563 | 0.062011719 |
BC4F8 | 0.968505859 | 0.000488281 | 0.031005859 |
BC5F8 | 0.98425293 | 0.000244141 | 0.01550293 |
BC6F8 | 0.992126465 | 0.00012207 | 0.007751465 |
BC7F8 | 0.996063232 | 6.1E-05 | 0.003875732 |
BC8F8 | 0.998031616 | 3.05E-05 | 0.001937866 |
Dominant Germplasm Type Distributions
Dominant Nucleotide only. Two sets of statistics are calculated: Major-Pairing and Minor-Pairing.
Germplasm_type | Major-Pairing Major Allele ratio | Major-Pairing Minor Allele ratio | Minor-Pairing Major Allele ratio | Minor-Pairing Minor Allele ratio |
RH | 0.5 | 0.5 | 0.5 | 0.5 |
RIL | 0.5 | 0.5 | 0.5 | 0.5 |
F2 | 0.75 | 0.25 | 0.25 | 0.75 |
F3 | 0.625 | 0.375 | 0.375 | 0.625 |
F4 | 0.5625 | 0.4375 | 0.4375 | 0.5625 |
F5 | 0.53125 | 0.46875 | 0.46875 | 0.53125 |
F6 | 0.515625 | 0.484375 | 0.484375 | 0.515625 |
F7 | 0.5078125 | 0.4921875 | 0.4921875 | 0.5078125 |
F8 | 0.50390625 | 0.49609375 | 0.49609375 | 0.50390625 |
F9 | 0.501953125 | 0.498046875 | 0.498046875 | 0.501953125 |
BC1F1 | 1 | 0 | 0.5 | 0.5 |
BC2F1 | 1 | 0 | 0.75 | 0.25 |
BC3F1 | 1 | 0 | 0.875 | 0.125 |
BC4F1 | 1 | 0 | 0.9375 | 0.0625 |
BC5F1 | 1 | 0 | 0.96875 | 0.03125 |
BC6F1 | 1 | 0 | 0.984375 | 0.015625 |
BC7F1 | 1 | 0 | 0.9921875 | 0.0078125 |
BC8F1 | 1 | 0 | 0.99609375 | 0.00390625 |
BC1F2 | 0.875 | 0.125 | 0.625 | 0.375 |
BC2F2 | 0.9375 | 0.0625 | 0.8125 | 0.1875 |
BC3F2 | 0.96875 | 0.03125 | 0.90625 | 0.09375 |
BC4F2 | 0.984375 | 0.015625 | 0.953125 | 0.046875 |
BC5F2 | 0.9921875 | 0.0078125 | 0.9765625 | 0.0234375 |
BC6F2 | 0.99609375 | 0.00390625 | 0.98828125 | 0.01171875 |
BC7F2 | 0.998046875 | 0.001953125 | 0.994140625 | 0.005859375 |
BC8F2 | 0.999023438 | 0.000976563 | 0.997070313 | 0.002929688 |
BC1F3 | 0.8125 | 0.1875 | 0.6875 | 0.3125 |
BC2F3 | 0.90625 | 0.09375 | 0.84375 | 0.15625 |
BC3F3 | 0.953125 | 0.046875 | 0.921875 | 0.078125 |
BC4F3 | 0.9765625 | 0.0234375 | 0.9609375 | 0.0390625 |
BC5F3 | 0.98828125 | 0.01171875 | 0.98046875 | 0.01953125 |
BC6F3 | 0.994140625 | 0.005859375 | 0.990234375 | 0.009765625 |
BC7F3 | 0.997070313 | 0.002929688 | 0.995117188 | 0.004882813 |
BC8F3 | 0.998535156 | 0.001464844 | 0.997558594 | 0.002441406 |
BC1F4 | 0.78125 | 0.21875 | 0.71875 | 0.28125 |
BC2F4 | 0.890625 | 0.109375 | 0.859375 | 0.140625 |
BC3F4 | 0.9453125 | 0.0546875 | 0.9296875 | 0.0703125 |
BC4F4 | 0.97265625 | 0.02734375 | 0.96484375 | 0.03515625 |
BC5F4 | 0.986328125 | 0.013671875 | 0.982421875 | 0.017578125 |
BC6F4 | 0.993164063 | 0.006835938 | 0.991210938 | 0.008789063 |
BC7F4 | 0.996582031 | 0.003417969 | 0.995605469 | 0.004394531 |
BC8F4 | 0.998291016 | 0.001708984 | 0.997802734 | 0.002197266 |
BC1F5 | 0.765625 | 0.234375 | 0.734375 | 0.265625 |
BC2F5 | 0.8828125 | 0.1171875 | 0.8671875 | 0.1328125 |
BC3F5 | 0.94140625 | 0.05859375 | 0.93359375 | 0.06640625 |
BC4F5 | 0.970703125 | 0.029296875 | 0.966796875 | 0.033203125 |
BC5F5 | 0.985351563 | 0.014648438 | 0.983398438 | 0.016601563 |
BC6F5 | 0.992675781 | 0.007324219 | 0.991699219 | 0.008300781 |
BC7F5 | 0.996337891 | 0.003662109 | 0.995849609 | 0.004150391 |
BC8F5 | 0.998168945 | 0.001831055 | 0.997924805 | 0.002075195 |
BC1F6 | 0.7578125 | 0.2421875 | 0.7421875 | 0.2578125 |
BC2F6 | 0.87890625 | 0.12109375 | 0.87109375 | 0.12890625 |
BC3F6 | 0.939453125 | 0.060546875 | 0.935546875 | 0.064453125 |
BC4F6 | 0.969726563 | 0.030273438 | 0.967773438 | 0.032226563 |
BC5F6 | 0.984863281 | 0.015136719 | 0.983886719 | 0.016113281 |
BC6F6 | 0.992431641 | 0.007568359 | 0.991943359 | 0.008056641 |
BC7F6 | 0.99621582 | 0.00378418 | 0.99597168 | 0.00402832 |
BC8F6 | 0.99810791 | 0.00189209 | 0.99798584 | 0.00201416 |
BC1F7 | 0.75390625 | 0.24609375 | 0.74609375 | 0.25390625 |
BC2F7 | 0.876953125 | 0.123046875 | 0.873046875 | 0.126953125 |
BC3F7 | 0.938476563 | 0.061523438 | 0.936523438 | 0.063476563 |
BC4F7 | 0.969238281 | 0.030761719 | 0.968261719 | 0.031738281 |
BC5F7 | 0.984619141 | 0.015380859 | 0.984130859 | 0.015869141 |
BC6F7 | 0.99230957 | 0.00769043 | 0.99206543 | 0.00793457 |
BC7F7 | 0.996154785 | 0.003845215 | 0.996032715 | 0.003967285 |
BC8F7 | 0.998077393 | 0.001922607 | 0.998016357 | 0.001983643 |
BC1F8 | 0.751953125 | 0.248046875 | 0.748046875 | 0.251953125 |
BC2F8 | 0.875976563 | 0.124023438 | 0.874023438 | 0.125976563 |
BC3F8 | 0.937988281 | 0.062011719 | 0.937011719 | 0.062988281 |
BC4F8 | 0.968994141 | 0.031005859 | 0.968505859 | 0.031494141 |
BC5F8 | 0.98449707 | 0.01550293 | 0.98425293 | 0.01574707 |
BC6F8 | 0.992248535 | 0.007751465 | 0.992126465 | 0.007873535 |
BC7F8 | 0.996124268 | 0.003875732 | 0.996063232 | 0.003936768 |
BC8F8 | 0.998062134 | 0.001937866 | 0.998031616 | 0.001968384 |
- No labels