Quality Control using GOBii-KDCompute Module

TODO

How to initiate QC on dataset loading

TODO

Data Types	Description
Dominant	Presence, Absence
Codominant	hom_class = 0, het_class = 1, hom_class_2 = 2 0 usually represents homozygous absence, 1 the heterozygous class, 2 = the homozygous presence class
SSR_allele	Each discovered combination of alleles whether homozygote or heterozygote
2_letter_nucleotide (SNP)	hom_class_1 is the first nucleotide encountered in the list of samples, hom_class_2 is the alternate allele encountered, het class: a combination of the above alleles
IUPAC	IUPAC is converted to 2_letter_nucleotide upon loading and so is as above

Output from QC

Follow the file path in the QC email notification to access output files as below:

dataset.hmp.txt

This file contains genotype data extracted from GDM during QC process

dataset_summary.csv

Field	Description
project_pi_contact
project_name
project_genotyping_purpose
project_date_sampled
project_division
project_study_name
experiment_name
platform_name
vendor_protocol_name
vendor_name
protocol_name
dataset_name
dataset_type
analysis_name

F1.csv

marker.file

Reproducibility.csv

Pair-wise comparison between all samples with exact matches (case sensitive) for the metadata field names.

For example, samples A,B,and C having the same germplasm_external_code=10001 will have 3 (AB, AC, BC) reproducibility comparisons.

Field	Description
dnarun_name
germplasm_name
germplasm_external_code
dnasample_name
dnasample_platename
dnasample_num
dnasample_ref_sample

Reproducibility calculations depend on the dataset_type. For all dataset_types, if there is any missing data (NN or N) in either sample the marker will be ignored in the calculation.

2_letter_nucleotide

1_allele_mismatch: number of markers where one sample has a single allele that is different to the other sample eg sample 1 = AA and sample 2 = AT, or sample 1 = AT and sample 2 = CA (ie phasing is not taken into consideration). Calculated as a ratio of markers that have no missing data for either samples.
2_allele_mismatch: number of markers where one sample has 2 alleles are different to the other sample eg a 2 allele mismatch could be sample 1 = AA and sample 2 = TT, or sample 1 = AA and sample 2 = CG.. Calculated as a ratio of markers that have no missing data for either samples. Note phasing is not considered and so sample 1= AT is considered a match to sample 2 = TA
total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch, as a ratio of all markers that have no missing data for both samples

Codominant

1_allele_mismatch: number of markers where one sample is a 0 or 2 and the other sample is a 1 (heterozygote), inferring one allele is mismatched, as a ratio of all markers with valid allele data for both samples.
2_allele_mismatch: number of markers where one sample is a 0 and the other sample is a 2, inferring both alleles are mismatched, as a ratio of all markers with valid allele data for both samples.

Dominant

Mismatch: number of markers where one sample has a ‘0’ allele call and the other sample has a ‘1’ allele call, as a ratio of all markers without missing data for both samples.

SSR_allele

1_allele_mismatch: Either the first allele in both samples mismatch OR the second allele in both samples mismatch, as a ratio of all markers with valid allele data for both samples. Note, only the simplest case of SSR alleles is considered where there are 2 alleles, ie a sample 123,124,125 will have a 1 allele mismatch with 123, 124, 127[1] ?
2_allele_mismatch: First allele in both samples mismatch AND the second allele in both samples mismatch, as a ratio of all markers with valid allele data for both samples.
total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch inferring an overall mismatch score, as a ratio of all markers with valid allele data for both samples.

sample.file

similarity_matrix.csv

Pair-wise calculation of genotypic similarity amongst all samples, with sample metadata provided above and left of the matrix. The calculation is displayed as an symmetric matrix (diagonals are a comparison of the same sample and should always = 1) with column names identical to row names.

For example [Table 1]:

	Sample_1	Sample_2	Sample_3
Sample_1	1.0	0.9	0.4
Sample_2	0.9	1.0	0.3
Sample_3	0.4	0.3	1.0

similarity_matrix_columnwise.csv

similarity_matrix_with_meta.csv

summary.file

summary_markers.csv

Field	Description
marker_name
platform_name
Sample_count	Number of samples genotyped
Missing_count	Number of samples with missing data allele calls (eg N, NN)
Unexpected_count	Number of samples with unexpected allele calls for the dataset_type
No_data	Number of samples with blank fields (no allele call provided)
Data_count	Number of samples with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))
Call_rate	frequency of data_count in relation to sample_count, i.e. data_count/sample_count
freq	Allele counts and frequencies, calculated as a ratio of the data_count, vary according to the dataset_type
MAF	minor allele frequency: frequency of the allele with the lowest frequency (including when present in the het class)

summary_samples.csv

Field	Description
sample_name
dnarun_name
germplasm_name
germplasm_external_code
dnasample_name
dnasample_platename
dnasample_num
germplasm_type
germplasm_species
germplasm_subsp
germplasm_heterotic_group
dnasample_sample_group
dnasample_sample_group_cycle
dnasample_ref_sample
Marker_count	Number of markers genotyped for the sample
Missing_count	Number of markers with missing data allele calls (eg N, NN)
Unexpected_count	Number of markers with unexpected allele calls for the dataset_type
No_data	Number of markers with blank fields (no allele call provided)
Data_count	Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))
Call_rate	frequency of data_count in relation to marker_count ie data_count/marker_count
freq	Allele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type

summary_samples_averages.csv

Provides averaged statistics of samples by following sample metadata fields

Field	Description
germplasm_type
germplasm_species
germplasm_subsp
germplasm_heterotic_group
dnasample_sample_group
dnasample_sample_group_cycle
Marker_count	Number of markers genotyped for the sample
Missing_count	Number of markers with missing data allele calls (eg N, NN)
Unexpected_count	Number of markers with unexpected allele calls for the dataset_type
No_data	Number of markers with blank fields (no allele call provided)
Data_count	Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))
Call_rate	frequency of data_count in relation to marker_count ie data_count/marker_count
freq	Allele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type

summary_samples_chisq.csv

Report.xlsx

Data QC

Quality Control using GOBii-KDCompute Module

How to initiate QC on dataset loading

Output from QC