Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »


Quality Control using GOBii-KDCompute Module

TODO

How to initiate QC on dataset loading

TODO

Data TypesDescription
DominantPresence, Absence
Codominant

hom_class = 0, het_class = 1, hom_class_2 = 2

0 usually represents homozygous absence, 1 the heterozygous class, 2 = the homozygous presence class

SSR_alleleEach discovered combination of alleles whether homozygote or heterozygote
2_letter_nucleotide (SNP) hom_class_1 is the first nucleotide encountered in the list of samples, hom_class_2 is the alternate allele encountered, het class: a combination of the above alleles
IUPACIUPAC is converted to 2_letter_nucleotide upon loading and so is as above


Output from QC

Follow the file path in the QC email notification to access output files as below:

 dataset.hmp.txt

This file contains genotype data extracted from GDM during QC process

 dataset_summary.csv
FieldDescription

project_pi_contact


project_name


project_genotyping_purpose


project_date_sampled


project_division


project_study_name


experiment_name


platform_name


vendor_protocol_name


vendor_name


protocol_name


dataset_name


dataset_type


analysis_name
 F1.csv


 marker.file


 Reproducibility.csv

Pair-wise comparison between all samples with exact matches (case sensitive) for the metadata field names.

For example, samples A,B,and C having the same germplasm_external_code=10001 will have 3 (AB, AC, BC) reproducibility comparisons.

FieldDescription

dnarun_name


germplasm_name


germplasm_external_code


dnasample_name


dnasample_platename


dnasample_num


dnasample_ref_sample

Reproducibility calculations depend on the dataset_type. For all dataset_types, if there is any missing data (NN or N) in either sample the marker will be ignored in the calculation.

2_letter_nucleotide

  • 1_allele_mismatch: number of markers where one sample has a single allele that is different to the other sample eg sample 1 = AA and sample 2 = AT, or sample 1 = AT and sample 2 = CA (ie phasing is not taken into consideration). Calculated as a ratio of markers that have no missing data for either samples.
  • 2_allele_mismatch:  number of markers where one sample has 2 alleles are different to the other sample eg a 2 allele mismatch could be sample 1 = AA and sample 2 = TT, or sample 1 = AA and sample 2 = CG.. Calculated as a ratio of markers that have no missing data for either samples. Note phasing is not considered and so sample 1= AT is considered  a match to sample 2 = TA
  • total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch, as a ratio of all markers that have no missing data for both samples

Codominant

  • 1_allele_mismatch: number of markers where one sample is a 0 or 2 and the other sample is a 1 (heterozygote), inferring one allele is mismatched,  as a ratio of all markers with valid allele data for both samples.
  • 2_allele_mismatch: number of markers where one sample is a 0 and the other sample is a 2, inferring both alleles are mismatched,  as a ratio of all markers with valid allele data for both samples.

Dominant

Mismatch: number of markers where one sample has a ‘0’ allele call and the other sample has a ‘1’ allele call, as a ratio of all markers without missing data for both samples.

SSR_allele

  • 1_allele_mismatch: Either the first allele in both samples mismatch OR the second allele in both samples mismatch,  as a ratio of all markers with valid allele data for both samples. Note, only the simplest case of SSR alleles is considered where there are 2 alleles, ie a sample 123,124,125 will have a 1 allele mismatch with 123, 124, 127[1] ?
  • 2_allele_mismatch: First allele in both samples mismatch AND the second allele in both samples mismatch,  as a ratio of all markers with valid allele data for both samples.
  • total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch inferring an overall mismatch score,  as a ratio of all markers with valid allele data for both samples.
 sample.file


 similarity_matrix.csv

Pair-wise calculation of genotypic similarity amongst all samples, with sample metadata provided above and left of the matrix. The calculation is displayed as an symmetric matrix (diagonals are a comparison of the same sample and should always = 1) with column names identical to row names.


For example [Table 1]:


Sample_1

Sample_2

Sample_3

Sample_1

1.0

0.9

0.4

Sample_2

0.9

1.0

0.3

Sample_3

0.4

0.3

1.0

 similarity_matrix_columnwise.csv


 similarity_matrix_with_meta.csv


 summary.file


 summary_markers.csv
FieldDescription

marker_name


platform_name


Sample_count

Number of samples genotyped

Missing_count

Number of samples with missing data allele calls (eg N, NN)

Unexpected_count

Number of samples with unexpected allele calls for the dataset_type

No_data

Number of samples with blank fields (no allele call provided)

Data_count

Number of samples with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))

Call_rate

frequency of data_count in relation to sample_count, i.e. data_count/sample_count

freq

Allele counts and frequencies, calculated as a ratio of the data_count, vary according to the dataset_type
MAFminor allele frequency: frequency of the allele with the lowest frequency (including when present in the het class)
 summary_samples.csv
FieldDescription

sample_name


dnarun_name


germplasm_name


germplasm_external_code


dnasample_name


dnasample_platename


dnasample_num


germplasm_type


germplasm_species


germplasm_subsp


germplasm_heterotic_group


dnasample_sample_group


dnasample_sample_group_cycle


dnasample_ref_sample


Marker_count

Number of markers genotyped for the sample

Missing_count

Number of markers with missing data allele calls (eg N, NN)

Unexpected_count

Number of markers with unexpected allele calls for the dataset_type

No_data

Number of markers with blank fields (no allele call provided)

Data_count

Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))

Call_rate

frequency of data_count in relation to marker_count ie data_count/marker_count
freqAllele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type
 summary_samples_averages.csv

Provides averaged statistics of samples by following sample metadata fields

FieldDescription

germplasm_type


germplasm_species


germplasm_subsp


germplasm_heterotic_group


dnasample_sample_group


dnasample_sample_group_cycle

Marker_count

Number of markers genotyped for the sample

Missing_count

Number of markers with missing data allele calls (eg N, NN)

Unexpected_count

Number of markers with unexpected allele calls for the dataset_type

No_data

Number of markers with blank fields (no allele call provided)

Data_count

Number of markers with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))

Call_rate

frequency of data_count in relation to marker_count ie data_count/marker_count
freqAllele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type
 summary_samples_chisq.csv


 Report.xlsx


  • No labels