Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleF1.csv

F1 Pedigree Test

Where germplasm_type for samples genotyped have been identified as F1, and germplasm_par1 and germplasm_par2 fields have been identified as germplasm_names that match samples in the same dataset, F1 allele match to the identified parents can be calculated.

To calculate F1 match, an expected F1 is first derived which is then compared to the F1 progeny. The expected F1 can only be derived if there is no missing or heterozygous data in either parent. In the case below, only 7 values can be derived for the expected F1.



germplasm_type

germplasm_par1

germplasm_par2

Mkr 1

Mkr 2

Mkr 3

Mkr 4

Mkr 5

Mkr 6

Mkr 7

Mkr 8

Mkr 9

Mkr 10

Parent1




TT

TT

CC

CC

CC

TT

CC

CT

CC

TT

Parent 2




CC

TT

CC

CC

TT

TT

CT

NN

TT

CT

SampledF1

F1

Parent1

Parent2

TT

TT

CT

CT

CC

TT

CC

CC

TT

TT

Exp F1 (derived)




CT

TT

CC

CC

CT

TT

 -

-

CT

-


The ‘par_1 contained’ calculation looks at how many alleles from Parent1 are contained in the SampledF1 ie i.e. how many of marker alleles in the SampledF1 can be explained by the Parent1 contribution. In this case 9/10 of the SampledF1 marker alleles could have been derived from Parent1, so the result is 90% P1_contained.  For the ‘par_2 contained’ calculation, 7/9 of the SampledF1 marker alleles could have been derived from Parent2, so 78% P2_contained.

The calculation of Percent_F1_match is based on the number of marker genotype calls that exactly match between the SampledF1 and the derived F1, as a percent of the total number of markers that are non-missing or non-heterozygous in both parents.  An exact match has to be both alleles matching and so AA and AA are a 100% match, but AA and AT are a zero match

FieldDescription

dnarun_name


germplasm_name


germplasm_external_code


germplasm_par1


par1_dnarun_name (dnarun_name of the par1 germplasm)


germplasm_par2


par2_dnarun_name (dnarun_name of the par1 germplasm)


dnasample_name


dnasample_platename


dnasample_num

Count_data


Percent_P1_contained


Percent_P2_contained


Percent_F1_match


...

Expand
titleReproducibility.csv

Reproducibility

Pair-wise comparison between all samples with exact matches (case sensitive) for the metadata field names.

For example, samples A,B, and C having the same germplasm_external_code=10001 will have 3 (AB, AC, BC) reproducibility comparisons.

FieldDescription

dnarun_name


germplasm_name


germplasm_external_code


dnasample_name


dnasample_platename


dnasample_num


dnasample_ref_sample


Reproducibility calculations depend on the dataset_type 41222154. For all dataset_types, if there is any missing data (NN or N) in either sample the marker will be ignored in the calculation.

Info
title2_letter_nucleotide
  • 1_allele_mismatch: number of markers where one sample has a single allele that is different to the other sample eg e.g. sample 1 = AA and sample 2 = AT, or sample 1 = AT and sample 2 = CA (ie i.e. phasing is not taken into consideration). Calculated as a ratio of markers that have no missing data for either samples.
  • 2_allele_mismatch:  number of markers where one sample has 2 alleles are different to the other sample eg e.g. a 2 allele mismatch could be sample 1 = AA and sample 2 = TT, or sample 1 = AA and sample 2 = CG.. Calculated as a ratio of markers that have no missing data for either samples. Note phasing is not considered and so sample 1= AT is considered  a match to sample 2 = TA
  • total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch, as a ratio of all markers that have no missing data for both samples


Info
titleCodominant
  • 1_allele_mismatch: number of markers where one sample is a 0 or 2 and the other sample is a 1 (heterozygote), inferring one allele is mismatched,  as a ratio of all markers with valid allele data for both samples.
  • 2_allele_mismatch: number of markers where one sample is a 0 and the other sample is a 2, inferring both alleles are mismatched,  as a ratio of all markers with valid allele data for both samples.


Info
titleDominant

Mismatch: number of markers where one sample has a ‘0’ allele call and the other sample has a ‘1’ allele call, as a ratio of all markers without missing data for both samples.


Info
titleSSR_allele
  • 1_allele_mismatch: Either the first allele in both samples mismatch OR the second allele in both samples mismatch,  as a ratio of all markers with valid allele data for both samples. Note, only the simplest case of SSR alleles is considered where there are 2 alleles, ie i.e. a sample 123,124,125 will have a 1 allele mismatch with 123, 124, 127[1] ?
  • 2_allele_mismatch: First allele in both samples mismatch AND the second allele in both samples mismatch,  as a ratio of all markers with valid allele data for both samples.
  • total_sample_mismatch: sum of 1_allele_mismatch and 2_allele_mismatch inferring an overall mismatch score,  as a ratio of all markers with valid allele data for both samples.



...

Expand
titlesimilarity_matrix.csv

Similarity Matrix

Pair-wise calculation of genotypic similarity amongst among all samples, with sample metadata provided above and left of the matrix. The calculation is displayed as an symmetric matrix (diagonals are a comparison of the same sample and should always = 1) with column names identical to row names.


For example [Table 1]:



Sample_1

Sample_2

Sample_3

Sample_1

1.0

0.9

0.4

Sample_2

0.9

1.0

0.3

Sample_3

0.4

0.3

1.0


In the above table, samples 1 and 2 are very similar, whereas sample 3 is less so; the genetic similarity between samples 1 and 3 is 0.4.

Genetic similarity ranges from 0 (no similarity) to 1 (identical) and is calculated as the average of the comparison scores across all markers

using the following scoring methodology for markers with valid allele calls:

  • Dominant: For markers with both samples having a ‘0’ or both samples having a ‘1’, the value is  1 (a match), otherwise the value is 0 (no match)
  • Codominant & 2_letter_nucleotide: For markers with both samples having the same  homozygote the value is 1, if one sample is a heterozygote the value is 0.5, or if the samples have different homozygote the value is  0.0. HOW How about both samples being heterozygote? Both samples are the same heterozygote have a value of 1 (note phasing os not considered so AT = the same as TA[2] ).
  • SSR: For markers with both samples having the same homozygous allele pair the value is 1, if either sample is a heterozygote the value is 0.5, for different homozygotes the value is 0.


Warning

Missing nucleotides in either sample will omit that marker from calculation of similarity


FieldDescription

germplasm_name


germplasm_external_code


dnasample_name


dnasample_num


dnasample_platename


dnasample_ref_sample


dnarun_name


...

Expand
titlesummary_markers.csv

Summary by Markers

FieldDescription

marker_name


platform_name


Sample_count

Number of samples genotyped

Missing_count

Number of samples with missing data allele calls (eg N, NN)

Unexpected_count

Number of samples with unexpected allele calls for the dataset_type

No_data

Number of samples with blank fields (no allele call provided)

Data_count

Number of samples with valid allele calls for the dataset_type ie (sample_count minus (missing_count plus unexpected_count plus no_data))

Call_rate

frequency of data_count in relation to sample_count, i.e. data_count/sample_count

freq

Allele counts and frequencies, calculated as a ratio of the data_count, vary according to the dataset_type 41222154
MAFminor allele frequency: frequency of the allele with the lowest frequency (including when present in the het class)


...

Expand
titlesummary_samples_averages.csv

Summary Samples Averages

Provides averaged statistics of samples by following sample metadata fields

FieldDescription

germplasm_type


germplasm_species


germplasm_subsp


germplasm_heterotic_group


dnasample_sample_group


dnasample_sample_group_cycle

Marker_count

Number of markers genotyped for the sample

Missing_count

Number of markers with missing data allele calls (eg e.g. N, NN)

Unexpected_count

Number of markers with unexpected allele calls for the dataset_type

No_data

Number of markers with blank fields (no allele call provided)

Data_count

Number of markers with valid allele calls for the dataset_type ie i.e. (sample_count minus (missing_count plus unexpected_count plus no_data))

Call_rate

frequency of data_count in relation to marker_count ie i.e. data_count/marker_count
freqAllele counts and frequencies (freq: calculated as a ratio of the data_count) vary according to the dataset_type


...