Genomic selection (GS): is a new approach for improving quantitative traits in large plant breeding populations that uses whole genome molecular markers (high density markers and high throughput genotyping). Genomic prediction combines marker data with phenotypic and pedigree data (when available) in an attempt to increase the accuracy of the prediction of breeding and genotypic values.
Training Population: also called candidate population or reference population that are both phenotyped and genotyped and can be used to predict the performance of related individuals that are only genotyped in related environment and management conditions.
Training Datasets: GS uses a training population of individuals with known phenotypes and marker data to build a model for the prediction of performance in a population of untested individuals based on marker data.
Prediction Populations: also called test or validation populations, which are genotyped but not phenotyped. Performance of individuals in the population are estimated based on marker effects in the training set.
Prediction Datasets: The predictor of an individual phenotype is the genomic estimated breeding value (GEBV) obtained as the sum of all corresponding marker effects of the individual.
Training model fitting: GS uses this marker data in two different ways: by using markers to model relationships between individuals or by estimating the effects of each marker on the trait of interest, some of which are presumed to be in linkage disequilibrium (LD) with relevant quantitative trait loci (QTL).
Prediction accuracies: are measured as correlations between the observed and predicted phenotypes using a cross validation method.
Cross validation: the cross validation method randomly partitions the genotypes into folds and isolates folds as target populations while omitting the target’s phenotypic data. The remaining folds of genotypes with their phenotypic data intact are used as the training dataset; this process is repeated for each fold.
BLUP: best linear unbiased prediction
GEBV: genomic estimated breeding values, GEBVs are used to rank and select genotypes, without phenotypic data, for the next generation of breeding.
The Bayesian methods address the problem of small number of observations (n) and a large number of parameters (p) to be estimated (n<<p) by restricting the size of the regression coefficients via shrinkage or regularization
COP: coefficient of parentage
BLUE: best linear unbiased estimator
PCA: principal component analysis
PEV: Prediction error variance
Input file for phenotypes: the phenotype file contain two columns are:
Input file for genotypes matrix, with sample names in the first row (unique and no duplicated sample names) and genotypic matrix in the table where marker names removed from the first column in the matrix: the genotype file which each column is a different sample name (alphanumeric values) with their genotypes data (real values) in its rows associated.