Consensus And Splitting Tool (CAST)

 

The Consensus And Splitting Tool will allow you to select a dataset from GOBii, apply some basic filtering, carry out consensus calling of replicate samples, split the file by sample metadata, and create a project file for analysis in Flapjack. Details of these steps are provided below. Note: currently the pipeline is for 2 letter nucleotide datatypes only.

 

Contents

 

Login

BrAPI URL

First, enter the URL for the environment that you would like to pull data from.

For GOBii-GDM versions 2.2 and earlier, the URL will look something like http://example.gobii.org:8081/gobii-zoan/, where the crop/environment name is zoan.

For GOBii-GDM versions 2.2 and earlier this URL can be copy / pasted from the Extractor link in the main portal.

For GOBii-GDM versions 3.0 and later, the URL will look like http://example.gobii.org/gdm/crops/zoan/.

This URL automatically points to /brapi/v2 and so you don’t need to add this extension to the URL as you would for import of data into some other tools (e.g. Flapjack).

Credentials

Log in to the GOBii-GDM system using your GOBii-GDM credentials.

 

Get Data

This page uses BrAPI calls to select a dataset from the GOBii-GDM database.

  • Select a Study (optional). ‘Study’ is a BrAPI term that is equivalent to GOBii’s Experiment. The datasets will now be filtered by the selected study/experiment. Note that you don’t HAVE to select a study. You can proceed to selecting a variantset without selecting a study.

  • Select a Variantset (required). Variantset is a BrAPI term equivalent to GOBii’s Dataset. In brackets, afgter the variantset name, you will see a count of the number of markers x the number of samples.

  • Select a Mapset (optional). Note, that the mapsets are NOT be filtered by the markers in the dataset/variantset (we could do this but it would be a very long query and likely not worth the wait). So, you will need to KNOW ahead of time which mapset is relevant to the dataset selected. Note-2, you do not have to select a mapset. Flapjack handles not having a mapset quite well, and simply orders the markers as they are received at 1 cM intervals onto a ‘synthetic’ map.

  • In the future, we will enable more than one study and dataset to be selected and the resulting data will show the union of datasets selected. However, only one map can be selected.

  • Select the ‘eye’ icon next to the variantset selection to preview the dataset selected.

  • To remove a selection, click on the x next to the study, variantset, or mapset name.

  • Note: The associated metadata for the samples will also be utilized in the downstream steps of this tool as follows;

    • The most important metadata are the parents of the F1 samples: make sure to enter the germplasm_name of parents in the germplasm_parent1 and germplasm_parent2 fields of GOBii-GDM for each F1 sample. This will identify the correct parents against which the F1 samples are compared

    • There may be multiple samples of each germplasm_name and these can be consensus called to give a single parental genotype against which the F1 samples are compared. If you don’t apply consensus calling, and there is more than one parent replicate sample in Flapjack, the user will need to manually select which parent to reference and will not be able to take advantage of the automated batch analysis of multiple datasets

    • Several fields for the sample data can be used to split the genotyping data into different datasets (e.g. sample_group and sample_group_cycle, pedigree, germplasm parent 1 and germplasm parent 2 etc.). Enter these fields when loading data to GOBii-GDM.

    • In the future, we will allow metadata for samples to be uploaded to the tool in case these data were not loaded to GOBii or are stored in a different database

 

In the table below are example sample metadata.

  • The dnarun_name is the sample name associated with the genotyping data matrix

  • The germplasm_par1 and germplasm_par2 fields identify the parents by their germplasm_name

  • The file can be split into 2 separate datasets for analysis based on any of the following fields: germplasm_name, germplasm_pedigree, dnasample_group, or a combination of the germplasm_par1 and germplasm_par2 fields

  • Consensus calling will be carried out on parents with the same germplasm name; in this case only p4 has replicate samples with the same germplasm_name. Consensus called parents will be indicated by a star in the output file eg p4*

Table 1. Example metadata used in CAST

Germplasm_name

dnarun_name

dnasample_num

germplasm_par1

germplasm_par2

germplasm_pedigree

dnasample_group

Germplasm_name

dnarun_name

dnasample_num

germplasm_par1

germplasm_par2

germplasm_pedigree

dnasample_group

cross1_p1/p2

cross1_p1/p2-sample1

1

p1

p2

p1/p2

cross1

cross1_p1/p2

cross1_p1/p2-sample2

2

p1

p2

p1/p2

cross1

cross1_p1/p2

cross1_p1/p2-sample3

3

p1

p2

p1/p2

cross1

cross2_p3/p4

cross2_p3/p4-sample1

4

p3

p4

p3/p4

cross2

cross2_p3/p4

cross2_p3/p4-sample2

5

p3

p4

p3/p4

cross2

cross2_p3/p4

cross2_p3/p4-sample3

6

p3

p4

p3/p4

cross2

p1

p1-sample1

7

 

 

 

 

p2

p2-sample2

8

 

 

 

 

p3

p3-sample3

9

 

 

 

 

p4

p4-sample1

10

 

 

 

 

p4

p4-sample2

11

 

 

 

 

 

Filter Data

Use this page to filter your data based on marker and sample percent data.

  • For ‘Marker Percent’: enter a percent value and your data will be filtered to only include markers with greater than this percent of data

  • For 'Sample Percent': enter a percent value and your data will be filtered to only include samples with greater than this percent of data

  • Note: the percent values are based on the original, unfiltered data matrix and are not recalculated following removal of markers or samples

  • Select APPLY. The number of markers and samples remaining after filtering are summarized on the top right of the page. A preview of the filtered data can also be seen on the page. You can click through the preview pages to see more of the file. The filtered file can be downloaded if desired

 

Consensus Call Data

  • Consensus calling is currently only available using the algorithm ‘majority genotype (favoring homozygotes)’. This means that only parent replicate samples are consensus called. The parent samples are identified by the fields germplasm_par1 and germplasm_par2 which reference the germplasm_name of the parents

  • The term ‘favoring homozygotes' means that if there are equal frequencies of homozygous and heterozygous genotype calls, then the homozygous genotype is called. For example: if there are replicate samples AA AA AT and AT, then the consensus call will be AA. However, if there are equal frequencies of homozygous calls, eg replicate samples AA AA TT and TT, then the consensus call will be NN or missing, as there is an equal tie between two different homozygous calls. Equal frequencies of homozygous calls will take precedence, so that if replicate samples are AA, AT and TT, the consensus call will be NN.

  • A consensus threshold can be optionally applied if more stringency in the consensus calling is needed, e.g. if the user wants at least 50% of one call to be observed. For example, if the replicate samples are AA AA AT AT TT, and a 50% consensus threshold is applied, then the consensus call will be an NN, as less than 50% of the calls are AA. However, AA AA AA AT and TT will return a consensus call of AA as now more than 50% of the calls are AA

  • Select ‘APPLY’. A preview of the consensus calling can be viewed in the screen. Each consensus called parent can be selected from the drop-down menu to see the contributing replicate sample calls. The consensus calls can be edited if the user does not agree with the calls.

  • To see all the consensus calls, click on ‘Download’ to view the consensus calling in Excel

 

Split Data

The dataset can be split into multiple datasets for downstream analysis using any of the Available Split Categories.

  • Drag and drop the category that you want to split data by from the ‘Available Split Category' to ‘Selected Split Columns’

  • You may want to select more than one split category. For example: if your data needs to be split by a combination of parents identified in germplasm_par1 and germplasm_par2 fields

  • Enter ‘Apply’ to split your data. A message will show as ‘Successful’ when data is successfully split

  • You will see a summary of the number of split datasets that have 2 parents and can be analyzed in Flapjack using pedigree verification. Datasets that do NOT have 2 parents will not be included in the project file.

  • Note: the parents of split datasets will be automatically pulled into the first two rows of each dataset according to the parentage defined in germplasm_par1 and germplasm_par2 fields.

 

Export Data

This page shows a summary of the actions taken by the user including:

  • Select ‘Download Flapjack File’ to download the split dataset in a Flapjack project file format

  • Select the downloaded file to automatically open the project file in Flapjack, where you will see the results of your consensus calling and splitting. You will see each split data set listed on the left hand side of the Flapjack application. The consensus called parents are automatically positioned at the top of each dataset with consensus calling having been applied for any replicate parent samples. You are now ready for batch analysis for F1 pedigree verification. See the Flapjack help menu for more details.