k-fold cross-validation

Created by Star Yanxin Gao, last modified on Aug 31, 2018

This method has a single parameter named k that refers to the number of groups that a given data set is to be split into. Typically k is set to 5 or 10 in most cross-validation studies. The general procedure is as follow:

I) Shuffle the dataset randomly.

II) Split the dataset into k groups of approximately equal size.

III) Fo each unique group: a) take the group as a hold out or test data set, b) take the remaining groups as a training data set, c) fit a model on the training set and evaluates it on the test set, d) retain the accuracy evaluation score and repeat the process with a different test set data.

IV) Summarize the scores, typically by estimating the correlation between predictive and the test data set.

Each sample assigned to an individual group is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.