Clustering Analysis

Created by Star Yanxin Gao, last modified by Xuecai Zhang on Jul 03, 2018

Go to start of metadata

Why do we need to do clustering analysis in Galaxy GS pipeline? The main objective of clustering analysis or population structure analysis in GS is to estimate how many subgroups existing in the training population. Based the information of the number of subgroups in the training population, the user can implement a customized cross validations across the subgroups. The other potential application of using the clustering analysis/population structure analysis results is how to optimize the training population to do GEBV  prediction, according to the information of the number of subgroups existed in the training population, and the relatedness between the training and predicted populations (not functional now). 



Which kinds of clustering analysis are provided in Galaxy GS pipeline?

The major differences between  Hierarchical K-Means Clustering and Hybrid K-Means Clustering or K-Means Clustering?

Requirement of creating input files for running  Hierarchical K-Means Clustering and Hybrid K-Means Clustering or K-Means Clustering?

Set parameters  for running the cluster analysis: Distance Metric [hc.metric] (the distance measure to be used)  --  Method [hc.metric]  (the agglomeration method to be used)  --   Algorithm [km.algorithm]  (the algorithm to be used for kmeans)   ----   Number of clusters [k]  ((integer) the maximum number of iterations allowed for k-means)--- Iterations [iter]  ((integer) the maximum number of iterations allowed for k-means)

How to explain the results?

How to link the output results with the next steps, customized cross validation or customized training population for prediction.



In GS Galaxy pipeline, two kind of cluster analysis methods are provided, i.e.,  Hierarchical K-Means Clustering and Hybrid K-Means Clustering

K-means starts with a random choice of cluster centers, therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. However, with hierarchical clustering, you will most definitely get the same clustering results.

Off course, K-means clustering requires prior knowledge of K (or number of clusters), whereas in hierarchical clustering you can stop at whatever level (or clusters) you wish.

K-means (Chapter @ref(kmeans-clustering)) represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

we described an hybrid method, named hierarchical k-means clustering (hkmeans), for improving k-means results. 



The algorithm is summarized as follow:

  1. Compute hierarchical clustering and cut the tree into k-clusters

  2. Compute the center (i.e the mean) of each cluster

  3. Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers

There are a number of important differences between k-means and hierarchical clustering, ranging from how the algorithms are implemented to how you can interpret the results.

The k-means algorithm is parameterized by the value k, which is the number of clusters that you want to create. As the animation below illustrates, the algorithm begins by creating centroids. It then iterates between an assign step (where each sample is assigned to its closest centroid) and an update step (where each centroid is updated to become the mean of all the samples that are assigned to it. This iteration continues until some stopping criteria is met; for example, if no sample is re-assigned to a different centroid.

The k-means algorithm makes a number of assumptions about the data, which are demonstrated in this scikit-learn example: demonstration of k-means assumptions. The most notable assumption is that the data is 'spherical,' see how to understand the drawbacks of K-means for a detailed discussion.



Agglomerative hierarchical clustering, instead, builds clusters incrementally, producing a dendogram. As the picture below shows, the algorithm begins by assigning each sample to its own cluster (top level). At each step, the two clusters that are the most similar are merged; the algorithm continues until all of the clusters have been merged. Unlike k-means, you don't need to specify a parameter: once the dendogram has been produced, you can navigate the layers of the tree to see which number of clusters makes the most sense to your particular application.