Convex Clustering and Biclustering with application in R

Marta Karas

30 Sep 2016

My story with R in USA

  • R skill brought me to work in USA in Biostatistics Departments

  • Top #1 tool used in academic Biostatsistics

  • Reproducibility & usability awareness - rarely one can see a methodology paper published without corresponding R package

Clustering

Cluster analysis - goals


  • We perform clustering when we want to group or segment the data set into subsets so that objects within each subset are more closely related to others than those objects assigned to other subsets.

Motivation: cancer subtype discovery

  • A cancer may present clinically as a homogenous disease, but it typically consists of several distinct subtypes at the molecular level

  • One of cancer research goals is to identify such cancer subtypes - groups of patients that share distinct genomic signatures and cancer-differing clinical outcomes

  • It is the first step towards developing personalized treatment strategies (example: see MD Anderson Cancer Center (Houston, TX) Personalized Cancer Therapy webpage based on Molecular Profiling)

Cancer subtype discovery as a biclustering problem

  • We look for subtypes of cancerous tumors that have similar molecular profiles and the genes that characterize each of the them

  • This subtype discovery problem can be posed as a biclustering problem of gene expression data matrix, where data is partitioned into a checkerboard-like pattern

Cancer subtype discovery as a biclustering problem - successess and failures

  • Biclustering breast cancer data has identified sets of genes whose expression levels segregated patients into five subtypes with distinct survival outcomes (Sørlie et al., 2001)

  • These subtypes have been reproduced in numerous other studies (Sørlie et al., 2003)

  • … but some other subtypes discoveries were not (i.e. with ovarian cancer (Tothill et al., 2008))

Cancer subtype discovery as a biclustering problem - challenges

  • The failure to reproduce these other results may reflect an absence of biologically meaningful groupings

  • … but another possibility may be related to limitations in the computational methods currently used to identify biclusters

  • Goal: Discover clinically meaningful & reproducible cancer subtypes

Simple solution: bicluster heatmap with hclust

Simple solution: bicluster heatmap with hclust

library(s4vd)
data(lung200) # lung cancer data set (Bhattacharjee et al. 2001)

lung200.D.mat <- dist(lung200, method = "euclidean") 
lung200.hclust <- hclust(t(lung200.D.mat), method = "average")

heatmap(lung200, Rowv = as.dendrogram(lung200.hclust), cexRow = 0.3, cexCol = 0.7)

Simple solution: hclust - algorithm

  1. Begin with \(n\) observations and a measure (i.e. Euclidean distance) of all the \(n = n(n − 1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.

  2. For \(i=n,n−1,...,2\):

    • Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters.

    • Compute the new pairwise inter-cluster dissimilarities among the \(i-1\) remaining clusters.

Simple solution: hclust - pros and cons

  • Pros:

    • Easy to interpret
    • Fast computation
  • Cons:

    • Lack of global objective function
    • Instability (subject to perturbations in data, distance choice, linkage type choice)
    • How to choose number of biclusters?

hclust - distance choice effect

# Function to plot heatmap with hclust dendrogram (method = "average") 
plot.lung200.heatmap.1 <- function(row.dist, col.dist, plt.title){
  heatmap(lung200, main = plt.title,
          cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE, 
          Rowv = as.dendrogram(hclust(row.dist, method = "average")), 
          Colv = as.dendrogram(hclust(col.dist, method = "average")))
}


D.mat.row <- dist(lung200, method = "euclidean")
D.mat.col <- dist(t(lung200), method = "euclidean")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Euclidean dist")

D.mat.row <- dist(lung200, method = "maximum")
D.mat.col <- dist(t(lung200), method = "maximum")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Maximum dist")

D.mat.row <-  as.dist(1 - cor(t(lung200)))
D.mat.col <- as.dist(1 - cor(lung200))
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Correlation dist")