Convex Clustering and Biclustering with application in R

Marta Karas

30 Sep 2016

My story with R in USA

  • R skill brought me to work in USA in Biostatistics Departments

  • Top #1 tool used in academic Biostatsistics

  • Reproducibility & usability awareness - rarely one can see a methodology paper published without corresponding R package

Clustering

Cluster analysis - goals


  • We perform clustering when we want to group or segment the data set into subsets so that objects within each subset are more closely related to others than those objects assigned to other subsets.

Motivation: cancer subtype discovery

  • A cancer may present clinically as a homogenous disease, but it typically consists of several distinct subtypes at the molecular level

  • One of cancer research goals is to identify such cancer subtypes - groups of patients that share distinct genomic signatures and cancer-differing clinical outcomes

  • It is the first step towards developing personalized treatment strategies (example: see MD Anderson Cancer Center (Houston, TX) Personalized Cancer Therapy webpage based on Molecular Profiling)

Cancer subtype discovery as a biclustering problem

  • We look for subtypes of cancerous tumors that have similar molecular profiles and the genes that characterize each of the them

  • This subtype discovery problem can be posed as a biclustering problem of gene expression data matrix, where data is partitioned into a checkerboard-like pattern

Cancer subtype discovery as a biclustering problem - successess and failures

  • Biclustering breast cancer data has identified sets of genes whose expression levels segregated patients into five subtypes with distinct survival outcomes (Sørlie et al., 2001)

  • These subtypes have been reproduced in numerous other studies (Sørlie et al., 2003)

  • … but some other subtypes discoveries were not (i.e. with ovarian cancer (Tothill et al., 2008))

Cancer subtype discovery as a biclustering problem - challenges

  • The failure to reproduce these other results may reflect an absence of biologically meaningful groupings

  • … but another possibility may be related to limitations in the computational methods currently used to identify biclusters

  • Goal: Discover clinically meaningful & reproducible cancer subtypes

Simple solution: bicluster heatmap with hclust

Simple solution: bicluster heatmap with hclust

library(s4vd)
data(lung200) # lung cancer data set (Bhattacharjee et al. 2001)

lung200.D.mat <- dist(lung200, method = "euclidean") 
lung200.hclust <- hclust(t(lung200.D.mat), method = "average")

heatmap(lung200, Rowv = as.dendrogram(lung200.hclust), cexRow = 0.3, cexCol = 0.7)

Simple solution: hclust - algorithm

  1. Begin with \(n\) observations and a measure (i.e. Euclidean distance) of all the \(n = n(n − 1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.

  2. For \(i=n,n−1,...,2\):

    • Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters.

    • Compute the new pairwise inter-cluster dissimilarities among the \(i-1\) remaining clusters.

Simple solution: hclust - pros and cons

  • Pros:

    • Easy to interpret
    • Fast computation
  • Cons:

    • Lack of global objective function
    • Instability (subject to perturbations in data, distance choice, linkage type choice)
    • How to choose number of biclusters?

hclust - distance choice effect

# Function to plot heatmap with hclust dendrogram (method = "average") 
plot.lung200.heatmap.1 <- function(row.dist, col.dist, plt.title){
  heatmap(lung200, main = plt.title,
          cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE, 
          Rowv = as.dendrogram(hclust(row.dist, method = "average")), 
          Colv = as.dendrogram(hclust(col.dist, method = "average")))
}


D.mat.row <- dist(lung200, method = "euclidean")
D.mat.col <- dist(t(lung200), method = "euclidean")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Euclidean dist")

D.mat.row <- dist(lung200, method = "maximum")
D.mat.col <- dist(t(lung200), method = "maximum")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Maximum dist")

D.mat.row <-  as.dist(1 - cor(t(lung200)))
D.mat.col <- as.dist(1 - cor(lung200))
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Correlation dist")

hclust - linkage function choice effect

# Function to plot heatmap with hclust dendrogram (distance method = "euclidean") 
plot.lung200.heatmap.2 <- function(hclust.method, plt.title){
  D.mat.row <- dist(lung200, method = "euclidean")
  D.mat.col <- dist(t(lung200), method = "euclidean")
  heatmap(lung200, main = plt.title,
          cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE, 
          Rowv = as.dendrogram(hclust(D.mat.row, method = hclust.method)), 
          Colv = as.dendrogram(hclust(D.mat.col, method = hclust.method)))
}

plot.lung200.heatmap.2("average", "'average' linkage")
plot.lung200.heatmap.2("single", "'single' linkage")
plot.lung200.heatmap.2("complete", "'complete' linkage")
plot.lung200.heatmap.2("centroid", "'centroid' linkage")

hclust - perturbations in data effect

plot.lung200.heatmap.3 <- function(gaussian.sd, plt.title){
  n <- dim(lung200)[1] * dim(lung200)[2]
  gaussian.err <-  matrix(rnorm(n, sd = gaussian.sd), 
                          nrow = dim(lung200)[1], ncol = dim(lung200)[2])
  lung200.pert <- lung200 + gaussian.err
  D.mat.row <- dist(lung200.pert, method = "euclidean")
  D.mat.col <- dist(t(lung200.pert), method = "euclidean")
  heatmap(lung200.pert, main = plt.title,
          cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE, 
          Rowv = as.dendrogram(hclust(D.mat.row, method = "average")), 
          Colv = as.dendrogram(hclust(D.mat.col, method = "average")))
}

R2 <- 0.8
err.sd <- sqrt(var(as.vector(lung200)) * (1 - R2) / R2) 
set.seed(1)
plot.lung200.heatmap.3(0, "No Gaussian pert.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 1.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 2.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 3.")

Question:

Can we formulate a convex method for clustering that will yield a unique & global solution?

Convex Biclustering

Convex Clustering - idea

Goal:

  • Simple and interpretable like cluster dendrogram

  • Good algorithmic behavior:
    • Global minimizer
    • Stability with respect to data and other inputs

  • Good, data-driven way to select cluster number and extent

Convex Clustering - idea

Solution:

  • Solve combinatorially hard problem with its convex surrogate:

    • All local minima are global minima

    • Algorithms converge to global minimizer regardless of initialization

Convex Clustering - formula

  • \(\mathbf{u}_i\) is a centroid to each data point \(\mathbf{x}_i\)

  • \(\gamma \sum_{i<j}w_{ij}|| \textbf{u}_i- \textbf{u}_j||_1\) penalty shrinks centroids together

  • \(\gamma\) controls BOTH cluster assignments & number of clusters:
    • \(\gamma=0\) - each observation is its own cluster
    • \(\gamma\) larger - column means begin to merge together into clusters
    • \(\gamma\) very large - all observations fused into one cluster

  • \(w_{ij}\) - weights - sparse, proportional to distance between observations; pre-assigned

Convex Clustering - solution path

\(\gamma_0 = 0\) used, gives as many clusters as points

\(\gamma_{k_1}\) used, such as: \(\gamma_{k_1} > \gamma_0\)

\(\gamma_{k_2}\) used, such as: \(\gamma_{k_2} > \gamma_{k_1}\); 5 clusters

\(\gamma_{k_3}\) used, such as: \(\gamma_{k_3} > \gamma_{k_2}\); 3 clusters

\(\gamma_{k_4}\) used, such as: \(\gamma_{k_4} > \gamma_{k_3}\); 3 clusters

\(\gamma_{k_5}\) used, such as: \(\gamma_{k_5} > \gamma_{k_4}\); 1 cluster

Convex Biclustering - formula

  • Similar idea applies when we cluster both columns and rows (biclustering):

  • \(i\)-th and \(j\)-th columns belong to same column cluster if \[\mathbf{U}_{\cdot i} − \mathbf{U}_{\cdot j} = 0\]
  • \(k\)-th and \(l\)-th rows belong to same row cluster if \[\mathbf{U}_{k \cdot} − \mathbf{U}_{l \cdot} = 0\]

Convex Biclustering - application on lung200 data

lung200 data set

  • The data consists of the expression levels of 200 genes across 56 individuals, a subset of the data studied in Lee et al. (2010).

  • Subjects belong to one of four lung cancer subgroups: Normal, Carcinoid, Colon, or Small Cell.

lung200[1:5, 1:5]  # lung cancer data set 
##            Carcinoid Carcinoid  Carcinoid Carcinoid   Carcinoid
## 40808_at    3.945499  3.798840  4.6410622  4.866513  4.34580553
## 36924_r_at  4.453260  4.287104  4.2031069  4.764685  4.15189982
## 32252_at    1.081860  5.731959  5.7501960  5.797660 -0.05565615
## 37864_s_at -4.963623 -4.730377 -0.6692398 -2.775429 -3.01534195
## 38691_s_at -2.745149 -3.118464 -3.2292698 -2.875352 -2.81904103

# Normalize data 
X <- lung200
X <- X - mean(X)
X <- X/norm(X,'f') 


# Create annotation for heatmap
types <- colnames(lung200)
ty <- as.numeric(factor(types))
cols <- rainbow(4)
YlGnBu5 <- c('#ffffd9','#c7e9b4','#41b6c4','#225ea8','#081d58')
hmcols <- colorRampPalette(YlGnBu5)(256)


# Construct weights and edge-incidence matrices
library(cvxbiclustr)
wts <- gkn_weights(X) # combines Gaussian kernel weights with k-nearest neighbor weights
w_row <- wts$w_row  # Vector of weights for row graph
w_col <- wts$w_col  # Vector of weights for column graph
E_row <- wts$E_row  # Edge-incidence matrix for row graph
E_col <- wts$E_col  # Edge-incidence matrix for column graph

# Initialize path parameters and structures
(gammaSeq <- 10^seq(0, 3, length.out = 5))

# Generate solution path,
# perform an MM algorithm wrapper to do parameter selection
#
# (MM algorithm is an iterative optimization method which exploits the convexity 
# of a function in order to find their maxima or minima)
solution <- cobra_validate(X, E_row, E_col, w_row, w_col, gammaSeq)
# Solutions validation error
round(solution$validation_error, 2)
## [1] 0.31 0.20 0.31 0.31 0.31

idx <- 1
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 2
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 3
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 4
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 5
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

Convex Biclustering - summary

  • Key property: \(\mathbf{U}^*\), the global solution (rows and columns clusters selection), exists, is unique, and depends continuously on shrinkage parameter, weights and data \((\gamma,\mathbf{W},\mathbf{\widetilde{W}},\mathbf{X} )\) => implication: STABILITY to perturbations in data

  • Yields simple & interpretable solutions like cluster heatmap

  • Well-behaved:
    • Unique, global minimizer
    • Stability wrt initialization, parameters, and data

  • Fast algorithm

  • One tuning parameter \(\gamma\) that controls number and extent of biclusters; data-dependent way of selecting \(\gamma\)

Acknowledgements

  • Presentation topic was inspired by Module 5: Unsupervised Methods for Statistical Machine Learning of Summer Institute in Statistics for Big Data I attended at University of Washington in Seattle, WA in 2016.

  • A part of the images (convex clustering solution path) was copy-pasted from materials publicly available on GitHub (link), with the consent offered orally during the Module classes by its instructors, Genevera Allen & Yufeng Liu.

Bibliography

  1. Allen, G. I. (2015), Convex Biclustering Techniques for Cancer Subtype Discovery. Slides (link)
  2. Chi, E. C., Allen, G. I., Baraniuk, R. G. (2015), Convex Bilustering
  3. Sørlie, T., Perou, C. M., Tibshirani, R. et al. (2001), Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences 98 10869-10874.
  4. Sørlie, T., Tibshirani, R., Parker, J. et al. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences 100 8418-8423.
  5. Tothill, R. W., Tinker, A. V., George, J. et al. (2008). Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research 14 5198-5208.

Bibliography (cont.)

  1. Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via Sparse Singular Value Decomposition. Biometrics 66 1087-1095.
  2. James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, Springer Science+Business Media, New York.