`lung200` data set

The data consists of the expression levels of 200 genes across 56 individuals, a subset of the data studied in Lee et al. (2010).
Subjects belong to one of four lung cancer subgroups: Normal, Carcinoid, Colon, or Small Cell.

lung200[1:5, 1:5]  # lung cancer data set

##            Carcinoid Carcinoid  Carcinoid Carcinoid   Carcinoid
## 40808_at    3.945499  3.798840  4.6410622  4.866513  4.34580553
## 36924_r_at  4.453260  4.287104  4.2031069  4.764685  4.15189982
## 32252_at    1.081860  5.731959  5.7501960  5.797660 -0.05565615
## 37864_s_at -4.963623 -4.730377 -0.6692398 -2.775429 -3.01534195
## 38691_s_at -2.745149 -3.118464 -3.2292698 -2.875352 -2.81904103

# Normalize data 
X <- lung200
X <- X - mean(X)
X <- X/norm(X,'f') 


# Create annotation for heatmap
types <- colnames(lung200)
ty <- as.numeric(factor(types))
cols <- rainbow(4)
YlGnBu5 <- c('#ffffd9','#c7e9b4','#41b6c4','#225ea8','#081d58')
hmcols <- colorRampPalette(YlGnBu5)(256)


# Construct weights and edge-incidence matrices
library(cvxbiclustr)
wts <- gkn_weights(X) # combines Gaussian kernel weights with k-nearest neighbor weights
w_row <- wts$w_row  # Vector of weights for row graph
w_col <- wts$w_col  # Vector of weights for column graph
E_row <- wts$E_row  # Edge-incidence matrix for row graph
E_col <- wts$E_col  # Edge-incidence matrix for column graph

# Initialize path parameters and structures
(gammaSeq <- 10^seq(0, 3, length.out = 5))

# Generate solution path,
# perform an MM algorithm wrapper to do parameter selection
#
# (MM algorithm is an iterative optimization method which exploits the convexity 
# of a function in order to find their maxima or minima)
solution <- cobra_validate(X, E_row, E_col, w_row, w_col, gammaSeq)

# Solutions validation error
round(solution$validation_error, 2)

## [1] 0.31 0.20 0.31 0.31 0.31

idx <- 1
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 2
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 3
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 4
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

idx <- 5
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
        main = paste0("gamma = ",round(gammaSeq[idx], 2)))

Convex Biclustering - summary

Key property: $\mathbf{U}^*$ , the global solution (rows and columns clusters selection), exists, is unique, and depends continuously on shrinkage parameter, weights and data $(\gamma,\mathbf{W},\mathbf{\widetilde{W}},\mathbf{X} )$ => implication: STABILITY to perturbations in data
Yields simple & interpretable solutions like cluster heatmap
Well-behaved:
- Unique, global minimizer
- Stability wrt initialization, parameters, and data
Fast algorithm
One tuning parameter $\gamma$ that controls number and extent of biclusters; data-dependent way of selecting $\gamma$

Acknowledgements

Presentation topic was inspired by Module 5: Unsupervised Methods for Statistical Machine Learning of Summer Institute in Statistics for Big Data I attended at University of Washington in Seattle, WA in 2016.
A part of the images (convex clustering solution path) was copy-pasted from materials publicly available on GitHub (link), with the consent offered orally during the Module classes by its instructors, Genevera Allen & Yufeng Liu.

Bibliography

Allen, G. I. (2015), Convex Biclustering Techniques for Cancer Subtype Discovery. Slides (link)
Chi, E. C., Allen, G. I., Baraniuk, R. G. (2015), Convex Bilustering
Sørlie, T., Perou, C. M., Tibshirani, R. et al. (2001), Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences 98 10869-10874.
Sørlie, T., Tibshirani, R., Parker, J. et al. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences 100 8418-8423.
Tothill, R. W., Tinker, A. V., George, J. et al. (2008). Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research 14 5198-5208.

Bibliography (cont.)

Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via Sparse Singular Value Decomposition. Biometrics 66 1087-1095.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, Springer Science+Business Media, New York.

Convex Clustering and Biclustering with application in R

Marta Karas

30 Sep 2016

My story with `R` in USA

Clustering

Cluster analysis - goals

Motivation: cancer subtype discovery

Cancer subtype discovery as a biclustering problem

Cancer subtype discovery as a biclustering problem - successess and failures

Cancer subtype discovery as a biclustering problem - challenges

Simple solution: bicluster heatmap with `hclust`

Simple solution: bicluster heatmap with `hclust`

Simple solution: `hclust` - algorithm

Simple solution: `hclust` - pros and cons

`hclust` - distance choice effect

`hclust` - linkage function choice effect

`hclust` - perturbations in data effect

Question:

Can we formulate a convex method for clustering that will yield a unique & global solution?

Convex Biclustering

Convex Clustering - idea

Goal:

Convex Clustering - idea

Solution:

Convex Clustering - formula

Convex Clustering - solution path

Convex Biclustering - formula

Convex Biclustering - application on `lung200` data

`lung200` data set

Convex Biclustering - summary

Acknowledgements

Bibliography

Bibliography (cont.)