R
in USAR
skill brought me to work in USA in Biostatistics DepartmentsR
packagehclust
hclust
library(s4vd)
data(lung200) # lung cancer data set (Bhattacharjee et al. 2001)
lung200.D.mat <- dist(lung200, method = "euclidean")
lung200.hclust <- hclust(t(lung200.D.mat), method = "average")
heatmap(lung200, Rowv = as.dendrogram(lung200.hclust), cexRow = 0.3, cexCol = 0.7)
hclust
- algorithmBegin with n observations and a measure (i.e. Euclidean distance) of all the n=n(n−1)/2 pairwise dissimilarities. Treat each observation as its own cluster.
For i=n,n−1,...,2:
Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters.
Compute the new pairwise inter-cluster dissimilarities among the i−1 remaining clusters.
hclust
- pros and consPros:
Cons:
hclust
- distance choice effect# Function to plot heatmap with hclust dendrogram (method = "average")
plot.lung200.heatmap.1 <- function(row.dist, col.dist, plt.title){
heatmap(lung200, main = plt.title,
cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE,
Rowv = as.dendrogram(hclust(row.dist, method = "average")),
Colv = as.dendrogram(hclust(col.dist, method = "average")))
}
D.mat.row <- dist(lung200, method = "euclidean")
D.mat.col <- dist(t(lung200), method = "euclidean")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Euclidean dist")
D.mat.row <- dist(lung200, method = "maximum")
D.mat.col <- dist(t(lung200), method = "maximum")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Maximum dist")
D.mat.row <- as.dist(1 - cor(t(lung200)))
D.mat.col <- as.dist(1 - cor(lung200))
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Correlation dist")
hclust
- linkage function choice effect# Function to plot heatmap with hclust dendrogram (distance method = "euclidean")
plot.lung200.heatmap.2 <- function(hclust.method, plt.title){
D.mat.row <- dist(lung200, method = "euclidean")
D.mat.col <- dist(t(lung200), method = "euclidean")
heatmap(lung200, main = plt.title,
cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE,
Rowv = as.dendrogram(hclust(D.mat.row, method = hclust.method)),
Colv = as.dendrogram(hclust(D.mat.col, method = hclust.method)))
}
plot.lung200.heatmap.2("average", "'average' linkage")
plot.lung200.heatmap.2("single", "'single' linkage")
plot.lung200.heatmap.2("complete", "'complete' linkage")
plot.lung200.heatmap.2("centroid", "'centroid' linkage")
hclust
- perturbations in data effectplot.lung200.heatmap.3 <- function(gaussian.sd, plt.title){
n <- dim(lung200)[1] * dim(lung200)[2]
gaussian.err <- matrix(rnorm(n, sd = gaussian.sd),
nrow = dim(lung200)[1], ncol = dim(lung200)[2])
lung200.pert <- lung200 + gaussian.err
D.mat.row <- dist(lung200.pert, method = "euclidean")
D.mat.col <- dist(t(lung200.pert), method = "euclidean")
heatmap(lung200.pert, main = plt.title,
cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE,
Rowv = as.dendrogram(hclust(D.mat.row, method = "average")),
Colv = as.dendrogram(hclust(D.mat.col, method = "average")))
}
R2 <- 0.8
err.sd <- sqrt(var(as.vector(lung200)) * (1 - R2) / R2)
set.seed(1)
plot.lung200.heatmap.3(0, "No Gaussian pert.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 1.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 2.")
plot.lung200.heatmap.3(err.sd, "Gaussian pert. 3.")
γ0=0 used, gives as many clusters as points
γk1 used, such as: γk1>γ0
γk2 used, such as: γk2>γk1; 5 clusters
γk3 used, such as: γk3>γk2; 3 clusters
γk4 used, such as: γk4>γk3; 3 clusters
γk5 used, such as: γk5>γk4; 1 cluster
lung200
datalung200
data setThe data consists of the expression levels of 200 genes across 56 individuals, a subset of the data studied in Lee et al. (2010).
Subjects belong to one of four lung cancer subgroups: Normal, Carcinoid, Colon, or Small Cell.
lung200[1:5, 1:5] # lung cancer data set
## Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid
## 40808_at 3.945499 3.798840 4.6410622 4.866513 4.34580553
## 36924_r_at 4.453260 4.287104 4.2031069 4.764685 4.15189982
## 32252_at 1.081860 5.731959 5.7501960 5.797660 -0.05565615
## 37864_s_at -4.963623 -4.730377 -0.6692398 -2.775429 -3.01534195
## 38691_s_at -2.745149 -3.118464 -3.2292698 -2.875352 -2.81904103
# Normalize data
X <- lung200
X <- X - mean(X)
X <- X/norm(X,'f')
# Create annotation for heatmap
types <- colnames(lung200)
ty <- as.numeric(factor(types))
cols <- rainbow(4)
YlGnBu5 <- c('#ffffd9','#c7e9b4','#41b6c4','#225ea8','#081d58')
hmcols <- colorRampPalette(YlGnBu5)(256)
# Construct weights and edge-incidence matrices
library(cvxbiclustr)
wts <- gkn_weights(X) # combines Gaussian kernel weights with k-nearest neighbor weights
w_row <- wts$w_row # Vector of weights for row graph
w_col <- wts$w_col # Vector of weights for column graph
E_row <- wts$E_row # Edge-incidence matrix for row graph
E_col <- wts$E_col # Edge-incidence matrix for column graph
# Initialize path parameters and structures
(gammaSeq <- 10^seq(0, 3, length.out = 5))
# Generate solution path,
# perform an MM algorithm wrapper to do parameter selection
#
# (MM algorithm is an iterative optimization method which exploits the convexity
# of a function in order to find their maxima or minima)
solution <- cobra_validate(X, E_row, E_col, w_row, w_col, gammaSeq)
# Solutions validation error
round(solution$validation_error, 2)
## [1] 0.31 0.20 0.31 0.31 0.31
idx <- 1
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
main = paste0("gamma = ",round(gammaSeq[idx], 2)))
idx <- 2
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
main = paste0("gamma = ",round(gammaSeq[idx], 2)))
idx <- 3
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
main = paste0("gamma = ",round(gammaSeq[idx], 2)))
idx <- 4
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
main = paste0("gamma = ",round(gammaSeq[idx], 2)))
idx <- 5
heatmap(solution$U[[idx]], col = hmcols, labRow = NA, labCol = NA, ColSideCol = cols[ty],
main = paste0("gamma = ",round(gammaSeq[idx], 2)))
Presentation topic was inspired by Module 5: Unsupervised Methods for Statistical Machine Learning of Summer Institute in Statistics for Big Data I attended at University of Washington in Seattle, WA in 2016.
A part of the images (convex clustering solution path) was copy-pasted from materials publicly available on GitHub (link), with the consent offered orally during the Module classes by its instructors, Genevera Allen & Yufeng Liu.