`R`

in USA`R`

skill brought me to work in USA in Biostatistics Departments- Top #1 tool used in academic Biostatsistics
- Reproducibility & usability awareness - rarely one can see a methodology paper published without corresponding
`R`

package

- We perform clustering when we want to group or segment the data set into subsets so that objects within each subset are more closely related to others than those objects assigned to other subsets.

- A cancer may present clinically as a homogenous disease, but it typically consists of several distinct subtypes at the molecular level

- One of cancer research goals is to identify such cancer subtypes - groups of patients that share distinct genomic signatures and cancer-differing clinical outcomes

- It is the first step towards developing personalized treatment strategies (example: see MD Anderson Cancer Center (Houston, TX) Personalized Cancer Therapy webpage based on Molecular Profiling)

- We look for subtypes of cancerous tumors that have similar molecular profiles and the genes that characterize each of the them
- This subtype discovery problem can be posed as a biclustering problem of gene expression data matrix, where data is partitioned into a checkerboard-like pattern

- Biclustering breast cancer data has identified sets of genes whose expression levels segregated patients into five subtypes with distinct survival outcomes (SÃ¸rlie et al., 2001)
- These subtypes have been reproduced in numerous other studies (SÃ¸rlie et al., 2003)
- â€¦ but some other subtypes discoveries were not (i.e.Â with ovarian cancer (Tothill et al., 2008))

- The failure to reproduce these other results may reflect an absence of biologically meaningful groupings
- â€¦ but another possibility may be related to limitations in the computational methods currently used to identify biclusters

**Goal**: Discover clinically meaningful & reproducible cancer subtypes

`hclust`

`hclust`

```
library(s4vd)
data(lung200) # lung cancer data set (Bhattacharjee et al. 2001)
lung200.D.mat <- dist(lung200, method = "euclidean")
lung200.hclust <- hclust(t(lung200.D.mat), method = "average")
heatmap(lung200, Rowv = as.dendrogram(lung200.hclust), cexRow = 0.3, cexCol = 0.7)
```

`hclust`

- algorithmBegin with \(n\) observations and a measure (i.e.Â Euclidean distance) of all the \(n = n(n âˆ’ 1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.

For \(i=n,nâˆ’1,...,2\):

Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters.

Compute the new pairwise inter-cluster dissimilarities among the \(i-1\) remaining clusters.

`hclust`

- pros and consPros:

- Easy to interpret
- Fast computation

Cons:

- Lack of global objective function
- Instability (subject to perturbations in data, distance choice, linkage type choice)
- How to choose number of biclusters?

`hclust`

- distance choice effect```
# Function to plot heatmap with hclust dendrogram (method = "average")
plot.lung200.heatmap.1 <- function(row.dist, col.dist, plt.title){
heatmap(lung200, main = plt.title,
cexRow = 0.3, cexCol = 0.7, labRow = FALSE, labCol = FALSE,
Rowv = as.dendrogram(hclust(row.dist, method = "average")),
Colv = as.dendrogram(hclust(col.dist, method = "average")))
}
D.mat.row <- dist(lung200, method = "euclidean")
D.mat.col <- dist(t(lung200), method = "euclidean")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Euclidean dist")
D.mat.row <- dist(lung200, method = "maximum")
D.mat.col <- dist(t(lung200), method = "maximum")
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Maximum dist")
D.mat.row <- as.dist(1 - cor(t(lung200)))
D.mat.col <- as.dist(1 - cor(lung200))
plot.lung200.heatmap.1(D.mat.row, D.mat.col, "Correlation dist")
```