Mark Dunning (Acknowledgements: Oscar Rueda, Matt Ritchie, Christina Curtis)
Last modified: 23 Mar 2016
When clustering genes, it is common to pre-process;
Common Similarity / Dissimilarity measures include
The dist
function can be used to calculate a variety of similarity measures
?dist
.myMatrix <- matrix(rnorm(1000),ncol=10)
colnames(myMatrix) <- LETTERS[1:10]
head(myMatrix)
## A B C D E F
## [1,] 0.4313777 -2.4670200 -0.7073725 0.3194366 0.8602843 -0.7072524
## [2,] -0.1119854 0.2155933 1.3840235 -2.1342261 0.4873031 -1.4214807
## [3,] -0.8989685 0.4339697 1.5643489 -0.1795468 0.1206374 0.8601242
## [4,] 0.6369946 -0.9573872 0.1323738 -0.3081852 1.0382901 -0.8216086
## [5,] 0.1357330 0.2440095 1.1538843 -1.0067510 -1.3158878 -0.2761261
## [6,] 0.6717109 -1.4595497 -0.2515446 -1.5297978 1.1495169 1.6037638
## G H I J
## [1,] -0.6794123 -1.6310056 0.4481794 -1.2323060
## [2,] 0.8233594 1.1139286 -0.8939646 1.2021941
## [3,] 0.3049781 -0.1753819 1.5622296 -0.5362093
## [4,] -1.9204444 -1.5448343 0.4831130 -0.9426206
## [5,] -0.5078418 -2.5823836 0.4086202 -2.2609447
## [6,] -0.6668203 1.2564544 2.1929750 -0.7531744
d <- dist(myMatrix)
d
t
d <- dist(t(myMatrix))
d
## A B C D E F G H
## B 14.08415
## C 14.88023 15.01007
## D 15.07597 13.81700 13.92782
## E 15.30797 15.18065 12.91680 14.36210
## F 13.92214 15.01753 14.35741 12.90844 14.22044
## G 16.78331 15.87847 14.93457 14.26315 13.65642 14.24959
## H 15.18897 14.91864 13.59910 13.75474 13.14450 13.81787 14.26011
## I 15.21106 15.15853 13.90017 13.34724 14.63817 12.75963 16.45115 13.91325
## J 16.24253 14.95350 14.28564 14.37443 14.50068 15.69133 14.79427 14.30121
## I
## B
## C
## D
## E
## F
## G
## H
## I
## J 15.45298
method
argumentd.man <- dist(t(myMatrix),method="manhattan")
d.man
## A B C D E F G H
## B 115.2690
## C 122.7021 121.9089
## D 122.9448 113.7223 109.2488
## E 121.8248 116.7150 104.0412 117.9643
## F 111.4721 121.4489 118.7915 105.4497 116.4869
## G 133.4619 132.4103 120.1602 110.9659 108.6304 109.9150
## H 122.5602 114.8863 110.4001 106.7878 105.2342 114.3228 114.1843
## I 122.4814 121.2553 114.1197 101.7003 117.7069 101.9944 131.5409 111.6427
## J 131.3290 123.4407 115.3894 114.4036 122.6288 122.0493 122.2218 117.2269
## I
## B
## C
## D
## E
## F
## G
## H
## I
## J 125.1519
cor
function (?cor
)dist
?cor(myMatrix)
## A B C D E
## A 1.00000000 0.1873257371 -0.023000453 -0.08803428 -0.13080526
## B 0.18732574 1.0000000000 -0.011511807 0.10968627 -0.08226229
## C -0.02300045 -0.0115118072 1.000000000 -0.03532202 0.10502192
## D -0.08803428 0.1096862726 -0.035322018 1.00000000 -0.16237803
## E -0.13080526 -0.0822622909 0.105021917 -0.16237803 1.00000000
## F 0.10557545 0.0004291339 -0.040904036 0.13286359 -0.06779513
## G -0.18307097 -0.0408289615 -0.033375088 0.01087487 0.09412331
## H -0.12118891 -0.0463548665 0.004657798 -0.06479482 0.02138155
## I -0.01338224 0.0169886817 0.066632105 0.10070537 -0.08673286
## J -0.14899699 0.0576154555 0.028275765 -0.02098191 -0.04670010
## F G H I J
## A 0.1055754466 -0.18307097 -0.121188905 -0.01338224 -0.14899699
## B 0.0004291339 -0.04082896 -0.046354867 0.01698868 0.05761546
## C -0.0409040361 -0.03337509 0.004657798 0.06663211 0.02827576
## D 0.1328635922 0.01087487 -0.064794815 0.10070537 -0.02098191
## E -0.0677951324 0.09412331 0.021381549 -0.08673286 -0.04670010
## F 1.0000000000 0.08674360 -0.018787424 0.22943055 -0.16930951
## G 0.0867435951 1.00000000 0.015941398 -0.19833124 0.05636269
## H -0.0187874244 0.01594140 1.000000000 0.01838101 -0.02447738
## I 0.2294305467 -0.19833124 0.018381014 1.00000000 -0.07764364
## J -0.1693095113 0.05636269 -0.024477377 -0.07764364 1.00000000
as.dist
corMat <- as.dist(cor(myMatrix))
corMat
## A B C D E
## B 0.1873257371
## C -0.0230004526 -0.0115118072
## D -0.0880342757 0.1096862726 -0.0353220181
## E -0.1308052649 -0.0822622909 0.1050219171 -0.1623780286
## F 0.1055754466 0.0004291339 -0.0409040361 0.1328635922 -0.0677951324
## G -0.1830709748 -0.0408289615 -0.0333750883 0.0108748704 0.0941233070
## H -0.1211889054 -0.0463548665 0.0046577978 -0.0647948150 0.0213815490
## I -0.0133822445 0.0169886817 0.0666321053 0.1007053715 -0.0867328634
## J -0.1489969940 0.0576154555 0.0282757647 -0.0209819071 -0.0467000977
## F G H I
## B
## C
## D
## E
## F
## G 0.0867435951
## H -0.0187874244 0.0159413977
## I 0.2294305467 -0.1983312437 0.0183810136
## J -0.1693095113 0.0563626950 -0.0244773771 -0.0776436408
corMat <- as.dist(1-abs(cor(myMatrix)))
corMat
## A B C D E F G
## B 0.8126743
## C 0.9769995 0.9884882
## D 0.9119657 0.8903137 0.9646780
## E 0.8691947 0.9177377 0.8949781 0.8376220
## F 0.8944246 0.9995709 0.9590960 0.8671364 0.9322049
## G 0.8169290 0.9591710 0.9666249 0.9891251 0.9058767 0.9132564
## H 0.8788111 0.9536451 0.9953422 0.9352052 0.9786185 0.9812126 0.9840586
## I 0.9866178 0.9830113 0.9333679 0.8992946 0.9132671 0.7705695 0.8016688
## J 0.8510030 0.9423845 0.9717242 0.9790181 0.9532999 0.8306905 0.9436373
## H I
## B
## C
## D
## E
## F
## G
## H
## I 0.9816190
## J 0.9755226 0.9223564
## gene1 gene2
## gene2 6.3245553
## gene3 0.6057322 6.0805548
## gene1 gene2 gene3
## gene1 1.0000000 1.0000000 0.9797137
## gene2 1.0000000 1.0000000 0.9797137
## gene3 0.9797137 0.9797137 1.0000000
Clustering Algorithms
hclust
(?hclust
)clust <- hclust(d)
clust
##
## Call:
## hclust(d = d)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 10
plot(clust)
clust.ward <- hclust(d,method = "ward.D")
par(mfrow=c(1,2))
plot(clust)
plot(clust.ward)
cutree
and rect.hclust
par(mfrow=c(1,1))
plot(clust)
cutree(clust, h=16)
## A B C D E F G H I J
## 1 1 2 1 2 1 2 2 1 2
rect.hclust(clust, h=16)
plot(clust)
cutree(clust, k=2)
## A B C D E F G H I J
## 1 1 2 1 2 1 2 2 1 2
rect.hclust(clust, k=2)
silhouette
function for the gene expression example earlierIn genomic data, we have a large number of variables which are often highly correlated
PCA is one of many dimension-reduction techniques that can remove redundancy and give a smaller more mangeable set of variables. In summary it can help us to:-
Compress the data
The data: 86 malt whiskies scored for 12 different taste categories
(https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html)
## Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty
## Aberfeldy 2 2 2 0 0 2 1 2 2
## Aberlour 3 3 1 0 0 4 3 2 2
## AnCnoc 1 3 2 0 0 2 0 0 2
## Malty Fruity Floral
## Aberfeldy 2 2 2
## Aberlour 3 3 2
## AnCnoc 2 3 2
So, PCA can be used as a quality assessment tool and can inform us of any factors we need to account for in the analysis
A two-dimensional case
Getting the data…
whisky <- read.csv("https://www.mathstat.strath.ac.uk/outreach/nessie/datasets/whiskies.txt")
scores <- whisky[,c(3:14)]
rownames(scores) <- whisky$Distillery
head(scores,n=3)
## Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty
## Aberfeldy 2 2 2 0 0 2 1 2 2
## Aberlour 3 3 1 0 0 4 3 2 2
## AnCnoc 1 3 2 0 0 2 0 0 2
## Malty Fruity Floral
## Aberfeldy 2 2 2
## Aberlour 3 3 2
## AnCnoc 2 3 2
The prcomp
function does the hard-work for us
pca <- prcomp(scores)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5358 1.2270 0.8654 0.8039 0.75261 0.68513 0.63256
## Proportion of Variance 0.3011 0.1922 0.0956 0.0825 0.07231 0.05992 0.05108
## Cumulative Proportion 0.3011 0.4933 0.5889 0.6714 0.74370 0.80363 0.85471
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.59943 0.52347 0.50049 0.42422 0.27266
## Proportion of Variance 0.04587 0.03498 0.03198 0.02297 0.00949
## Cumulative Proportion 0.90058 0.93556 0.96754 0.99051 1.00000
pca$rotation[,1:5]
## PC1 PC2 PC3 PC4 PC5
## Body 0.36119005 -0.49130643 0.0301178096 -0.07460952 0.227019231
## Sweetness -0.20298238 -0.04659634 -0.2638792230 -0.37063897 0.009220720
## Smoky 0.47794419 -0.06874216 0.2188100647 0.08852143 -0.201614612
## Medicinal 0.57527678 0.16079484 0.0431598432 0.08237288 -0.033120119
## Tobacco 0.09173306 0.02004776 -0.0006685359 -0.03337611 -0.008967789
## Honey -0.22090804 -0.41799491 0.1102471134 0.03315990 -0.596888644
## Spicy 0.05811101 -0.17548310 0.6992443906 -0.17163088 -0.133792990
## Winey -0.03745608 -0.63964979 -0.2331295871 -0.22573776 0.110928262
## Nutty -0.04766410 -0.26036122 -0.1785529039 0.85059012 0.025288924
## Malty -0.12781608 -0.10296202 0.1084169899 0.07177790 -0.105401421
## Fruity -0.20235755 -0.12374977 0.4034627791 0.09457148 0.702770800
## Floral -0.38394443 0.13074914 0.3433008365 0.14903423 -0.120141367
We look at the variance explained by each component and judge where it “drops-off”
plot(pca)
The new co-ordinate system is given by pca$x
pca$x[1:3,1:5]
## PC1 PC2 PC3 PC4 PC5
## Aberfeldy -0.5033841 -1.1220223 -0.1612002 0.5058255 -0.2841501
## Aberlour -1.4788883 -3.0048507 1.5170911 -0.1385370 -0.7102894
## AnCnoc -1.2531129 0.6537207 -0.2847196 0.9274739 0.1127587
ggplot2
is especially convenient for visualising the results
df <- data.frame(whisky,pca$x)
ggplot(df, aes(x=PC1,y=PC2,label=Distillery,cex=Smoky,col=as.factor(Floral))) + geom_point() + geom_text(cex=5,alpha=0.3)
The question we are asking here is;
“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”
Where Theme X could be genes belonging to a particular GO (Gene Onotology) term.
Let’s imagine that we have a bag full of balls. Each balls represents a gene in the gene universe. - Paint the balls representing our selected list grey, and paint the rest red.
In this small example, we can define;
Select a number (say, 12) of the balls at random without seeing into the bag and look at what we get
We have picked, at random, 8 grey balls.
rnorm
, rbinom
, rhyper
etc….#see ?rhyper for argument definition
trials <- rhyper(10000,40,10,12)
hist(trials)
table(trials)
## trials
## 5 6 7 8 9 10 11 12
## 7 58 404 1333 2639 3126 1934 499
table(trials)/10000
## trials
## 5 6 7 8 9 10 11 12
## 0.0007 0.0058 0.0404 0.1333 0.2639 0.3126 0.1934 0.0499
cumsum(table(trials)/10000)
## 5 6 7 8 9 10 11 12
## 0.0007 0.0065 0.0469 0.1802 0.4441 0.7567 0.9501 1.0000
1-cumsum(table(trials)/10000)
## 5 6 7 8 9 10 11 12
## 0.9993 0.9935 0.9531 0.8198 0.5559 0.2433 0.0499 0.0000
Back to our example, the distribution of balls can be expressed as a contingency table, on which we can use a Fisher’s exact test
Total grey balls: 10 Total in subset: 12
Selection_in | Selection_out | |
---|---|---|
Grey_in | 8 | 2 |
Grey_out | 4 | 26 |
Selection_in | Selection_out | RowTotal | |
---|---|---|---|
Grey_in | a | b | a +b |
Grey_out | c | d | c+d |
Column Total | a+c | b+d | a+b+c+d (=n) |
The formula for Fishers exact test is;
p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}
or less formally;
P = (ways of choosing grey balls) X (ways of non-grey balls amongst subset) / ways of choosing subset
goseq
in the practical
For these tests, we don’t have to specify a statistical threshold and use the test statistics from all genes as the input to the test. The popular Gene Set Enrichment Analysis (GSEA) uses this approach. These tests can be used to detect differential expression for a group of genes, even when the effects are too small or there is too little data to detect the genes individually.
Subramanian et al, PNAS 2005
The required input text contains a set of gene identifiers that are ordered according to some test statistic. To produce such a file, we first have to order the statistics from our test (e.g. with the order
function)
load("Day3/edgeRAnalysis_ALLGENES.RData")
y <- y[order(y$LR,decreasing = TRUE),]
write.table(y, file="genetrail-gsea.txt",sep="\t",row.names = FALSE,quote=FALSE)
geneSetTest
function in limma
,e.g. pick 50 genes at random.
library(limma)
load("Day3/edgeRAnalysis_ALLGENES.RData")
stats <- y$logFC
myGenes <- sample(1:nrow(y),50)
geneSetTest(index = myGenes, statistics = stats)
## [1] 0.7891464
barcodeplot(stats, index=myGenes)
On the other hand, we could deliberately pick genes with a significant p-value and notice how the results are altered.
myGenes <- which(y$FDR < 0.01)[1:50]
geneSetTest (index = myGenes , statistics = stats)
## [1] 1.30939e-33
geneSetTest (index = myGenes , statistics = stats ,
alternative = "down" )
## [1] 0.069947
geneSetTest (index = myGenes , statistics = stats ,
alternative = "up" )
## [1] 0.9300574
barcodeplot (statistics = stats , index = myGenes )
Space, Right Arrow or swipe left to move to next slide, click help below for more details