3. Data Manipulation Techniques

Motivation

  • So far we have been lucky that all our data have been in the same file:
    • This is not usually the case
    • Dataset may be spread over several files
      • This takes longer, and is harder, than many people realise
    • We need to combine before doing an analysis

Combining data from multiple sources: Gene Clustering Example

  • R has powerful functions to combine heterogeneous data sources into a single data set
  • Gene clustering example data:
    • Gene expression values in gene.expression.txt
    • Gene information in gene.description.txt
    • Patient information in cancer.patients.txt
  • A breast cancer dataset with numerous patient characteristics:
    • We will concentrate on ER status (positive / negative)
    • What genes show a statistically-significant different change between ER groups?

Analysis goals

  • We will show how to lookup a particular gene in the dataset
  • Also, how to look-up genes in a given genomic region
  • Perform a “sanity-check” to see if a previously-known gene exhibits a difference in our dataset
  • How many genes on chromosome 8 are differentially-expressed?
  • Create a heatmap to cluster the samples and reveal any subgroups in the data
    • do the subgroups agree with our prior knowledge about the samples

Peek at the data

  • gene.expression.txt is a tab-delimited file, so we can use read.delim to import it
  • here the head function is used as a convenient way to see the first six rows of the resulting data frame
normalizedValues <- read.delim("gene.expression.txt")
head(normalizedValues)
  • 498 rows and 337 columns
  • One row for each gene:
    • Rows are named according to particular technology used to make measurement
    • The names of each row can be returned by rownames(normalizedValues); giving a vector
  • One column for each patient:
    • The names of each column can be returned by colnames(normalizedValues); giving a vector
geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
head(geneAnnotation)
  • 498 rows and 4 columns
  • One for each gene
  • Includes mapping between manufacturer ID and Gene name
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
head(patientMetadata)
  • One for each patient in the study
  • Each column is a different characteristic of that patient
    • e.g. whether a patient is ER positive (value of 1) or negative (value of 0)
table(patientMetadata$er)

  0   1 
 88 249 

Ordering and sorting

To get a feel for these data, we will look at how we can subset and order

  • R allows us to do the kinds of filtering, sorting and ordering operations you might be familiar with in Excel
  • For example, if we want to get information about patients that are ER negative
    • these are indicated by an entry of 0 in the er column
patientMetadata$er == 0

We can do the comparison within the square brackets

  • Remembering to include a , to index the columns as well
  • Best practice to create a new variable and leave the original data frame untouched
erNegPatients <- patientMetadata[patientMetadata$er == 0,]
head(erNegPatients)

or

View(erNegPatients)

Sorting is supported by the sort() function

  • Given a vector, it will return a sorted version of the same length
sort(erNegPatients$grade)
 [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[55] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
  • But this is not useful in all cases
    • We have lost the extra information that we have about the patients
  • Instead, we can use order()
  • Given a vector, order() will give a set of numeric values which will give an ordered version of the vector
    • default is smallest –> largest
myvec <- c(90,100,40,30,80,50,60,20,10,70)
myvec
 [1]  90 100  40  30  80  50  60  20  10  70
order(myvec)
 [1]  9  8  4  3  6  7 10  5  1  2
  • i.e. number in position 9 is the smallest, number in position 8 is the second smallest:
myvec[9]
[1] 10
myvec[8]
[1] 20

N.B. order will also work on character vectors

firstName  <- c("Adam", "Eve", "John", "Mary", "Peter", "Paul", "Joanna", "Matthew", "David", "Sally")
order(firstName)
 [1]  1  9  2  7  3  4  8  6  5 10
  • We can use the result of order() to perform a subset of our original vector
  • The result is an ordered vector
myvec.ord <- myvec[order(myvec)]
myvec.ord
 [1]  10  20  30  40  50  60  70  80  90 100
  • Implication: We can use order on a particular column of a data frame, and use the result to sort all the rows

  • We might want to select the youngest ER negative patients for a follow-up study
  • Here we order the age column and use the result to re-order the rows in the data frame

erNegPatientsByAge <- erNegPatients[order(erNegPatients$age),]
head(erNegPatientsByAge)
  • can change the behaviour of order to be Largest –> Smallest
erNegPatientsByAge <- erNegPatients[order(erNegPatients$age,decreasing = TRUE),]
head(erNegPatientsByAge)
  • we can write the result to a file if we wish
write.table(erNegPatientsByAge, file="erNegativeSubjectsByAge.txt", sep="\t")

Exercise: Exercise7

  • Imagine we want to know information about chromosome 8 genes that have been measured.
  1. Create a new data frame containing information on genes on Chromosome 8
  2. Order the rows in this data frame according to start position, and write the results to a file
## Your Answer Here ###

Alternative:

  • you might find the function subset a bit easier to use
    • no messing around with square brackets
    • no need to remember row and column indices
    • no need for $ operator to access columns
  • more advanced packages like dplyr use a similar approach
    • you’ll find out about this on our intermediate course
chr8Genes <- subset(geneAnnotation, Chromosome=="chr8")
head(chr8Genes)

Retrieving data for a particular gene

  • Gene ESR1 is known to be hugely-different between ER positive and negative patient
    • let’s check that this is evident in our dataset
    • if not, something has gone wrong!
  • First step is to locate this gene in our dataset
  • We can use == to do this, but there are some alternatives that are worth knowing about

Character matching in R

  • match() and grep() are often used to find particular matches
    • CAUTION: by default, match will only return the first match!
match("D", LETTERS)
[1] 4
grep("F", rep(LETTERS,2))
[1]  6 32
match("F", rep(LETTERS,2))
[1] 6
  • grep can also do partial matching
    • can also do complex matching using “regular expressions”
month.name
 [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September"
[10] "October"   "November"  "December" 
grep("ary",month.name)
[1] 1 2
grep("ber",month.name)
[1]  9 10 11 12
  • %in% will return a logical if each element is contained in a shortened list
month.name %in% c("May", "June")
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Retrieving data for a particular gene

  • Find the name of the ID that corresponds to gene ESR1 using match
    • mapping between IDs and genes is in the genes data frame
      • ID in first column, gene name in the second
  • Save this ID as a variable
rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
geneAnnotation[rowInd,]
myProbe <- geneAnnotation$probe[rowInd]
myProbe
[1] "NM_000125"

Now, find which row in our expression matrix is indexed by this ID

  • recall that the rownames of the expression matrix are the probe IDs
  • save the expression values as a variable
match(myProbe, rownames(normalizedValues))
[1] 384
normalizedValues[match(myProbe, rownames(normalizedValues)), 1:10]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
class(myGeneExpression)
[1] "data.frame"

Relating to patient characteristics

We have expression values and want to visualise them against our categorical data

  • use a boxplot, for example
  • however, we have to first make sure our values are treat as numeric data
  • as we created the subset of a data frame, the result was also a data frame
    • use as.numeric to create a vector that we can plot
    • various as. functions exist to convert between various data types
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)

  • In this case there is a clear difference, so we probably don’t even need a p-value to convince ourselves of the difference
    • in real-life, we would probably test lots of genes and implement some kind of multiple-testing
    • e.g. p.adjust (?p.adjust)
t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)

    Welch Two Sample t-test

data:  as.numeric(myGeneExpression) by patientMetadata$er
t = -38.746, df = 205.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.246953 -1.126198
sample estimates:
mean in group 0 mean in group 1 
    -1.17388506      0.01269076 

Complete script

geneAnnotation    <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
normalizedValues    <- read.delim("gene.expression.txt")
rowInd      <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe    <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)

t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)

    Welch Two Sample t-test

data:  as.numeric(myGeneExpression) by patientMetadata$er
t = -38.746, df = 205.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.246953 -1.126198
sample estimates:
mean in group 0 mean in group 1 
    -1.17388506      0.01269076 

Exercise: Exercise 8

Repeat the same steps we performed for the gene ESR1, but for GATA3:

  • Try and make as few changes as possible from the ESR1 script
  • Can you see why making a markdown document is useful for analysis?
### Your Answer Here ###

Extra Discussion

This example has been simplified by the fact that the columns in the expression matrix are in the same order as the patient metadata. This would normally be the case for data obtained in a public repository such as Gene Expression Omnibus

colnames(normalizedValues)
  [1] "NKI_4"   "NKI_6"   "NKI_7"   "NKI_8"   "NKI_9"   "NKI_11"  "NKI_12"  "NKI_13"  "NKI_14"  "NKI_17" 
 [11] "NKI_23"  "NKI_24"  "NKI_26"  "NKI_27"  "NKI_28"  "NKI_29"  "NKI_30"  "NKI_31"  "NKI_32"  "NKI_34" 
 [21] "NKI_35"  "NKI_36"  "NKI_37"  "NKI_38"  "NKI_39"  "NKI_40"  "NKI_41"  "NKI_42"  "NKI_43"  "NKI_44" 
 [31] "NKI_45"  "NKI_48"  "NKI_51"  "NKI_56"  "NKI_57"  "NKI_58"  "NKI_59"  "NKI_60"  "NKI_61"  "NKI_62" 
 [41] "NKI_69"  "NKI_70"  "NKI_71"  "NKI_72"  "NKI_73"  "NKI_75"  "NKI_76"  "NKI_78"  "NKI_79"  "NKI_80" 
 [51] "NKI_83"  "NKI_84"  "NKI_85"  "NKI_86"  "NKI_88"  "NKI_89"  "NKI_90"  "NKI_91"  "NKI_92"  "NKI_93" 
 [61] "NKI_94"  "NKI_95"  "NKI_96"  "NKI_97"  "NKI_98"  "NKI_99"  "NKI_100" "NKI_102" "NKI_103" "NKI_104"
 [71] "NKI_106" "NKI_107" "NKI_108" "NKI_109" "NKI_110" "NKI_111" "NKI_113" "NKI_114" "NKI_116" "NKI_117"
 [81] "NKI_118" "NKI_119" "NKI_120" "NKI_122" "NKI_123" "NKI_124" "NKI_125" "NKI_126" "NKI_127" "NKI_128"
 [91] "NKI_129" "NKI_130" "NKI_131" "NKI_132" "NKI_133" "NKI_134" "NKI_135" "NKI_136" "NKI_137" "NKI_138"
[101] "NKI_139" "NKI_140" "NKI_141" "NKI_142" "NKI_144" "NKI_145" "NKI_146" "NKI_147" "NKI_148" "NKI_149"
[111] "NKI_150" "NKI_151" "NKI_153" "NKI_154" "NKI_155" "NKI_156" "NKI_157" "NKI_158" "NKI_159" "NKI_160"
[121] "NKI_161" "NKI_162" "NKI_163" "NKI_164" "NKI_165" "NKI_166" "NKI_167" "NKI_169" "NKI_170" "NKI_172"
[131] "NKI_174" "NKI_175" "NKI_176" "NKI_177" "NKI_178" "NKI_179" "NKI_180" "NKI_181" "NKI_182" "NKI_183"
[141] "NKI_184" "NKI_185" "NKI_186" "NKI_187" "NKI_188" "NKI_189" "NKI_190" "NKI_191" "NKI_192" "NKI_193"
[151] "NKI_194" "NKI_195" "NKI_196" "NKI_197" "NKI_198" "NKI_199" "NKI_200" "NKI_201" "NKI_202" "NKI_203"
[161] "NKI_205" "NKI_207" "NKI_208" "NKI_209" "NKI_210" "NKI_212" "NKI_213" "NKI_214" "NKI_215" "NKI_217"
[171] "NKI_218" "NKI_219" "NKI_220" "NKI_221" "NKI_222" "NKI_224" "NKI_226" "NKI_227" "NKI_228" "NKI_229"
[181] "NKI_230" "NKI_231" "NKI_233" "NKI_235" "NKI_236" "NKI_237" "NKI_238" "NKI_239" "NKI_240" "NKI_241"
[191] "NKI_243" "NKI_245" "NKI_246" "NKI_247" "NKI_248" "NKI_249" "NKI_250" "NKI_251" "NKI_252" "NKI_254"
[201] "NKI_256" "NKI_257" "NKI_258" "NKI_259" "NKI_260" "NKI_261" "NKI_263" "NKI_264" "NKI_265" "NKI_266"
[211] "NKI_267" "NKI_268" "NKI_269" "NKI_270" "NKI_271" "NKI_272" "NKI_273" "NKI_274" "NKI_275" "NKI_276"
[221] "NKI_277" "NKI_278" "NKI_280" "NKI_281" "NKI_282" "NKI_283" "NKI_284" "NKI_285" "NKI_286" "NKI_287"
[231] "NKI_288" "NKI_290" "NKI_291" "NKI_292" "NKI_293" "NKI_294" "NKI_295" "NKI_296" "NKI_297" "NKI_298"
[241] "NKI_300" "NKI_301" "NKI_302" "NKI_303" "NKI_304" "NKI_305" "NKI_306" "NKI_307" "NKI_308" "NKI_309"
[251] "NKI_310" "NKI_311" "NKI_312" "NKI_313" "NKI_314" "NKI_315" "NKI_317" "NKI_318" "NKI_319" "NKI_320"
[261] "NKI_321" "NKI_322" "NKI_323" "NKI_324" "NKI_325" "NKI_326" "NKI_327" "NKI_328" "NKI_329" "NKI_330"
[271] "NKI_331" "NKI_332" "NKI_333" "NKI_334" "NKI_335" "NKI_336" "NKI_337" "NKI_338" "NKI_339" "NKI_340"
[281] "NKI_341" "NKI_342" "NKI_343" "NKI_344" "NKI_345" "NKI_346" "NKI_347" "NKI_348" "NKI_349" "NKI_350"
[291] "NKI_351" "NKI_352" "NKI_353" "NKI_354" "NKI_355" "NKI_356" "NKI_357" "NKI_358" "NKI_359" "NKI_360"
[301] "NKI_361" "NKI_362" "NKI_363" "NKI_364" "NKI_365" "NKI_366" "NKI_367" "NKI_368" "NKI_369" "NKI_370"
[311] "NKI_371" "NKI_373" "NKI_374" "NKI_375" "NKI_377" "NKI_378" "NKI_379" "NKI_380" "NKI_381" "NKI_383"
[321] "NKI_385" "NKI_387" "NKI_388" "NKI_389" "NKI_390" "NKI_391" "NKI_392" "NKI_393" "NKI_394" "NKI_395"
[331] "NKI_396" "NKI_397" "NKI_398" "NKI_401" "NKI_402" "NKI_403" "NKI_404"
patientMetadata$samplename
  [1] "NKI_4"   "NKI_6"   "NKI_7"   "NKI_8"   "NKI_9"   "NKI_11"  "NKI_12"  "NKI_13"  "NKI_14"  "NKI_17" 
 [11] "NKI_23"  "NKI_24"  "NKI_26"  "NKI_27"  "NKI_28"  "NKI_29"  "NKI_30"  "NKI_31"  "NKI_32"  "NKI_34" 
 [21] "NKI_35"  "NKI_36"  "NKI_37"  "NKI_38"  "NKI_39"  "NKI_40"  "NKI_41"  "NKI_42"  "NKI_43"  "NKI_44" 
 [31] "NKI_45"  "NKI_48"  "NKI_51"  "NKI_56"  "NKI_57"  "NKI_58"  "NKI_59"  "NKI_60"  "NKI_61"  "NKI_62" 
 [41] "NKI_69"  "NKI_70"  "NKI_71"  "NKI_72"  "NKI_73"  "NKI_75"  "NKI_76"  "NKI_78"  "NKI_79"  "NKI_80" 
 [51] "NKI_83"  "NKI_84"  "NKI_85"  "NKI_86"  "NKI_88"  "NKI_89"  "NKI_90"  "NKI_91"  "NKI_92"  "NKI_93" 
 [61] "NKI_94"  "NKI_95"  "NKI_96"  "NKI_97"  "NKI_98"  "NKI_99"  "NKI_100" "NKI_102" "NKI_103" "NKI_104"
 [71] "NKI_106" "NKI_107" "NKI_108" "NKI_109" "NKI_110" "NKI_111" "NKI_113" "NKI_114" "NKI_116" "NKI_117"
 [81] "NKI_118" "NKI_119" "NKI_120" "NKI_122" "NKI_123" "NKI_124" "NKI_125" "NKI_126" "NKI_127" "NKI_128"
 [91] "NKI_129" "NKI_130" "NKI_131" "NKI_132" "NKI_133" "NKI_134" "NKI_135" "NKI_136" "NKI_137" "NKI_138"
[101] "NKI_139" "NKI_140" "NKI_141" "NKI_142" "NKI_144" "NKI_145" "NKI_146" "NKI_147" "NKI_148" "NKI_149"
[111] "NKI_150" "NKI_151" "NKI_153" "NKI_154" "NKI_155" "NKI_156" "NKI_157" "NKI_158" "NKI_159" "NKI_160"
[121] "NKI_161" "NKI_162" "NKI_163" "NKI_164" "NKI_165" "NKI_166" "NKI_167" "NKI_169" "NKI_170" "NKI_172"
[131] "NKI_174" "NKI_175" "NKI_176" "NKI_177" "NKI_178" "NKI_179" "NKI_180" "NKI_181" "NKI_182" "NKI_183"
[141] "NKI_184" "NKI_185" "NKI_186" "NKI_187" "NKI_188" "NKI_189" "NKI_190" "NKI_191" "NKI_192" "NKI_193"
[151] "NKI_194" "NKI_195" "NKI_196" "NKI_197" "NKI_198" "NKI_199" "NKI_200" "NKI_201" "NKI_202" "NKI_203"
[161] "NKI_205" "NKI_207" "NKI_208" "NKI_209" "NKI_210" "NKI_212" "NKI_213" "NKI_214" "NKI_215" "NKI_217"
[171] "NKI_218" "NKI_219" "NKI_220" "NKI_221" "NKI_222" "NKI_224" "NKI_226" "NKI_227" "NKI_228" "NKI_229"
[181] "NKI_230" "NKI_231" "NKI_233" "NKI_235" "NKI_236" "NKI_237" "NKI_238" "NKI_239" "NKI_240" "NKI_241"
[191] "NKI_243" "NKI_245" "NKI_246" "NKI_247" "NKI_248" "NKI_249" "NKI_250" "NKI_251" "NKI_252" "NKI_254"
[201] "NKI_256" "NKI_257" "NKI_258" "NKI_259" "NKI_260" "NKI_261" "NKI_263" "NKI_264" "NKI_265" "NKI_266"
[211] "NKI_267" "NKI_268" "NKI_269" "NKI_270" "NKI_271" "NKI_272" "NKI_273" "NKI_274" "NKI_275" "NKI_276"
[221] "NKI_277" "NKI_278" "NKI_280" "NKI_281" "NKI_282" "NKI_283" "NKI_284" "NKI_285" "NKI_286" "NKI_287"
[231] "NKI_288" "NKI_290" "NKI_291" "NKI_292" "NKI_293" "NKI_294" "NKI_295" "NKI_296" "NKI_297" "NKI_298"
[241] "NKI_300" "NKI_301" "NKI_302" "NKI_303" "NKI_304" "NKI_305" "NKI_306" "NKI_307" "NKI_308" "NKI_309"
[251] "NKI_310" "NKI_311" "NKI_312" "NKI_313" "NKI_314" "NKI_315" "NKI_317" "NKI_318" "NKI_319" "NKI_320"
[261] "NKI_321" "NKI_322" "NKI_323" "NKI_324" "NKI_325" "NKI_326" "NKI_327" "NKI_328" "NKI_329" "NKI_330"
[271] "NKI_331" "NKI_332" "NKI_333" "NKI_334" "NKI_335" "NKI_336" "NKI_337" "NKI_338" "NKI_339" "NKI_340"
[281] "NKI_341" "NKI_342" "NKI_343" "NKI_344" "NKI_345" "NKI_346" "NKI_347" "NKI_348" "NKI_349" "NKI_350"
[291] "NKI_351" "NKI_352" "NKI_353" "NKI_354" "NKI_355" "NKI_356" "NKI_357" "NKI_358" "NKI_359" "NKI_360"
[301] "NKI_361" "NKI_362" "NKI_363" "NKI_364" "NKI_365" "NKI_366" "NKI_367" "NKI_368" "NKI_369" "NKI_370"
[311] "NKI_371" "NKI_373" "NKI_374" "NKI_375" "NKI_377" "NKI_378" "NKI_379" "NKI_380" "NKI_381" "NKI_383"
[321] "NKI_385" "NKI_387" "NKI_388" "NKI_389" "NKI_390" "NKI_391" "NKI_392" "NKI_393" "NKI_394" "NKI_395"
[331] "NKI_396" "NKI_397" "NKI_398" "NKI_401" "NKI_402" "NKI_403" "NKI_404"

There is a quick shortcut to check that these names are the same using the all function

colnames(normalizedValues) == patientMetadata$samplename
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [64] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[148] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[169] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[190] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[232] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[253] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[274] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[295] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[337] TRUE
all(colnames(normalizedValues) == patientMetadata$samplename)
[1] TRUE

Let’s say that our metadata have been re-ordered by ER status and age, and not by patient ID

patientMetadata <- patientMetadata[order(patientMetadata$er,patientMetadata$age),]
patientMetadata
  • If we run the same code as before to produce the boxplot and perform the t-test we would get a very different result.
  • This should make use immediately suspicious, as the ESR1 gene is known to be highly differentially-expressed in the contrast we are making
  • Such sanity checks are important to check to your code
rowInd      <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe    <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)

t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)

    Welch Two Sample t-test

data:  as.numeric(myGeneExpression) by patientMetadata$er
t = -1.7848, df = 133.53, p-value = 0.07656
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.29727383  0.01525417
sample estimates:
mean in group 0 mean in group 1 
     -0.3990460      -0.2580361 

If we run the same check as before on the column names and patient IDs, we see that it fails:-

all(colnames(normalizedValues) == patientMetadata$samplename)
[1] FALSE

A solution is to use match again. Specifically, we want to know where each column in the expression matrix can be found in the patient metadata. The result is a vector, each item of which is an index for a particular row in the patient metadata

match(colnames(normalizedValues),patientMetadata$samplename)
  [1] 143 260  53  62 244 105  54 144 245 246 145  68 134 206  32 311 327 312 328 329 330 313 331 314 221 324 332
 [28] 333 334  85 106   5 146  90   2 111 196 207 112 100 147 113  63 122 281  82  83 335 325  20  16  80  91   6
 [55]  11  76   4  48   7  87 114 208  88  35   9  92  64 282  39  13  17 115 247  36 297 283  57 336 337 298 231
 [82]  86 162 179 248 116 284 117 163 285 180  58  28 232  93 118  49   8 148 197 149 222  12 123  24 249 233  25
[109] 261  43 101 164  21 135 262 165 209  69 198  94 223 286 119  77 124 181 199 136 166 224 150  65 225  66 250
[136]  33 102  18 182 167  44 168  59 137 151 125 251  97 152 287 210  50 234  22 183 126 184  95  60 263 288 200
[163] 211 153 289  40 264 154  70  42 120 226 169 212  14 201  71 127  29 213   3 185 170 235  72  78  41 138 236
[190]  37 202 252  19 290  34 203 128 265 227 107 266  10 186 171 172 291 129 173  38 187 267  26  27  79 174 188
[217] 292 268 269  23 108 270 253 254 255 271 214 189 155 204  98 272 130 256 228 273 257 229 109 293  99 237 238
[244] 190 139 140 191  45 156 192  51 175 239 141 131 142 193 110 258 205 132 215 157  55 176  30 274 158 275   1
[271] 259  73  46 103  67 240  89  74 216 194  52 217 218  47 241 276 159 294 219  56 160 195 104 242 295 277 121
[298] 220 278 279 177 230 178  75 299 296 280  96  31 300 301 302 315 316  84 317 318 319 320 321 326 322 323 303
[325] 304 305 306 307 308 309  81 310  15 161  61 243 133

The vector we have just generated can then by used to re-order the rows in the patient metadata

patientMetadata <- patientMetadata[match(colnames(normalizedValues),patientMetadata$samplename),]
patientMetadata
all(colnames(normalizedValues) == patientMetadata$samplename)
[1] TRUE

And we can now proceed to perform the analysis and can the result we expect

rowInd      <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe    <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)

t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)

    Welch Two Sample t-test

data:  as.numeric(myGeneExpression) by patientMetadata$er
t = -38.746, df = 205.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.246953 -1.126198
sample estimates:
mean in group 0 mean in group 1 
    -1.17388506      0.01269076 
LS0tCnRpdGxlOiAiSW50cm9kdWN0aW9uIHRvIFNvbHZpbmcgQmlvbG9naWNhbCBQcm9ibGVtcyBVc2luZyBSIC0gRGF5IDIiCmF1dGhvcjogTWFyayBEdW5uaW5nLCBTdXJhaiBNZW5vbiBhbmQgQWlvcmEgWmFiYWxhLiBPcmlnaW5hbCBtYXRlcmlhbCBieSBSb2JlcnQgU3Rvam5pxIcsCiAgTGF1cmVudCBHYXR0bywgUm9iIEZveSwgSm9obiBEYXZleSwgRMOhdmlkIE1vbG7DoXIgYW5kIElhbiBSb2JlcnRzCmRhdGU6ICdgciBmb3JtYXQoU3lzLnRpbWUoKSwgIkxhc3QgbW9kaWZpZWQ6ICVkICViICVZIilgJwpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRvYzogeWVzCiAgICB0b2NfZmxvYXQ6IHllcwotLS0KCgojIDMuIERhdGEgTWFuaXB1bGF0aW9uIFRlY2huaXF1ZXMKCiMjIE1vdGl2YXRpb24KCi0gU28gZmFyIHdlIGhhdmUgYmVlbiBsdWNreSB0aGF0IGFsbCBvdXIgZGF0YSBoYXZlIGJlZW4gaW4gdGhlIHNhbWUgZmlsZToKICAgICsgVGhpcyBpcyBub3QgdXN1YWxseSB0aGUgY2FzZQogICAgKyBEYXRhc2V0IG1heSBiZSBzcHJlYWQgb3ZlciBzZXZlcmFsIGZpbGVzCiAgICAgICAgKyBUaGlzIHRha2VzIGxvbmdlciwgYW5kIGlzIGhhcmRlciwgdGhhbiBtYW55IHBlb3BsZSByZWFsaXNlCiAgICArIFdlIG5lZWQgdG8gY29tYmluZSBiZWZvcmUgZG9pbmcgYW4gYW5hbHlzaXMKCgoKIyMgQ29tYmluaW5nIGRhdGEgZnJvbSBtdWx0aXBsZSBzb3VyY2VzOiBHZW5lIENsdXN0ZXJpbmcgRXhhbXBsZQoKLSBSIGhhcyBwb3dlcmZ1bCBmdW5jdGlvbnMgdG8gY29tYmluZSBoZXRlcm9nZW5lb3VzIGRhdGEgc291cmNlcyBpbnRvIGEgc2luZ2xlIGRhdGEgc2V0Ci0gR2VuZSBjbHVzdGVyaW5nIGV4YW1wbGUgZGF0YToKICAgICsgR2VuZSBleHByZXNzaW9uIHZhbHVlcyBpbiAqKipnZW5lLmV4cHJlc3Npb24udHh0KioqCiAgICArIEdlbmUgaW5mb3JtYXRpb24gaW4gKioqZ2VuZS5kZXNjcmlwdGlvbi50eHQqKioKICAgICsgUGF0aWVudCBpbmZvcm1hdGlvbiBpbiAqKipjYW5jZXIucGF0aWVudHMudHh0KioqCi0gQSBicmVhc3QgY2FuY2VyIGRhdGFzZXQgd2l0aCBudW1lcm91cyBwYXRpZW50IGNoYXJhY3RlcmlzdGljczoKICAgICsgV2Ugd2lsbCBjb25jZW50cmF0ZSBvbiAqKipFUiBzdGF0dXMqKiogKHBvc2l0aXZlIC8gbmVnYXRpdmUpCiAgICArIFdoYXQgZ2VuZXMgc2hvdyBhIHN0YXRpc3RpY2FsbHktc2lnbmlmaWNhbnQgZGlmZmVyZW50IGNoYW5nZSBiZXR3ZWVuIEVSIGdyb3Vwcz8KCiMjIEFuYWx5c2lzIGdvYWxzCgotIFdlIHdpbGwgc2hvdyBob3cgdG8gbG9va3VwIGEgcGFydGljdWxhciBnZW5lIGluIHRoZSBkYXRhc2V0Ci0gQWxzbywgaG93IHRvIGxvb2stdXAgZ2VuZXMgaW4gYSBnaXZlbiBnZW5vbWljIHJlZ2lvbgotIFBlcmZvcm0gYSAic2FuaXR5LWNoZWNrIiB0byBzZWUgaWYgYSBwcmV2aW91c2x5LWtub3duIGdlbmUgZXhoaWJpdHMgYSBkaWZmZXJlbmNlIGluIG91ciBkYXRhc2V0Ci0gSG93IG1hbnkgZ2VuZXMgb24gY2hyb21vc29tZSA4IGFyZSBkaWZmZXJlbnRpYWxseS1leHByZXNzZWQ/Ci0gQ3JlYXRlIGEgaGVhdG1hcCB0byBjbHVzdGVyIHRoZSBzYW1wbGVzIGFuZCByZXZlYWwgYW55IHN1Ymdyb3VwcyBpbiB0aGUgZGF0YQogICAgKyBkbyB0aGUgc3ViZ3JvdXBzIGFncmVlIHdpdGggb3VyIHByaW9yIGtub3dsZWRnZSBhYm91dCB0aGUgc2FtcGxlcwoKIyMgUGVlayBhdCB0aGUgZGF0YQoKLSBgZ2VuZS5leHByZXNzaW9uLnR4dGAgaXMgYSB0YWItZGVsaW1pdGVkIGZpbGUsIHNvIHdlIGNhbiB1c2UgYHJlYWQuZGVsaW1gIHRvIGltcG9ydCBpdAotIGhlcmUgdGhlIGBoZWFkYCBmdW5jdGlvbiBpcyB1c2VkIGFzIGEgY29udmVuaWVudCB3YXkgdG8gc2VlIHRoZSBmaXJzdCBzaXggcm93cyBvZiB0aGUgcmVzdWx0aW5nIGRhdGEgZnJhbWUKCmBgYHtyfQpub3JtYWxpemVkVmFsdWVzIDwtIHJlYWQuZGVsaW0oImdlbmUuZXhwcmVzc2lvbi50eHQiKQpoZWFkKG5vcm1hbGl6ZWRWYWx1ZXMpCmBgYAoKCi0gYHIgbnJvdyhub3JtYWxpemVkVmFsdWVzKWAgcm93cyBhbmQgYHIgbmNvbChub3JtYWxpemVkVmFsdWVzKWAgY29sdW1ucworIE9uZSByb3cgZm9yIGVhY2ggZ2VuZToKICAgICsgUm93cyBhcmUgbmFtZWQgYWNjb3JkaW5nIHRvIHBhcnRpY3VsYXIgdGVjaG5vbG9neSB1c2VkIHRvIG1ha2UgbWVhc3VyZW1lbnQKICAgICsgVGhlIG5hbWVzIG9mIGVhY2ggcm93IGNhbiBiZSByZXR1cm5lZCBieSBgcm93bmFtZXMobm9ybWFsaXplZFZhbHVlcylgOyBnaXZpbmcgYSB2ZWN0b3IKKyBPbmUgY29sdW1uIGZvciBlYWNoIHBhdGllbnQ6CiAgICArIFRoZSBuYW1lcyBvZiBlYWNoIGNvbHVtbiBjYW4gYmUgcmV0dXJuZWQgYnkgYGNvbG5hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpYDsgZ2l2aW5nIGEgdmVjdG9yCiAgICAKCmBgYHtyfQpnZW5lQW5ub3RhdGlvbiA8LSByZWFkLmRlbGltKCJnZW5lLmRlc2NyaXB0aW9uLnR4dCIsc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFKQpoZWFkKGdlbmVBbm5vdGF0aW9uKQpgYGAKCgotIGByIG5yb3coZ2VuZUFubm90YXRpb24pYCByb3dzIGFuZCBgciBuY29sKGdlbmVBbm5vdGF0aW9uKWAgY29sdW1ucwotIE9uZSBmb3IgZWFjaCBnZW5lCi0gSW5jbHVkZXMgbWFwcGluZyBiZXR3ZWVuIG1hbnVmYWN0dXJlciBJRCBhbmQgR2VuZSBuYW1lCgoKYGBge3J9CnBhdGllbnRNZXRhZGF0YSA8LSByZWFkLmRlbGltKCJjYW5jZXIucGF0aWVudHMudHh0IixzdHJpbmdzQXNGYWN0b3JzID0gRkFMU0UpCmhlYWQocGF0aWVudE1ldGFkYXRhKQpgYGAKCi0gT25lIGZvciBlYWNoIHBhdGllbnQgaW4gdGhlIHN0dWR5Ci0gRWFjaCBjb2x1bW4gaXMgYSBkaWZmZXJlbnQgY2hhcmFjdGVyaXN0aWMgb2YgdGhhdCBwYXRpZW50CiAgICArIGUuZy4gd2hldGhlciBhIHBhdGllbnQgaXMgRVIgcG9zaXRpdmUgKHZhbHVlIG9mIDEpIG9yIG5lZ2F0aXZlICh2YWx1ZSBvZiAwKQoKYGBge3J9CnRhYmxlKHBhdGllbnRNZXRhZGF0YSRlcikKYGBgCgoKCiMjIE9yZGVyaW5nIGFuZCBzb3J0aW5nCgpUbyBnZXQgYSBmZWVsIGZvciB0aGVzZSBkYXRhLCB3ZSB3aWxsIGxvb2sgYXQgaG93IHdlIGNhbiBzdWJzZXQgYW5kIG9yZGVyCgotIFIgYWxsb3dzIHVzIHRvIGRvIHRoZSBraW5kcyBvZiBmaWx0ZXJpbmcsIHNvcnRpbmcgYW5kIG9yZGVyaW5nIG9wZXJhdGlvbnMgeW91IG1pZ2h0IGJlIGZhbWlsaWFyIHdpdGggaW4gRXhjZWwKLSBGb3IgZXhhbXBsZSwgaWYgd2Ugd2FudCB0byBnZXQgaW5mb3JtYXRpb24gYWJvdXQgcGF0aWVudHMgdGhhdCBhcmUgRVIgbmVnYXRpdmUKICAgICsgdGhlc2UgYXJlIGluZGljYXRlZCBieSBhbiBlbnRyeSBvZiAqKiowKioqIGluIHRoZSBgZXJgIGNvbHVtbgoKYGBge3IgZXZhbD1GQUxTRX0KcGF0aWVudE1ldGFkYXRhJGVyID09IDAKYGBgCgpXZSBjYW4gZG8gdGhlIGNvbXBhcmlzb24gd2l0aGluIHRoZSBzcXVhcmUgYnJhY2tldHMKCi0gUmVtZW1iZXJpbmcgdG8gaW5jbHVkZSBhIGAsYCB0byBpbmRleCB0aGUgY29sdW1ucyBhcyB3ZWxsCi0gQmVzdCBwcmFjdGljZSB0byBjcmVhdGUgYSBuZXcgdmFyaWFibGUgYW5kIGxlYXZlIHRoZSBvcmlnaW5hbCBkYXRhIGZyYW1lIHVudG91Y2hlZAoKYGBge3J9CmVyTmVnUGF0aWVudHMgPC0gcGF0aWVudE1ldGFkYXRhW3BhdGllbnRNZXRhZGF0YSRlciA9PSAwLF0KaGVhZChlck5lZ1BhdGllbnRzKQpgYGAKCm9yCgpgYGB7cn0KVmlldyhlck5lZ1BhdGllbnRzKQpgYGAKClNvcnRpbmcgaXMgc3VwcG9ydGVkIGJ5IHRoZSAqKmBzb3J0KClgKiogZnVuY3Rpb24KCi0gR2l2ZW4gYSB2ZWN0b3IsIGl0IHdpbGwgcmV0dXJuIGEgc29ydGVkIHZlcnNpb24gb2YgdGhlIHNhbWUgbGVuZ3RoCgpgYGB7cn0Kc29ydChlck5lZ1BhdGllbnRzJGdyYWRlKQpgYGAKCi0gQnV0IHRoaXMgaXMgbm90IHVzZWZ1bCBpbiBhbGwgY2FzZXMKICAgICsgV2UgaGF2ZSBsb3N0IHRoZSBleHRyYSBpbmZvcm1hdGlvbiB0aGF0IHdlIGhhdmUgYWJvdXQgdGhlIHBhdGllbnRzCiAgICAKLSBJbnN0ZWFkLCB3ZSBjYW4gdXNlICoqYG9yZGVyKClgKioKLSBHaXZlbiBhIHZlY3RvciwgYG9yZGVyKClgIHdpbGwgZ2l2ZSBhIHNldCBvZiBudW1lcmljIHZhbHVlcyB3aGljaCB3aWxsIGdpdmUgYW4gb3JkZXJlZCB2ZXJzaW9uIG9mIHRoZSB2ZWN0b3IKICAgICsgZGVmYXVsdCBpcyBzbWFsbGVzdCAtLT4gbGFyZ2VzdAoKYGBge3J9Cm15dmVjIDwtIGMoOTAsMTAwLDQwLDMwLDgwLDUwLDYwLDIwLDEwLDcwKQpteXZlYwpvcmRlcihteXZlYykKYGBgCgotIGkuZS4gbnVtYmVyIGluIHBvc2l0aW9uIDkgaXMgdGhlIHNtYWxsZXN0LCBudW1iZXIgaW4gcG9zaXRpb24gOCBpcyB0aGUgc2Vjb25kIHNtYWxsZXN0OgoKYGBge3J9Cm15dmVjWzldCm15dmVjWzhdCmBgYAoKTi5CLiBgb3JkZXJgIHdpbGwgYWxzbyB3b3JrIG9uIGNoYXJhY3RlciB2ZWN0b3JzIAoKYGBge3J9CmZpcnN0TmFtZSAgPC0gYygiQWRhbSIsICJFdmUiLCAiSm9obiIsICJNYXJ5IiwgIlBldGVyIiwgIlBhdWwiLCAiSm9hbm5hIiwgIk1hdHRoZXciLCAiRGF2aWQiLCAiU2FsbHkiKQpvcmRlcihmaXJzdE5hbWUpCmBgYAoKLSBXZSBjYW4gdXNlIHRoZSByZXN1bHQgb2YgYG9yZGVyKClgIHRvIHBlcmZvcm0gYSBzdWJzZXQgb2Ygb3VyIG9yaWdpbmFsIHZlY3RvcgotIFRoZSByZXN1bHQgaXMgYW4gb3JkZXJlZCB2ZWN0b3IKYGBge3J9Cm15dmVjLm9yZCA8LSBteXZlY1tvcmRlcihteXZlYyldCm15dmVjLm9yZApgYGAKCi0gSW1wbGljYXRpb246IFdlIGNhbiB1c2UgYG9yZGVyYCBvbiBhIHBhcnRpY3VsYXIgY29sdW1uIG9mIGEgZGF0YSBmcmFtZSwgYW5kIHVzZSB0aGUgcmVzdWx0IHRvIHNvcnQgYWxsIHRoZSByb3dzCgotIFdlIG1pZ2h0IHdhbnQgdG8gc2VsZWN0IHRoZSB5b3VuZ2VzdCBFUiBuZWdhdGl2ZSBwYXRpZW50cyBmb3IgYSBmb2xsb3ctdXAgc3R1ZHkKLSBIZXJlIHdlIG9yZGVyIHRoZSBgYWdlYCBjb2x1bW4gYW5kIHVzZSB0aGUgcmVzdWx0IHRvIHJlLW9yZGVyIHRoZSByb3dzIGluIHRoZSBkYXRhIGZyYW1lCgpgYGB7cn0KZXJOZWdQYXRpZW50c0J5QWdlIDwtIGVyTmVnUGF0aWVudHNbb3JkZXIoZXJOZWdQYXRpZW50cyRhZ2UpLF0KaGVhZChlck5lZ1BhdGllbnRzQnlBZ2UpCmBgYAoKCi0gY2FuIGNoYW5nZSB0aGUgYmVoYXZpb3VyIG9mIGBvcmRlcmAgdG8gYmUgTGFyZ2VzdCAtLT4gU21hbGxlc3QKYGBge3J9CmVyTmVnUGF0aWVudHNCeUFnZSA8LSBlck5lZ1BhdGllbnRzW29yZGVyKGVyTmVnUGF0aWVudHMkYWdlLGRlY3JlYXNpbmcgPSBUUlVFKSxdCmhlYWQoZXJOZWdQYXRpZW50c0J5QWdlKQpgYGAKCi0gd2UgY2FuIHdyaXRlIHRoZSByZXN1bHQgdG8gYSBmaWxlIGlmIHdlIHdpc2gKCmBgYHtyIGV2YWw9RkFMU0V9CndyaXRlLnRhYmxlKGVyTmVnUGF0aWVudHNCeUFnZSwgZmlsZT0iZXJOZWdhdGl2ZVN1YmplY3RzQnlBZ2UudHh0Iiwgc2VwPSJcdCIpCmBgYAoKCiMjIEV4ZXJjaXNlOiBFeGVyY2lzZTcgCgotIEltYWdpbmUgd2Ugd2FudCB0byBrbm93IGluZm9ybWF0aW9uIGFib3V0IGNocm9tb3NvbWUgOCBnZW5lcyB0aGF0IGhhdmUgYmVlbiBtZWFzdXJlZC4KMS4gQ3JlYXRlIGEgbmV3IGRhdGEgZnJhbWUgY29udGFpbmluZyBpbmZvcm1hdGlvbiBvbiBnZW5lcyBvbiBDaHJvbW9zb21lIDgKMi4gT3JkZXIgdGhlIHJvd3MgaW4gdGhpcyBkYXRhIGZyYW1lIGFjY29yZGluZyB0byBzdGFydCBwb3NpdGlvbiwgYW5kIHdyaXRlIHRoZSByZXN1bHRzIHRvIGEgZmlsZQoKYGBge3J9CgojIyBZb3VyIEFuc3dlciBIZXJlICMjIwoKYGBgCgoKIyMjIEFsdGVybmF0aXZlOiAKCi0geW91IG1pZ2h0IGZpbmQgdGhlIGZ1bmN0aW9uIGBzdWJzZXRgIGEgYml0IGVhc2llciB0byB1c2UKICAgICsgbm8gbWVzc2luZyBhcm91bmQgd2l0aCBzcXVhcmUgYnJhY2tldHMKICAgICsgbm8gbmVlZCB0byByZW1lbWJlciByb3cgYW5kIGNvbHVtbiBpbmRpY2VzCiAgICArIG5vIG5lZWQgZm9yIGAkYCBvcGVyYXRvciB0byBhY2Nlc3MgY29sdW1ucwotIG1vcmUgYWR2YW5jZWQgcGFja2FnZXMgbGlrZSBkcGx5ciB1c2UgYSBzaW1pbGFyIGFwcHJvYWNoCiAgICArIHlvdSdsbCBmaW5kIG91dCBhYm91dCB0aGlzIG9uIG91ciBpbnRlcm1lZGlhdGUgY291cnNlCiAgICAKYGBge3J9CmNocjhHZW5lcyA8LSBzdWJzZXQoZ2VuZUFubm90YXRpb24sIENocm9tb3NvbWU9PSJjaHI4IikKaGVhZChjaHI4R2VuZXMpCmBgYAoKCiMjIFJldHJpZXZpbmcgZGF0YSBmb3IgYSBwYXJ0aWN1bGFyIGdlbmUKCiAtIEdlbmUgYEVTUjFgIGlzIGtub3duIHRvIGJlIGh1Z2VseS1kaWZmZXJlbnQgYmV0d2VlbiBFUiBwb3NpdGl2ZSBhbmQgbmVnYXRpdmUgcGF0aWVudAogICAgKyBsZXQncyBjaGVjayB0aGF0IHRoaXMgaXMgZXZpZGVudCBpbiBvdXIgZGF0YXNldAogICAgKyBpZiBub3QsIHNvbWV0aGluZyBoYXMgZ29uZSB3cm9uZyEKLSBGaXJzdCBzdGVwIGlzIHRvIGxvY2F0ZSB0aGlzIGdlbmUgaW4gb3VyIGRhdGFzZXQKLSBXZSBjYW4gdXNlIGA9PWAgdG8gZG8gdGhpcywgYnV0IHRoZXJlIGFyZSBzb21lIGFsdGVybmF0aXZlcyB0aGF0IGFyZSB3b3J0aCBrbm93aW5nIGFib3V0CgojIyBDaGFyYWN0ZXIgbWF0Y2hpbmcgaW4gUgoKLSBgbWF0Y2goKWAgYW5kIGBncmVwKClgIGFyZSBvZnRlbiB1c2VkIHRvIGZpbmQgcGFydGljdWxhciBtYXRjaGVzCiAgICArIENBVVRJT046IGJ5IGRlZmF1bHQsIG1hdGNoIHdpbGwgb25seSByZXR1cm4gdGhlICoqKmZpcnN0KioqIG1hdGNoIQoKYGBge3J9Cm1hdGNoKCJEIiwgTEVUVEVSUykKZ3JlcCgiRiIsIHJlcChMRVRURVJTLDIpKQptYXRjaCgiRiIsIHJlcChMRVRURVJTLDIpKQpgYGAKCi0gYGdyZXBgIGNhbiBhbHNvIGRvIHBhcnRpYWwgbWF0Y2hpbmcKICAgICsgY2FuIGFsc28gZG8gY29tcGxleCBtYXRjaGluZyB1c2luZyAicmVndWxhciBleHByZXNzaW9ucyIKICAgIApgYGB7cn0KbW9udGgubmFtZQpncmVwKCJhcnkiLG1vbnRoLm5hbWUpCmdyZXAoImJlciIsbW9udGgubmFtZSkKYGBgCgotIGAlaW4lYCB3aWxsIHJldHVybiBhIGxvZ2ljYWwgaWYgZWFjaCBlbGVtZW50IGlzIGNvbnRhaW5lZCBpbiBhIHNob3J0ZW5lZCBsaXN0CgpgYGB7cn0KbW9udGgubmFtZSAlaW4lIGMoIk1heSIsICJKdW5lIikKYGBgCgoKIyMgUmV0cmlldmluZyBkYXRhIGZvciBhIHBhcnRpY3VsYXIgZ2VuZQoKLSBGaW5kIHRoZSBuYW1lIG9mIHRoZSBJRCB0aGF0IGNvcnJlc3BvbmRzIHRvIGdlbmUgKioqRVNSMSoqKiB1c2luZyBgbWF0Y2hgCiAgICArIG1hcHBpbmcgYmV0d2VlbiBJRHMgYW5kIGdlbmVzIGlzIGluIHRoZSAqKipnZW5lcyoqKiBkYXRhIGZyYW1lCiAgICAgICAgKyBJRCBpbiBmaXJzdCBjb2x1bW4sIGdlbmUgbmFtZSBpbiB0aGUgc2Vjb25kCi0gU2F2ZSB0aGlzIElEIGFzIGEgdmFyaWFibGUKCmBgYHtyfQpyb3dJbmQgPC0gbWF0Y2goIkVTUjEiLCBnZW5lQW5ub3RhdGlvbiRIVUdPLmdlbmUuc3ltYm9sKQpnZW5lQW5ub3RhdGlvbltyb3dJbmQsXQpteVByb2JlIDwtIGdlbmVBbm5vdGF0aW9uJHByb2JlW3Jvd0luZF0KbXlQcm9iZQpgYGAKCk5vdywgZmluZCB3aGljaCByb3cgaW4gb3VyIGV4cHJlc3Npb24gbWF0cml4IGlzIGluZGV4ZWQgYnkgdGhpcyBJRAoKLSByZWNhbGwgdGhhdCB0aGUgcm93bmFtZXMgb2YgdGhlIGV4cHJlc3Npb24gbWF0cml4IGFyZSB0aGUgcHJvYmUgSURzCi0gc2F2ZSB0aGUgZXhwcmVzc2lvbiB2YWx1ZXMgYXMgYSB2YXJpYWJsZQoKYGBge3J9Cm1hdGNoKG15UHJvYmUsIHJvd25hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpKQpub3JtYWxpemVkVmFsdWVzW21hdGNoKG15UHJvYmUsIHJvd25hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpKSwgMToxMF0KbXlHZW5lRXhwcmVzc2lvbiA8LSBub3JtYWxpemVkVmFsdWVzW21hdGNoKG15UHJvYmUscm93bmFtZXMobm9ybWFsaXplZFZhbHVlcykpLF0KY2xhc3MobXlHZW5lRXhwcmVzc2lvbikKYGBgCgoKICAgIAojIyBSZWxhdGluZyB0byBwYXRpZW50IGNoYXJhY3RlcmlzdGljcwoKV2UgaGF2ZSBleHByZXNzaW9uIHZhbHVlcyBhbmQgd2FudCB0byB2aXN1YWxpc2UgdGhlbSBhZ2FpbnN0IG91ciBjYXRlZ29yaWNhbCBkYXRhCgotIHVzZSBhIGJveHBsb3QsIGZvciBleGFtcGxlCi0gaG93ZXZlciwgd2UgaGF2ZSB0byBmaXJzdCBtYWtlIHN1cmUgb3VyIHZhbHVlcyBhcmUgdHJlYXQgYXMgbnVtZXJpYyBkYXRhCi0gYXMgd2UgY3JlYXRlZCB0aGUgc3Vic2V0IG9mIGEgZGF0YSBmcmFtZSwgdGhlIHJlc3VsdCB3YXMgYWxzbyBhIGRhdGEgZnJhbWUKICAgICsgdXNlIGBhcy5udW1lcmljYCB0byBjcmVhdGUgYSB2ZWN0b3IgdGhhdCB3ZSBjYW4gcGxvdAogICAgKyB2YXJpb3VzIGBhcy5gIGZ1bmN0aW9ucyBleGlzdCB0byBjb252ZXJ0IGJldHdlZW4gdmFyaW91cyBkYXRhIHR5cGVzCgoKYGBge3J9CmJveHBsb3QoYXMubnVtZXJpYyhteUdlbmVFeHByZXNzaW9uKSB+IHBhdGllbnRNZXRhZGF0YSRlcikKYGBgCgoKLSBJbiB0aGlzIGNhc2UgdGhlcmUgaXMgYSBjbGVhciBkaWZmZXJlbmNlLCBzbyB3ZSBwcm9iYWJseSBkb24ndCBldmVuIG5lZWQgYSBwLXZhbHVlIHRvIGNvbnZpbmNlIG91cnNlbHZlcyBvZiB0aGUgZGlmZmVyZW5jZQogICAgKyBpbiByZWFsLWxpZmUsIHdlIHdvdWxkIHByb2JhYmx5IHRlc3QgbG90cyBvZiBnZW5lcyBhbmQgaW1wbGVtZW50IHNvbWUga2luZCBvZiBtdWx0aXBsZS10ZXN0aW5nCiAgICArIGUuZy4gYHAuYWRqdXN0YCAoYD9wLmFkanVzdGApCgpgYGB7cn0KdC50ZXN0KGFzLm51bWVyaWMobXlHZW5lRXhwcmVzc2lvbikgfiBwYXRpZW50TWV0YWRhdGEkZXIpCgpgYGAKCgoKIyMgQ29tcGxldGUgc2NyaXB0CgpgYGB7cn0KZ2VuZUFubm90YXRpb24gICAgPC0gcmVhZC5kZWxpbSgiZ2VuZS5kZXNjcmlwdGlvbi50eHQiLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKcGF0aWVudE1ldGFkYXRhIDwtIHJlYWQuZGVsaW0oImNhbmNlci5wYXRpZW50cy50eHQiLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKbm9ybWFsaXplZFZhbHVlcyAgICA8LSByZWFkLmRlbGltKCJnZW5lLmV4cHJlc3Npb24udHh0IikKCnJvd0luZCAgICAgIDwtIG1hdGNoKCJFU1IxIiwgZ2VuZUFubm90YXRpb24kSFVHTy5nZW5lLnN5bWJvbCkKbXlQcm9iZSAgICA8LSBnZW5lQW5ub3RhdGlvbiRwcm9iZVtyb3dJbmRdCm15R2VuZUV4cHJlc3Npb24gPC0gbm9ybWFsaXplZFZhbHVlc1ttYXRjaChteVByb2JlLHJvd25hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpKSxdCgpib3hwbG90KGFzLm51bWVyaWMobXlHZW5lRXhwcmVzc2lvbikgfiBwYXRpZW50TWV0YWRhdGEkZXIpCnQudGVzdChhcy5udW1lcmljKG15R2VuZUV4cHJlc3Npb24pIH4gcGF0aWVudE1ldGFkYXRhJGVyKQpgYGAKCiMjIEV4ZXJjaXNlOiBFeGVyY2lzZSA4CgpSZXBlYXQgdGhlIHNhbWUgc3RlcHMgd2UgcGVyZm9ybWVkIGZvciB0aGUgZ2VuZSBFU1IxLCBidXQgZm9yIEdBVEEzOgoKLSBUcnkgYW5kIG1ha2UgYXMgZmV3IGNoYW5nZXMgYXMgcG9zc2libGUgZnJvbSB0aGUgRVNSMSBzY3JpcHQKLSBDYW4geW91IHNlZSB3aHkgbWFraW5nIGEgbWFya2Rvd24gZG9jdW1lbnQgaXMgdXNlZnVsIGZvciBhbmFseXNpcz8KCgpgYGB7cn0KCiMjIyBZb3VyIEFuc3dlciBIZXJlICMjIwoKYGBgCgojIyBFeHRyYSBEaXNjdXNzaW9uCgpUaGlzIGV4YW1wbGUgaGFzIGJlZW4gc2ltcGxpZmllZCBieSB0aGUgZmFjdCB0aGF0IHRoZSBjb2x1bW5zIGluIHRoZSBleHByZXNzaW9uIG1hdHJpeCBhcmUgaW4gdGhlIHNhbWUgb3JkZXIgYXMgdGhlIHBhdGllbnQgbWV0YWRhdGEuIFRoaXMgd291bGQgbm9ybWFsbHkgYmUgdGhlIGNhc2UgZm9yIGRhdGEgb2J0YWluZWQgaW4gYSBwdWJsaWMgcmVwb3NpdG9yeSBzdWNoIGFzIEdlbmUgRXhwcmVzc2lvbiBPbW5pYnVzCgpgYGB7cn0KY29sbmFtZXMobm9ybWFsaXplZFZhbHVlcykKcGF0aWVudE1ldGFkYXRhJHNhbXBsZW5hbWUKCmBgYAoKVGhlcmUgaXMgYSBxdWljayBzaG9ydGN1dCB0byBjaGVjayB0aGF0IHRoZXNlIG5hbWVzIGFyZSB0aGUgc2FtZSB1c2luZyB0aGUgYGFsbGAgZnVuY3Rpb24KCmBgYHtyfQpjb2xuYW1lcyhub3JtYWxpemVkVmFsdWVzKSA9PSBwYXRpZW50TWV0YWRhdGEkc2FtcGxlbmFtZQphbGwoY29sbmFtZXMobm9ybWFsaXplZFZhbHVlcykgPT0gcGF0aWVudE1ldGFkYXRhJHNhbXBsZW5hbWUpCmBgYAoKTGV0J3Mgc2F5IHRoYXQgb3VyIG1ldGFkYXRhIGhhdmUgYmVlbiByZS1vcmRlcmVkIGJ5IEVSIHN0YXR1cyBhbmQgYWdlLCBhbmQgbm90IGJ5IHBhdGllbnQgSUQKCgpgYGB7cn0KcGF0aWVudE1ldGFkYXRhIDwtIHBhdGllbnRNZXRhZGF0YVtvcmRlcihwYXRpZW50TWV0YWRhdGEkZXIscGF0aWVudE1ldGFkYXRhJGFnZSksXQpwYXRpZW50TWV0YWRhdGEKYGBgCgotIElmIHdlIHJ1biB0aGUgc2FtZSBjb2RlIGFzIGJlZm9yZSB0byBwcm9kdWNlIHRoZSBib3hwbG90IGFuZCBwZXJmb3JtIHRoZSB0LXRlc3Qgd2Ugd291bGQgZ2V0IGEgdmVyeSBkaWZmZXJlbnQgcmVzdWx0LgotIFRoaXMgc2hvdWxkIG1ha2UgdXNlIGltbWVkaWF0ZWx5IHN1c3BpY2lvdXMsIGFzIHRoZSBFU1IxIGdlbmUgaXMga25vd24gdG8gYmUgaGlnaGx5IGRpZmZlcmVudGlhbGx5LWV4cHJlc3NlZCBpbiB0aGUgY29udHJhc3Qgd2UgYXJlIG1ha2luZwotIFN1Y2ggc2FuaXR5IGNoZWNrcyBhcmUgaW1wb3J0YW50IHRvIGNoZWNrIHRvIHlvdXIgY29kZQoKYGBge3J9CnJvd0luZCAgICAgIDwtIG1hdGNoKCJFU1IxIiwgZ2VuZUFubm90YXRpb24kSFVHTy5nZW5lLnN5bWJvbCkKbXlQcm9iZSAgICA8LSBnZW5lQW5ub3RhdGlvbiRwcm9iZVtyb3dJbmRdCm15R2VuZUV4cHJlc3Npb24gPC0gbm9ybWFsaXplZFZhbHVlc1ttYXRjaChteVByb2JlLHJvd25hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpKSxdCgpib3hwbG90KGFzLm51bWVyaWMobXlHZW5lRXhwcmVzc2lvbikgfiBwYXRpZW50TWV0YWRhdGEkZXIpCnQudGVzdChhcy5udW1lcmljKG15R2VuZUV4cHJlc3Npb24pIH4gcGF0aWVudE1ldGFkYXRhJGVyKQpgYGAKCklmIHdlIHJ1biB0aGUgc2FtZSBjaGVjayBhcyBiZWZvcmUgb24gdGhlIGNvbHVtbiBuYW1lcyBhbmQgcGF0aWVudCBJRHMsIHdlIHNlZSB0aGF0IGl0IGZhaWxzOi0KCmBgYHtyfQphbGwoY29sbmFtZXMobm9ybWFsaXplZFZhbHVlcykgPT0gcGF0aWVudE1ldGFkYXRhJHNhbXBsZW5hbWUpCmBgYAoKQSBzb2x1dGlvbiBpcyB0byB1c2UgYG1hdGNoYCBhZ2Fpbi4gU3BlY2lmaWNhbGx5LCB3ZSB3YW50IHRvIGtub3cgd2hlcmUgZWFjaCBjb2x1bW4gaW4gdGhlIGV4cHJlc3Npb24gbWF0cml4IGNhbiBiZSBmb3VuZCBpbiB0aGUgcGF0aWVudCBtZXRhZGF0YS4gVGhlIHJlc3VsdCBpcyBhIHZlY3RvciwgZWFjaCBpdGVtIG9mIHdoaWNoIGlzIGFuIGluZGV4IGZvciBhIHBhcnRpY3VsYXIgcm93IGluIHRoZSBwYXRpZW50IG1ldGFkYXRhCgpgYGB7cn0KbWF0Y2goY29sbmFtZXMobm9ybWFsaXplZFZhbHVlcykscGF0aWVudE1ldGFkYXRhJHNhbXBsZW5hbWUpCgpgYGAKClRoZSB2ZWN0b3Igd2UgaGF2ZSBqdXN0IGdlbmVyYXRlZCBjYW4gdGhlbiBieSB1c2VkIHRvIHJlLW9yZGVyIHRoZSByb3dzIGluIHRoZSBwYXRpZW50IG1ldGFkYXRhCgpgYGB7cn0KcGF0aWVudE1ldGFkYXRhIDwtIHBhdGllbnRNZXRhZGF0YVttYXRjaChjb2xuYW1lcyhub3JtYWxpemVkVmFsdWVzKSxwYXRpZW50TWV0YWRhdGEkc2FtcGxlbmFtZSksXQpwYXRpZW50TWV0YWRhdGEKYWxsKGNvbG5hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpID09IHBhdGllbnRNZXRhZGF0YSRzYW1wbGVuYW1lKQpgYGAKCkFuZCB3ZSBjYW4gbm93IHByb2NlZWQgdG8gcGVyZm9ybSB0aGUgYW5hbHlzaXMgYW5kIGNhbiB0aGUgcmVzdWx0IHdlIGV4cGVjdAoKYGBge3J9CnJvd0luZCAgICAgIDwtIG1hdGNoKCJFU1IxIiwgZ2VuZUFubm90YXRpb24kSFVHTy5nZW5lLnN5bWJvbCkKbXlQcm9iZSAgICA8LSBnZW5lQW5ub3RhdGlvbiRwcm9iZVtyb3dJbmRdCm15R2VuZUV4cHJlc3Npb24gPC0gbm9ybWFsaXplZFZhbHVlc1ttYXRjaChteVByb2JlLHJvd25hbWVzKG5vcm1hbGl6ZWRWYWx1ZXMpKSxdCgpib3hwbG90KGFzLm51bWVyaWMobXlHZW5lRXhwcmVzc2lvbikgfiBwYXRpZW50TWV0YWRhdGEkZXIpCnQudGVzdChhcy5udW1lcmljKG15R2VuZUV4cHJlc3Npb24pIH4gcGF0aWVudE1ldGFkYXRhJGVyKQpgYGAKCg==