We can use survival
package to accomplish this.
Surv
function to get the data in the right format.survfit
function to compute simple Kaplan-Meier estimator.survdiff
function to test if difference between 2 survival functions is significant.# install.packages("OIsurv")
# install.packages("survsim")
# install.packages("broom")
library(OIsurv)
## Loading required package: survival
## Loading required package: KMsurv
# library(help=KMsurv)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(ggplot2)
library(survsim)
## Loading required package: eha
## Loading required package: statmod
# Real dataset (From KMSurv package)
# help(tongue)
data(tongue)
head(tongue)
## type time delta
## 1 1 1 1
## 2 1 3 1
## 3 1 3 1
## 4 1 4 1
## 5 1 10 1
## 6 1 13 1
summary(tongue)
## type time delta
## Min. :1.00 Min. : 1.00 Min. :0.0000
## 1st Qu.:1.00 1st Qu.: 23.75 1st Qu.:0.0000
## Median :1.00 Median : 69.50 Median :1.0000
## Mean :1.35 Mean : 73.83 Mean :0.6625
## 3rd Qu.:2.00 3rd Qu.:101.75 3rd Qu.:1.0000
## Max. :2.00 Max. :400.00 Max. :1.0000
This dataset is about 80 male subjects who were diagnosed with one of the 2 types of cancer of the tongue. Time tp death is expressed in weeks. More info on this can be found from the package documentation for this dataset using help(tongue)
.
# Analyzing just one type of tumor (Aneuploid Tumor)
tongue2 = tongue %>%
filter(type == 1)
# Converting into a Surv object
tongue2_surv <- Surv(tongue2$time, tongue2$delta)
# Getting KM estimator
tongue2_survfit = survfit(tongue2_surv ~ 1)
plot(tongue2_survfit)
+
signs represent censoring i.e. subjects for whom we don’t have any readings available after that week. The only thing we know is the subject didn’t die until that point in time.glance(tongue2_survfit)
## records n.max n.start events median conf.low conf.high
## 1 52 52 52 31 93 67 NA
The median value of 93 implies, 50% of the subjects died within 93 weeks of diagnosis.
# Graphically Comparing KM estimator for 2 tumors
tongue_survfit = survfit(Surv(time = time, event = delta) ~ type, data = tongue)
plot(tongue_survfit, lty = 2:3, xlab = "weeks", ylab = "Proporation Survival")
legend(100, .8, c("Aneuploid", "Diploid"), lty = 2:3)
ggplot2
version of the same plot alongwith the confidence intervals (shaded).# ggplot2 version of the plot
tongue_tidy = tidy(tongue_survfit)
mx = max(tongue_tidy$n.censor)
ggplot(tongue_tidy, aes(time, estimate, fill = strata)) +
geom_line() +
geom_point(aes(shape = as.factor(n.censor)), size = 3) +
scale_shape_manual(values=c(NA, 1:mx))+
geom_ribbon(aes(ymin=conf.low, ymax=conf.high), alpha=.25) +
xlab("weeks") +
ylab("Proportion Survival")
## Warning in loop_apply(n, do.ply): Removed 43 rows containing missing values
## (geom_point).
We can compare the survival curves using various statistical tests using survdiff
function. Apart from the usual formula and data arguments it takes rho
is an argument. Setting different values of rho
leads to different kinds of test. In general -
rho
> 0 implies higher weight is given to initial part of the survival curverho
< 0 implies higher weight is given to latter part of the survival curve Depending on the value of rho we can get different results as can be seen in the below example.# Statistical tests for comparison
survdiff(Surv(time = time, event = delta) ~ type, data = tongue, rho = 0) # log-rank, default
## Call:
## survdiff(formula = Surv(time = time, event = delta) ~ type, data = tongue,
## rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## type=1 52 31 36.6 0.843 2.79
## type=2 28 22 16.4 1.873 2.79
##
## Chisq= 2.8 on 1 degrees of freedom, p= 0.0949
survdiff(Surv(time = time, event = delta) ~ type, data = tongue, rho = 1)
## Call:
## survdiff(formula = Surv(time = time, event = delta) ~ type, data = tongue,
## rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## type=1 52 20.2 24.4 0.731 3.3
## type=2 28 15.1 10.9 1.643 3.3
##
## Chisq= 3.3 on 1 degrees of freedom, p= 0.0694
It outputs a chi-squared statistic as well as p-value which can be used to determine whether the difference is significant at certain significance level.
simple.surv.sim
function from survsim
package.Following is the same analysis using this simulated dataset. Try and see if you can make sense of code and the output.
set.seed(2365)
d1 = simple.surv.sim(n = 500, foltime = 1000, dist.ev = 'weibull', anc.ev = 1.5, beta0.ev = 2, # event dist - weibull with p = 1.5
dist.cens = 'weibull', anc.cens = 10, beta0.cens = 2.01, # censoring dist - weibull with p = 10
# z = list(c("weibull", 1)), #assuming independent observations i.e. no within subject correlation
beta = list(c(.6)), x = list(c("bern", .7)) # 1 binary covariate
)
# View(d1)
summary(d1)
##
## Number of subjects at risk
## ----------------------------
## sub.risk
## 500
##
##
## Number of events
## ----------------------------
## num.events
## 186
##
##
## Proportion of subjects with event
## ----------------------------
## mean.ep.sub
## 0.372
##
##
## Total time of follow-up
## ----------------------------
## foltime
## 2964.863
##
##
## Time of follow-up (median)
## ----------------------------
## med.foltime
## 6.612405
##
##
## Density of incidence
## ----------------------------
## dens.incid
## 0.06273477
summary(as.data.frame(d1))
## nid status start stop z
## Min. : 1.0 Min. :0.000 Min. :0 Min. :0.1452 Min. :1
## 1st Qu.:125.8 1st Qu.:0.000 1st Qu.:0 1st Qu.:4.8752 1st Qu.:1
## Median :250.5 Median :0.000 Median :0 Median :6.6124 Median :1
## Mean :250.5 Mean :0.372 Mean :0 Mean :5.9297 Mean :1
## 3rd Qu.:375.2 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:7.4102 3rd Qu.:1
## Max. :500.0 Max. :1.000 Max. :0 Max. :9.1243 Max. :1
## x
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.714
## 3rd Qu.:1.000
## Max. :1.000
d1_survfit = survfit(Surv(time = start, time2 = stop, event = status) ~ x, data = d1)
d1_survfit = survfit(Surv(time = stop, event = status) ~ x, data = d1) # same as above
plot(d1_survfit, lty = 2:3, xlab = "time", ylab = "Proporation Survival")
legend(100, .8, c("x=0", "x=1"), lty = 2:3)
# ggplot2 version of the plot
d1_tidy = tidy(d1_survfit)
mx = max(d1_tidy$n.censor)
ggplot(d1_tidy, aes(time, estimate, fill = strata)) +
geom_line() +
geom_point(aes(shape = as.factor(n.censor)), size = 2) +
scale_shape_manual(values=c(NA, 1:mx))+
geom_ribbon(aes(ymin=conf.low, ymax=conf.high), alpha=.25) +
xlab("time") +
ylab("Proportion Survival")
## Warning in loop_apply(n, do.ply): Removed 186 rows containing missing
## values (geom_point).
# Statistical tests for comparison
survdiff(Surv(time = stop, event = status) ~ x, data = d1, rho = 0)
## Call:
## survdiff(formula = Surv(time = stop, event = status) ~ x, data = d1,
## rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=0 143 79 45.7 24.30 32.3
## x=1 357 107 140.3 7.91 32.3
##
## Chisq= 32.3 on 1 degrees of freedom, p= 1.31e-08