This can be done using Cox Propotional Hazard model. (No time-varying covariates)
survival
package has coxph
function to accomplish this.
Surv
function to get the data in the right format, incorporating censoring information.coxph
function alongwith the covariate(s) to find out their effect.cox.zph
to validate proportional hazard assumption.library(magrittr)
{
library(OIsurv)
library(dplyr)
library(broom)
library(ggplot2)
library(survsim)
}
# %>%
# suppressMessages()
# Real dataset from KMsurv package
data(tongue)
# head(tongue)
# summary(tongue)
# help(tongue)
This dataset is about 80 male subjects who were diagnosed with one of the 2 types of cancer of the tongue. Time tp death is expressed in weeks. More info on this can be found from the package documentation for this dataset using help(tongue)
.
Shown below is coxph
model with cancer type
as covariate.
# COX PH
tongue_coxph = coxph(Surv(time = time, event = delta) ~ as.factor(type), data = tongue)
tongue_coxph_tidy = tidy(tongue_coxph)
tongue_coxph_tidy # equivalent of print()
## term estimate std.error statistic p.value conf.low
## 1 as.factor(type)2 0.4663742 0.2804476 1.662964 0.09631971 -0.08329303
## conf.high
## 1 1.016041
The confidence interval for the coefficient estimate [-0.083293, 1.0160414] contains 0, implies the relationship is not significant.
The likelihood ratio test against the null model (below) has p-value of 0.1020272 which again supports this.
glance(tongue_coxph) # equivalent of summary()
## n nevent statistic.log p.value.log statistic.sc p.value.sc
## 1 80 53 2.673567 0.1020272 2.814338 0.09342544
## statistic.wald p.value.wald r.squared r.squared.max concordance
## 1 2.77 0.09631971 0.03286732 0.9929755 0.5642941
## std.error.concordance logLik AIC BIC
## 1 0.03561883 -196.9975 395.995 -Inf
# plot(survfit(tongue_coxph, newdata = as.data.frame(cbind(type = 1)))) # for prediction, only the inputs/covariates need to be provided
Now, even though the coefficient is not significant, we can test and see if the model assumptions are satisfied.
cox.zph
function. It tests correlation between Schoenfeld residuals and (transformed) time using a chi-square test. For this test, p-value less than certain threshold (say 0.5) would imply the correlation between the residuals and (transformed) time is significant and proportional hazards assumption does not hold. (More info on schoenfeld residual)# Validating Cox PH Assumptions
validate_coxph = cox.zph(tongue_coxph, transform = "km")
validate_coxph
## rho chisq p
## as.factor(type)2 -0.0958 0.473 0.491
From the low p-value, the proportional hazards assumption does not seem to hold here.
The plot below shows the scaled Schoenfeld residuals. Ideally for PH assumption to hold, this should be a flat straight line. (Like any other residuals, these residuals should exhibit random pattern. )
plot(validate_coxph)
abline(h=0)
validate_gg = data.frame(cbind(x = validate_coxph[["x"]],
y = validate_coxph[["y"]]))
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,4,5,7,9,14,15,17,22,24,27,28,47,48 --> row.names NOT used
names(validate_gg) = c("transformed_t", "scaled_schoenfeld_residual")
ggplot(data = validate_gg, aes(x = transformed_t, y = scaled_schoenfeld_residual)) +
geom_point() +
geom_smooth(method = "loess")
simple.surv.sim
function from survsim
package.(The same dataset was used in prior post)
coxph
or other similar models)Following is the same analysis using this simulated dataset. Try and see if you can make sense of code and the output.
set.seed(2365)
d1 = simple.surv.sim(# event dist - weibull with p = 1.5
n = 500, foltime = 1000, dist.ev = 'weibull', anc.ev = 1.5, beta0.ev = 2,
# censoring dist - weibull with p = 10
dist.cens = 'weibull', anc.cens = 10, beta0.cens = 2.01,
# 1 binary covariate
beta = list(c(.6)), x = list(c("bern", .7))
)
# View(d1)
# summary(d1)
summary(as.data.frame(d1))
## nid status start stop z
## Min. : 1.0 Min. :0.000 Min. :0 Min. :0.1452 Min. :1
## 1st Qu.:125.8 1st Qu.:0.000 1st Qu.:0 1st Qu.:4.8752 1st Qu.:1
## Median :250.5 Median :0.000 Median :0 Median :6.6124 Median :1
## Mean :250.5 Mean :0.372 Mean :0 Mean :5.9297 Mean :1
## 3rd Qu.:375.2 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:7.4102 3rd Qu.:1
## Max. :500.0 Max. :1.000 Max. :0 Max. :9.1243 Max. :1
## x
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.714
## 3rd Qu.:1.000
## Max. :1.000
# COX PH
d1_coxph = coxph(Surv(time = start, time2 = stop, event = status) ~ as.factor(x), data = d1)
d1_coxph_tidy = tidy(d1_coxph)
d1_coxph_tidy # equivalent of print()
## term estimate std.error statistic p.value conf.low
## 1 as.factor(x)1 -0.8220694 0.1486678 -5.529572 3.210132e-08 -1.113453
## conf.high
## 1 -0.5306858
glance(d1_coxph) # equivalent of summary()
## n nevent statistic.log p.value.log statistic.sc p.value.sc
## 1 500 186 28.63993 8.716611e-08 32.31914 1.308183e-08
## statistic.wald p.value.wald r.squared r.squared.max concordance
## 1 30.58 3.210132e-08 0.05567025 0.9871835 0.5959589
## std.error.concordance logLik AIC BIC
## 1 0.01658598 -1074.935 2151.87 -Inf
# plot(survfit(d1_coxph, newdata = as.data.frame(cbind(type = 1)))) # for prediction, only the inputs/covariates need to be provided
The coefficient for covariate x
is significant but is -0.8220694, quiet different from the parameter we used to simulate this data i.e. 0.6
. This is happening partly due to the fact that for coxph
we are finding the coefficients using partial likelihood without making any assumptions around the time-to-event distribution.
# Validating Cox PH Assumptions
validate_d1_coxph = cox.zph(d1_coxph, transform = "km")
validate_d1_coxph
## rho chisq p
## as.factor(x)1 0.0659 0.798 0.372
From not so low p-value (), the proportional hazards assumption holds here.
The scaled Schoenfeld residuals plot below also looks flat for the most part, seems to be increasing for higher values of time.
plot(validate_d1_coxph)
abline(h=0)
Below is a ggplot2 version using loess
fit telling the same story.
validate_d1_gg = data.frame(cbind(x = validate_d1_coxph[["x"]],
y = validate_d1_coxph[["y"]]))
names(validate_d1_gg) = c("transformed_t", "scaled_schoenfeld_residual")
ggplot(data = validate_d1_gg, aes(x = transformed_t, y = scaled_schoenfeld_residual)) +
geom_point() +
geom_smooth(method = "loess")