Survival Analysis

 

  • Learning Objectives:

 

    1. Understand Survival Analysis and identify business use cases
    2. Classify the different methods of Survival Analysis based on number of parameters used
    3. Interpret the log rank test
    4. Interpret a hazard ratio, survival function
    5. Interpret coefficients in Cox proportional hazards regression analysis

 

2. Prerequisites:

  1. Hypothesis Testing
  2. Regression

 

What is Survival Analysis?

 

Survival analysis aka time-to-event analysis is generally defined as a set of methods for analyzing data where the outcome variable is:-

event status, which records if the event of interest occurred or not AND the time until the occurrence of an event of interest

 

The event can be death, occurrence of a disease, marriage, divorce, customer churn etc.  For example, survival analysis can be used to predict if a customer will churn or not (event status) and also estimating the time (in days, months, years etc.) when the customer would churn.

 

Practical Applications of Survival Analysis:

Survival analysis is used for:

  • Engineering. It is used to perform reliability or failure analysis on parts and components.
  • Warranty Applications. It is used to estimate optimal time for part replacement versus part service.
  • Healthcare. It is used to estimate the risk of disease progression in patients, the impact of medications and treatment algorithms, and the timing of medications and procedures
  • Government and Social Services. It is used in child welfare to match children with foster parents and to optimize the length of stay of children in the program, to estimate participation time in various social programs, and to estimate the time it takes for various policies to take effect
  • Marketing Operations. It is used to assess the length of participation in loyalty programs

 

Objectives of Survival analysis:

 

There are 3 objectives of Survival analysis, let’s just get introduced to these for now and the details of each will be discussed later in this course

  • Estimate time-to-event for a group of individuals, such as time until heart-attack for a group of female patients

To describe the survival times of members of a group, use the following:-

  • Kaplan-Meier curves
  • Survival function
  • Hazard function
  • To compare time-to-event between two or more groups, such as time until heart-attach for female vs. male patients

To compare the survival times of two or more groups, use the following:-

  • Log-rank test
  • To assess the relationship of co-variables to time-to-event, such as: does weight, exercise, or cholesterol influence survival time of heart-attack patients?
    • Cox proportional hazards regression

 

Definitions of common terms in survival analysis:

The following terms are commonly used in survival analyses:

 

  1. Event: Death, disease occurrence, disease recurrence, recovery, or other experience of interest
  2. Time: The time from the beginning of an observation period (such as surgery or beginning treatment) to (i) an event, or (ii) end of the study, or (iii) loss of contact or withdrawal from the study.
  3. Censoring / Censored observation:

To understand censoring, consider this example:

Let the time-to-event be a person’s age at onset of cancer. If you stop following someone after age 65, you may know that the person did NOT have cancer at age 65, but you do not have any information after that age.

You know that their age of getting cancer is greater than 65. But you do not know if they will never get cancer or if they’ll get it at age 66, only that they have a “survival” time greater than 65 years.

 

These cases of incomplete information are called censored observations. They are censored because we did not gather information on that subject after age 65.

So one cause of censoring is merely that we can’t follow people forever. At some point you have to end your study, and not all people will have experienced the event.

But another common cause is that people are lost to follow-up during a study.

 

There are 3 main reasons why this happens:

  1. Individual does not experience the event when the study is over.
  2. Individual is lost to follow-up during the study period.
  3. Individual withdraws from the study

 

There are 3 major times of censoring: right, left and interval censoring which we will discuss below.

Right-censored

Right-censoring, the most common type of censoring, occurs when the survival time is “incomplete” at the right side of the follow-up period. Consider the follow example where we have 3 patients (A, B, C) enrolled onto a clinical study that runs for some period of time (study end – study start).

These 3 patients have three different trajectories:

  1. Patient A: Experiences a death before the study ends. We count this as an event.
  2. Patient B: Survives passed the end of the study.
  3. Patient C: Withdraws from the study.

 

 

Patient A requires no censoring since we know their exact survival time which is the time until death. Patient B however needs to be censored (indicated with the + at the end of the follow-up time) since we don’t know the exact survival time of the patient; We only know that they survived up to at least the end of the study. Patient C also needs to be censored since they withdrew before the study ended. So we only know that they survived up to the time they withdrew, but again we don’t know the exact survival time of this patient. In right censoring, the true survival times will always be equal to or greater than the observed survival time.

Survival analysis automatically takes care of right censoring. Right censoring is not considered as missing values in the analysis.

Left-censored

In contrast to right-censoring, left censoring occurs when the person’s true survival time is less than or equal to the observed survival time. An example of a situation could be for virus testing. For instance, if we’ve been following an individual and recorded an event when for instance the individual tests positive for a virus:

But we don’t know the exact time of when the individual was exposed to the disease. We only know that there was some exposure between 0 and the time they were tested:

Interval-censored

Using the virus testing example, if we have the situation whether we’ve performed testing on the individual at some timepoint (t1) and the individual was negative. But then at a timepoint further on (t2), the individual tested positive:

In this scenario, we know the individual was exposed to the virus sometime between t1 and t2, but we do not know the exact timing of the exposure.

 

 

  • Survival Probability or Survival Function: Probability of surviving until time *t* is called survival function. It is normally represented as S(t)

 

 

S(t) = No. of individuals or subjects that have survived until time t/ Total no. of subjects

 

All survivor functions follow these same 3 characteristics:

  • As t increases, the S(t) should decrease.
  • S(0)=1. That is at the start of the study, no one has the event and hence the probability of surviving at t=0 is 1.
  • S(∞)=0. If the study were to go to S(∞), then everyone will eventually experience the event and hence the survival probability must be 0

 

  • Kaplan-Meier survival estimate:

 

The Kaplan-Meier (KM) method is method used to estimate the survival probability that we just learned about from observed survival times

 

The survival probability at time ti, S(ti), is calculated as follow:

S(ti)=S(ti−1)(1−di/ni)

Where,

  • S(ti−1) = the probability of being alive at ti−1
  • ni = the number of patients alive just before ti
  • di = the number of events at ti
  • At t(0)=0, S(0) = 1

 

The estimated probability S(t) is a step function that changes value only at the time of each event. The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.

 

The Survival curve which represents the survival function at any point of time, ideally the curve looks like a smooth curve like shown below but more often than not, the curve looks like a staircase:-

 

 

  • Hazard Rate/Hazard function:  The potential risk of having an event given you have survived up to time t. Mathematically, h(t) is represented as follows:

 

Let’s break down this equation:

  • Lim Δt→∞: This indicates as the time interval approaches 0. This essentially gives us the instantaneous measurement at a particular time
  • P(t≤T<t+Δt | T≥t): Probability that a person’s survival time T will be between the time interval t and t+Δt given that they have survived up to t (T≥tT≥t)

So what this equation is telling us is if the time interval approaches 0 (lim Δt→∞), then we are getting the instantaneous risk of having an event at time t given they have survived up to t.

The Hazard function can be expressed as a function of Survival function

h(t) = − d/dt (log S(t)

 

Survival Modeling in R

Install and load required R package

We’ll use two R packages:

  • survival for computing survival analyses
  • survminer for summarizing and visualizing the results of survival analysis

# list all packages we’d need

list_of_packages <- c(‘survival’, ‘survminer’, ‘dplyr’)

 

# install packages, if not already installed

new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,”Package”])]

if(length(new_packages)) install.packages(new_packages, repos = “http://cran.us.r-project.org“)

 

# Loading all required packages

lapply(list_of_packages, require, character.only = TRUE)

Example data sets

We’ll use the lung cancer data available in the survival package

 

data(“lung”)
head(lung)

 

inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0

 

  • inst: Institution code
  • time: Survival time in days
  • status: censoring status 1=censored, 2=dead
  • age: Age in years
  • sex: Male=1 Female=2
  • ph.ecog: ECOG performance score (0=good 5=dead)
  • ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
  • pat.karno: Karnofsky performance score as rated by patient
  • meal.cal: Calories consumed at meals
  • wt.loss: Weight loss in last six months

Compute survival curves: survfit()

We want to compute the survival probability by sex.

 

The function survfit [in survival package] can be used to compute Kaplan-Meier survival estimate. Its main arguments include:

  • a survival object created using the function Surv
  • and the data set containing the variables.

 

To compute survival curves, type this:

fit <- survfit(Surv(time, status) ~ sex, data = lung)

Visualize survival curves

We’ll use the function ggsurvplot [in survminer R package] to produce the survival curves for the two groups of subjects.

ggsurvplot(fit)

At time 250, the probability of survival is approximately 0.55 (or 55%) for sex=1 and 0.75 (or 75%) for sex=2. From the survival curve, we can say that the probability of survival for sex=2 at any point in time is always higher than survival probability for sex=1

 

We can use ggsurvplot to show additional information in the survival curve like the following:-

  • We can show the 95% confidence limits of the survivor function by using the argument conf.int = TRUE.
  • the number and/or the percentage of individuals at risk by time using the option risk.table. Allowed values for risk.table include:
    • TRUE or FALSE specifying whether to show or not the risk table. Default is FALSE.
    • “absolute” or “percentage”: to show the absolute number and the percentage of subjects at risk by time, respectively. Use “abs_pct” to show both absolute number and percentage.
  • the p-value of the Log-Rank test comparing the groups using pval = TRUE
  • If there are more than one groups in the variable, like we have sex=1 and sex=2 in the sex variable and we need to use a different color to represent the two groups, we can use the color attribute. By default color=”strata” but we can use the argument palette to use custom colors as well.
  • legend.labs can be used to rename legends

 

 

ggsurvplot(fit,

color=”strata”,

conf.int = TRUE,

risk.table = TRUE,

risk.table.col = “strata”,

pval=TRUE,

legend.labs =  c(“Male”, “Female”))

Log-Rank test comparing survival curves: survdiff()

The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.

The function survdiff() [in survival package] can be used to compute log-rank test comparing two or more survival curves.

surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
surv_diff

Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)
N Observed Expected (O-E)^2/E   (O-E)^2/V
sex=1 138      112     91.6      4.55                 10.3
sex=2  90       53     73.4      5.68                   10.3
Chisq= 10.3  on 1 degrees of freedom, p= 0.00131

 

The function returns a list of components, including:

  • n: the number of subjects in each group.
  • obs: the weighted observed number of events in each group.
  • exp: the weighted expected number of events in each group.
  • chisq: the chisquare statistic for a test of equality.
  • strata: optionally, the number of subjects contained in each stratum.

 

The p-value from the output here is of most importance. The log rank test for difference in survival gives a p-value of p = 0.0013, indicating that the sex groups differ significantly in survival as the null hypothesis here is that there is no difference in survival of males and females

 

Note: This information can also be extracted from a survfit object using pval=TRUE in ggsurvplot()

 

Cox Regression: Overview

Cox Proportional Hazards model is also referred by Cox Model, Cox Regression, or Proportional Hazard Model.

Cox Model is used for analysis of Survival data and finding out relationship between Survival Time and covariates (also called explanatory variables or predictors)

 

Difference in Kaplan-Meier/Logrank and Cox Regression

 

Kaplan-Meier curves and logrank tests – are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.

Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

 

Equation:

The Cox model is expressed by the hazard function denoted by h(t).

h(t)=h0(t) × exp(b1x1+b2x2+…+bpxp)

As seen from the equation, Cox Model formulation has two parts – baseline hazard function and exponential transformation of covariates.

where,

  • t represents the survival time
  • h(t) is the hazard function determined by a set of p covariates (x1,x2,…,xpx1,x2,…,xp)
  • the coefficients (b1,b2,…,bpb1,b2,…,bp) measure the impact (i.e., the effect size) of covariates.
  • the term h0 is called the baseline hazard. It corresponds to the value of the hazard if all the xi are equal to zero (the quantity exp(0) equals 1). The ‘t’ in h(t) reminds us that the hazard may vary over time

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables xi, with the baseline hazard being an ‘intercept’ term that varies with time.

Hazard Ratios and their significance:

Hazard ratio specifies the factor by which the risk of dying would change for one unit difference in an explanatory variable (in case of a numerical covariate)

In Cox Model, Hazard is formulated as

h(t) = h0(t) e ^(b0+b1x1 +b2x2+…)

Consider example for i and j cases

hi(t) = h0(t) e ^(b0+b1x1i +b2x2i+…)                                                         ….. (1)

hj(t) = h0(t) e ^(b0+b1x1j +b2x2j+…)                                                         ….. (2)

 

Hazard Ratio for i and j

hi(t)/hj(t) = e ^(b1(x1i – x1j ) +b2(x2i – x2j))

Hazards Ratio  hi(t)/hj(t)  does not depend on the time.

If we want to check per unit change for a variable  X1  keeping other variables constant then Hazard Ratio will be

hi(t)/hj(t) = e^b1

Hence, the quantities exp(bi) are called hazard ratios (HR).

A value of bi greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the ith covariate increases, the event hazard increases and thus the length of survival decreases.

Put another way, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

  • HR = 1: No effect
  • HR < 1: Reduction in the hazard
  • HR > 1: Increase in Hazard

Note that in cancer studies i.e. where the event is a negative event:

  • A covariate with hazard ratio > 1 (i.e.: b > 0) is called bad prognostic factor
  • A covariate with hazard ratio < 1 (i.e.: b < 0) is called good prognostic factor

Interpretation of Hazard Ratio for categorical variables

For a categorical variable,

Hazard Ratio (i.e. the ratio of hazards) = Hazard in the intervention group ÷ Hazard in the control group

Because Hazard Ratio is a ratio, then when:

HR = 0.5: at any particular time, half as many patients in the treatment group are experiencing an event compared to the control group.

HR = 1: at any particular time, event rates are the same in both groups,

HR = 2: at any particular time, twice as many patients in the treatment group are experiencing an event compared to the control group.

 

Cox Regression in R

In survival package in R, coxph function can be used for building Cox Proportional Hazard Model.

 

coxph(formula, data, method)

 

  • formula: is linear model with a survival object as the response variable. Survival object is created using the function Surv() as follow: Surv(time, event).
  • data: a data frame containing the variables
  • method: is used to specify how to handle ties. The default is ‘efron’. Other options are ‘breslow’ and ‘exact’. The default ‘efron’ is generally preferred to the once-popular “breslow” method. The “exact” method is much more computationally intensive.

 

 

res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)

summary(res.cox)

Let’s understand the specific outputs of summary() which helps us interpret the output of the model:-

 

  1. Global Statistical significance of the model:-

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (β) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

 

  1. P-value for each covariate – Pr(>|Z|) :- The p-value tests if the beta coefficient of the variable is significantly different from 0 or not. A beta coefficient of 0 means the impact of the variable is 0. From the P-values, we can see that covariate sex, ph.ecog and ph.karno are significant in determining the survival time while the rest of the variables are not statistically significant

 

  1. Hazards ratio or exp(coef): The Hazards ratio<1 for sex which means being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.

Ph.ecog with a hazard ratio HR = 2.08, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.

 

Predicting survival time after the model equation is created

 

Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion

 

 

# Create the new data for which survival needs to be predicted

predict_df <- with(lung, data.frame(sex = 1,

age = 20,

ph.ecog = 1,

ph.karno=100,

pat.karno=10,

meal.cal=120,

 

wt.loss=10))

 

head(predict_df)

 

 

# Survival curves

fit <- survfit(res.cox, newdata =  predict_df)

ggsurvplot(fit, conf.int=FALSE, data= predict_df)

 

 

References

http://tinyheero.github.io/2016/05/12/survival-analysis.html#censoring

https://www.cscu.cornell.edu/news/statnews/stnews78.pdf

https://en.wikipedia.org/wiki/Survival_analysis

https://en.wikipedia.org/wiki/Proportional_hazards_model