In Survival Analysis, you have three options for modeling the survival function: non-parametric (such as Kaplan-Meier), semi-parametric (Cox regression), and parametric (such as the Weibull distribution). When should you use each? What are their tradeoffs?
The most common non-parametric technique for modeling the survival function is the Kaplan-Meier estimate. One way to think about survival analysis is non-negative regression and density estimation for a single random variable (first event time) in the presence of censoring. In line with this, the Kaplan-Meier is a non-parametric density estimate (empirical survival function) in the presence of censoring.
The advantage of this is that it’s very flexible, and model complexity grows with the number of observations. There are two disadvantages: a) it isn’t easy to incorporate covariates, meaning that it’s difficult to describe how individuals differ in their survival functions. The main way to do it is to fit a different model on different subpopulations and compare them. However, as the number of characteristics and values of those characteristics grows, this becomes infeasible. b) the survival functions aren’t smooth. In particular they are piecewise constant. They approach a smooth estimator as the sample size grows, but for small samples they are far from smooth. It’s not clear that it’s realistic that the death probability ‘jumps’ in a small interval. Further, if you don’t have any death observations in the interval [0,t), then it will assign survival probability 1 to that period, which may not be desirable. There are ways to smooth the survival function (kernel smoothing), but the interpretation of the smoothing can be a bit tricky.
Let’s try this. We use the ovarian dataset from the R package ‘survival.’ We borrow some code from this tutorial in order to pre-process the data and make this plot. The data has death or censoring times for ovarian cancer patients over a period of approximately 1200 days. It also has the treatment rx (1 or 2), a diagnosis on regression of tumors, and patient performance on an ECOG criteria. Here is a plot of two Kaplan Meier fits according to treatment.
This plot has some of the issues we mentioned. Firstly, the survival probabilities ‘jump.’ Secondly, for rx=2, we see that for the first 350 or so days, no one died, and thus we see a survival probability of 1. It can be dangerous to presume that this is close to the true survival probability, particularly if the data size for that group is small. Finally, if we want to incorporate the regression diagnosis or patient performance in addition to treatment, we’ll need to fit many different models.
The most well-known semi-parametric technique is Cox regression. This addresses the problem of incorporating covariates. It decomposes the hazard or instantaneous risk into a non-parametric baseline, shared across all patients, and a relative risk, which describes how individual covariates affect risk.
This allows for a time-varying baseline risk, like in the Kaplan Meier model, while allowing patients to have different survival functions within the same fitted model. Again though, the survival function is not smooth. Further, we now have to satisfy two assumptions for inferences to be correct and predictions to be good:
- linearity between covariates and log-hazard
- proportional hazards
One can also assume that the survival function follows a parametric distribution. For instance, one can assume an exponential distribution (constant hazard) or a Weibull distribution (time-varying hazard). There are now two benefits. The first is that if you choose an absolutely continuous distribution, the survival function is now smooth. The second is that choosing a parametric survival function constrains the model flexibility, which may be good when you don’t have a lot of data and your choice of parametric model is appropriate. Unlike applying a smoothing technique after an initial estimation of the survival function, for these parametric models we tend to have good intuition for how they behave. Further, like in Cox regression, it’s easy to incorporate covariates into the model and inference procedure.
The downside is that one needs the parametric model to actually be a good description of your data. This may or may not be true, and one needs to test it, either by formal hypothesis testing or visualization procedures.
Question to Ask
When deciding which type of model to fit. Ask yourself the following questions:
- Do you need covariates? Lean towards parametric or semi-parametric. Make sure assumptions are satisfied.
- Do you need your survival function to be smooth? Lean towards parametric, or apply a smoothing technique.
- Does your data appear to follow a parametric distribution? Lean towards parametric if it does. Otherwise semi-parametric or non-parametric.