An important area in applied and methodological statistics as well as machine learning is disease progression modeling. There are arguably six main questions that we aim to answer with disease progression models. Different models have various strengths and limitations for each of these questions, and this also varies across datasets and problems.
- What will future values of health measurements be, given their history (forecasting)? This is the most obvious question. For instance, in HIV/AIDS progression, one would like to forecast future values of CD4 white blood cell counts. In glaucoma, one would like to forecast the visual field index and thickness of the retinal nerve fiber layer.
- Can we describe intuitively how these health measurements evolve, both generally and for specific individuals? For instance, we may stratify CD4 counts in HIV/AIDS into states, and then describe how quickly we transition between some states (levels). We alternatively may say that on average, every one month sees a decrease in CD4 count of some amount. Different model classes allow us to make different sorts of statements about how health measurements evolve. For instance, the first statement can be made very easily with a continuous-time Markov chain (see [1]), while the second is easy to make with a linear mixed effects model. Of course, if the assumptions underlying the model are violated, then the intuitive statements that come out of fitting it to data may be wrong.
- When will this person experience some event (death, dropout, appointment times, readmission)? Given a patient’s history of health measurements can we say something about either their risk of death or their risk of dropping out of a study? This is often the domain of survival analysis, but it’s very relevant to disease modeling as well. Can we predict their next appointment times and/or whether they’ll have a hospital readmission?
- Is there a difference between two groups in progression? This is tricky in disease modeling. In a standard two-sample t-test, you assume iid data and Gaussian sample means and test whether there is a significant difference between them. When you want to test a difference in progression, it’s really a difference in health measurement trajectories, but then we have to decide how we want to describe trajectories: do we assume some parametric form, that they follow some stochastic process, etc?
- What can covariates tell us about progression? For instance, say we find that there is a difference between the treatment group and the control group in progression. We would like to describe the difference. What does the choice of treatment tell us about the health measurement dynamics in various regions of data? This requires us to make stronger assumptions about the functional form of parameters as a function of treatment than in 4, but assuming these assumptions are satisfied, we can describe the difference rather than simply saying that there is a difference.
- What will an outcome variable be? You may want to do classification or regression of trajectories. For instance, given some sequence of measurements, can we predict whether this person will eventually be diagnosed with a specific disease (without modeling when)?
[1] Shoko, Claris, and Delson Chikobvu. “Time-homogeneous Markov process for HIV/AIDS progression under a combination treatment therapy: cohort study, South Africa.” Theoretical Biology and Medical Modelling 15, no. 1 (2018): 3.