We present SurvivEHR, a foundation model for time-to-event prediction using Electronic Health Records (EHR), based on the Generative Pre-trained Transformer (GPT) architecture. The model is trained on 23 million patient records from the UK Clinical Practice Research Datalink (CPRD), encompassing longitudinal primary care data. In total, 7.6 billion recorded event across patient timelines are used, with each represented as a tuple comprising: (i) a categorical event index (a unique combination of ICD-10 codes), (ii) an associated numerical value (e.g. measurement), and (iii) the event time (days to/since birth)
SurvivEHR follows a pretrain-finetune paradigm: it first learns generalisable clinical representations from large-scale EHR data, and is then fine-tuned for specific prediction tasks such as forecasting future diagnoses, lab values, or mortality risk. This enables SurvivEHR to perform time-to-event forecasting, providing personalised forecasts for risk of future diagnoses, measurements, tests, and death. We further demonstrate that SurvivEHR supports strong transfer learning, and can be used as a Foundation Model for clinical prediction modelling on a number of case study examples.
This work is motivated by the growing burden of Multiple Long-Term Conditions (MLTCs), also referred to as multimorbidity, as the prevalence of individuals living with two or more chronic conditions continues to rise. This shift is largely driven by an ageing population and advances in medical care that have extended life expectancy, resulting in more people living longer with chronic diseases. MLTCs are associated with poorer health outcomes, reduced quality of life, increased healthcare costs, and higher rates of hospitalisation and mortality.