Background
One of the cornerstones of health and clinical research is to identify individuals who have a high risk of developing an adverse outcome over a specific time period, so that they can be targeted for early preventative strategies and possibly treatment. For example individuals who are seemingly healthy but are found to have a high risk of developing cardiovascular disease could be recommended to modify their lifestyle and behaviour (e.g. smoking, exercise, eating habits) to reduce their future risk. They may also be prioritised for clinical investigation, which could lead to early diagnosis of an underlying condition (e.g. diabetes, high blood pressure) and preventative treatment (e.g. statins or aspirin) to manage it.
For this purpose of prognostic risk assessments there is a growing interest in
risk prediction modelling, [
1‐
3] where a statistical model is used to estimate the risk of future outcomes for individuals based on one or more underlying characteristics. When considering future outcomes in patients, a risk prediction model is often referred to as a
prognostic model (typically used for outcome risk for a defined disease) or more generally a
clinical prediction model (used for both diseased or non-diseased settings) Similarly the word ‘model’ is often replaced with ‘score’, ‘tool’, ‘index’, or ‘rule’. However, the same principle remains: to accurately predict the risk of future occurrence of an outcome in an individual by utilising the values or levels of multiple individual characteristics. We refer here to such characteristics simply as predictors, but they are also termed prognostic factors, risk factors, prognostic variables, and prognostic markers [
4]. They often include standard features such as age, sex, smoking and family history, but also increasingly include more complex clinical measures such as biomarkers, relating to a diverse range of measurable biological (including genomic), pathological, imaging, clinical, and physiological variables.
Diagnostic risk prediction models also exist, where the risk of already having a disease is calculated; however, the focus in this article is on predicting the risk of future outcomes. Unless the outcome prediction relates to the very near future (e.g. risk of hypocalcaemia within 48 hours after thyroidectomy [
5]), single predictors usually do not provide accurate predictions at the individual-level [
4]. For this reason risk prediction models usually utilise multiple predictors in combination. For example, in healthy women the probability of developing breast cancer can be estimated from the Gail model, which is a risk prediction model combining information on family history, age, age at first live birth, age at menarche, breast biopsy number, and menopause [
6,
7]. In women with newly diagnosed breast cancer, a well-known risk prediction model is the Nottingham Prognostic Index (NPI), [
8] which gives a score that relates to the survival probability and is based on a combination of tumour grade, number of involved lymph nodes, and tumour size.
Before evaluation of its impact in daily practice [
3,
9,
10], risk prediction model research has two main phases: model development (including internal validation using the same data or data source) and external validation (using new data from a different data source) [
2,
11,
12]. Validation requires demonstrating that the model is accurate in the population of individuals for whom it is intended. It must ascertain the model’s ability to distinguish between patients with different outcomes (‘discrimination’) and show the agreement between predicted and observed risks in groups of individuals with similar risk predictions (‘calibration’) [
1]. Importantly, validation must go beyond the set of data and individuals that were used to develop the model, because predictive performance when estimated on the development data is often optimistic, related to multiple testing with a limited sample size [
1,
13,
14]. Validation is therefore needed in individuals not used in the development process and preferably selected from different settings (external validation) [
15].
Unfortunately most publications on risk prediction models describe model development, and only a small number report external validation studies [
3]. This might be a key reason why, despite many being developed, relatively few models are actually being adopted in practice. The collation and synthesis of individual participant data (IPD) from multiple studies offers a novel and natural opportunity to overcome this current lack of validation [
16]. For example, models could be developed using data from a subset of studies and assessed on data from the remaining studies [
17]. Variation in model accuracy across studies and its causes could also be explored. The approach would also unite researchers, increasing sample sizes and encouraging a consensus towards a single well developed and validated prognostic model, rather than a number of competing and non-validated models for the same clinical question. For example, the IMPACT (International Mission for Prognosis and Analysis of Clinical Trials) consortium developed a prediction model for mortality and unfavourable outcome in traumatic brain injury by sharing IPD from 11 studies (8509 patients), with successful external validation using IPD from another large study (6681 patients) [
18].
IPD meta-analysis in this context can also go beyond using IPD from multiple studies, and more broadly consider synthesising IPD from any relevant clusters in the wider population of interest. For example, large electronic databases and registries are increasingly available that contain routinely collected patient records and risk factor measurements, which can be linked to health outcomes using, for example, Health Episode Statistics (HES) linkage. An example is the THIN database [
19], which contains anonymised patient records and risk factor information from millions of patients collected from over 500 general practices in the UK [
20], Such databases inevitably contain clustering of patients, for example within practices, hospitals and countries, and so an IPD meta-analysis could account for such clustering, for example by developing a model using data from a subset of the clusters (e.g. hospitals, practices), followed by external validation on the remainder.
The aim of this article is to perform a qualitative review to examine how researchers are developing and validating risk prediction models when IPD from multiple studies are sought and then combined for this purpose. The aim is to identify the current research standards and techniques; the role of IPD meta-analysis methods toward development and validation; and the common challenges and methodological problems researchers face. This allows us to generate a set of recommendations for how research in this area can be improved, and to flag those methodological techniques and issues researchers should recognise when modelling risk prediction using multiple sources of IPD.
Methods
Our review aimed to identify and then evaluate published articles that developed and/or validated a risk prediction model using IPD from multiple studies. We now describe our review methods in detail.
Identifying potentially relevant articles
To identify potentially relevant articles, we used an existing database of 385 IPD meta-analyses articles that was formed using a systematic review to identify all IPD meta-analyses (on any topic) published up to March 2009 [
21]. The review searched Medline, Embase and the Cochrane library using a search strategy described elsewhere [
22], and defined an ‘IPD meta-analysis article’ as one seeking, obtaining and then synthesising raw patient-level data across multiple studies or multiple collaborating groups. The articles in the database were published from 1991 to 2009, and it forms the largest collection of IPD meta-analyses currently available.
Note that our aim was not to review an exhaustive set of all risk prediction research using IPD from multiple studies, but rather to identify the main methodological methods, limitations and challenges therein. Qualitatively we felt we would achieve saturation with the existing database, and therefore we did not consider it necessary to update our review with newer articles since 2009.
Inclusion and exclusion criteria
A relevant article was defined as one which sought and then used IPD from multiple studies to develop and/or validate a risk prediction model based on one or more predictors. There were no restrictions on the type of outcome being predicted or baseline disease/health of the patients under investigation, or the types of study (observational studies, randomised trials etc.) being utilised. We use ‘studies’ here loosely to refer to different research sources, and so it could therefore relate to different research centres or collaborating groups. However, there needed to have been a clear step for obtaining IPD from the multiple sources. We did not include articles that used an existing database already containing the multiple sources (e.g. practices). Though the analysis within such articles has similar issues, we wanted to focus on the broader picture of firstly obtaining and then analysing IPD from the multiple sources (the typical framework for an IPD meta-analysis).
Similarly ‘model’ is loosely used for any developed equation, tool, or classification approach that allowed an individual’s risk to be predicted. Articles that evaluated one or more factors for their association with outcome but not their ability to predict individual outcome risk were excluded; for example prognostic factor, risk factor and causal factor studies were excluded if they only considered factors in relation to relative risk (e.g. hazard ratios, odds ratios) and not also absolute risk (e.g. probability of death by one year).
IA screened the abstracts and titles of each of the 385 articles and classified them in regards their risk prediction model status as either ‘yes’, ‘unsure’, or ‘no’. TD then also independently classified each article as ‘yes’, ‘unsure’ and ‘no’. Finally RR checked all the ‘yes’ and ‘unsure’ articles and a random 10% of the ‘no’ articles, along with any ‘no’ article containing the words ‘prognostic model’ or ‘prognostic index’ or ‘prediction rule’ or ‘prediction model’ or ‘risk model’. Any discrepancies between the three reviewers were resolved through discussion and by obtaining the full papers. Any article deemed ‘yes’ or ‘unsure’ after this screening was then obtained and read in full by IA and TD, and a final set of relevant articles decided upon. Any discrepancies were checked by RR and a final decision was then made.
Data extraction and in-depth evaluation of articles
Each article finally classed as a ‘yes’ was used for in-depth evaluation. A data extraction form was developed that included over 70 questions (see Additional file
1). These questions covered the rationale, conduct, analysis, reporting, and feasibility of project developing and/or validating a risk prediction model using IPD from multiple studies. A summary of the questions used is as follows:
Background and objectives: e.g. researchers location, year of publication, research aim, the baseline condition of patients, and the outcome to be predicted.
Identifying IPD studies: e.g. how they relevant studies for inclusion were identified, what types of studies were included, whether the targeted number of patients or studies was explained or justified statistically, etc.
Obtaining IPD: e.g. how authors asked for and obtained IPD, what proportion of IPD requested was actually obtained, whether study quality was considered, etc
Missing data: e.g. if there were any missing data, either at the patient-level or study-level, and if so how it was handled?
Model development: e.g. the statistical methods used to develop the risk prediction model, how data coming from multiple studies was handled, whether heterogeneity between studies was considered, how continuous predictors were handled, and whether the final model was fully presented.
Model validation: e.g. the statistical methods and criteria used for (internally or externally) validating the prediction model, how multiple studies were handled in this process, etc.
Potential for bias: e.g. potential impact of studies not willing/able to provide IPD on the estimates of model performance (e.g. in terms of calibration and discrimination) for the intended model’s use and target population.
Conclusions: e.g. the key conclusions and recommendations, and the limitations and problems discussed.
IA read each article in full and extracted information that answered these 77 questions, and then TD also independently answered these questions for each article. Any discrepancies in responses were resolved with RR.
Discussion
Risk prediction models have the potential to inform strategies for disease prevention, early diagnosis, patient counselling and therapeutic care [
1]. As for all clinical practice, their use should be evidence-based. In particular, there must be consistent evidence that a model is reliable and applicable to the intended populations of individuals [
3]. This ideally requires the model to be successfully validated in multiple datasets external to the model development phase. This can often take many years to achieve. However, with increasing access to the IPD from existing studies or large databases, there is a growing opportunity to both develop and validate risk prediction models simultaneously, within an IPD meta-analysis framework [
16].
In this article we have reviewed 15 articles that each used IPD from multiple studies to develop and potentially validate a risk prediction model for the development of future outcomes. This has allowed us to identify good practice, useful statistical methods, methodological challenges (Table
2), and some limitations in current reporting and methodology to be addressed (Table
3). We recognise that our review also has limitations. Firstly, it only covers articles published up to 2009, which was a restriction on the database we used; however we feel that our findings are unlikely to be different if the review were updated, and that qualitative saturation of issues and concepts had been achieved with our sample. Secondly, we focused on articles that utilise IPD obtained from multiple studies (or sources), and did not consider articles which used a single database containing clustering (e.g. by practice); such articles raise similar issues for the analysis, but we wanted to focus on the typical IPD meta-analysis scenario where IPD studies are obtained and then synthesised. Thirdly, by evaluating published articles we recognise our findings are clearly dependent on the reporting standards within the articles; thus any apparent research deficiencies or methodological gaps may just reflect poor reporting standards. Nevertheless, we believe our review and its findings will help inform those who wish to develop or validate a model using IPD multiple studies in the future. In particular, our work allows us to provide some key recommendations for improving the design, conduct and reporting of future research this field (Table
3), which echo those for IPD meta-analysis of prognostic factor [
37] and modelling [
16] studies, and concur with previous [
38,
39] and ongoing [
40] work toward improved reporting of risk prediction model articles in general. For brevity we do not discuss all these in detail now, but rather focus on the two most important recommendations in detail.
Table 3
Recommendations for improved research when developing and validating risk prediction models from multiple studies
Rationale and initiation
| • Produce a protocol for the project, detailing rationale, conduct and statistical analysis and reference this |
Obtaining IPD
| • Report how the primary study authors were approached for their IPD |
• Report strategy used to identify relevant studies, e.g. literature review/collaborative group |
• If literature review performed, then report search strategy, including keywords and databases used |
• Provide a flowchart showing the search strategy, classification of identified articles, and retrieval of IPD from relevant studies |
• Report any prior sample size considerations used, such as the number of IPD studies deemed necessary and the number of patients and events required. If no sample size requirements were considered, report this also |
Details of IPD
| • Report the number of patients and events for each study used in model development and/or validation |
• Report the missing data for each study (e.g. whether predictors were missing entirely, or how many patients had predictor values missing), and whether some patients or studies were entirely excluded for this reason |
• Detail the reasons why IPD was unavailable in some desired studies (if applicable), and report the number of patients and events from these studies |
• If any studies were excluded after IPD was obtained, provide the number of studies excluded and explain why they were removed (e.g. missing predictors, different outcome definition, different methods of measurement) |
• Compare and report the quality of studies for which IPD was obtained |
Statistical methods for model development
| • Account for clustering of patients within studies, for example by allowing for a separate intercept per study |
• Report the selection criteria and procedure used to decide which predictors are included in the final model |
• Assess and report any between study heterogeneity in the effects of included predictors |
• If large heterogeneity does exist in particular predictors, then try to reduce it by including more predictors or simply focus on including homogenous or weakly heterogeneous factors |
• Where possible model continuous predictors on their continuous scale, unless it is important to categorise with good clinical or statistical reason |
• Report the final developed model in original format with alpha (baseline risk) and beta estimates, so that others can ascertain how apply the model in practice |
• Detail how missing patient-level data and missing study-level factors were dealt with in the analysis |
Model validation and implementation
| • Validate the model that has been developed using internal-external cross-validation; we tentatively suggest at least 4 studies are required for this approach however. |
• Explain the choice of intercept (baseline hazard) to be used when implementing the model in the excluded study |
• Report validation statistics for each study excluded in the internal-external cross validation method |
• Report clearly whether there is evidence the model performs consistently well during the internal-external validation |
• If it performs consistently well, clearly report the final overall prediction model to be used in practice, and emphasise again how the intercept should be chosen upon application |
- If it does not perform consistently well, clearly flag those populations for which the model cannot be applied and draw attention to the model’s lack of generalisability |
Impact of missing IPD studies
| • If possible, compare the populations of those studies not providing IPD to those studies providing IPD, to be able to understand whether the developed model may need further generalisation in such populations in the future |
Recommendation 1: allow for different baseline risks in each of the IPD studies
In our review, 10 of the 15 articles did not account for clustering of patients within different IPD studies and therefore their developed prediction model did not allow for any study differences in baseline risk. Although such models can still perform adequately on average (that is, across all studies combined), when applied in practice to particular populations the model performance may deteriorate considerably if the population’s baseline risk is very different from the average estimated during model development. In other words, the developed model may require re-calibration in particular populations. Statistically it is also know than omission of an important predictor (i.e. study) can lead to biased effect estimates and reduced power [
36]. To address this, Debray et al. [
16] recommend that the prediction model should be developed with a separate intercept (baseline risk) per study, and then the model’s performance can be examined using internal-external validation alongside a strategy for choosing the intercept upon application to the excluded study. Such strategies may include using external knowledge of the intercept in the excluded study population; using the intercept as estimated from the IPD from the excluded study; or taking the intercept estimated from a study used in the model development that contains a similar population to the excluded study. The latter strategy is recommended within Steyerberg et al. [
18], where they propose others apply their model using the intercept for one particular trial in their analysis, as this trial is reflective of the population intended.
Where the intercept can be well-matched to the excluded study, Debray et al. [
16] show that their framework allows an IPD meta-analysis to produce a single, integrated prediction model from that can be implemented in practice and has improved model performance and generalisability. This echoes other recommendations to account for clustering in an IPD meta-analysis [
36]. For survival data, this means that the baseline hazard should be modelled during model development and so researchers should move away from using Cox regression (a common approach in the articles in our review, but one which does not estimate the baseline hazard) and rather use other approaches such as flexible parametric methods, like the Royston-Parmar model that estimates the baseline hazard using restricted cubic splines [
41,
42].
A similar recommendation is for researchers to examine between-study heterogeneity in predictor effects. This is currently rarely done. Debray et al. [
16] show that a model’s performance is likely to be more consistent if there is little or no heterogeneity in the effect of the predictors included. Researchers should examine heterogeneity and prioritise inclusion of homogeneous and only weakly heterogeneous predictors, or attempt to include interaction terms or additional predictors that reduce heterogeneity in others. The choice of predictors and their specification in the model (e.g. with transformation, and/or with linear or non-linear trends) is a complex issue, and statistical software procedures to integrate such decisions in the context of an IPD meta-analysis would be very helpful.
Recommendation 2: implement a framework that uses internal-external cross-validation
A major finding from our review is that, despite the availability of multiple studies, most researchers develop their model by using the IPD from all available studies, and so then perform an internal validation (on the same set of data) rather than an external validation (on different data). Only two of the 15 articles used a form of external validation, and so most models require further validation in order to investigate their true performance. One plausible reason why researchers choose not to use IPD for external validation is that they want to maximise the data available for model development; this is understandable, especially when faced with a large set of candidate predictors and possible non-linear relationships. Furthermore, even if researchers do decide to hold-back some IPD for external validation, it is not easy to decide how much IPD (and how many studies) should be removed.
For these reasons, the internal-external cross-validation approach is highly appealing [
16,
17], yet seemingly under-utilised. It wasused by Steyerberg et al. [
18] and Yap et al. [
35] in the articles we reviewed up to 2009. We also performed a citation search of the Royston et al. [
17] article that proposed the method. By the end of 2012 (based on abstracts of citations identified) there were still only nine citing articles (including the aforementioned Yap and Steyerberg) that developed a prediction model.
The internal-external cross-validation approach involves removing just one study from the development phase of the model, fitting the model on the remaining IPD, and then testing performance in the excluded study. This framework is then repeated by rotating the omitted study and assessing the validation in all the possible scenarios. Model estimates are therefore always based on the majority of IPD, and its model fit and predictive ability can be appraised across all the studies simultaneously. Where performance is consistently adequate across all combinations of the omitted study, a final step can be to utilise the IPD from all studies to produce the final specification of the model. In situations where model fit appears inadequate in some excluded studies, then this identifies a lack of generalisability and highlights populations (studies) for which the model is not currently suitable for. It also signals the need to improve the current model specification, for example by including additional predictors with homogenous effects across studies.
Royston et al
.[
17] originally proposed the internal-external validation approach for survival data, within a framework to construct and validate a prognostic survival model from an IPD meta-analysis [
12]. This framework allowed them to evaluate whether derived models have good prognostic separation in independent studies and whether the baseline survival distribution is heterogeneous across studies. Afterwards, a single final model was derived from all available IPD using flexible parametric proportional hazards (PH) modelling techniques. This article has only been cited 26 times between 2004-2012, again showing the general lack of uptake of this method. We hope this will change as researchers become more aware of its usefulness. Debray et al. [
16] recently extended the methodological framework to binary outcomes where logistic regression models are used to develop the risk prediction model.
How many IPD studies are needed and should all available IPD be used?
We recognise that implementing these two recommendations may be difficult given only a few IPD studies and/or when the number of patients per study is small. In particular, there is a need for methodological research on the necessary sample size to implement the internal-external approach, in terms of both the total number of IPD studies needed (Debray et al. [
16] tentatively suggest at least four or five) and the sample size (number of patients and events) within those studies that are excluded from model development. With small sample sizes, internal-external validation may not be plausible; for example, Nieder et al. [
28] only obtained IPD for 40 patients in total across eight case series reports. However, such small sample sizes are perhaps less likely to be an issue when using data from trials, prospectively planned cohort studies, or large electronic databases.
Given a set of available IPD studies, researchers should begin by identifying those that are relevant and most reliable for the clinical question at hand. It may be entirely sensible to exclude some available IPD for a variety of reasons framework [
16]. For example, researchers should evaluate the outcome definitions used in each study, and exclude studies that are not consistent (and cannot be made consistent) with others. They should also check whether important predictors are recorded in each study, and evaluate the amount of missing data for available predictors; though multiple imputation can limit these issues, studies with multiple missing predictors or large proportions of missing values might be best excluded for robustness (unless imputation assumptions can be justified). Studies may also be removed if they have crucial differences in such as: the method of measuring an important predictor; the treatment and healthcare patients received during follow-up; and the start-point (baseline) entry of patients to the study (e.g. one year after treatment, rather than at the time of diagnosis). If ignored, such differences might limit clinical interpretability of any model produced and reduce its performance. The number of studies excluded, and the decisions that lead to this, should always be transparently reported upon publication (Table
3).
A further issue arises when IPD are not available for all desired studies. This raises the threat of availability bias, where studies not providing their IPD are potentially different to those that do provide IPD [
43]. For a traditional meta-analysis of treatment effects, this can cause bias in the summary meta-analysis result [
43]. In the context of risk prediction models, it might cause the developed model to be unreliable in those populations (studies) which did not provide their IPD, and therefore reduce its usefulness in practice until further validation studies can be undertaken in these populations.
Finally, an issue rarely considered is whether ethical approval is required to collate IPD from multiple studies. Only six of the 15 articles mentioned that had ethical approval. One might argue that ethical approval is not required when IPD is being used in accordance with the original objectives of the studies involved (that is, to understand and improve the prognosis of patients). We hope that most ethics committees will support this view, but it should at least be checked that the studies providing IPD actually had ethics approval themselves.
Conclusions
It is paramount to consider statistical and methodological issues when planning to develop and/or validate a risk prediction model from multiple datasets, in order to avoid poorly generalisable and poorly performing models. Our review highlights that the IPD meta-analysis approach is highly appealing, as it allows the use of internal-external cross validation to develop a model and simultaneously evaluate its performance across multiple populations. However, researchers are faced with numerous challenges when the IPD is collated, in particular missing data and heterogeneity in study quality and methods of measurement. Perhaps an ideal way forward is a prospective IPD meta-analysis, where researchers agree at their study onset to use set quality standards and record particular variables in a common way so that, upon their own study completion, they can supply their IPD to those developing/validating a risk prediction model. Heterogeneity can then be limited by researchers agreeing, before data collection, to standardize predictor definitions, measurement methods, and outcome recoding.
Our review has only considered articles that developed or validated a prediction model. For risk prediction models to become more common in practice, research also needs to show they have a positive impact on health outcomes. Such impact studies are currently rare [
3], but they should follow any IPD meta-analysis that develops and validates an accurate risk prediction model.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RR and KM conceived the study. IA developed the questions for the data extraction form, which were then refined and extended by all authors. IA and TD independently extracted information from the articles to answer each question. IA collated responses, and resolved any disagreements with TD and RR. IA wrote the results up in full for his thesis, including tables and figures. RR converted this chapter into the form presented in this article, and this was edited and added to by KM, TD and IA. All authors contributed to revisions of the article in response to reviewers’ comments. All authors read and approved the final manuscript.