Comparison with previous literature
Our findings are in agreement with previous studies, which also found that the Framingham prediction models overestimate the risk of CHD in the general population [
12,
13] and that (overall) calibration is better in the USA than in European populations [
14]. Furthermore, the overestimation of risk was more pronounced in more recent populations than in earlier study populations [
12,
13]. A recent study found that the PCE overestimate the risk of CVD [
15], and in another study, the better discrimination seen in women was attributed to a stronger association between risk factors and CVD in women compared to men [
72]. In addition to these previous reviews, we compared the predictive performance of the three prediction models that are currently in the medical guideline and that are most often evaluated in external validation studies, and found that these models have similar performance. Further, we did an extensive evaluation of factors possibly associated with heterogeneity in predictive performance and we found that predictive performance improved if these existing models were being updated.
Reasons for overprediction
There could be several reasons for the observed overprediction. These reasons have also been extensively discussed previously with regard to the PCE [
15,
73]. First, differences in eligibility criteria (e.g. the exclusion of participants with previous CVD events) across validation studies may have affected calibration. Second, the three prediction models have been (partly) developed using data from the 1970s, and since then, treatment of people at high risk for a CVD event has changed considerably, such as the introduction of statins in 1987 [
74]. The increased use of effective treatments over time aimed at preventing CVD events will have lowered the observed number of events in more recent validation studies, resulting in overestimation of risk in these validation populations [
39,
75,
76]. This would also explain why overprediction was most pronounced in high-risk individuals and why we found more overprediction in studies with increasing mean total cholesterol levels. We hypothesized that the degree of overprediction would increase over the years [
12,
13,
77]; however, this could not be confirmed statistically. About one third of validations of the PCE excluded participants receiving treatment to lower CVD risk at baseline, but we found no difference in performance between validations that did or did not exclude these participants. However, as the use of risk-lowering medication during follow-up was rarely reported in these studies, we cannot rule out an effect of incident treatment use on model performance [
76]. Following the recently issued Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guideline [
78,
79] and the guidance on adjusting for treatment use in prediction modelling studies [
75,
76,
80], we also strongly recommend investigators of future prediction model studies to record the use of treatment during follow-up. Third, in agreement with previously published reviews [
12‐
15], we found more overestimation of risk in European populations compared to those of the USA whereas in some Asian populations an underestimation was seen. Both suggest that differences between these populations in, for example, unmeasured CVD risk factors and in the use of preventive CVD strategies (e.g. medical treatment or lifestyle programs) are responsible. Unfortunately, not enough information was available to study the role of ethnicity on the heterogeneity of the models’ predictive accuracies. Finally, rather than overprediction by the models, there could also be issues in the design of the external validation studies that give rise to a lower number of identified events. Underascertainment or misclassification of outcome events, unusually high rates of people receiving treatment, short follow-up duration, and inclusion of ethnicities not included in development of the models, have been mentioned as reasons for the overprediction we also observed [
57,
81‐
84]. Also outcome definitions can be different between the development and validation studies. This seems particularly a problem for the Framingham Wilson model, as half of the validation studies did not include angina in their outcome definition. Researchers have however shown that overestimation can often not fully be explained by treatment use and misclassified outcome events [
39,
85].
Implications for practice and research
According to the ACC-AHA guidelines [
2], risk-lowering treatment is considered in people 40–75 years old, without diabetes, with LDL cholesterol levels between 70 and 189 mg/dl and with 10-year predicted risk of CVD ≥ 7.5%. After a discussion between clinician and patient about adverse effects and patient preferences, it is decided whether risk-lowering treatment is initiated. The overprediction observed in this review is problematic as this might change the population eligible for risk-lowering treatment. Unfortunately, this is true for all three CVD risk prediction models. As the meta-analysis indicates that overprediction does not consistently occur across different settings and populations, there is no simple solution to address this problem. From the studies that provided data on calibration in subgroups, we found that overestimation was more pronounced in high-risk individuals. When the (over)estimation of the absolute risk is already beyond the treatment probability threshold, it will not influence treatment decisions. However, overestimation of CVD risk might still influence the intensity (dose and frequency) of administered treatments and affect the patient’s behaviour. Although excessive risk estimates could stimulate patients to adopt a more healthy lifestyle (similarly to patients with more risk factors [
86]), it could also cause unnecessary anxiety for future cardiovascular events. For people at lower risk, this might, however, result in crossing the treatment probability boundary when, actually, they are at lower risk. If we assume that the observed risks reported in the validation studies are not influenced by factors that changed during follow-up such as treatment use, we can state that in 82% of individuals the average predicted and observed percentages were both below or both above the treatment boundary of 7.5% (i.e. concordant points); in 1% of individuals, the predicted risk was below 7.5% and the observed risk was above 7.5% which would, on average, lead to undertreatment in these groups of individuals; and in 17% of individuals, the average predicted risk was above 7.5% while the average observed risk was below 7.5% which would, on average, lead to overtreatment. Though this does not mean that 17% of individual will receive unnecessary treatment, it is important that doctors realize that some patients may receive unnecessary treatment, resulting in adverse effects, patient’s burden and extra costs of treatment and monitoring.
The clinical implications of a certain c-statistic are even more difficult to predict. A low c-statistic (i.e. close to 0.5) indicates that the model discriminates no better than a coin toss, while a high c-statistic (close to 1.0) indicates that the model can perfectly separate people who develop an event from people who do not develop an event. If we have four pairs of individuals visiting the doctor, of which one individual will develop CVD in future (case) and one not (noncase), a c-statistic of 0.75 means that for 3 of these pairs, the case would indeed receive a higher predicted risk (which could still be too low or too high), while in one pair, the case would receive a lower predicted risk compared to the noncase.
In general, the performance of prediction models tends to vary substantially across different settings and populations, due to differences in case-mix and health care systems [
87]. Hence, one external validation may not be sufficient to claim adequate performance and multiple validations are necessary to get insight in the generalizability of prediction models [
71]. Primary external validation studies already showed the need for recalibration to improve a model’s calibration and better tailor it to the setting at hand. In this systematic review, we now show that this also holds when all studies are pooled together in a meta-analysis. We found that none of the models offer reliable predictions unless (at least) their baseline risk or hazard (and, if applicable, population means of the predictors in the model) are recalibrated to the local setting. Studies that reported performance of the model before and after update showed that performance indeed improves after update; however, the necessary adjustments vary from setting to setting and thus need to be evaluated locally [
8,
11,
36,
47,
48,
54]. As previously emphasized, more extensive revision methods are often not needed [
24,
25,
88]. Hence, it appears that conventional predictors, such as age, smoking, diabetes, blood pressure and cholesterol, are still relevant indicators of 10-year CHD or CVD risk, and their associations with CVD events have largely remained stable. The need for updating CVD risk prediction models has already been discussed more than 15 years ago [
11,
89], but still nothing has changed. We believe this should change now, especially since nowadays applying simple model updating is becoming increasingly possible, due to improvements in the storage of the information required to update a model. Clinical guidelines should advocate that the performance of the models is not appropriate for every patient population, and recalibration before using the models is necessary. Nice examples of the tailoring of CVD risk prediction models to specific populations are the Globorisk prediction model, which can easily be tailored to different countries using country-specific data on the population prevalence of outcomes and predictors [
90], and the SCORE model, which has been tailored to many European countries using national mortality statistics [
21,
91‐
93].
Furthermore, it is very important to gain insight into factors that cause the observed overestimation of events. This can be studied in empirical data, like the study by Cook et al. [
38] where hypothetical situations are created in which a number of events are missed or prevented by risk-lowering drugs. Also simulation studies might give more insight into reasons for the overestimation of CVD risk.
These suggestions, however, offer no short-term solution for practitioners currently using the three reviewed prediction models. For now, we advise practitioners in the USA to continue using the PCE, though being aware of the potential miscalibration especially in high-risk individuals. The PCE are currently advised to be used by the ACC-AHA clinical guidelines, which is understandable because alternatives have not been studied in enough detail. However, the actual value of the PCE in clinical practice is limited by their calibration. We therefore advise to recalibrate the model in the near feature to specific populations. Fortunately, systematic reviews have shown that the prevalence of common CVD risk factors decreases (e.g. cholesterol levels drop) in populations where CVD risk prediction models and their corresponding treatment guidance are being used [
94,
95]. Furthermore, statins have been proven effective with limited adverse events [
96]. Finally, we advise practitioners to choose a model that predicts a clinically relevant outcome (for example, according to the AHA, CVD rather than only CHD, since stroke and CHD share pathophysiological mechanisms [
10,
97]), consists of predictors available in their situation and is developed or updated in a setting that closely resembles their own setting.
Limitations
This study has several limitations. First, we focused on the three most validated and used prediction models in the USA, while in Europe many more prediction models are currently used for predicting cardiovascular risk, such as QRISK3 [
22] and SCORE [
21]. The differences between all these models are however limited, as most models include the same core set of predictors. Therefore, we believe our results can be generalized to other prediction models. Second, we had to rely on what is reported by the authors of primary validation studies and we unfortunately had to exclude relevant validations from our meta-analyses because of unreported information which we could not obtain from the authors. Only 19 out of 61 authors were able to provide us with additional information, and we had to exclude 9 validations for the c-statistic and 18 for the OE ratio. Sometimes, incomplete reporting made it difficult to judge whether a study should be included in our analyses, for example when there was no explicit reference to the validated model and when it was not clear whether changes were made to the model before validation. In one specific study (D’Agostino et al. 2001 [
14]), we decided to include the study because the validated model was similar in predictors and outcome definition. Although its eligibility could be debated, the exclusion of this study would not alter our conclusions regarding the performance of the Framingham Wilson model. Due to poor reporting, many validations scored unclear risk of bias, especially for the domain predictors and outcome. Many other validations scored high risk of bias which may hamper our conclusions. Unfortunately, evidence on the impact of these issues on model performance is still limited [
98], which makes it difficult to argue in which direction the predictive performance of the models will change if all validations had low risk of bias. Third, the total OE ratio, while commonly reported, only provides an overall measure of calibration. To overcome this problem, we extracted information on the OE ratio in categories of predicted risk, which showed there was more overestimation of risk in the highest categories of predicted risk. Based on this information, we calculated the calibration slope, which suggested that miscalibration of the Framingham Wilson and ATP III models and PCE men model was mostly related to heterogeneity in baseline risk, while for PCE women the model is overfitted or does not transport well to new populations. In addition, more clinically relevant measures, such as net benefit, could not be considered in this meta-analysis due to the lack of reporting of these measures [
5]. Fourth, because of the low number of external validation studies, we did not perform meta-regression analyses for the ATP III model. Unfortunately, the relatively small sample size makes it difficult to draw firm conclusions on the sources of observed heterogeneity. Fifth, the exclusion of non-English studies could have influenced the geographical representation. However, since only one full-text article was excluded for this reason, we believe the effect on our results is limited.