Background
Models that calculate the risk of disease are widely used to aid diagnosis and prognosis [
1]. Examples of commonly used models include the Framingham Risk score for CVD and the Gail model for breast cancer [
2,
3]. However, the predictions provided by these models are not perfect and ways to improve the predictions are frequently proposed. One such method is to include additional predictors in the model [
4]. Whether the additional predictors provide better predictions and how this is evaluated has been the subject of numerous articles in recent years [
5‐
8].
If a new predictor is to be added to a prediction model then the benefits of including this new predictor must outweigh the costs; the new predictor must demonstrate clinical utility. Measures of clinical utility that have been proposed include the net benefit and event free life years (EFLYs) [
6,
9]. Several authors have suggested that such measures of clinical utility be calculated after the new predictor has demonstrated incremental predictive ability in terms of either an increase in the c-statistic, the continuous version of net reclassification improvement (NRI(>0)) or the categorical NRI [
5,
10‐
12]. This staged approach implies the predictive ability results provide an indication of the likely clinical utility results.
The c-statistic has been criticised as being insensitive to the effect of important new predictors [
13]. Therefore it is questionable whether such an insensitive measure is of use in determining which new predictors should then be assessed in terms of clinical utility. The NRI(>0) has been proposed as a better measure of discrimination than the c-statistic when comparing predictors but how the NRI(>0) then relates to the clinical utility measures has not been examined [
5]. If the measures of predictive ability do not correlate with the measures of clinical utility then it is doubtful whether they would be helpful in deciding which new predictors should be investigated further.
An additional concern is that these measures may behave differently as the mean risk of the population being studied changes. For example, the c-statistic is largely unaffected by the mean risk in the population whereas measures of reclassification may be affected by where the reclassification cutpoint is set in relation to the distribution of risk in the population [
14]. The categorical NRI also implicitly weights the reclassification of cases and non-cases by the prevalence in the sample population [
15]. The impact of changing cutpoints on net benefit has been examined recently [
16], however, less attention has been paid to the situation where the cutpoint is fixed but the mean risk varies across the populations studied, a common situation when new cardiovascular risk predictors are assessed.
In the cardiovascular setting, the application of risk thresholds for treatment has been widely promoted for a number of years in guidelines across the world [
17‐
20]. New predictors of cardiovascular disease have then been assessed using these fixed thresholds but in a wide variety of populations. For example, the Emerging Risk Factors Collaboration has brought together 104 prospective population-based studies across a number of countries, several of which are from North America [
21]. The mean age of these North American cohort studies ranges from 54 to 78, and the percentage of males from 0% to 100% [
6]. This indicates a range of mean risks, and differences in the distribution of risks, across studies in which the same threshold for treatment would be applied.
In this paper we examine how the measures of predictive ability (c-statistic, binary-statistic, NRI(0), NRI (with two cutpoints), binary NRI (at the upper cutpoint)) are related to the measures of clinical utility: net benefit and event free life years (EFLY) for assessing the effect of adding a new predictor to a model. We investigate how differences in the mean risk between populations affect these measures using simulated data and also using data from the Framingham Study where the mean risk of CVD differs for men and women [
22].
Methods
Measures of predictive ability
For each of the measures we have chosen to calculate them at a common, fixed follow-up time of ten years consistent with the UK guidelines that apply CVD risk prediction models [
17].
Sensitivity and specificity at ten years, assuming a fixed cutpoint were calculated using the following formulas.
where the numbers of true positives, true negatives and those with and without an event were estimated using the Kaplan-Meier estimates of the proportion surviving at ten years.
Harrell’s c-statistic is a measure of discrimination calculated using the formula –
Each individual with an event before ten years is paired with every other person, irrespective of their event status. A pair is usable if their observed survival times differ, and the paired person had an event or their censoring time was greater than the survival time for the individual with the event. A usable pair is concordant if the predicted survival time is less for the member of the pair with the shorter observed survival time. A pair is tied if they have the same predicted survival time. People with events after ten years are considered censored at ten years [
23].
The binary area under the ROC curve (or binary c-statistic) is a measure of discrimination at a particular cutpoint. It is calculated by averaging the sensitivity and specificity at that cutpoint when the new marker is added to the model. We calculated the difference in binary c-statistic using the formula
Measures of reclassification
NRI(>0) and NRI(with two cutpoints) measure the amount of reclassification that occurs when the new predictor is added to a model [
24]. The proportion of events and non-events correctly reclassified (reclassified up and down, respectively) are adjusted by the proportion of events and non-events incorrectly reclassified.
We calculated the NRI(>0) using the formula –
where n is the total number of people and the subscripts U and D indicate those reclassified up and down. The Kaplan-Meier estimates at ten years among all people, and those reclassified up or down, provide the probabilities (P).
We calculated the NRI(with two cutpoints) using the formula –
where
is the proportion of events, and non-events, that are reclassified up, or down. As the data are censored the number of events and non-events are estimated from the Kaplan-Meier estimates at ten years for each of the cells in the reclassification table [
25].
The binary NRI was also calculated. This has a single cutpoint which was set as the upper cutpoint from the NRI(with two cutpoints). The binary NRI is directly related to the binary c-statistic.
Measures of clinical utility
The net benefit provides a measure of the number of people correctly classified as having the outcome, adjusted for the number of people incorrectly classified as having the outcome, where the number of people incorrectly classified as having the outcome is weighted by the relative importance of a correct classification compared to an incorrect classification [
26]. This weight is determined by the threshold probability at which people are classified as having the outcome. We calculated the Net Benefit using the formula –
where the number of true positives and false positives are estimated using the Kaplan-Meier estimates of the percentage surviving at ten years among those with calculated risks greater than the threshold probability; n is the total number of people and p
t
is the threshold that defines high risk.
The number of event free life years (EFLYs) was estimated using the methods described by Rapsomaniki and colleagues [
6] and is based on the formula.
where P is the proportion of those evaluated who are treated, B(T) is the benefit in terms of event free life years gained among those treated and C(T) is the costs for those treated, measured also relative to event free life years. Briefly, individuals with a calculated risk above a treatment threshold are assumed to have their risk reduced by treatment. This reduction in risk leads to a reduction in events and an increase in the total number of event free life years for the population within a given time period (here 10 years). Each gain in EFLY is assumed to have a monetary value. However, there are costs involved in treatment particularly for those who would not have experienced an event within the time period. Assuming a particular cost per EFLY, Rapsomaniki’s method deducts these costs in terms of EFLYs from the benefit obtained from the gain in EFLY of those treated.
As in Rapsomaniki’s paper, we set the reduction in risk due to treatment at 20% which was based on results from a meta-analysis [
27]. The cost of treatment, in terms of EFLYs, is calculated assuming that the threshold for treatment is the optimal cutpoint in that benefits match costs at this point. Rapsomaniki and colleagues put a monetary value on this cost by relating it to the cost of one EFLY (£20,000) as proposed by the National Institute for Health and Clinical Excellence (NICE) [
28].
Our main analysis focused on the upper cutpoint of 20% risk at ten years which is used in the UK CVD prevention guidelines [
17]. We also repeated our analyses using upper cutpoints of 10% and 50%. The lower cutpoint in the calculation of the NRI categorical for these analyses were arbitrarily set at 5% and 25%, respectively.
Simulated data
For our simulations we generated survival times that followed a Cox-exponential survival model using the formula [
29].
Where
U is a uniform random number between 0 and 1,
λ is the baseline hazard rate which was varied to produce datasets with mean risks at ten years distributed between 0% and 100%. The variables
x
1
and
x
2
each had standard Normal distributions and were independent of each other. We carried out separate series of simulations by varying the coefficient
β
1
from a hazard ratio of 1.5 per 1 standard deviation increase (weak baseline model) to 3 (medium baseline model) to 6 (strong baseline model) and by varying coefficient
β
2
of the second covariate to produce hazard ratios of 1.2 (weak predictor), 2 (medium predictor) and 3 (strong predictor). Note the hazard ratios derived from the Framingham dataset for the traditional CVD risk factors of age, SBP and total cholesterol ranged from 1.25 to 2.04, per one standard deviation increase (Additional file
1: Table S1). Since we have assumed a constant hazard ratio across simulated datasets that have different mean risks, the odds ratio calculated for events occurring before ten years will not be constant across the datasets [
30]. The estimated odds ratio is similar to the hazard ratio if the mean risk is small, but the odds ratio increasingly overestimates the hazard ratio as the mean risk increases.
We also generated censoring times that followed an exponential distribution with a 10% risk of being censored at ten years. If the censoring time was less than the survival time the observation was considered censored at the censoring time. The proportion censored decreased from 10% to approximately 2.5% as the mean risk increased. Each simulation dataset contained 10,000 observations. We simulated 1,000 datasets for each combination of baseline model and additional predictor.
For each of the simulated datasets the measures of predictive ability and clinical utility were calculated comparing models without and with the second variable. We then plotted the proportion of people classified as high risk (above the upper cutpoint) by each of the two models classified by the mean calculated risk of an event at ten years for that dataset. The mean risk was calculated from the model containing both covariates. We plotted the measures of predictive ability and clinical utility against the mean risk and applied a cubic spline smoother.
Empirical data
We obtained data from the Framingham Heart Study on the people included in the analysis that resulted in the 2008 Framingham risk equation [
22]. At the initial visit, blood pressure, serum total cholesterol, HDL, smoking status, diabetes status and use of anti-hypertensive medication were recorded using standard methods. All study participants were free of prevalent CVD at the initial visit and were under continuous surveillance for the development of cardiovascular events and death. Maximum follow-up was 12 years.
We fitted two Cox proportional hazards models to the Framingham dataset consisting of the variables that were included in the proposed general CVD risk prediction model [
22]. The first model included age, total cholesterol, high density lipoprotein, smoking status, diabetes status and use of antihypertensive medication. The second model included all of these variables, plus systolic blood pressure (SBP). We carried out separate analyses for men and women as the risk of CVD differs between men and women.
We compared the models without SBP and with SBP using the following measures: change in c-statistic, binary c-statistic, NRI(>0), NRI(10%, 20%), binary NRI (20%), net benefit and the event free life years (EFLY). Ninety-five percent confidence intervals were calculated for these measures using 2000 bootstrap samples. We used a treatment cutpoint of 20% for the calculations of the measures net benefit and EFLYs (and 10%, 20% for the NRI(10%, 20%)) to match current cardiovascular disease (CVD) prevention guidelines [
17]. We assumed that for treated people their risk of CVD would be reduced by 20% based on the meta-analysis reported in the paper by Rapsomaniki that introduced the EFLY [
6].
Discussion
We have described how the measures of predictive ability, reclassification and clinical utility used to assess a new predictor in a model depend upon the mean risk of the population. We have also demonstrated that the reclassification measures exhibit a different relationship with the mean risk than the clinical utility measures. The continuous NRI increases with increasing mean risk; the NRI categorical with two cutpoints often peaks at two points; whereas the net Benefit and EFLY peak once close to the cutpoint and then generally decrease to zero as the mean risk increases.
In the Framingham Study the mean risk of CVD was higher for men than for women, and also closer to the upper cutpoint of 20%. Based on this, we may have expected the measures of predictive ability, reclassification and clinical utility to be higher among men. However the hazard ratio for systolic blood pressure when it was added to the model was higher for women compared to men, and this compensated for the lower mean risk among women. In a recent review of several new predictors of cardiovascular disease, Paynter and colleagues have also highlighted that results may differ between men and women due to differences in effect sizes of new predictors as well as the strength of the baseline model and the mean risk in the study sample [
31].
In our simulations we observed that as the mean risk increased the NRI(>0), and the change in the c-statistic, also increased. In the paper that introduced the NRI(>0) Pencina suggested that one of the benefits of this measure was that it was not affected by the event rates in the population [
24]. Our simulations, where we assumed a constant hazard rate, indicate that the NRI(>0) increases as the event rate (the mean risk) increases for event rates above the cutpoint. The NRI(>0), as with the change in the c-statistic, is unaffected by event rates only if the odds ratio does not vary. However, as we have demonstrated, if the hazard ratio is assumed to be the same in populations with different event rates (a common assumption in cohort studies of cardiovascular outcomes) then the NRI(>0) will increase with increasing event rate.
In our simulations, when the mean risk in the population was less than the cutpoint the measures of reclassification and clinical utility were generally consistent with each other and increased as the mean risk increased. However, beyond this cutpoint the measures diverged. The reclassification measures continued to increase while the clinical utility measures decreased, although the NRI binary and NRI(with two cutpoints) did eventually decrease. Similar patterns were also observed by Van Calster and others when they varied the cutpoint and assumed a fixed mean risk; as the cutpoint moved away from mean risk the reclassification measures provided a more optimistic view of the new predictor compared to that provided by the difference in net benefit [
16].
The clinical utility measures, difference in EFLY and difference in Net Benefit, achieved a maximum value at approximately the point where the threshold for treatment equaled the mean risk in the population, as expected [
32]. However, we observed a divergence in the clinical utility measures in our simulations as the mean risk increased. This is attributable to differences between the two measures in terms of how benefits and costs are counted and the weights given to benefits and costs in populations with different mean risks.
When a new predictor is added to a model, the difference in EFLY is measured in terms of event free life years. An event free life year gained has the same value whether it occurs in a high risk or low risk population. In contrast, the difference in Net Benefit is measured in units of true positives, adjusted for false positives, with the weighting of false positives relative to true positives determined by the cutpoint defining high risk. However, the actual value of a true positive will differ in populations with different mean risks since the number of event free life years gained will be greater for an individual from a high risk population compared to a low risk population. Also, a false positive will have a greater cost in a low risk population than a high risk population as the survival time, and hence, treatment time, will be greater.
Although there are issues in using the Net Benefit when accounting for costs and benefits over a specific time period, there are also issues in the calculation of costs and benefits for the EFLY. Possible heterogeneity in treatment effects across patient subgroups is not accounted for in the EFLY. Also, the calculation of the EFLY assumes that the chosen cutpoint is the ‘optimal’ cutpoint in that costs equal benefits at this point; the cost of treatment, in terms of event free life years, is then calculated based on this assumption. Rapsomaniki and colleagues acknowledge that many factors, other than the costs and benefits they account for in their EFLY calculations, are considered when a particular cutpoint is chosen [
6]. However, their assumption avoids the problem of an irrational choice of cutpoint resulting in a poorer model being favoured [
6].
In previous papers the relationship between choice of cutpoint and the measures of reclassification and the difference in Net Benefit has been described when the mean risk in the population is fixed [
14,
16,
33]. We observed similar results when the mean risk in the population varies but the cutpoint is fixed. The scenario we have described is the one more commonly encountered in the evaluation of new predictors of cardiovascular events. For example, the Emerging Risk factor Collaboration (EFRC) brings together several cohort studies from the same country which have different mean risks but where the same guidelines and cutpoints for defining high risk would apply. As each of the measures we have examined are in some way affected by the mean risk in the study population this must be taken into account when comparisons are made between different studies whose mean risk varies, or when the mean risk in the study population differs from the population in which a new predictor will ultimately be implemented.
A number of methods have been proposed to allow for these differences. Where the study data arise from a matched case control study Pepe has proposed a method for calculating an adjusted c-statistic that takes into account the greater similarity in risk between cases and controls that arises from matching [
34]. The ERFC applied age-sex specific measures of reclassification observed in their study population to the standard European population to estimate the amount of reclassification that would occur in this standard population [
35,
36]. However, this relies upon having a large enough study population to provide reliable estimates of reclassification in each age-sex stratum. If the data arise from a case control study, Rousson suggests reweighting the proportions of cases and controls to match the proportions found in the parent population [
37].
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
KM designed the study, carried out the analysis, interpreted the results and drafted the manuscript. The other authors, PM, LI and PB, all contributed to the interpretation of results, reviewed the manuscript and revised it for important intellectual content. All authors read and approved the final manuscript.