We considered all known Australian cohorts for inclusion in this study. An expert steering committee was established, which agreed, a priori, that the aim was to develop a 5-year CVD risk score using Australian data that included, unless there was evidence otherwise, measures of socioeconomic status, family history of CVD and markers for renal disease, in addition to the classical Framingham risk factors. The 5-year time frame was chosen for the following reasons: i) to reflect current absolute risk guidelines in Australia, which is based on 5-year risk of a CVD event [
1], ii) focus group testing has shown Australian consumer preference for a shorter 5-year time frame over a 10-year time-frame for risk prediction [
6] and iii) to enable modelling of treatment effects in RCTs, which are of a relatively shorter duration. Cohorts were included if they had data on CVD outcomes and on traditional CVD risk factors (age, sex, diabetes, systolic blood pressure (SBP), total cholesterol (TC), high-density lipoprotein cholesterol (HDLC) and smoking) and socioeconomic deprivation, measured by the Australian Socioeconomic Index For Areas (SEIFA) postcode-based score for some, or all, participants [
7]. Cohorts were excluded if they were derived from a high-risk CVD population or if all participants were aged less than 40 years or older than 74 years. Cohorts were additionally excluded if information on prior CVD was unavailable. Six prospective cohorts, whose investigators were willing and able to contribute individual participant data, were subsequently identified (Table
1). Data were pooled and the relevant variables were harmonised across studies. These cohorts contributed to the Australian and New Zealand Diabetes and Cancer Collaboration [
8]. This study was approved by the Alfred Health Human Research Ethics Committee (HREC; 310/14) and the Australian Institute for Health and Welfare (AIHW) HREC (2015/1/142).
Table 1
Summary data (mean and standard deviation or number (%)) and number (%) missing for putative risk factors, by study
n | 7417 | | 3558 | | 894 | | 1747 | | 38897 | | 2316 | | 54829 | |
Age | 54.06 (9.49) | nil | 61.14 (6.91) | nil | 55.00 (9.72) | nil | 65.86 (4.3) | nil | 55.02 (8.61) | nil | 54.61 (9.77) | nil | 55.62 (8.94) | nil |
Women, n (%) | 4103 (55) | nil | 2063 (58) | nil | 510 (57) | nil | 987 (57) | nil | 23430 (60) | nil | 1241 (54) | nil | 32582 (59) | nil |
Systolic blood pressure | 130.44 (18.19) | 30 (0.4) | 142.37 (19.94) | 17 (0.48) | 132.11 (21.70) | nil | 144.17 (22.19) | nil | 136.55 (19.0) | 88 (0.23) | 130.23 (17.86) | 1 (0.04) | 136.01 (19.34) | 136 (0.25) |
Total cholesterol | 5.79 (1.05) | 1 (0.01) | 5.92 (1.05) | 442 (12.42) | 5.42 (0.98) | nil | 6.57 (1.23) | 4 (0.23) | 5.53 (1.06) | 148 (0.38) | 5.45 (1.03) | 17 (0.73) | 5.61 (1.08) | 612 (1.16) |
HDL cholesterol | 1.43 (0.39) | 2 (0.03) | 1.45 (0.43) | 445 (12.51) | 1.45 (0.39) | nil | 1.38 (0.38) | 7 (0.40) | 1.41 (0.40) | 33815 (86.93) | 1.37 (0.39) | 18 (0.72) | 1.42 (0.40) | 34286 (62.5) |
Diabetes, n (%) | 458 (6) | 1 (0.01) | 204 (6) | 3 (0.08) | 65 (7.3) | nil | 85 (5) | nil | 1478 (4) | 6 (0.02) | 142 (6) | nil | 2432 (4) | 10 (0.02) |
FPG (mmol/L) | 5.60 (1.22) | 1 (0.01) | 5.28 (1.50) | 342 (9.10) | 5.27 (1.50) | nil | 5.32 (1.68) | 11 (0.60) | 5.65 (1.10) | 12659 (32.5)* | 5.40 (1.50) | nil | 5.70 (1.51) | 13013 (23.4) |
SEIFA fifth, n (%) | | 17 (0.22) | | 6 (0.16) | | 6 (0.75) | | nil | | 128 (0.33) | | 8 (0.32) | | 165 (0.30) |
1st (most disadvantaged) | 625 (8.43) | | nil | | 558 (62.42) | | 1747 (100) | | 5537 (14.28) | | 843 (36.40) | | 9310 (16.98) | |
2nd | 1310 (17.66) | | 2466 (69.31) | | 237 (26.51) | | nil | | 8023 (20.69) | | 537 (23.219) | | 12573 (22.93) | |
3rd | 2151 (29.00) | | nil | | 93 (10.40) | | nil | | 7149 (18.44) | | 371 (16.02) | | 9764 (17.781) | |
4th | 1460 (19.68) | | 1085 (30.49) | | nil | | nil | | 7978 (20.58) | | 473 (20.42) | | 10996 (20.06) | |
5th (least disadvantaged) | 1854 (25.00) | | 1 (0.03) | | nil | | nil | | 10082 (26.01) | | 84 (3.63) | | 12021 (21.92) | |
Current smoker, n (%) | 1050 (14) | 131 (1.77) | 558 (16) | 116 (3.26) | 137 (15) | nil | 309 (18) | 18 (1.03) | 4378 (11) | 10 (0.03) | 453 (20) | 16 (0.69) | 6885 (13) | 291 (0.53) |
eGFR (ml/min/m2) | 94.11 (14.04) | 33 (0.44) | 66.96 (14.78) | 795 (22.34) | 87.15 (15.73) | nil | n/a | 1747 (100) | n/a | 38897 (100) | n/a | 2316 (100) | 86.75 (18.46) | 43788 (79.86) |
Albumin-creatinine ratio | 1.61 (7.22) | 37 (0.50) | n/a | 3558 (100) | n/a | 894 (100) | n/a | 1747 (100) | n/a | 38897 (100) | n/a | 2316 (100) | 1.61 (7.22) | 47449 (86.55) |
Family history CVD, n (%) | n/a | 7417 (100) | n/a | 3558 (100) | n/a | 894 | 603 (35) | nil | 19830 (51) | nil | 1551 (67) | 34 (1.47) | 21984 (40) | 11903 (27.71) |
BMI | 27.26 (4.97) | 64 (0.86) | 26.99 (4.86) | 40 (1.22) | 28.18 (5.19) | 2 (0.22) | 26.12 (4.25) | 2 (0.11) | 26.86 (4.42) | 27 (0.07) | 28.50 (5.48) | 1 (0.04) | 26.99 (4.60) | 136 (0.25) |
High school +, n (%) | 2692 (36) | 4 (0.05) | 2130 (60) | 222 (6.24) | 211 (24) | nil | 490 (29) | 42 (2.40) | 12797 (33) | 9 (0.02) | 1204 (52) | 656 (2.42) | 19524 (36) | 333 (0.61) |
CVD death, n (%) | 87 (1.17) | nil | 246 (6.91) | nil | 6 (0.67) | nil | 320 (18.32) | nil | 691 (1.80) | 606 (1.56) | 25 (1.08) | nil | 1375 (2.51) | 606 (1.11) |
Years of follow-up, mean | 11.77 | n/a | 15.17 | n/a | 9.82 | n/a | 18.09 | n/a | 18.04 | n/a | 10.46 | n/a | 16.55 | nil |
Cardiovascular disease risk factors
We collected data on baseline age (years), sex, TC (mmol/L), HDLC (mmol/L), SBP (mm Hg), smoking status, diabetes status, body mass index (BMI; kg/m
2), SEIFA, educational attainment, estimated glomerular filtration rate (eGFR; ml/min/m
2), urinary albumin to creatinine ratio (ACR), and family history of CVD. However, ACR was omitted from predictive risk modelling because it was only measured in one study, and family history was omitted because it was inconsistently collected across studies (e.g. self-reported cause of death for mother or father; mother, father, sister, or brother having experienced a CVD event (with no upper age limit); mother, father, sister, or brother having experienced a CHD event prior to age 60 years). TC, HDLC and SBP were measured using standard procedures. Smoking status was dichotomised as current or not current smoking. Diabetes status was defined as a fasting plasma glucose (FPG) ≥126 mg/dl, where available. When data on FPG were missing (Table
1) we used self-reported diabetes status. Participants who were missing FPG and self-reported as not having diabetes were recorded as no diabetes. eGFR was estimated using an enzymatic creatinine assay according to the CKD-EPI equation [
8]. The SEIFA score was categorized by national fifths, indexed as 1–5. BMI was derived with objectively measured height and weight. Educational attainment was dichotomised as completed high school or not.
Statistical methods
Participants were included in the analysis if they were between 40 and 74 years of age and free of CVD at baseline. All continuous variables were tested for log-linear associations with the risk of CVD mortality by graphical means. The only violation found was for eGFR, which had a curvilinear association. To reduce the chance of bias from missing data, multiple imputation by chained equations with 30 imputations was used [
9]. Covariates included in our imputation models were baseline age, sex, SBP, TC, HDLC, SEIFA fifth, BMI, eGFR, eGFR
2, family history of CVD, diabetes status, smoking status, highest level of education and follow-up data on CVD mortality, mortality from any cause and days to censoring or death. As we decided, a priori, that age and sex were likely to be effect modifiers for other risk factors, the imputation model was stratified by sex and by age (in thirds). Analyses were conducted on the complete pooled data set.
Cox proportional hazards regression models were used to quantify the associations between baseline factors and the risk of CVD mortality. When estimating CVD mortality all other causes of death were ignored. The proportional hazards assumption was tested for all covariates included in the model using the Schoenfeld’s global test and by graphical inspection of a plot of the scaled Schoenfeld residuals on a function of time. As an initial exploratory analysis, a model was fitted with only traditional risk factors: age, sex, SBP, TC, HDLC, diabetes and smoking. For the primary prediction model all the exposure variables available were considered as potential prognostic factors, together with all interactions between sex and other variables and between age and all other variables. For the primary prediction model all significant (p < 0.05) predictors (risk factors with sex or age interaction terms) in multiple adjusted models were included. We additionally constructed, in an identical way, a low information model, which omitted all clinical variables collected via blood tests, for potential use in non-clinical settings.
From general theory [
9,
10], the 5-year risk prediction from a Cox model is approximated as:
$$ \widehat{p}=1 - S{\left(5,\overline{x}\right)}^{\exp (w)} $$
where
\( S\left(5,\overline{x}\right) \) is the probability of survival (without a CVD death) for a 5-year period for the average person (someone with mean values of each risk factor) at baseline (the start of the 5-year period) in the sample data. Also,
$$ w = {\displaystyle \sum }{b}_1\left({x}_1-{\overline{x}}_1\right)+{b}_2\left({x}_2-{\overline{x}}_2\right) + {b}_3\left({x}_3-{\overline{x}}_3\right) + \dots \dots $$
where the {x} are the values taken by any given individual for the risk factors included in the model, the \( \left\{\overline{x}\right\} \) are their mean values (in the sample data) and the {b} are the regression coefficients (log hazard ratios) from the Cox model.
To obtain a primary risk score, using only sample data, \( S\left(5,\overline{x}\right) \) was taken as the mean value after fitting the Cox model for the primary risk model in each of the 30 imputations. Similarly, w was taken as the mean over the 30 imputations, but with the {b} values taken from the multiple imputation process (thus fixed at each iteration).
Recalibration
This primary risk score, obtained from the pooled Australian data, may be poorly calibrated for current national purposes for at least two reasons. First, the sample used in each study may be healthier than ‘the average’ at the time of sampling because of the voluntary nature of study participation or the exclusion of subjects who are hard to recruit. Second, because there has been a considerable annual decrease in ‘background’ CVD mortality rates in Australia since the studies used to create the primary score were inaugurated (Additional file
1: Figure S1). The primary score was thus recalibrated [
10] using the most current (2013) national data on mortality [
11] and risk factors [
12], using similar methodology to the GLOBORISK project [
13] and an earlier, unadopted, Australian risk score [
14] that was recalibrated from European Systematic COronary Risk Evaluation (SCORE) estimates of risk [
15].
In our recalibrated score we replaced, for each 5-year age/sex group,
\( S\left(5,\overline{x}\right) \) with the estimated national 5-year death rate for Australians in 2016 based on the most recent national death statistics, which gives annual CVD mortality rates by 5-year age/sex group, up to 2013 [
11]. Also, we replaced
\( \left\{\overline{x}\right\} \) by the mean values from the most recent (2011/3) comprehensive national health survey [
12], obtained by request from the Australian Bureau of Statistics. See Additional file
2: Table S1 for a comparison between Australian national data and the pooled cohorts. Using these sources of data incurs a minor error due to their inclusion of those with prevalent CVD (6.9% in the six datasets used in this paper).
Single-year mortality projections for 2016 were derived from fitting Poisson regression models to 5 year age/sex-specific annual data, for ages 40–79 years, from 2000 to 2013. This model provided a good fit to the data (Additional file
3: Figure S2). Using standard lifetable (‘compound interest’) methods these projections were used to obtain estimated 5-year risks for each 5-year age/sex group, for someone aged at the mid-range of the particular age group. Transition to the next highest age group after 2.5 years was accounted for by taking the single-year estimate of risk as the geometric mean of the estimates in year three of follow-up in the original and next age groups, stratified by sex. Similarly, the value of an individual’s age was rounded to the mid-range of her or his specific 5-year age group when evaluating the
w component of the recalibrated risk score in each five-year age-group.
Evaluating the scores
We tested the discrimination of the primary risk score by evaluating its performance in the multiple imputation model using Harrell’s c-statistic [
9]. Additionally, we found the corresponding c-statistic in each of the 30 imputation sets and obtained a pooled estimate from a fixed effect meta-analysis [
9]. We also compared the discrimination of the primary, low information and traditional risk factor models. Finally, we evaluated discrimination in an external dataset: the Scottish Heart Health Extended Cohort Study [
4], approximating SEIFA fifths with the postcode-based deprivation fifths in this study. Although the calibration of the primary risk score does not require evaluation, given that recalibration has been performed, nevertheless it was useful to check that the primary risk score is well calibrated within the sample data. To do such a test, a calibration plot [
9] was constructed for a pre-specified arbitrary imputation set (i.e. the sample data from the combined six Australian cohorts with missing data ‘filled-in’) – the first set generated. In addition, the Hosmer-Lemeshow test for survival data [
9] was applied to the equal tenths of predicted risk. For comparison with existing scores for CVD mortality, calibration plots were also produced, applied to the same imputation set, for the SCORE models for low- and high-risk European populations [
15]. The published 10-year risks from SCORE were transformed to 5-year risks using ‘compound interest’ calculations.
Although external validation would be ideal [
10], there is no meaningful way to validate the final, recalibrated model as, by definition, this is a projection into an unknown future Australia. Alternatively, we compared the primary and recalibrated models with each other and with the two SCORE predictions. We computed the estimates for all four algorithms for a woman and a man who did or did not smoke, had or did not have diabetes and had average values of all the other risk factors according to the Australian risk factor survey [
12].
Analyses were undertaken using SAS and STATA software; a p value of 0.05 or less was considered significant. All analyses and reporting of the prediction model development and validation were conducted in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines.