Background
The Global initiative for chronic Obstructive Lung Disease (GOLD) classification for severity of chronic obstructive pulmonary disease (COPD) is used to classify individual patients, describe study populations, monitor disease progression, and guide individual treatment decisions.
Consensus has grown that the previous GOLD classification, which was entirely based on forced expiratory volume in 1 second as percentage of the predicted value for someone of the same gender, age and height (FEV
1%-predicted), was an insufficiently reliable predictor of the variety of manifestations of the disease [
1‐
4]. For example, frequent exacerbators are also found among patients with relatively mild forms of airway obstruction [
5]. This is important, since exacerbations do not only predict future exacerbations but are also a risk factor of faster disease progression and mortality [
6]. There have been pleas for a more explicit recognition of the variety of COPD phenotypes which should improve understanding of the impact of the disease and, more importantly, provide prognostic information and guide the selection of more appropriate therapies [
7].
In 2011, GOLD presented a new classification system, which was adapted slightly in 2013 [
8]. This new classification distinguishes four groups of patients, based on symptoms and exacerbation risk. The assessment of the latter can be based on either exacerbation history or degree of airflow limitation, whatever results in a higher risk. Symptoms are to be assessed using either the modified British Medical Research Council questionnaire (mMRC) [
9], which measures breathlessness, or the COPD Assessment Test (CAT) [
10], which provides a more comprehensive assessment of the symptomatic impact of COPD.
Recently, several studies investigated the prognostic value of the old and the new (2011/2013) systems with regards to a number of outcome measures. Mortality was predicted equally well by both systems in studies by Soriano et al., Agustí et al. and Johannessen et al. [
11‐
13], whereas Leivseth et al. found that the old classification performed better [
14]. Exacerbations and hospitalisations were predicted better by the new system according to Lange et al. and Agustí et al. [
12,
15], but Johannessen et al. saw no difference in performance between the systems [
13]. So far, only one study examined the ability of the new system to predict lung function decline [
12]. It did not find differences in predicted lung function decline across severity stages. However, in this study no comparison with the old system was made.
It is important that results from these studies are replicated or contradicted in different populations.
Data from the “Understanding Potential Long-term Impacts on Function with Tiotropium” (UPLIFT) trial [
16‐
18] provide the opportunity to investigate the prognostic performance of the new classification system with four years of follow-up. This trial is especially suitable for this purpose, not only because of its duration, but also because of its size (almost 6,000 patients randomized), international origin, and high quality-controlled lung function data.
The aim of this study, therefore, was to compare the ability of the old and the new (i.e. 2013) COPD classification to predict future decline in lung function, mortality, the total number of exacerbations and the number of severe exacerbations.
Methods
Data
The UPLIFT trial was a multinational, randomized, double-blind, placebo-controlled trial, investigating the effect of tiotropium on the yearly rate of decline in FEV
1 in ≥40 years old, currently or formerly smoking patients (≥10 pack-years) with moderate to very severe COPD according to the old GOLD classification system (stages 2 to 4, post-bronchodilator FEV
1 of 70% or less of the predicted value) [
16,
17]. Key exclusion criteria were a history of asthma, a COPD exacerbation or respiratory infection within 4 weeks before screening, a pulmonary resection, use of supplemental oxygen for more than 12 hours per day, and coexisting illnesses that could preclude participation in the study or interfere with the study results.
Patients received either 18 μg of tiotropium or a matching placebo once daily. All respiratory medications, except other inhaled anticholinergic drugs, were permitted during the trial. Smoking cessation programs were offered to all patients before randomization.
Patients were recruited from 2003 to 2004 at 487 centres in 37 countries. The study protocol was approved by the ethics committee at each centre, and all patients provided written informed consent [
17]. The follow-up period was four years, in which lung function, exacerbations, St. Georges Respiratory Questionnaire (SGRQ) [
19] and mortality were recorded. Exacerbations were defined as an increase in or the new onset of more than one respiratory symptom (cough, sputum, sputum purulence, wheezing, or dyspnoea) lasting three days or more and requiring treatment with an antibiotic or a systemic corticosteroid, and/or a hospitalisation. Patients were assessed at randomisation, after one month, six months, and every six months thereafter. For the base case analyses the data from the two treatment groups (tiotropium and control) were combined.
Data from 5630 patients were used in the analysis.
Mortality
Time to death was analysed in Weibull regression models, with either the old or new GOLD classification as covariates as well as other prognostic factors. These were selected in an iterative backward selection process, in which the covariate with the highest p-value was excluded until all p-values were below 0.20. Candidates for inclusion in the model were age, gender, body mass index (BMI), smoking status and the presence or absence of several co-morbidities (coronary heart disease, arrhythmia, vascular disease, nervous disease, diabetes, depression and anaemia).
The regression results were used to construct average adjusted survival curves, following the procedure proposed by Hernàn [
20]. First, the model coefficients were used to fit multiple individual survival curves for each patient. Each curve assumed a different GOLD stage, irrespective of the actual classification of the patient. The other baseline characteristics were kept constant within patients. After this, mean survival probabilities per 6-month interval were calculated over all patients for each stage and each point in time. These probabilities were then used for constructing survival curves per stage. This was done to assure that differences in the curves would be due to different severity stage assignments only, and not to other differences (e.g. demographic differences) between the groups.
The models’ performance was compared by visually inspecting the ranges over which 4-year mortality differed across stages, by using the Akaike Information Criterion (AIC) for model fit [
21] and by Harrell’s c-statistic for the measure of discrimination across stages [
22,
23]. A c-statistic of 0.5 means that a model has no predictive discrimination, in other words, that it has a 50% chance of correctly predicting which of two subjects in different risk categories has the highest probability of experiencing the event. There is no universally used interpretation of the value of the c-statistic. In the context of logistic regression, Hosmer et al. consider values of 0.7 to 0.8 to indicate acceptable discrimination, while discrimination is considered excellent between 0.8 and 0.9 and outstanding when the c-statistic ≥0.9 [
24].
The AIC is a measure to compare the goodness-of-fit of different statistical models. Its absolute value has no interpretation. A difference in AIC of ≥4 is often considered an indication that the model with the higher AIC fits the data less well [
25].
Exacerbations
Negative binomial regression with adjustment for treatment exposure was applied to analyse the total rate of exacerbations. The regression model contained either the old or the new GOLD stages, as well as other prognostic factors if necessary. In an iterative backward selection process, the covariate with the highest p-value was excluded from the model unless this led to a 10% change in the estimate of the annual exacerbation rate [
26].
The regression results were used to estimate mean rates per GOLD stage. For each patient, the number of exacerbations per year was predicted for each stage, given the patient’s characteristics but irrespective of the actual classification of the patient, and assuming 365.25 days per year. The individual predictions per disease stage were then averaged over all patients.
The performance of the new model was compared with that of the old model by visually inspecting the ranges over which rates differed across stages and by using the AIC for model fit. This was repeated for severe exacerbations, which were defined as COPD exacerbations requiring a hospital admission.
Lung function decline
Lung function decline, expressed as the deteriorating course of post-bronchodilator FEV1, was analysed in a linear random effects model. This analysis started at day 30 in order to take into account the fact that many patients experienced an initial post-randomisation improvement in lung function. Covariates were days since randomisation and interactions of GOLD stage and days. These interactions were used to describe decline for each stage. The intercepts and the slope for time since randomisation were assumed to be random with an unstructured covariance matrix and the interactions were modelled as fixed effects. Patients with at least three measurements from day 30 were included. The regression results were used to estimate mean annual lung function decline per GOLD stage. The annual rate of decline per disease stage was determined by multiplying the regression coefficient for this stage by 365.25. The selection of covariates took place along the same lines as for exacerbations. The models’ performance was compared by visually inspecting the ranges over which rates differed across severities and by using the AIC for model fit.
Classification
Patients were classified into GOLD stage 2 to 4, based on post-bronchodilator FEV
1% predicted (50-70%, 30-50%, <30%) and into GOLD stage A to D, based on the 2013 GOLD classification [
8]. Patients were considered a high risk for an exacerbation if they had a FEV
1% predicted <50%, or experienced at least two exacerbations in the previous year, or had been admitted to the hospital with an exacerbation at least once during the previous year. The number of exacerbations in the year before randomization was defined as the number of courses of oral corticosteroids or antibiotics or the number of hospitalisations, whichever was the highest.
Since the dataset did not contain CAT or mMRC scores, on which the symptom dimension of the classification is supposed to be based, the Saint Georges Respiratory Questionnaire score (SGRQ) was used instead. The SGRQ measures perceived well-being in COPD patients and the impact of the disease on their activities. Patients with an SGRQ score ≥25 were placed in the ‘high level’ symptoms category. This threshold value was found by Han et al. to have the strongest correspondence with the CAT threshold ≥10 [
27].
Substages
All analyses with the new GOLD classification were repeated with substages of C and D. Patients were assigned to substages based on the reason for being considered high-risk: FEV1% predicted <50% but no history of frequent exacerbations (C1 and D1), history of frequent exacerbations but FEV1% predicted ≥50% (C2 en D2), or FEV1% predicted <50% combined with a history of frequent exacerbations (C3 en D3).
All analyses were performed in Stata 12.1 [
28]. Confidence intervals were calculated by bootstrapping with 1000 replications [
29,
30].
Sensitivity analyses
All analyses were repeated with a different threshold for symptom severity: SGRQ ≥39. This value was found by Han et al. to have the strongest correspondence with the mMRC threshold of 2 [
27].
Furthermore, the analyses with the SGRQ ≥25 threshold were repeated in the control group separately.
Discussion
This study compared the prognostic performance of the old and new GOLD classifications for COPD regarding mortality, exacerbations and decline in lung function. The findings depend on the outcome measure.
As for mortality, both classification systems discriminated equally well, but the old model performed better in terms of model fit. The loss of information on lung function, which was grouped into fewer categories in the new system, does not appear to have been completely mitigated by the added information on symptoms and exacerbation history in the new system.
With regard to (severe) exacerbations, all three dimensions of the new GOLD classification strongly contributed to the predictions. This led to a much better performance for the new classification system than for the old system.
With regard to lung function decline, however, the predictive power of the old system was much better. Information on symptom level and exacerbation history did not improve the ability to predict decline of FEV1.
Our study is the first to compare the old and new system’s ability to predict lung function decline. Agusti et al. did assess the decline across the new stages [
12] but did not compare the two classification systems. Furthermore, they did not find significant differences in decline, whereas patients with a worse lung function in our data showed a slower decline. This pattern was less clear in the new system than in old system, but still clear and statistically significant. Combining patients with a low lung function and history of frequent exacerbations into the same stages hides the major differences between these patients. This was also observed in earlier studies [
12,
15]. Dividing the stages into substages, depending on the reason for which patients are considered high-risk, is very informative and could improve recommendations in individual treatment decisions and in the preparation of treatment guidelines.
The aim of the new guidelines is to enhance the understanding of the impact of COPD on individual patients by combining ‘the symptomatic assessment with the patient’s spirometric classification and/or risk of exacerbations’ [
8]. Although lung function in itself does not have a direct impact on patients – it only does so through symptoms, exacerbation risk and mortality risk – it still is an important aspect of disease severity, and hence of the new classification system, because it is a better predictor of mortality than symptoms and exacerbations.
Using trial data for a study like this has advantages and disadvantages. Among the advantages is the high quality of the spirometry data because of there was a good quality control system in place. A disadvantage is that a trial population shows less variation in patient and disease characteristics than a real-life population because of the in- and exclusion criteria. Furthermore, the exacerbation rate in the UPLIFT trial was relatively low. Despite this we found that the new classification system was clearly better in predicting exacerbations than the old classification system.
For all analyses we combined the data from the two treatment groups in the UPLIFT trial. We performed additional analyses with treatment as a covariate. This did not lead to different conclusions.
A limitation of this study is that our data contained no information on the mMRC or CAT scores, which are the recommended ways of establishing symptom severity in COPD patients. However, SGRQ and CAT are highly correlated [
31]. According to the authors of the new GOLD guidelines, ‘the crucial aspect is to consider whether the patient has only trivial symptoms or feels significantly limited by them’ [
32]. Several scales can be used for that purpose. In fact, the authors note that updates of the guidelines may include other scales.
Nevertheless, different scales may lead to different categorisations. The currently proposed cut-off points of the CAT and mMRC do not lead to exactly the same classification of patients [
27,
33]. More specifically, patients were 25% less likely to be classified as C instead of D when the mMRC criterion was applied [
27]. The current CAT cut-off point of 10 or more appears to be more in line with a mMRC score of 1 instead of 2 [
27,
34].
Earlier studies based the categorisation on the mMRC ≥2 [
11‐
15]. Overall, their findings are in line with ours, using SGRQ ≥25 as a surrogate for CAT ≥10. Furthermore, we found similar results when we used a higher SGRQ threshold as a surrogate for mMRC ≥2 in the comparison of the old and new classification. This is consistent with the guideline statement that does not attach particular importance to the choice for a specific symptom scale. Nevertheless, the model fit was better when the higher SGRQ threshold was used.
In summary, in the UPLIFT population of moderate to very severe COPD patients, the 2013 GOLD classification performed better than the old classification when predicting future exacerbations, whereas the old classification system performed equally well or better when predicting mortality and lung function decline.
Competing interests
This study was funded by Boehringer Ingelheim GmbH.
Authors’ contributions
LG: design, data analysis and interpretation, manuscript writing. IL, NM: design, interpretation, manuscript revision. KB: interpretation, manuscript revision. MPMHR: conception and design, interpretation, manuscript writing. All: final approval of the manuscript.