Introduction
Economic evaluations in health care involve the comparison of the costs and the benefits of different health technologies [
1]. Cost-effectiveness analysis is a widely accepted form of economic evaluation. Cost-utility analysis (CUA) is a specific form of cost-effectiveness analysis in which the benefits of health technologies are measured in terms of quality adjusted life years (QALYs) [
1]. The QALY is a composite measure of both quantity and quality of life.
EQ-5D is a generic measure of health-related quality of life (HrQoL) which can be used in clinical and economic studies, and is the recommended measure in National Institute of Health and Care Excellence (NICE) guidelines for calculating QALYs in cost-utility analysis in England and Wales [
2]. EQ-5D consists of 5 dimensions of health i.e. mobility, self-care, usual activities, pain/discomfort and anxiety/depression [
3]. It also includes a visual analogue scale (EQ-VAS), which asks participants to rate their overall health on a scale from 0 to 100. In the original of the EQ-5D (EQ-5D-3L), each dimension of health included 3 answer options (levels) to measure whether participants were experiencing no problems, some/moderate problems or severe/extreme problems [
3]. However, there were concerns that the use of only 3-levels resulted in these levels being too broad so that the EQ-5D-3L measure offered only limited information on the degree to which respondents’ health was impaired, and was also less sensitive to changes in respondents’ health status over time. As a result, a 5-level version of the EQ-5D (EQ-5D-5L) was subsequently developed and introduced in 2009 to address these concerns by providing two additional levels for each dimension to enable a more nuanced profile of an individual’s health status to be elicited. In the EQ-5D-5L, each dimension of health includes 5 levels to measure whether participants are experiencing no problems, slight problems, moderate problems, severe problems, or extreme/unable problems [
4]. Henceforth in this article, we refer to the EQ-5D-3L and EQ-5D-5L instruments as 3L and 5L respectively.
For the purposes of economic evaluation, EQ-5D responses can be converted into a single index summary score based on questionnaire responses to the 5 dimensions of health by using a valuation algorithm based on the social preferences of the general population. Such evaluation algorithms are country-specific and are currently available for a number of countries.
The measurement properties of any HrQoL instrument, such as distributional properties, consistency, reliability and validity, should be evaluated in order to assess its appropriateness for use in a specific patient population [
5]. A measurement instrument may exhibit good distributional properties if the presence of ceiling and floor effects are low (so that responses are not concentrated within the highest and/or lowest levels of an instrument). Both 3L and 5L should demonstrate consistency with each other if participants’ responses to the 5L matched with the corresponding levels of the 3L when both measures were administered at the same time point [
6]. Reliability analysis assesses the ability of an instrument to provide reproducible measurements, whereas validity analysis involves assessing the extent to which an instrument measures what it purports to measure [
7]. Convergent (and discriminant) validity and responsiveness are two types of validity analysis. An instrument exhibits convergent validity if it is highly correlated with a related instrument, whereas an instrument exhibits discriminant validity if it has a comparatively low correlation with an unrelated instrument [
5].
Responsiveness may be described as ‘longitudinal validity’, and assesses the degree to which an instrument is able to respond to a meaningful or clinically important external change over time [
5,
8]. Given the most common function of the EQ-5D is to detect changes in HrQoL over time in clinical trials, it is particularly important to evaluate the responsiveness of EQ-5D. An anchor-based analysis may be performed to assess responsiveness. The objective of an anchor-based analysis is to assess whether scores on the measure of interest (i.e. 3L or 5L) change in the expected direction when compared with changes in the scores of a related construct or measure (the ‘anchor’ measure) [
9,
10]. For an anchor-based responsiveness analysis to be undertaken, it is necessary that the anchor measure is responsive in the study population.
We are aware of two previous studies which have compared the 3L and 5L versions of EQ-5D for people of any age with multimorbidity (defined in these studies as ≥ 2 chronic conditions) [
11,
12]. Our study is the first we are aware of, to examine the responsiveness of the 3L or 5L in a population with substantial multimorbidity (our sample presented with a mean of 11.5 chronic conditions upon entry into our study) and polypharmacy (defined for this study, as 5 or more different regular drugs for more than 30 days). The definition of multimorbidity used in this study was based on the inclusion criteria for the underlying clinical trial that provided the data basis for the present study (presence of ≥ 3 concurrent chronic conditions), and is stricter than the definition of multimorbidity which is typically used in the clinical field (presence of ≥ 2 concurrent chronic conditions) [
13,
14]. Studying the measurement properties of the 3L and 5L in this population is of significant interest, because this population is increasing in prevalence over time [
15]. Our study is also the first head-to-head study we are aware of (i.e. the same individual completing both the 3L and 5L) that has been undertaken for this population. In terms of studies comparing the measurement properties of 3L and 5L versions of EQ-5D, many studies have been carried out across other populations [
16]. Most of these studies showed that the 5L is highly consistent with 3L responses, as well as offering a better level of performance in terms of reduced ceiling effects and better informativity compared to the 3L [
16‐
18]. Ceiling effects occur when a high proportion of subjects have maximum scores on the measurement of interest. A smaller number of studies, which have applied modern test theory through Rasch analysis, have also indicated improved performance of the 5L compared to the 3L in terms of demonstrating greater sensitivity [
19,
20]. Furthermore, we are aware of only six studies comparing the responsiveness of the 5L and 3L. Of these, three studies found that the 5L was more responsive than the 3L [
21‐
23], two found that the measures exhibited similar responsiveness [
24,
25], and one study of 112 stroke patients indicated that the 3L was more responsive than the 5L [
26].
The main objectives of this study were to:
a
Assess discriminant validity, informativity and responsiveness of the 3L and 5L versions of EQ-5D in an older adult population with substantial multimorbidity, and polypharmacy.
b
Assess consistency of the 3L and 5L, in an older adult population with substantial multimorbidity, and polypharmacy. Consistency involves assessing the extent to which responses based on 3L correspond to those based on 5L.
Results
Descriptive statistics
At the 6 months follow-up in the OPERAM study, 256 (83%) of patients reported the EQ-5D measures themselves, 45 (15%) had the EQ-5D measures reported by proxy by their next of kin, and 8 (2%) had the EQ-5D measures reported by proxy by some other individual (unspecified). Of the 256 participants, 224 participants also self-reported EQ-5D measures at 12 months, and with full completion of all 3L and 5L items at 6 and 12 months. This sample of 224 participants was used for all analyses and included participants who reported inconsistent responses. Age, gender, education level and comorbidity characteristics of the sample analysed for this study, were broadly similar to the characteristics of the overall OPERAM trial population (described in the methods section).
Summary statistics are provided in Table
1, showing that 56% of participants were male, 28% were university educated, the highest level of education was completed high school for 46% of participants, 26% of participants did not complete high school and 5% had spent some time in the 6 months before the trial started living in a nursing home. The average participant was experiencing a median of 10 coexistent chronic conditions upon entering the OPERAM trial. A small index score reduction of 0.01 (rounded) was observed between 6 and 12 months for both the 3L and 5L.
Table 1
Summary statistics (n = 224)
Age (years) | 224 | 77.5 | 5.35 |
Number of coexistent chronic conditionsa | 224 | 11.5 (median = 10) | 6.01 |
Number of medicationsa | 224 | 9.3 | 3.34 |
Barthel Index 6 months | 224 | 95.2 | 8.81 |
EQ-5D-3L index score 6 months | 224 | 0.83 | 0.21 |
EQ-5D-3L index score 12 months | 224 | 0.82 | 0.22 |
EQ-5D-5L index score 6 months | 224 | 0.81 | 0.19 |
EQ-5D-5L index score 12 months | 224 | 0.80 | 0.21 |
EQ-VAS score 6 months | 224 | 69.9 | 15.71 |
Gender (male) | 126 (56.2%) |
Highest level of education—university | 62 (27.6%) |
Highest level of education—high school | 102 (45.5%) |
Highest level of education—less than high school | 59 (26.3%) |
Highest level of education—not applicable/reported | 1 (0.4%) |
Spent some time in the 6 months before trial in nursing home | 10 (4.4%) |
Country of residence—Switzerland | 55 (24.5%) |
Country of residence—Ireland | 48 (21.4%) |
Country of residence—Belgium | 58 (25.8%) |
Country of residence—Netherlands | 63 (28.1%) |
In this sample at 6 months, 41 unique health states were represented using the EQ-5D-3L, and 99 states using the EQ-5D-5L. Spearman’s rho at 6 months between the 3L and 5L index scores was 0.88 (95% CI: 0.84 to 0.90), between the 3L index scores and 3L-VAS was 0.41 (95% CI: 0.30 to 0.52), and between the 5L index score and 5L-VAS was 0.44 (95% CI: 0.32 to 0.54).
Missing data was similar between both instruments (see footnotes of Appendix Tables
7 and
8).
With both the 3L and 5L, it was observed that there was a small reduction between the 6 and 12 month time points in the rate of participants reporting "no problems" in their ability to undertake usual activities [from 73 to 68% with the 3L (Appendix Table
7), and from 64 to 61% with the 5L (Appendix Table
8)]. There were no statistically significant changes at the 5% level in responses between 6 and 12 months, for any of the 3L and 5L dimensions (Appendix Tables
7 and
8). Whilst the pattern of change as indicated by 3L and 5L between the two time points is broadly similar, there were important differences. Notably, it was observed that for mobility and anxiety/depression, the direction of change was different between 3 and 5L (positive for 3L and negative for 5L for both items; see Appendix Table
9).
Consistency
We assessed presence of inconsistent responses between 3 and 5L, i.e. 5L responses that differed by ≥ 2 levels with the same person’s 3L response (highlighted in Appendix Table
10). There were 28 (3%) inconsistent responses between the 3L and 5L reported across items (7 (3%) inconsistent responses for the mobility item, 4 (2%) for the self-care item, 7 (3%) for the usual activities item, 4 (2%) for the pain/discomfort item and 4 (2%) for the anxiety/depression item). The 28 inconsistent responses were elicited from 26 participants in total.
Ceiling effects
A high presence of ceiling effect was observed for the self-care item for both instruments (84% of participants reported "no problems" for self-care with the EQ-5D-3L and 83% with the EQ-5D-5L). There was a substantially higher degree of ceiling effect with the EQ-5D-3L index score (29%) than with the EQ-5D-5L index score (22%), which was a statistically significant difference (
p < 0.001) (Table
2).
Table 2
Percentage of patients with a ceiling effect for each dimension of the completed EQ-5D-3L and EQ-5D-5L instruments and for the overall measures at 6 months (n = 224)
Mobility | 50.8 | 39.2 | < 0.001 |
Self-care | 84.3 | 83.4 | 0.48 |
Usual activities | 72.7 | 63.8 | < 0.001 |
Pain/discomfort | 50.4 | 44.6 | 0.002 |
Anxiety/depression | 83.4 | 79.0 | 0.003 |
Index score | 29.4 | 22.3 | < 0.001 |
For comparison, 4 participants (2%) reported a VAS score of 100 at 6 months (indicating they have the ‘best health they can imagine’). All 4 of these participants also reported full health with both the 3L and 5L at 6 months. 81 participants (36%) reported a Barthel Index score of 100 at 6 months (indicating they have no dependency in ADLs). Of these, 58 participants reported full health with the 3L, and 44 participants reported full health with the 5L at 6 months.
Validity
For discriminant validity, we assessed the correlation between the EQ-5D items and Barthel Index (Table
3). There were no statistically significant differences at the 5% level between the 3L and 5L items, in terms of how correlated they were with the Barthel Index (absence of statistically significant differences was demonstrated from all 95% confidence intervals for the 3L items overlapping with the 95% confidence intervals for the corresponding 5L items). Although the difference was not statistically significant, it is observed that the negative correlation between the mobility domain and the Barthel index was larger in magnitude for the 5L. We found that out of all items of the 3L and 5L, the pain/discomfort and anxiety/depression items had the weakest correlation with the Barthel Index.
Table 3
Correlation coefficients of dimensions of the EQ-5D measures with the Barthel Index (n = 224)
Mobility | − 0.37 | − 0.48 to − 0.25 | − 0.42 | − 0.52 to − 0.30 |
Selfcare | − 0.57 | − 0.65 to − 0.48 | − 0.58 | − 0.66 to − 0.48 |
Usual activities | − 0.45 | − 0.55 to − 0.34 | − 0.42 | − 0.52 to − 0.31 |
Pain/discomfort | − 0.15 | − 0.27 to − 0.02 | − 0.16 | − 0.28 to − 0.03 |
Anxiety/depression | − 0.14 | − 0.27 to − 0.01 | − 0.13 | − 0.26 to − 0.003 |
Responsiveness
142 participants (64%) reported no change in their total Barthel Index score between 6 and 12 months. Responsiveness analysis demonstrated both EQ-5D measures were responsive to changes in the Barthel Index from 6 to 12 months (Table
4). Evidence of responsiveness of both measures to changes in the Barthel Index, was demonstrated both in the overall sample (Table
4), as well as in each of the country-specific subgroups (Appendix Tables
11,
12,
13,
14). Furthermore, compared to each other, both 3L and 5L measures were similar in their responsiveness to changes in the Barthel Index. For both the 3L and 5L, Cohen’s D effect sizes changed from a moderate positive effect when the Barthel Index improved to a small negative effect when the Barthel Index worsened.
Table 4
Assessment of responsiveness of the EQ-5D-3L and the EQ-5D-5L measures to changes in the Barthel Index (n = 224)
Barthel Index |
Improved | 0.68 | 0.83 | 0.16 (0.07 to 0.25) | 0.66** | 31 | 55 |
No change | 0.89 | 0.87 | − 0.02 (− 0.05 to 0.02) | − 0.09 | 142 | 25 |
Worsened | 0.74 | 0.66 | − 0.08 (− 0.16 to 0.00) | − 0.30* | 50 | 28 |
Barthel Index |
Improved | 0.68 | 0.81 | 0.13 (0.07 to 0.20) | 0.66*** | 31 | 59 |
No change | 0.87 | 0.86 | − 0.01 (− 0.04 to 0.02) | − 0.05 | 142 | 32 |
Worsened | 0.73 | 0.65 | − 0.09 (− 0.14 to − 0.03) | − 0.36** | 50 | 32 |
Both the 3L and 5L demonstrated some degree of responsiveness to changes in the VAS from 6 to 12 months (Table
5). Compared with each other, both 3L and 5L measures demonstrated similar responsiveness to changes in the VAS. There was a small positive improvement in 3L and 5L scores as the patient’s VAS scores improved. This improvement in the overall sample appeared to be driven by improvements in 3L and 5L scores in the Netherlands and Ireland (Appendix Tables
17,
18). However, there was no statistically significant change in in 3L and 5L scores for the patients whose VAS scores worsened from 6 to 12 months.
Table 5
Assessment of responsiveness of the EQ-5D-3L and the EQ-5D-5L measures to changes in the VAS (n = 224)
3L-VAS |
Improved | 0.80 | 0.88 | 0.08 (0.02 to 0.14) | 0.40* | 52 | 44 |
No MCID | 0.85 | 0.81 | − 0.04 (− 0.10 to 0.01) | − 0.19 | 101 | 30 |
Worsened | 0.81 | 0.79 | − 0.02 (− 0.06 to 0.02) | − 0.07 | 71 | 21 |
5L-VAS |
Improved | 0.78 | 0.84 | 0.06 (0.01 to 0.10) | 0.31* | 55 | 45 |
No MCID | 0.83 | 0.81 | − 0.03 (− 0.07 to 0.02) | − 0.13 | 98 | 36 |
Worsened | 0.80 | 0.78 | − 0.02 (− 0.06 to 0.02) | − 0.11 | 71 | 28 |
Shannon’s evenness indices indicated that the 3L and 5L were informative for mobility, usual activities and pain/discomfort dimensions, although less informative for self-care and anxiety/depression dimensions (Table
6). The EQ-5D-3L was slightly more informative with respect to self-care (EQ-5D-3L J′ = 0.48; EQ-5D-5L J′ = 0.42), and the EQ-5D-5L was substantially more informative with respect to mobility (EQ-5D-3L J′ = 0.69; EQ-5D-5L J′ = 0.86).
Table 6
Shannon’s index (H′) and Shannon’s evenness index (J′) values for EQ-5D-3L and EQ-5D-5L measures at 6 months
Mobility | 224 | 1.09 | 0.69 | 224 | 1.99 | 0.86 |
Self-care | 224 | 0.75 | 0.48 | 224 | 0.96 | 0.42 |
Usual activities | 224 | 0.94 | 0.59 | 224 | 1.51 | 0.65 |
Pain/discomfort | 224 | 1.27 | 0.80 | 224 | 1.85 | 0.80 |
Anxiety/depression | 224 | 0.73 | 0.46 | 224 | 1.05 | 0.45 |
Discussion
In this study, we investigated the measurement properties of the EQ-5D-3L and EQ-5D-5L in older adults with substantial multimorbidity. From our analyses, we found a low proportion of inconsistent responses between the EQ-5D-3L and EQ-5D-5L, which was also found in the majority of previous studies comparing the 3L and 5L [
16]. This indicates 5L responses distribute logically with the 3L responses. The EQ-5D-3L represented 41 unique health states out of a possible 243 states (17%), and the EQ-5D-5L represented 99 unique health states out of a possible 3,125 states (3%). This shows that more of the descriptive space of the 3L is used. Both the EQ-5D-3L and EQ-5D-5L exhibited discriminant validity with the Barthel Index; which was also found in a previous study [
36]. Missing data occurrence at 12 months was also similar between the two measures. Almost all missing data resulted from participants not being available at 12 months to provide necessary information for secondary outcome measures of the main OPERAM trial through telephone interview (e.g. due to trial drop-out), and should not be considered reflective of the performance of the EQ-5D measures themselves.
We observed high rates of ‘no problems’ with 3L and 5L self-care and anxiety/depression items, which could suggest that the EQ-5D description of levels excludes the type of self-care or anxiety/depression problems encountered by the patient population studied. Alternatively, it may be the case that patients genuinely do not have such problems, or that care settings are working well to enable self-care.
Consistent with most other studies [
16,
41,
42] including an assessment of the subgroup of multimorbid patients in a study by Thompson et al. [
11], in our sample we observed a reduction in ceiling effects from using the EQ-5D-5L (22%) compared to the EQ-5D-3L (29%). The EQ-5D-5L therefore appears to better capture variability in health status among those who have a high level of health, compared with the EQ-5D-3L. Also consistent with all the studies identified in a systematic review by Buchholz et al. in 2018 [
16] was our finding of an overall improvement in informativity from using the 5L compared to the 3L. This was the consequence of a substantially higher Shannon evenness index score for the mobility item of the 5L compared with the 3L; which was also observed in a study of multimorbid adults by Thompson et al. [
11]. However, informativity in our study was higher on the 3L than with the 5L for self-care.
We observed similar responsiveness to change over time for the EQ-5D-3L and EQ-5D-5L. Several studies evaluating responsiveness have reported an improvement in responsiveness from using the EQ-5D-5L compared with the EQ-5D-3L [
21‐
23], but other studies have reported either no difference in responsiveness [
24,
25] or a reduced responsiveness from using the EQ-5D-5L compared with the EQ-5D-3L [
26]. Given the mixed findings across these responsiveness studies, there is currently no clear evidence that using the 5L instead of the 3L to collect utility data for economic evaluations, will lead to systemically different incremental QALY estimates. This contrasts with the notion by Hernandez-Alava et al. (2018), that using the 5L instead of the 3L will lead to systemically lower estimates of incremental QALYs [
43]. In our study, both the 3L and 5L were more responsive to the Barthel Index than they were to the VAS. This may be because the VAS measures a broader underlying construct of health, whereas the Barthel Index is a disability-specific measure. Feng et al. also previously observed a weak correlation between EQ-VAS change scores with 5L change scores [
44].
This is the first study to investigate measurement properties of the EQ-5D-3L and EQ-5D-5L in older adults with substantial multimorbidity, through a head-to-head comparison. The 5L and 3L were not administered directly after each other, which probably reduced the possibility of a patient’s 3L response being directly influenced by the 5L response immediately beforehand. Further separation was not possible given the set-up of the OPERAM trial. We were able to carry out a robust assessment of responsiveness through analysis of a sample of 224 participants who we assessed over a 6-month follow-up period. We assessed responsiveness within a clinical trial, and observed for our sub-sample, only a very small reduction in 3L and 5L scores between 6 and 12 months. The findings of our study may to an extent be relevant to other clinical trials during which small changes in health are occurring (particularly in trials with a similar population to our own), and inform the decision of whether to select the 3L or 5L in such trials. A limitation of our study is that we only investigated responsiveness of the instruments to changes in the Barthel Index and the EQ-VAS. Investigation of responsiveness of the instruments to other variables predicted to correlate with HRQoL in older multimorbid patients would have been desirable but these were not available. In our analyses, we assessed responsiveness of the EQ-5D-3L to change in the 3L-VAS measure, and responsiveness of the EQ-5D-5L to change in the 5L-VAS measure. We did this to prevent results from being biased in favour of one instrument over the other. This was due to our concern that an “order effect” might be induced [
45], in which 5L-VAS responses were influenced by responding to the EQ-5D-5L directly beforehand and 3L-VAS responses were influenced by responding to the EQ-5D-3L directly beforehand.
Furthermore, as the sample size of proxy EQ-5D responses gathered was too small, and the participants from whom proxy responses were elicited had more physical health impairments than self-reporting participants, we removed proxy responders from our analyses. It was not feasible under these circumstances to investigate the measurement properties of proxy EQ-5D responses for older multimorbid adults. However, a separate analysis comparing the patient and proxy responses that we collected is planned for a future publication. Another limitation was that there may be country differences but that, given the sample size and the heterogeneity of the sample, these could be confirmed or assessed in detail. Larger studies would be required for this. Furthermore, our sample presented with a notably large number of multi-morbidities (mean of 11.5; median of 10 concurrent chronic conditions); hence, caution in generalising our results to older adults with fewer comorbidities should be exercised.
Another possible limitation is that we decided to use the German crosswalk method to calculate 5L scores instead of the German 5L value set. We did this because using the crosswalk method instead of a national value set is still a recognized standard in some major guidelines for calculating 5L scores for economic evaluations [
2], indicating this is currently best practice. The implications of this decision on the results from our study are not known.
One potential area of future research is to compare test–retest reliability of the EQ-5D-3L and the EQ-5D-5L. Investigation of this property was beyond the scope of this study and few prior studies comparing the EQ-5D-3L and EQ-5D-5L have investigated this property, which relates to how strongly correlated repeated EQ-5D scores are [
16,
46].
Both the EQ-5D-3L and EQ-5D-5L demonstrated satisfactory performance in this study, thus justifying their use as measures for HRQoL studies and cost-utility analyses of older people with multimorbidity. However, prominent guidelines recommend to use the EQ-5D-5L consistently across all diseases and populations [
2], and the overall consensus of the literature comparing the measurement properties of the 3L and 5L in different patient populations, is that the 5L exhibits better measurement properties compared to the 3L [
16]. Nevertheless, it also needs to be considered that the 3L may be considered slightly less burdensome to complete than the 5L due to having fewer response options. Also, when compared to the 5L the appropriate value sets for the 3L are currently available more widely.
We conclude that both the EQ-5D-3L and EQ-5D-5L exhibit a reasonably high level of performance for measuring the health of older adults with substantial multimorbidity and associated polypharmacy and who display the ability to self-complete the questionnaires.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.