1 Introduction
In the economic evaluation of health interventions, the quality-adjusted life year (QALY) is a commonly used metric that combines length and quality of life into a single figure. The quality, or utility, weight used in the estimation of QALYs is anchored on a full health (1) to dead (0) scale, with negative values assigned to health states considered worse than dead. Utility values for health states associated with a particular condition or disease can be derived in several ways, one of which is via the use of preference-based measures (PBMs) of health. Of currently available PBMs, the EQ-5D [
1,
2] is the most widely used.
EQ-5D classifies health on five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. The original version of the EQ-5D (described as EQ-5D-3L) included three severity levels (none, some, extreme/unable to)
1, thereby describing (3
5 =) 243 health states. In the UK, utility values for EQ-5D-3L health states were derived using the time trade-off (TTO) preference elicitation technique [
3]. The resulting ‘value set’ has been widely influential, and is preferred by the National Institute for Health and Care Excellence (NICE) for use in the cost-utility analysis of health interventions [
4]. EQ-5D-3L values are also accepted by reimbursement agencies worldwide, including the Pharmaceutical Benefits Advisory Committee (PBAC) in Australia [
5] and the Canadian Agency for Drugs and Technology in Health (CADTH) in Canada [
6]. The instrument itself is also used in a wide range of settings, including population health surveys and routine clinical practice [
7].
Notwithstanding the widespread use of the EQ-5D-3L descriptive system and value set, research has suggested that both have a number of limitations. Regarding the descriptive system, it has been shown that the EQ-5D-3L is not sensitive to the important quality of life impacts of all conditions [
8,
9]. It may also not be sensitive to smaller changes in health as it only has three response levels in each dimension, and in general public and some patient samples, a substantial proportion of respondents report themselves as being in the best health state, i.e. no problems on any dimension (11111). This is known as a ceiling effect [
10]. Regarding the value set, the procedure and modelling used to elicit values for worse than dead health states has been criticised [
11]. Furthermore, the EQ-5D-3L valuation data were collected in 1993, and population preferences for different aspects of health and quality of life may have changed in this time given advances in treatment and care. Social and environmental changes may also be important.
In an effort to improve the instrument’s sensitivity and reduce the ceiling effect, a five-level descriptive system, the EQ-5D-5L [
12], was developed. The new instrument includes five response levels (none, slight, moderate, severe, extreme/unable to). The wording was also standardised across dimensions so that the worst level of mobility was changed from ‘confined to bed’ to ‘unable to walk about’, which is in line with the severity indicators used for the other functioning dimensions (self-care and usual activities). The intermediate severity level was also standardised to be ‘moderate’. The EQ-5D-5L increases the number of states described to (5
5 =) 3125. Research has shown improved measurement properties of the EQ-5D-5L descriptive system across a number of patient samples when compared to the EQ-5D-3L [
13].
One consequence of this initiative was the need to develop value sets for the new descriptive system that reflect more up-to-date preferences of the population for health and quality of life, and this resulted in two separate developments. Firstly, an interim ‘crosswalk’ value set was developed so that EQ-5D-3L values could be used to predict EQ-5D-5L values [
14]. Secondly, in order to elicit values for health states generated by the EQ-5D-5L descriptive system, a new valuation protocol combining TTO and discrete choice experiment (DCE) methods was developed [
15]. This protocol used a ‘composite’ TTO approach combining standard and ‘lead time’ TTO [
15‐
17]. In England, health states generated by the EQ-5D-5L were valued during 2012 and 2013 using this protocol and subsequently modelled using newly developed techniques that combined TTO and DCE data in a hybrid model to produce an EQ-5D-5L value set [
18,
19].
Three EQ-5D value sets are therefore now available for use in cost-utility analysis in the UK and/or England, those being the EQ-5D-3L value set, the crosswalk value set mapping the EQ-5D-5L descriptive system onto the EQ-5D-3L value set, and the EQ-5D-5L value set. The first two of these were developed based on valuations from respondents in the UK, while the latter was based on valuations from respondents in England only. However, this is only one way in which they differ. As noted, they are also based on different descriptive systems, valuation protocols, and modelling methods. Given widespread and increasing use of the EQ-5D-5L in decision making, it is important to systematically assess the differences between the value sets, and the implications of the new values. For example, in recent work, it has been found that quality of life changes are valued less using the EQ-5D-5L value set [
20]. At the end of 2017, NICE released a position statement regarding the use of the EQ-5D-5L stating that “the mapping function developed by van Hout et al. [
14] [i.e. the crosswalk value set] should be used for reference-case analyses” until its position is reviewed in 2018 [
21]. This means that the UK crosswalk is currently important in health technology assessment (HTA) carried out by NICE, and the results of studies comparing the new EQ-5D-5L value set with the crosswalk and EQ-5D-3L will inform future decisions about which measure to use. Therefore the aim of this paper is to add to the literature in this area by comparing the UK EQ-5D-3L and English EQ-5D-5L value sets, and the EQ-5D-5L and crosswalk value sets.
4 Discussion
We have compared three EQ-5D value sets that can be used to support HTA in the UK. The comparison firstly investigated differences in the ‘theoretical’ values possible from the value sets for health states matched across the EQ-5D-3L and EQ-5D-5L descriptive systems and secondly compared values observed in patient data.
Regarding the theoretical values, the results demonstrate that there are differences between the EQ-5D-3L and EQ-5D-5L value sets, where the EQ-5D-5L values for matched states are higher, and the overall range and therefore change between adjacent states is smaller than for the EQ-5D-3L. The distribution of values also differs. There are similar differences between the EQ-5D-5L value set and the crosswalk tariff given that the latter is linked to the EQ-5D-3L value set. However, it is also worth noting that some underlying features of the preferences, and therefore utility scales, are similar. For example, the overall importance of each dimension to the overall value is similar, with only one difference (where the rank order of the dimensions is the same, apart from two dimensions, mobility and anxiety/depression, changing position in the ordering in the EQ-5D-5L value set), and the relative distance between the levels for different dimensions is similar.
Regarding the observed values from the patient data, the EQ-5D-5L value set produces higher values overall and across all of the conditions included, and the differences are generally significant. This is expected given the overall increase in the values of matched states and reduction in the overall utility scale. There is some evidence that the value sets rank different health conditions in a similar order, particularly the most and least severe conditions as measured by the descriptive systems. However, this requires further exploration across a larger range of conditions.
There are a number of possible reasons why the EQ-5D-3L and EQ-5D-5L value sets differ. These include differences in the samples used in terms of demographics and country. The EQ-5D-3L value set was based on a representative sample of England, Scotland, and Wales, whereas the EQ_5D-5L was based on just an English sample. This may have implications for decision making in the jurisdictions that are not represented. However, the project team has since collected EQ-5D-5L valuation data for the other countries in the UK so will be able to compare using a more representative sample (albeit one that is smaller than that used for the EQ-5D-3L). Potential changes in population demographics and preferences over time (from 1993 to 2013) are another possible reason why the value sets demonstrate differences. For example, the population is getting older [
24], and this might impact on preferences for different health dimensions. One indication of change in preferences over time might be the increased magnitude of the anxiety/depression dimension given increased focus on the detrimental aspects of mental health conditions in policy [
25] and reduction in stigma surrounding conditions such as depression [
26]. Even without the development of the EQ-5D-5L, the currently used EQ-5D-3L value set is outdated and therefore would require updating anyway. Overall, the dimension preference structure between the EQ-5D-3L and EQ-5D-5L is similar, with only one inversion (anxiety/depression and mobility), which is encouraging given the differences between the studies. This may demonstrate that the order of preferences for the five areas of health described by the EQ-5D may be generally consistent over time.
Other reasons why the value sets may differ relate to the descriptive system and the valuation method used. Firstly, regarding the descriptive system, the EQ-5D-5L uses more consistent wording, particularly for the more severe levels, and it is possible that the change in labelling of the mobility dimension (from ‘confined to bed’ to ‘unable to walk about’) has impacted the values, where mobility has a smaller weighting in the EQ-5D-5L than in the EQ-5D-3L. The increase in levels and associated sensitivity also may impact the magnitude of the difference and transition between the intermediate levels and therefore the overall value set.
Secondly, the valuation method differs, particularly regarding the process used to value states worse than dead, which was problematic for the EQ-5D-3L [
11]. The methodological change to a new approach to eliciting values < 0, the lead time TTO, meant that the lowest possible value for an EQ-5D-5L health state in the protocol used was − 1 [
15,
27]. In contrast, the minimum value was − 39 in the Dolan study [
3], which was rescaled to − 1. This therefore led to a reduction in the overall scale. The inclusion of DCE tasks in the EQ-5D-5L valuation also provides a different type of valuation data focusing on the choices between states rather than measuring direct values for states, as is the case with TTO. The development of innovative modelling methods combining TTO and DCE data in one model [
28,
29] provides further reasons for differences in the value sets. The modelling process for the EQ-5D-5L data also developed heterogeneous models for the TTO data only [
19], and further work is underway to model the EQ-5D-3L valuation data applying the methods developed for the EQ-5D-5L [
30]. It is also worth noting that a partial replication of the original EQ-5D-3L valuation study was carried out by Macran and Kind [
31]. In this study, the authors used a smaller health state design, but a similar TTO process to Dolan [
3] and estimated an EQ-5D-3L value set with quite different characteristics. For example, the value for the worst state was substantially higher (− 0.126 vs − 0.594), and the amount of negative states was substantially lower (12.3 vs 34.6%). This value set is more in line with other EQ-5D-3L value sets developed internationally [
32], and provides a useful counterpoint for comparisons between the value sets included in this study.
There are also large differences in the proportion of states valued as worse than dead (i.e. with a negative value) and the associated values assigned to these states, which has resulted in a smaller range for the EQ-5D-5L. One of the key criticisms of the EQ-5D-3L value set was the process used to value and subsequently model states worse than dead, which led to the large range observed [
11], which may not realistically reflect population preferences. The protocol for the development of the EQ-5D-5L value set introduced a new method for the valuation of states worse than dead, which bounded all observed values on a − 1 to 1 scale [
15,
17]. This has reduced the overall proportion of negative values and moved the anchor value of 0 (i.e. the state equivalent to dead). Further work could compare the characteristics of the health states that have values close to zero across different value sets.
However, the impact of the change in negative values on HTA is unclear, as it is not well established how often states that are worse than dead actually appear in cost-effectiveness models. There are differences in the proportions of negative states in different conditions, where the proportion is similar across the EQ-5D-3L and EQ-5D-5L in Parkinson’s disease, but quite different for multiple sclerosis and COPD, for example. This might be due to changes in the magnitude of the decrement associated with the key dimensions for each condition. As the overall range of EQ-5D-5L values is smaller, the change in QALYs (for estimates generated from quality of life changes) might be reduced across the whole scale for states both better and worse than dead. This depends on the descriptive data, where respondents could show no change on the 3L (i.e. ‘some’ problems both before and after) whilst showing a change on the 5L (move from “moderate” to “slight”), leading to higher QALY gains.
It is also useful to compare the scale of the English EQ-5D-5L value set with those from other countries that were developed using the same valuation protocol [
15]. For example, the Dutch value set has a minimum value of − 0.446, with around 15% of states valued negatively [
33]. The Spanish EQ-5D-5L value set has a minimum value of − 0.224 [
34]. Differences between countries could be due to cultural differences in preferences as well as the use of different modelling approaches. Further work should compare EQ-5D-5L value sets from different countries in more detail.
It is unclear how the differences between the value sets indicated in both analysis of the estimates and patient data will impact the HTA process. This is because the utility values will be applied to both treatments and their comparators, and therefore to some extent the differences may be even, and the estimates of improvements in quality of life between arms of a clinical trial could be similar using the EQ-5D-3L or EQ-5D-5L value sets. The increased sensitivity of the EQ-5D-5L in terms of the addition of two extra response levels, and the change possible across the levels may also favour QALY gains even if the changes in utility are smaller. An added complexity is whether the gain is linked to improving quality of life or extending length of life, and the interaction between the two. This requires further investigation on clinical trial data, which is a key part of this programme of research, and has also been investigated by other researchers, who found different cost-effectiveness estimates based on the value set used [
20].
There are also implications for the NICE reference case and further decision making based on their recently released position statement regarding the use of the EQ-5D-5L. The improvement in the methods used to both collect and model the valuation data and the increased use of the improved descriptive system make a strong case for the use of the new EQ-5D-5L value set. The EQ-5D-3L value set has benefits if the instrument is still being used in trials and other settings, but is based on societal preferences from decades ago. The crosswalk draws on the EQ-5D-3L values so is prone to the same issues as that value set. There is also the potential for ‘gaming’ where the crosswalk may be used instead of the EQ-5D-5L value set to potentially inflate QALY gains (as the utility range, and therefore change between states, is larger). One important point is how to compare results of cost-utility analyses using the EQ-5D-5L against those using the EQ-5D-3L and establish the cost per QALY thresholds that should be used. Further work is required to explore this.
The main limitation of this study is that we have not tested the impact of the value sets on any clinical trial data, which would have enabled us to directly compare QALY estimations. This would allow us to test some of the issues raised in data previously used for cost-utility analysis, and is the next planned stage of this programme of research. It will also be important to compare the psychometric performance, and impact on cost-utility analysis, of the EQ-5D-5L descriptive system and value set with those of other widely used generic measures. In particular, comparisons with version two of the SF-6D (SF-6Dv2) [
35], which has been valued using DCE with duration methods, would be useful.