Background
Measurement of health utility is an important part of cost-effectiveness analysis in health care. Health utility can be measured by several preference-based utility measures, of which the EuroQol (EQ-5D) [
1,
2] and the Health Utility Index [
3] are the most widely used. Recently, a new index score, called the SF-6D, has been developed [
4]. This instrument produces a summary score based on an algorithm using a subset of 11 questions from the SF-36 health status measure [
5]. The major reason for developing the SF-6D was to enlarge the basis for economic evaluations, while retaining the descriptive richness and sensitivity to change of the SF-36 [
6]. This reasoning is based on observations that the EQ-5D has poorer descriptive ability and is less sensitive to change compared to individual SF-36 domains [
7‐
10]. These potential advantages of the SF-6D over alternative instruments should be substantiated in additional studies. A further point of interest may be the difference in methodology applied in deriving a utility score, which could imply that utilities with different "meaning" are obtained, thus resulting in confusion when interpreting results from studies using different instruments [
11]. Potentially, policy decisions could be compromised by using utilities that are not equivalent. Therefore, we sought to assess the equivalency of the SF-6D and the EQ-5D cross-sectionally, in domain content, in scoring distribution, and in the amount of change measured after intervention. We addressed these questions by comparing the SF-6D and EQ-5D qualitatively and quantitatively, using data from two randomised controlled trials of patients with symptomatic coronary stenosis.
Discussion
We compared the measurement properties of the EQ-5D and the SF-6D in a group of patients undergoing coronary revascularisation. We found clear differences between these utility measures: conceptual, in baseline scores and in sensitivity to change. First of all, the number of domains differs: 5 versus 6. However, the contribution of the SF-6D vitality domain, which has no counterpart in the EQ-5D, is small. Therefore, one could expect that domains tapping similar areas of health have somewhat equal contributions to the total score. This is the case for the domains pain and mood/mental health. However, the content and weights of the other domains show considerable differences, with the EQ-5D giving more weight to physical functioning and the SF-6D to social functioning. A second difference is that the recall period of both instruments is different: today for EQ-5D, versus the last four weeks (or one week in the acute version of the SF-36) for the SF-6D. The third difference is that the scoring range of the EQ-5D is twice that of the SF-6D. The location of the baseline median scores in the scoring range was quite different: in the top quarter for EQ-5D, halfway for the SF-6D. A fourth difference was that the distributions were significantly different from each other, although the mean values appeared to be similar. The difference between the median values and the limits of agreement in the Bland-Altman plot exceed the minimal clinically important difference of both SF-6D and EQ-5D [
20,
21]. The lack of agreement is further exemplified by the low ICC.
A fifth difference is found in the sensitivity to change. Both measures recorded change in the PTCA group, but differed in the CABG groups: EQ-5D scores improved significantly, but SF-6D scores did not change. The SF-6D recorded greater change than the EQ-5D in the PTCA group, despite its narrower scoring range. In the CABG groups, the change in the EQ-5D was caused by change in anxiety/depression and mobility. There was however no corresponding improvement in the SF-6D physical functioning domain. The significant deterioration in social functioning and role limitations cancelled out the improvement in mental health, resulting in no change in the overall SF-6D score. Another important reason for the difference in amount of change after CABG may lie in the differing recall periods: with a post-intervention assessment at one month, the 4-week recall period of the SF-36 encompasses both the intervention and recovery period, as compared to today's health status in the EQ-5D. However, the difference between SF-6D and EQ-5D remains at the subsequent measurements. This cannot be fully explained by different recall periods, as patients are stable by 6 months, and today's health should not differ that much from that over the last 4 weeks.
Both measures display non-normal distributions, both at baseline and in change over time. The EQ-5D is skewed towards good health, which creates a ceiling effect. The SF-6D is highly centred on the middle of the scoring range (see figure
1). The difference in scoring range may be explained by differences in reference state for the valuation task and the valuation technique. Two-thirds of the respondents valued the worst possible health state health state of the SF-6D as better than dead, causing the lower limit of the SF-6D to be quite a bit higher than zero [
4]. The EQ-5D valuation study used dead as the lower anchor, resulting in negative scores for the worst health states [
16]. The valuation studies of both instruments used different valuation techniques. The standard gamble method, used for the SF-6D, generally gives somewhat higher valuations than time-trade off (used for MVH-A1 tariff) [
22,
23], but these differences are not large enough to explain the narrower scoring range of the SF-6D. The difference in scoring range implies that apparently similar baseline scores and change scores are not equivalent, prohibiting direct comparisons between utility scores obtained by different instruments. More detailed discussions of the differences in valuation methods and scoring algorithms are given by Brazier and coworkers [
24] and Bryan and Longworth [
25].
A substantial part of the missing SF-6D scores were caused by incompletely filled-in questionnaires. The algorithm of the SF-6D requires that all relevant questions are answered. However, the algorithm of the domain scores of the SF-36 allows a certain amount of missing scores, which are imputed with the mean value of the completed items of that domain [
5]. We used that technique to reduce the number of missing scores in the SF-6D; imputing a value for missing items in a SF-36 domain using the mean value for that domain. This way, the amount of missing scores in the SF-6D due to incomplete questionnaires was halved, from about 12% to 6% of the total number of SF-6D scores. Imputation did not affect the median values. Note however that this solution would not be viable if the SF-6D would be administered without the other SF-36 questions.
Recently, some studies were done that compared the EQ-5D and the SF-6D, as in our study [
21,
24‐
29]. In a study comparing seven patient groups, Brazier and coworkers found overall similar mean scores for the two measures in patients with mild diseases [
24], but baseline values clearly differ in more severe patients such as liver transplant patients [
26] and patients with a recent stroke [
21]. These studies confirmed some of the disagreements we found: differing descriptive content and differing scoring range [
24,
26,
29]. The pattern of correlations between domains we found was similar to the Brazier study, except that the magnitude of the correlations was much lower. Despite the strong correlation between the utility scores, these data do not support the construct validity, as the correlation structure was rather diffuse with only moderate correlations. Only mood/mental health behaved as expected (i.e. a strong correlation with each other, and low correlations with other domains).
The sensitivity to change of the SF-6D remains unclear: Pickard and colleagues found that the SF-6D was as sensitive as the EQ-5D in stroke patients – although the SF-6D also changed in patients who reported themselves as unchanged [
21]. Other studies, including ours, found no change in SF-6D after intervention, compared to significant changes in the EQ-5D [
26,
29].
These differences at baseline and in change over time imply that changes in utility and/or quality adjusted life years based on different instruments cannot be directly compared. Furthermore, these differences are larger than the minimal clinically important difference, which will influence conclusions of cost-effectiveness analysis and clinical decision-making.
Conclusion
In conclusion, the EQ-5D and SF-6D are not equivalent, despite some resemblances. Although the mean utility scores appear to be similar, the differences in median values, scoring range and sensitivity to change after intervention and the low agreement show that the EQ-5D and SF-6D yield incomparable scores. Even within a group of patients with the same diagnosis, the EQ-5D and SF-6D yield different scores, while sensitivity to change seems to be influenced by the type of intervention. The SF-6D has better distributional properties than the EQ-5D, but that did not result in improved sensitivity to change. However, it cannot be said which instrument is correct. Clearly, the SF-6D measures something else than the EQ-5D, and these instruments cannot be used interchangeably.
Currently, there is no clear benefit in using the SF-6D in clinical studies instead of the EQ-5D, as the SF-6D is not clearly better. As the EQ-5D presently is generally accepted, it may be preferred, thus obtaining results comparable with previous studies.
Authors' contributions
HFvS participated in the design of the study, performed the statistical analysis and drafted the manuscript. EB conceived of the study and participated in it's design. Both authors read and approved the final manuscript.