1 Introduction
Health technology assessment is developing rapidly in Central and Eastern Europe (CEE): e.g. in Bulgaria, Czechia, Hungary and Croatia [
1‐
4]. In Poland, it is compulsory when applying for drug reimbursement [
5]. The Polish Health Technology Assessment Agency (AOTMiT) has issued approximately 1800 recommendations since its foundation in 2006, and has assessed nearly 500 health technology assessment reports since the introduction of the current Reimbursement Act in 2012 [
6]. Based on this regulation, cost utility is the preferred form of pharmacoeconomic analysis with the official threshold for the cost per quality-adjusted life-year updated yearly [
7]. AOTMiT recommends EQ-5D for the purposes of valuing health states and calculating quality-adjusted life-years [
8].
The EQ-5D questionnaire consists of a descriptive system and a visual analog scale [
9]. The descriptive system contains five dimensions: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD) and anxiety/depression (AD). In the original version (EQ-5D-3L), each dimension has three levels:
no,
some or
severe problems; whereas there are five levels in the new version (EQ-5D-5L):
no,
slight,
moderate,
severe or
extreme problems [
10,
11]. The previous two mentioned questionnaires define 243 and 3125 health states, respectively. By attaching disutility to each of the levels in each dimension, it is possible to calculate a single value (EQ-Index) for every health state, forming a value set [
12]. EQ-5D-5L demonstrates better measurement properties than EQ-5D-3L [
13,
14].
There are only two published EQ-5D-3L value sets in CEE countries: Slovenian [
15] and Polish [
16], and no value set for EQ-5D-5L. Although it has been possible to use EQ-5D-5L in Poland [
17‐
20], using the mapping-based cross-walk value set [
21,
22], the lack of a directly measured EQ-5D-5L value set limited the implementation within the decision-making process.
Our objective was to derive a Polish tariff for the EQ-5D-5L descriptive system, using a standardised approach developed by the EuroQol Group. Such a value set could also be used by other CEE countries that are too small to finance valuation studies, yet are culturally similar and are likely to have congruent health preferences.
2 Methods
The methods and analyses reported in this paper comply with the CREATE guidelines for reporting valuation studies of multi-attribute utility-based instruments [
23].
2.1 Study Design
Quota-based sampling was applied using Polish census data from November 2014, based on personal identification number registry (PESEL) and Central Statistical Office data on education [
24]. A representative sample in terms of age, sex, education, geographical region and the size of the locality was obtained from Polish residents aged 18+ years. Individuals were recruited through a mixed strategy (public locations, personal contact). Interviews were conducted in public venues or at participants’ homes. Respondents received a financial incentive (voucher of value equivalent to €8).
The study design followed a valuation protocol: EuroQol Valuation Technology (EQ-VT 2.0). It includes software for conducting computer-assisted personal interviews, an interviewer script, standardised training materials, data quality-control procedures and an Excel-based quality-control tool enabling monitoring of protocol compliance, interviewer effects and the validity of the collected data [
25].
2.2 Valuation Interview
Computer-assisted personal interviews consisted of four main parts: introduction, composite time trade-off (TTO) valuation, discrete choice experiment (DCE) valuation and country-specific background questions. After a general introduction and explanation of the purpose of the study, the respondents self-reported their health using the EQ-5D-5L questionnaire and answered basic background questions (about age, sex and experiences of severe illness).
In the composite time trade-off valuation, a composite approach was used: starting with the standard TTO (to find the number of years in full health equivalent to 10 years in an impaired EQ-5D-5L state) and shifting to a ‘lead time’ TTO when participants considered the state to be
worse than dead (see detailed descriptions [
26‐
29]). The resulting TTO values range from − 1 to 1 in 0.05 increments (the smallest tradable unit being 6 months in duration).
The TTO part of the interview consisted of an explanation of the TTO procedure (the ‘being in a wheelchair’ example and three practice states: mild, severe and difficult to imagine), proper TTO valuation of ten EQ-5D-5L health states, a structured TTO debriefing and the TTO feedback module. Each respondent was presented with the rank ordering of health states derived from previous responses, to indicate states for which they were not happy with the ranking (though there was no possibility of re-evaluation).
The TTO experimental design included 86 EQ-5D-5L health states distributed into ten blocks that were balanced in terms of severity of states. The health states used in EQ-VT 2.0 were selected using a Monte Carlo simulation [
30]. Each block included one of five very mild states (only one dimension at level 2 and all others at 1), the most severe state (‘55555’) and eight intermediate states. Respondents were randomised into one of the ten blocks; the health states were presented in a random order.
In the DCE valuation task, participants were presented with a pair of EQ-5D-5L health states with no duration specified (labelled A and B) and asked to indicate which they consider ‘better’ [
30‐
32]. This part of the interview consisted of instructions regarding the task, the valuation of seven pairs and a structured debriefing.
The DCE experimental design included 196 pairs of states randomly divided into 28 blocks, which were identified using an efficient Bayesian design. The blocks were similar in terms of severity, assessed by the sum of the level scores of the health states (i.e. the misery index). Participants were randomly assigned to one of the blocks. The question order and left-right positioning of states were randomised.
The set of Polish country-specific questions covered: priorities in TTO valuations (length or quality of life), general health using an SF-1 question from the SF-36 questionnaire [
33], comorbidities, potential concerns during severe illness, religiosity and beliefs, relationship status, childcare responsibilities, professional status and financial situation. In accordance with the EQ-VT protocol, the minimal recommended sample size for EQ-5D-5L valuation studies is
N = 1000 (see the detailed description [
30]). Given a planned experimental arm of our research, we established the basic target sample size at
N = 1250 (the methods and results of the experimental substudy will be reported elsewhere).
2.3 Quality Control and Data Analysis
We excluded (1) interviews of suspicious quality (‘flagged’ interviews; for a detailed description of quality-control procedures see Electronic Supplementary Material [ESM] 1), (2) the first ten interviews conducted by an interviewer not meeting the minimum quality criteria (at least seven unflagged interviews) and (3) individual TTO valuations when marked by the respondent in the Feedback Module as not adequately representing their health preferences. No individual DCE valuations were excluded. Descriptive statistics were used to summarise the respondent’s characteristics and responses to the TTO and DCE tasks.
2.4 Modelling
2.4.1 General Approach
Below, we present the general approach (dependent/independent variables, model-selection criteria, estimation technique and the building blocks of the model specification under consideration). The formal specification is presented in ESM 1, Online Resource 2.
We based the final model on data from both elicitation techniques (often referred to as a hybrid approach). In the recent literature, all three approaches are used: TTO only [
34], DCE only [
35] or both [
36‐
39]. As it remains unknown if one clearly outperforms the other, we deemed it safest to have both of them impact the value set (which necessarily worsens the model fit). Therefore, there are two dependent variables: the reported utility of a state (for TTO) and the choice made from a pair of states (for DCE). The states’ dimensions are taken as independent variables. In the process of constructing the final model, several specifications were tested: the choices were based on statistical criteria, pragmatic reasons (what the estimation results are used for) or our beliefs concerning how the elicitation tasks work.
In the estimation process, we used a Bayesian approach [
40], as we find it more intuitive and flexible to work with a code (JAGS model run from within R, the code in ESM 2) directly describing the data generation process. To let the data speak, we used non-informative priors. In the estimation, we used a Markov-chain Monte Carlo simulation with, respectively, 2000, 30,000 and 20,000 adaptive, burn-in and actual iterations (2000, 20,000 and 10,000 for the intermediate models), no thinning and four chains. The medians of posterior distributions were used as point estimates, and 2.5 and 97.5 percentiles to construct 95% credible intervals. The model fit was assessed based on deviance and penalised deviance (deviance information criterion [DIC]). Potential scale reduction factors were monitored to diagnose convergence for individual parameters [
41].
We only used main effects, i.e. no interactions between dimensions. This was a pragmatic decision, undertaken to ensure the final model may also be useful when only partial information is available (e.g. marginal distributions of levels for each dimension separately) [
42]; for similar reasons, models with no constant term were preferred (also supported by results).
We tested (and utilised in the final model) the random parameters approach: the disutilities of dimensions/levels differ between individuals. Not only do we find this assumption intuitive but in addition the usefulness of random parameters (and the choice of specific distribution) was confirmed by DIC. Nevertheless, to limit random noise and the number of parameters, and also to avoid technical assumptions (the logical ordering of levels), we assumed it is the importance of each dimension (the disutility of level 5) that is distinctive for each individual, while the relative importance of each level is fixed across individuals (somewhat resembling the idea of simplifying how relative level importance is modelled [
43]).
It is not possible in TTO to report a utility lower than − 1. Hence, we tested (and used in the final model) censoring: the observed −1s are treated as ≤−1. Some authors use censoring at 0 (where TTO is changed for lead time TTO) or at 1 (in TTO, a value greater than 1 cannot be reported) [
38], which we find unconvincing. Regarding censoring at 0, negative values are possible in the protocol used, and modelling an endogenous self-censoring process would require assumptions (is a given zero the true utility or the effect of censoring?). Being unable to decide if a state is worse than dead is not equivalent to being unable to report <0 utility. Regarding censoring at 1, values above 1 are impossible, not only owing to the protocol but also because of the logical construction of the descriptive system and how the utility values were normalised.
Typically (and in our dataset), there is more variability in responses to more severe states (with lower utility, on average). This may be explained by the random parameters approach, as used in the present paper. Nonetheless, we find it plausible that for a given individual (the importance of dimensions known) there is an additional error term in TTO responses, and that this error tends to be larger for more severe health states (intuitively, for a state whose true utility for a given individual is close to 1, there is little room for a larger error). Therefore, we assumed that the scale parameter of the distribution increases with the theoretical disutility. Specifically, we used a generalised t-Student distribution with the scale and the number of degrees of freedom treated as parameters, allowing for fat tails (but also having a normal distribution as an asymptote).
In the DCE part, we assumed the probability of one state being chosen is a function of the difference in utilities, as is typically done. In the standard approach, this dependence is given by the cumulative distribution function of the logit distribution. Instead, based on the previous findings [
44] and the DIC, we used the Cauchy distribution.
Previous research suggests that people with religious beliefs may misrepresent their preferences in TTO tasks, owing to an unwillingness to trade life-years—interpreted as a reporting bias, rather than a difference in preference [
45]. For this reason, we introduced a parameter that scaled down the disutilities for religious respondents (separately for TTO and DCE), to disentangle the underlying and the reported preferences. In the final model, the scaling was not found in the DCE part, confirming the above interpretation.
We constructed several models sequentially, introducing additional building blocks in succession, and controlling for the DIC improvement, potential scale reduction factors and for whether the 95% credible interval contained a neutral value (i.e. a form of statistical significance). In this paper, we present the results of some of the intermediate steps (all based solely on TTO data):
-
M1—panel random-effects approach, with heteroscedasticity-robust standard errors;
-
M2—fixed parameters Bayesian model, with no constant term;
-
M3—random parameters Bayesian model;
-
M4—as M3, with error depending on the theoretical disutility via a t-Student distribution;
-
M5—as M4, with scaling as a result of religiosity.
We decided not to present the intermediate steps of the DCE-only part, as the parameters would require some anchoring (for more details on this issue, see [
46]). However, as in the DCE part, we monitored the impact of modelling assumptions on DIC.
2.5 Value Set Comparison
There are three EQ-5D value sets available for Poland: EQ-5D-3L [
16], EQ-5D-5L mapping-based cross-walk [
22] and the present, directly measured EQ-5D-5L value set. To compare the utility values, we used three methods. First, we estimated the kernel density function of the utility values. Second, we identified the median and the worst levels between the EQ-5D-3L and EQ-5D-5L systems and we presented the utilities for all states. In the ESM 1, Online Resource 6, we additionally present the scatter plot to illustrate the relationship between the EQ-5D-5L value set and the other value sets.
4 Discussion
In this study, we followed an official EQ-VT protocol, performed over 1200 computer-assisted face-to-face interviews, collected TTO values for 86 EQ-5D-5L health states and DCE choices for 196 pairs of states, and estimated the Polish EQ-5D-5L value set using both elicitation tasks. Our final model accounted for random parameters (respondent heterogeneity), error scaling (greater noise for more severe states), censoring at − 1, unwillingness to trade in TTO by religious participants and non-logit distribution in DCE. All these elements of the model were added in response to the statistical considerations. To the best of our knowledge, two elements are novel: the impact of religiosity and error scaling. We find the latter one rather intuitive; the variance of noise increasing with severity may partially explain why there is a weak relationship between the misery index and the disutility for the negative utility values [
47]. The former element is probably the most controversial assumption in our model, and our decision to use it followed the reasoning presented in [
45]. It is important to stress that correcting for the impact of religiosity does not aim at neglecting the preferences of religious individuals, but at correcting for how they may be biased in the TTO task (and how the elicitation task differs from what the resulting utilities are used for; not to actually shorten an individual’s life but to trade-off benefits between different individuals).
There are two more arbitrary decisions we made in the modelling. First, we decided to combine TTO and DCE data. We believe that provided there is no consensus on whether one method is clearly better (not in terms of cost or ease of application but the quality of the results) using both is the safest approach. Second, we decided to use a simple model with no constant and no interaction terms. As mentioned above, that makes the final results more applicable to situations where only limited information is available (e.g. only marginal distributions of levels in individual dimensions). To represent respondents’ answers more accurately (in the sense of predictive validity), a more complex model would probably have to be used (e.g. accounting for a non-linear time preference [
44]). In this sense, there is a trade-off between trying to represent the data faithfully and using a specification that can be subsequently easily used.
The assumptions resulted in the theoretical value of u(55555) = − 0.590, visibly lower than the average utility elicited in TTO, i.e. − 0.408. This difference stems from three elements of our model. First, censoring leads to interpreting observed − 1s as effectively possibly much lower than − 1 (33.5% of TTO tasks for 55555 ended by assigning − 1). Second, introducing the impact of religiosity in TTO tasks results in effectively assuming that the true disutility is larger than the observed one. Third, by considering the random noise as having a larger variance for severe states, we make the parameters less driven by the actual observations for the severe states. Nevertheless, the final utility for the pit state is similar to the one in the EQ-5D-3L value set (hence, the cross-walk), and the slight decrease is intuitive in view of the larger number of levels.
Regarding the final value set, despite the fact that it describes significantly more possible health states (3125 vs. 243), it is similar to the Polish EQ-5D-3L value set in terms of a minimum utility, the range of values and the order of three most important dimensions [
16]. The resemblance between the general characteristics of both value sets should support the comparability of Health State Utility Values obtained with these two types of EQ-5D questionnaire, and consequently the comparability of the results from economic analyses and the reimbursement decisions made, what was questioned in some other countries, such as the UK [
48,
49]. What differentiates our study from the previous Polish valuation is greater attention to sampling, which resulted in a study group similar to Polish society as a whole, in terms of a higher number of demographic features (geographical spread in the first instance, but also employment status and size of locality).
In similarity to some other EQ-5D-5L valuations performed in developed countries, we noted the relative increase in the importance of the anxiety/depression dimension, in comparison to former EQ-5D-3L valuation studies. We suppose that this is primarily a consequence of a change in health state preferences over the period of one or two decades separating the valuation studies, rather than the effect of different wording in the EQ-5D-5L questionnaire. We may observe this phenomenon in England, Germany, the Netherlands, Spain and Japan [
34,
36‐
38,
50,
51], whereas in lower income countries, such as Uruguay or The Philippines, anxiety/depression remains the least important domain [
52,
53]. In addition to this observation, the predominance of the mobility dimension in Asian countries (China, Hong Kong, Indonesia, Japan, South Korea and Thailand) merits further investigation [
39,
54‐
57]. Some changes in the dimension weightings may also be subject to change in the descriptive system: in the Polish version, the wording for mobility has been changed from ‘confined to bed’ to ‘extreme problems’.
Taking into account the number of CEE countries (20) and the relatively low gross domestic product these countries have, the objective of searching for simpler and inexpensive valuation protocols acquires further significance. Discrete choice experiment-based valuations performed online constitute a potential solution, although certain methodological challenges still have to be dealt with [
58,
59]. In the meantime, researchers from the CEE region frequently face the dilemma: ‘what EQ-5D value set should I use in the absence of a national value set?’ According to the results of the recent review, in the case of EQ-5D-3L, CEE researchers mostly prefer the UK Measurement and Valuation of Health study tariff [
60,
61]. In the case of EQ-5D-5L, the choice will be harder, as the EQ-5D-5L value set for England has faced criticism and is still not supported by the National Institute for Health and Care Excellence [
48]. Slovenian researchers may use the cross-walk approach based on their visual analog scale-based EQ-5D-3L value set, but recommendations for scientists from other CEE countries are far from straightforward [
21]. Nevertheless, they should at least consider using either the Polish or the forthcoming Hungarian EQ-5D-5L value sets, as CEE countries share some common cultural and historical background.