Eligibility
Using clinical and administrative data derived from the electronic health record and membership enrollment files, we identified individuals who were 60 years of age or older, had a history of hypertension, had one or more non-cardiovascular comorbidities, and had a score of 3 or more based on the Quan adaptation of the Elixhauser comorbidity index (Quan score) [
14]. Non-cardiovascular comorbidities we considered were HIV/AIDS, alcohol abuse, anemia, chronic pulmonary disease, depression, dementia, drug abuse, liver disease, neurological disorders and other paralyses, cirrhosis, osteoarthritis, osteoporosis, peptic ulcer disease, psychoses, pulmonary/circulation disorders, renal failure and rheumatoid arthritis.
We excluded patients who were not fluent in spoken English and patients who were visually impaired (e.g., legal blindness). We included patients who were mildly cognitively impaired, but excluded people who had a diagnosis of dementia within the 365 days prior to the cohort creation.
Sample recruitment
A random sample of eligible individuals was identified administratively using the Kaiser Permanente Colorado Virtual Data Warehouse, a quality-controlled common data model derived from multiple Kaiser Permanente Colorado data sources [
15]. We recruited random samples of eligible participants in waves of 50 until we reached the target of 200 completed surveys. Potential participants received a recruitment mailing that included an invitation letter, a Study Information Sheet, an opt-out postcard, the paper survey with a postage paid return envelope, and a $10 gift card incentive. Potential participants received follow-up telephone calls after 2 to 4 weeks which served as reminders and also offers of assistance with survey completion if needed.
There is no sample size calculation for best-worst scaling [
16,
17]. In a review of best-worst scaling surveys in health care [
17], the median sample size among object case surveys was 180. We defined a target sample size of 200.
Development of the best-worst scaling survey
We designed the survey as best-worst scaling tasks (case 1), a method introduced by Finn and Louviere [
18]. In this design, respondents are asked to choose the best and the worst of three or more “objects”. The main advantage of this method is that it has more discrimination than for example discrete choice experiments, as it also elicits which is the worst object, and not only which is the best. Thereby, it can yield complete rather than partial ranking information [
17]. Best-worst scaling is assumed to decrease cognitive burden placed on respondents, by asking to compare only a few of the outcomes at a time, instead of comparing all at once. We chose this method to minimize the cognitive burden, since we also included respondents with mild cognitive impairment, and because it allowed us to compare many outcomes. We used the balanced incomplete block design (generated using SAS version 9.4); the survey consisted of 11 blocks of five outcomes in total. As all outcomes had a negative impact on health, we phrased the question as: “If one of the following health problems were to happen to you, which would worry you most and which would worry you least?” The survey is shown in Additional file
1.
Based on previous input from patient and caregiver focus groups [
2] and a literature review of outcomes that have been used in relevant clinical trials, we identified 12 patient-important outcomes (death, myocardial infarction, stroke, chronic heart failure, end-stage renal disease (with dialysis), chronic kidney disease, acute kidney injury, hypotension with dizziness, syncope, cognitive impairment, injurious falls and treatment burden). We included all except death in the survey. Based on another study [
19], we assumed that death would almost always be considered the most worrisome outcome. We described symptomatic outcomes in lay language with expected severities based on input from clinicians and our patient and caregiver co-investigators. We described expected severities in order to decrease cognitive burden, so respondents would not need to consider probabilities. For example, we chose a mild scenario of a myocardial infarction, a mild to moderate scenario for stroke, and a severe scenario for chronic kidney disease (outcome descriptions in Additional file
1). We did not specify which outcomes were side effects from medications and which were outcomes related to hypertension.
Researchers at Johns Hopkins University pilot tested the questionnaires with our patient and caregiver co-investigators in order to assess whether the instructions, the descriptions of outcomes and the best-worst scaling tasks were clear and understandable.
Analysis
All analyses were preplanned and performed using R version 3.3.1 unless stated otherwise. Best-worst scaling surveys can be analyzed in several ways [
17,
20], therefore we used three different analyses in order to suggest how to weigh different outcomes related to hypertension. The main analysis was conditional logit regression, because this is based in random utility theory and therefore real-world choice behaviour [
17] and can be used to calculate utilities based on econometric models [
21] (although utility is sometimes only used to refer to preference elicitation under uncertainty). In sensitivity analyses, we compared this to mean best-minus-worst scores and surface under the cumulative ranking curve (SUCRA) scores. Best-minus-worst scores are simple count scores and can be calculated for each individual – thus, they also lend themselves to explore variability and potential associations with baseline characteristics. SUCRA scores are interesting because they have a natural scale from 0 to 1 and can be therefore readily used as weights, for example in quantitative benefit-harm assessments [
22,
23]. Furthermore, because both the mean best-minus-worst scores and the SUCRA scores lie in a closed range (but conditional logit parameters can be infinite), their minimum and maximum scores can indicate whether an outcome is not worrisome (i.e. most respondents choose the outcome always as least worrisome) or whether an outcome dominates (i.e. most respondents choose the outcome always as most worrisome).
In the conditional logit regression, the model outcome was defined as − 1 if it was the most worrisome outcome and + 1 if it was the least worrisome outcome, with strata defined by respondent and block. We set the least worrisome outcome as a reference so that all conditional logit coefficients were positive relative to the reference, with higher values indicating more worrisome outcomes.
Best-minus-worst scores count how many times an outcome was selected as best (least worrisome) or worst (most worrisome), averaged across respondents. The range of scores was − 5 to 5, as each outcome appeared in five of eleven blocks.
We calculated SUCRA scores using STATA version 13.1 based on estimated mean differences of the best-minus-worst scores between outcomes using a network meta-analysis model. The cumulative ranking curve of each outcome describes the probability an outcome has a certain rank or a higher one. If an outcome was always ranked as the least worrisome, it would receive a SUCRA score of 0, if it was always ranked as the most worrisome, it would receive a score of 1. The analysis is analogous to a network meta-analysis: Each block represents a trial, and each outcome in a block represents a treatment arm. The methodology was originally developed to rank treatments in a network meta-analysis of clinical trials [
24]. The SUCRA analysis considered only the best-minus-worst scores of outcomes that were chosen as least or most worrisome [
22]. As not choosing an outcome is also informative about the rank, the analysis could be considered as less powered than the other scores. While SUCRA scores directly reflect differences in the probability of choosing an outcome, conditional logit parameters need to be transformed for this purpose [
17,
21].
To assess the variability in the preferences, we calculated individual best-minus-worst scores. Furthermore, to explore potential associations of preferences with baseline characteristics, we performed pre-planned (hypothesis-driven) subgroup analyses and (preference data-driven) hierarchical cluster analysis. Subgroup analyses were according to age, current antihypertensive treatment (yes/no), Quan adaptation of Elixhauser comorbidity index [
14], self-reported number of pills per day, and self-reported life-expectancy stratified by age. We performed hierarchical cluster analysis using a variant of Ward’s minimum variance criterion that uses Euclidean distances [
25]. This kind of cluster analysis defines clusters such that the variance of choices within the clusters of respondents is minimized – i.e. respondents within a cluster gave similar answers. We selected the number of clusters such that at least 25 respondents were included in each cluster.
We excluded respondents with more than two missing or invalid choices. The answer to each block counted as two choices (least and most worrisome). We counted choices as invalid if people chose more than one outcome as least or most worrisome or if they chose the same outcome as most and least worrisome. We counted choices as missing if people did not choose a least or most worrisome outcome.
Sometimes the other choices a respondent made indicated a consistent ranking that allowed us to assign the missing choice. If no such ranking could be deduced, or if the choices were inconsistent (for example outcome A is more worrisome than B, B more worrisome than C, and C more worrisome than A), the choice was assigned according to what the respondent with the most similar choices had selected.
In order to investigate whether the survey was well understood, we analyzed overall consistency of the respondents’ answers by calculating a measure of variance in best-minus-worst scores, namely the sum of the squares of each score, per respondent across outcomes [
20]. The higher the variance, the more consistent the answer: If respondents answered with full consistency, one outcome had a score of − 5, and another had a score of + 5. Other outcomes had scores between − 3 and 3. If, however, respondents were not sure about how the outcomes should be ranked, the outcomes had more similar scores. Accordingly, the variance of the scores was higher if the outcomes were ranked consistently, than if they were not. The analysis of consistency considered only respondents without missing or invalid choices, as the way we replaced missing choices was expected to increase consistency. Inconsistent answers could imply either that the survey was not fully understood, or else, that the respondent perceived outcomes to be similarly worrisome. Typically, skewed distributions are observed from best-worst scaling tasks which are consistent with a gamma or a log-normal distribution – thus, most respondents answer with high consistency [
20].