Introduction
In recent years, the use of patient-reported outcome measures (PROMs) in routine diabetes care has significantly increased. PROMs are questionnaires completed by patients that measure perceived symptoms and the impact of symptoms on physical function, mental function, social function, and general health perceptions (often referred to as (aspects of) health-related quality of life (HRQOL)). PROMs have the potential to harness the voice of patients. They provide clinically important and complementary predictive information regarding effects of interventions, risk of hospitalization, and medication needs, can help clinicians with treatment decision support and monitoring, and help prioritize the use of healthcare resources for optimal public health benefit [
1].
Many different PROMs are used in care and research in people with type 2 diabetes, yet no consensus exists regarding which PROMs to use in research or clinical practice. In our recent systematic review, we identified 108 unique PROMs for measuring HRQOL in people with type 2 diabetes, addressing a variety of constructs [
2]. The harmonization of PROMs for use in diabetes care and research has been challenged by a lack of conceptual clarity and consensus regarding the core domains and constructs to be measured such as “diabetes-related quality of life” [
1]. This heterogeneity hampers the usefulness of PROMs to inform value-based health care and is a serious threat to comparative effectiveness research, despite recent initiatives such as from the International Consortium for Health Outcomes Measurements (ICHOM) and the American Diabetes Association (ADA) to standardized PRO measurements [
3,
4].
A good-quality PROM is developed in collaboration with patients to ensure that it measures what is most important to patients. Furthermore, the PROM should have good measurement properties, which means it is valid (it measures what aims to measure), reliable (it gives the same scores on repeated measurements in stable patients), and responsive (it is able to measure change in the PRO over time) (Appendix 1) [
5].
A key part of validity is content validity, which is considered the most important measurement property, referring to the relevance, comprehensiveness, and comprehensibility of a PROM (Table
1) [
5‐
8]. Relevance means that all questions (also called items) of a PROM measure things that are relevant for the outcome (also called construct), which the PROM aims to measure. It also means that the PROM does not measure things that are not related to the outcome of interest. For example, if a PROM aims to measure “physical function”, the questions should ask about the capability to perform, or perceived limitations in, relevant activities. The PROM should not include questions about other constructs, such as pain or fatigue. Comprehensiveness means that the PROM should measure all important aspects of the construct of interest; no key aspects should be missing. Furthermore, comprehensibility means that the questions are understood by people who complete them as intended. To be able to test whether a PROM has good content validity, the PROM should have a clear definition of the construct that it aims to measure. If a PROM does not have good content validity, wrong conclusions may be drawn when using that PROM [
6].
Table 1
Criteria for good content validity [
6]
Relevance |
1 | Are the included items relevant for the construct of interest? |
2 | Are the included items relevant for the target population of interest? |
3 | Are the included items relevant for the context of use of interest? |
4 | Are the response options appropriate? |
5 | Is the recall period appropriate? |
Comprehensiveness |
6 | Are all key concepts included? |
Comprehensibility |
7 | Are the PROM instructions understood by the population of interest as intended? |
8 | Are the PROM items and response options understood by the population of interest as intended? |
9 | Are the PROM items appropriately worded? |
10 | Do the response options match the question? |
High-quality systematic reviews are needed that evaluate and compare the measurement properties of PROMs to select the best PROMs for research or care. At least 16 systematic reviews of PROMs have been published in the field of diabetes [
9‐
24]. However, only seven reviews evaluated content validity of the included PROMs to some extent [
10,
12,
13,
18‐
20,
22]. Five of these reviews did not provide a comprehensive overview of content validity but only evaluated whether people with diabetes were involved in the PROM development [
10,
13,
18‐
20]. One review did not take the quality of the PROM development into account, and results for relevance, comprehensiveness, and comprehensibility were not presented separately, which limits its usefulness for identifying gaps and further development of the PROMs [
22]. One review evaluated relevance, comprehensiveness, and comprehensibility separately, but this was only done for PROMs relevant to differentiate effects of oral hypoglycaemic agents [
12].
The aim of the present study was to systematically evaluate the content validity of PROMs, which have specifically been developed to measure (aspects of) HRQOL in people with type 2 diabetes. We included PROMs that measured perceived symptoms, physical function, mental function, social function/participation, and general health perceptions and which were validated to at least some extent. We aim to provide evidence-based recommendations for the most suitable PROMs for use as outcome measures in research and clinical practice.
Discussion
We systematically evaluated the content validity of PROMs specifically developed to measure (aspects of) HRQOL in people with type 2 diabetes. We found evidence for sufficient content validity for only 41 out of the 150 (27%) included PROM subscales. For each aspect of HRQL, we identified one to 11 (subscales of) PROMs with sufficient content validity, except for depressive symptoms, for which we found no PROM (subscale) with sufficient content validity. However, the quality of evidence was generally low to very low. The highest quality evidence was found for the Diabetes Questionnaire subscales worries (measuring anxiety), barriers (measuring social function), and how you feel (measuring general health perceptions), for the DSSCI measuring symptom experience, and the IWADL measuring the ability to participate in daily activities.
Our results and conclusions differ from previous reviews [
9‐
22] because these reviews did not provide a comprehensive overview of content validity, did not take the quality of the PROM development into account, or did not consider evidence for relevance, comprehensiveness, and comprehensibility separately. Striking is that some of the PROMs with the best evidence for content validity based on our review (Diabetes Questionnaire, DFS, and IWADL) were not included in the most recent review, by Wee et al. [
22], indicating that their review was likely incomplete.
We found moderate evidence for the comprehensibility of many PROMs, indicating that the questions seem well understood by people with type 2 diabetes across different languages. However, the quality of the evidence for relevance and comprehensiveness of most PROMS was very low. More high-quality research is warranted to determine if these PROMS measure the most relevant aspects of HRQOL for people with type 2 diabetes.
The quality of the PROM development studies was considered inadequate for 85% of the included PROMs. Only half of the PROMs were developed with (some) input from people with type 2 diabetes. This is a major limitation because it is well-known that patients and healthcare professionals may have different opinions about important outcomes to measure. Also, many PROMs are modified versions of previously developed PROMs. Items were often removed based on statistical analyses without addressing the relevance of these items for people with type 2 diabetes. Also, the decision to add new items was often not discussed with people with type 2 diabetes. Furthermore, for more than half of the PROMs, it was unclearly described what the PROM (subscales) exactly aimed to measure. Undefined names are used, such as “physical health”, “emotional burden”, “dependence”, “impact”, or “how you feel”. The content of the (subscales of) PROMs is often very different (even though they claim to measure the same construct) and a rationale for the questions within scales is not provided. If what is being measured is unclear or not based on what is most relevant for the target population, this may affect other measurement properties, such as responsiveness. Furthermore, it will impede the identification of the best PROM for a specific context of use, it will hamper interpretation and comparison of PROM results in studies, and it will limit the usefulness of PROMs in clinical practice.
Another striking finding of this review is that many PROMs that claim to measure (aspects of) HRQOL measure in fact (partly) other things, such as contextual factors or patient experiences (Appendix 3). Examples of contextual factors are behaviour (diet adherence, self-management), attitudes, stigma, support, or financial worries [
2]. These are important factors that influence HRQOL, but they are not aspects of HRQOL. Examples of patient experiences are treatment satisfaction, treatment burden or barriers, and doctor–patient relationship. These are patient experience measures (PREMs), not PROMs [
61]. It should be noted, however, that many of the included PROMs were developed many years ago, when the methodology of PROM development and validation was not yet as strongly developed as it is today.
The large number of available (versions of) PROMs (and subscales) and the variety in content being measured with these PROMs suggests lack of consensus on which aspects of HRQOL are most relevant to measure in people with type 2 diabetes and how to measure them. Recent initiatives towards standardization of outcomes may improve this situation. Harman et al. recently established international consensus among a large group of people with type 2 diabetes and healthcare providers on the most important outcomes to be measured in clinical trials in people with type 2 diabetes. They identified global quality of life and activities of daily living as two core patient-reported outcomes [
62]. We did not include PROMs for measuring global quality of life in our review, but we found sufficient content validity of the IWADL[
32] for measuring activities of daily living. A second initiative, the International Consortium for Health Outcomes Measurement (ICHOM), developed a standard set of outcomes to be measured in all type 1 and type 2 diabetes patients in clinical practice. They included psychological well-being, depression, and distress as core outcomes and recommend the WHO5, the PHQ9, and the PAID for measuring these outcomes, respectively [
63]. The WHO5 and PHQ9 were not included in this review because they are not diabetes-specific. A recent review of the WHO5 concluded that this PROM has adequate validity [
64]. It should be noted that the WHO5 is often used to measure depression but actually measures well-being. Another systematic review identified good measurement properties of the PHQ9, although evidence on the content validity for people with type 2 diabetes is lacking [
13]. We found sufficient content validity for the PAID [
48] to measure distress, although with very low evidence.
Unfortunately, these two sets do not contain the same outcomes, while there is no justification why the most important outcomes to measure in clinical trials would be different from those in clinical practice. Skovlund et al. reviewed recent evidence and key opportunities and challenges for the clinical use of PROMs to support person-centred diabetes care. They recommended most of the above mentioned outcomes (quality of life, self-reported health, depression, anxiety, and distress) to measure in routine diabetes care [
1]. Finally, there is increasing evidence that across adults having different kind of diseases, the same patient-reported health outcomes are important, such as fatigue, sleep disturbances, anxiety, depression, physical function, and the ability to participate in social roles and activities [
65‐
67]. All these studies provide important input for what to measure routinely in people with type 2 diabetes.
Recommendations for Further Research
This review shows the need for more high-quality content validity studies on diabetes-specific HRQL PROMs. Furthermore, the evidence on other measurement properties of those PROMs with sufficient content validity should be summarized in a next review, or evidence from previous reviews [
9‐
21] should be updated. Wee et al. recently performed such a review [
22], but their review was likely incomplete.
In addition, we recommend to consider the Patient-Reported Outcomes Measurement Information System (PROMIS) for future validation studies in people with type 2 diabetes [
67]. PROMIS is a set of generic, high-quality, and efficient PROMs, based on modern psychometric methods (item response theory)[
68] that measure relevant outcomes such as fatigue, sleep disturbances, anxiety, depression, physical function, and the ability to participate in social roles and activities. PROMIS measures have been extensively validated and are increasingly being used across different (patient) populations [
69]. PROMIS measures are especially suitable for people with multiple medical conditions who would otherwise need to complete multiple PROMs for different health care providers. PROMIS measures are already used in routine care for people with diabetes [
70] but as far as we know have not yet been validated in people with diabetes.
Recommendations for the Use of PROMs in Research and Clinical Practice
We recommend that researchers and clinicians first consider carefully which aspects of HRQOL are most relevant to measure in their specific context. We recommend to involve people with type 2 diabetes in this selection process. We also recommend to consider outcomes that have shown to be relevant for many (patient) populations, such as fatigue, sleep disturbances, anxiety, depression, physical function, and the ability to participate in social roles and activities. We recommend to use (subscales of) PROMs with sufficient content validity (presented in green in Table
3), such as the DSSCI for measuring disease-specific symptoms, the Diabetes Questionnaire subscales for measuring worries and general health perceptions, and the IWADL measuring the ability to participate in daily activities. As an alternative, high-quality generic PROMs, such as the WHO5, PHQ9, and PROMIS, may be considered. We recommend not to use the 61 PROM subscales identified in this review with evidence for either insufficient relevance, insufficient comprehensiveness, or insufficient comprehensibility.
Limitations
This review has some limitations. First, we identified PROMs based on screening studies on PROM development or content validity. However, additional (versions of) PROMs may have been developed, for example, based on factor analysis, published in papers on other measurement properties. Not all of these papers were identified through our screening approach which means that this review may not include all existing (versions of) PROMs. However, PROMs based on statistic methods only would not be rated as having sufficient content validity, so we are quite confident that we did not miss PROMs with good content validity.
Second, we could not rate five PROMs because we were unable to find full copies of the PROMs, it was not always possible to distinguish between different versions of a PROM, and it was sometimes difficult to distinguish PROM development studies from content validity studies. This, as well as poor reporting of development and validation studies, may have led to underestimation of some of our ratings. Third, we found many papers by reference checking, which may indicate lack of comprehensiveness of the original search strategy. However, we were not able to identify additional search terms that would have identified these papers. It is likely that papers were not included in the search due to poor reporting of content validity details in the abstracts.
The strengths of our review were the extensive search strategy, with more than 13,000 papers screened and extensive reference checking, and the detailed and transparent assessment of all aspects of content validity, using the consensus-based COSMIN methodology [
6].
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.