Background
Depression is the most common mental health disorder in both clinical practice and the general population, a major contributor to disability and health care costs, and an important cause of morbidity as well as early mortality [
1]. Because the assessment and monitoring of depression relies principally on patient-reported symptoms, reliable and valid scales are essential for both research and clinical practice. The National Institutes of Health has made substantial investments in developing and testing the Patient-Reported Outcomes Measurement Information System (PROMIS) measures to assess symptoms and functional domains that cut across a number of medical and psychological conditions [
2].
Initially developed and validated in the general population, PROMIS measures are increasingly being tested in clinical settings. However, there are substantial gaps in understanding the performance of PROMIS measures in patients. One particularly important psychometric characteristic is a scale’s responsiveness (alternatively called sensitivity to change) which focuses on a measure’s ability to detect changes over time [
3]. A responsive measure is essential for clinical trials and other longitudinal studies to minimize the risk of false negative conclusions as well as to potentially reduce sample size and study costs. Responsiveness is also critical in clinical practice where the purpose is to detect clinically meaningful change over time in order to monitor and, if necessary, adjust treatment.
PROMIS measures draw upon item banks that are calibrated using item response theory and include large numbers of questions that collectively represent a well-defined, unidimensional construct. Individual questions from these large banks can then be extracted, using various strategies, to create unique short forms of that measure [
2]. These short forms can be static (i.e., the same items used in a fixed-length scale), or they can be constructed adaptively in real time based on the respondent’s answers to previous questions, known as computer adaptive testing (CAT). Although CAT may require a few less items than fixed-length forms to obtain comparable precision, the small increase in efficiency may not be sufficient to justify the added technical requirements for CAT administration.
Four PROMIS fixed-length depression scales are the focus of this study, which includes one with 4 items, one with 6 items, and two with 8 items. Fixed-length scales were chosen rather than CAT administration because in many clinical and research settings fixed-length scales are more feasible to administer and produce approximately comparable results to CAT. For this reason, fixed-length scales have been offered as a viable option by PROMIS developers [
4].
Only a few studies have examined PROMIS depression scale responsiveness. These studies have several limitations, including studying only a single sample [
5‐
8], no comparison to a legacy or other anchor measure[
7], and focusing only or principally on CAT rather than fixed-length PROMIS measures [
5,
7,
9]. Given the limitations of previous studies, our study purpose was to evaluate responsiveness of the four fixed-length PROMIS depression scales, and compare their responsiveness to legacy depression measures using three clinical samples. It should be noted that scores on these self-report scales represent depressive symptom severity rather than a depressive disorder diagnosis; the latter requires a clinical assessment.
Measures
PROMIS Depression Scales
We evaluated four fixed-length PROMIS depression scales: the original 8-item depression Short Form (8b), and the 4-item (4a), 6-item (6a) and 8-item (8a) depression scales from the PROMIS profiles (a collection of short forms containing a fixed number of items from key PROMIS domains). Items are nested in the latter three scales: the 6a scale adds two items to the 4a scale, and the 8a scale adds two items to the 6a scale. The 8a and 8b scales share 7 items in common and 1 unique item each. For each scale, respondents are asked how often in the past 7 days they have experienced specific depression symptoms, using a 5-point ordinal rating scale of “Never,” “Rarely,” “Sometimes,” “Often,” and “Always.” Raw score totals are converted to an item response theory-based T-scores. A T-score of 50 is the average for the United States general population with a standard deviation (SD) of 10. A higher T-score represents greater depression severity. Cronbach’s alphas for baseline PROMIS raw scores in the three trials ranged from 0.89 to 0.95.
Patient Health Questionnaire 9-item (PHQ-9) and 2-item (PHQ-2) Depression Scales
The PHQ-9 is among the best-validated and widely-used depression scales in both clinical practice and research [
11,
12]. The PHQ-9 [
13] includes 1 item for each of the 9 DSM-V criterion symptoms used in diagnosing major depression. Respondents are asked how much in the past 2 weeks they have been bothered by each symptom, with the response options being “Not at all”, “Several days”, “More than half the days”, and “Nearly every day.” Scores range from 0 to 27 with higher scores indicating greater depression severity. The Cronbach’s alpha for baseline PHQ-9 scores in the three trials ranged from 0.76 to 0.85. The PHQ-2 comprises the first two items of the PHQ-9 that capture depressed mood and anhedonia. It is scored 0 to 6 and has been validated as an ultra-brief screening tool [
12] with some evidence of responsiveness [
13,
14].
SF-36 Mental Health Scale
The SF-36 Mental Health scale was administered only in Sample 1 (CAMEO trial). The scale consists of five items with each item scored from 1 (not at all) to 5 (extremely) scale over the past four weeks. Responses from the five items are summed and then transformed to a 0–100 scale where a lower number represents more severe symptoms. The scale has demonstrated good operating characteristics as a depression screener as well as sensitive to change in longitudinal studies [
15,
16].
Prospective global rating of change
The prospective global rating of change is the difference between an individual’s cross-sectional global rating of mood at two time points (baseline minus follow-up) [
17]. Because the cross-sectional global rating is on a 5-point scale ranging from 0 (“Not unhappy or down at all”) to 4 (“Very severely unhappy or down”), change scores have a possible range of − 4 to + 4, where negative numbers indicated worsening mood and positive numbers improved mood. For example, a patient who reported being “severely unhappy or down” at baseline and “mildly unhappy or down” at follow-up would have a + 2 change (3 − 1), whereas a patient who reported being “moderately unhappy or down” at baseline and “severely unhappy or down” at follow-up would have a − 1 change (2 − 3). Change scores were collapsed into three categories of better (+ 1 to + 4), same (0), and worse (− 1 to − 4). We used this prospective anchor to overcome potential recall and reconstruction bias related to the retrospective global rating of change [
18]. A few studies have suggested, compared to the retrospective global rating of change, that the prospective global rating of change may be less influenced by post-treatment status [
18,
19].
Retrospective global rating of change
The retrospective global rating of change assesses overall clinical response from the participant’s perspective [
20]. At follow-up, participants were asked to rate their mood change compared to their mood at baseline assessment. Change in mood is rated on a 7-point scale with the following response options: − 3 (much worse), − 2 (moderately worse), − 1 (a little worse), 0 (no change), + 1 (a little better), + 2 (moderately better), or + 3 (much better). Based on the rating, participants were further categorized into three groups, improved (+ 1 to + 3), unchanged (0), and worsened (− 1 to − 3). The retrospective global rating of change has been widely used to assess responsiveness of patient-reported outcome measures [
3,
16].
Statistical analysis
We evaluated comparative responsiveness for all four PROMIS scales and legacy measures (i.e., PHQ-9, PHQ-2, and SF-36 Mental Health). Data from each of the three trials were analyzed separately rather than pooled, because the three trials involved different clinical populations, study interventions, and follow-up timeframes. We used both prospective and retrospective global ratings of change for mood as the anchors (i.e., criteria) to identify patients who had changed since baseline. Specifically, patients were categorized into three groups based on global ratings of mood change: better, same, and worse. Both within-group and between-group responsiveness to change were evaluated.
Within-group responsiveness
For within-group responsiveness, we estimated the amount of change over time within each global rating of depression change group (i.e., better, same, and worse). The standardized response mean (SRM) was used as the effect size measure of within-group responsiveness to change. The SRM is the ratio of the mean change to the standardized deviation (SD) of change, and is calculated using the formula (mean baseline score − mean follow-up score)/(SD of change score). We also calculated 95% confidence intervals for the SRMs with a bootstrapping procedure. SRM values of 0.2, 0.5, and 0.8 represent thresholds for small, moderate and large changes, respectively [
3,
21]. Some researchers suggest an absolute SRM value ≥ 0.3 indicates responsiveness [
22].
Between-group responsiveness
For between-group responsiveness, we compared the amount of change between global rating of change groups. First, we used omnibus ANOVA tests to compare mean change scores across global rating of change groups (i.e., improved, unchanged, and worsened). For this analysis, both retrospective and prospective rating of change groups were used as anchors. We used post-hoc Tukey–Kramer pairwise tests to compare the three groups and controlled for family-wise Type 1 error at 0.05.
Second, we used receiver-operating characteristic curve analyses to further quantify a measure’s ability to detect improvement. Area under the curve (AUC) is the probability of correctly discriminating between patients who have improved and those who have not. The AUC values range from 0.5 (the same as chance) to 1.0 (perfect discrimination). We calculated the AUC for each depression measure using retrospective and prospective global ratings of change as the anchors. For the retrospective anchor, we evaluated each measure’s ability to detect
any improvement (“a little better”, “moderately better”, or “very much better”) as well as
moderate improvement (“moderately better” or “very much better”). For the prospective anchor, we evaluated each measure’s ability to detect
any improvement (+ 1 to + 4) as well as
moderate improvement (+ 2 to + 4). To determine if depression scales differed in their ability to detect improvement, we also statistically compared AUC values between measures [
20,
23].
Discussion
Using data from three clinical trials, we found PROMIS depression scales were responsive to change using both prospective and retrospective global change anchors as well as AUC analysis. Responsiveness was similar among all four fixed-length PROMIS scales and comparable to the responsiveness of the PHQ-9 and PHQ-2. In general, the measures were better able to detect depression improvement than worsening. A strength of our study compared to previous research on responsiveness of PROMIS depression measures is the triangulation of results from three patient samples using three measures of responsiveness.
Only a few prior studies have explored the responsiveness of PROMIS depression scales. In an observational study of 234 patients undergoing inpatient treatment in four psychosomatic rehabilitation centers, the pre-post treatment effect size was similar for the PROMIS depression item bank scale (using all 28 items) and the Center for Epidemiological Studies Depression scale (CES-D) (1.16 vs. 1.09) [
7]. In a second observational study of 194 patients with depression treated for 12 weeks, the PROMIS CAT was similar to the PHQ-9 and CES-D in terms of treatment effect size: 0.84, 0.98, and 1.06, respectively [
5]. However, depression recovery defined in several different ways was less frequent with the PHQ-9 compared to PROMIS and CES-D. In contrast, the PHQ-9 and PROMIS 8-item short-form had similar responsiveness in identifying depression recovery in a longitudinal study of 701 patients with neurological or psychiatric disorders [
8]. In a longitudinal study of 903 patients with 5 diverse diseases (4 medical conditions and major depressive disorder), two thirds of patients completed PROMIS by CAT and one-third with an 8-item short form [
9]. The average SRM using a retrospective global anchor was 0.71 for the improved group and − 0.49 for the group that worsened. In a longitudinal study of 150 patients with depression, SRMs in those experiencing recovery were 0.82 and 0.79 for the PROMIS 28-item bank and 8-item short form depression scales, respectively, and 1.00 for the PHQ-9 [
6]. Unlike these previous studies that used either an observational design, a single sample, or PROMIS administration by CAT or the entire item bank, we used data from three RCTs and evaluated four PROMIS short forms of varying lengths. In addition, we evaluated responsiveness by triangulating several methods. Thus, our study substantially strengthens the evidence regarding the responsiveness of PROMIS depression scales.
Responsiveness was not symmetric with respect to improvement and worsening. SRMs for improvement averaged a moderate positive effect size and were roughly twice the SRMs for worsening which averaged a small negative effective size. Also, the 3 to 6 point improvement in PROMIS depression T-scores was above the minimally important difference. This greater sensitivity of symptom scales for detecting improvement has been previously reported for depression [
5,
16,
24], pain [
20,
22,
25‐
28] and anxiety [
24].
The Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) guidelines consider SRMs and other effect size metrics an imperfect approach to assessing responsiveness [
29] and also discuss the limitations of transition anchors such as global rating of change. Objections to these opinions [
30] as well as the COSMIN rationale [
31] have been subsequently articulated. Suffice it to say, SRMs and effect sizes as well as global of rating change anchors have been widely used to assess responsiveness both before [
3,
30,
32‐
34] and since [
20,
35‐
42] publication of the COSMIN guidelines; only a small number of representative studies are cited here.
The AUCs in Table
4 represent modest rather than strong differentiation between patients whose depression had improved and those who were the same or worse. However, AUCs have been reported in a similar range in other studies using retrospective global rating of change as an anchor [
16,
20,
27,
43] in which AUCs tend to be lower than in studies of diagnostic tests for which there is a criterion (“gold”) standard to determine the presence of a disease. Retrospective global ratings of change may be influenced by recall bias as well as the current state of symptoms [
19,
44]. Some experts recommend an AUC ≥ 0.70 as a threshold for responsiveness when using a criterion standard anchor but also acknowledge that criterion standards often do not exist for patient-reported outcomes (PROs) [
29,
45]. Thus, AUCs for scales measuring symptoms and other PROs have been < 0.70 not only when using retrospective global change anchors but also in some studies using other anchors as well [
32,
46,
47]. Ours is the first study to also use prospective global change anchors to assess AUCs for PRO scales. Although this anchor lead to more AUC estimates ≥ 0.70, the sample size of those with moderate change by this anchor was small yielding wide confidence intervals. For all these reasons, the similarity of AUCs when using a global change anchors is more salient than their absolute value [
48].
Scale length did not have a strong effect on responsiveness. The four PROMIS depression scales ranging from 4 to 8 items had similar responsiveness, a finding previously reported for PROMIS pain scales [
20]. The PROMIS fixed-length scales for a specific domain share some items in common, which may explain in part their comparable responsiveness. Also, the average responsiveness of the PHQ-9 and PHQ-2 did not differ substantially, as has been shown in only one previous study [
13]. Short measures may be more desirable for studies with many outcome measures, particularly where depression is a secondary rather than primary outcome, or in busy clinical practice settings with time constraints or the need to assess multiple patient-reported outcome measures.
Methodologically, our study is relatively unique in using both retrospective and prospective global change anchors allowing assessment of responsiveness with two different global anchors. Notably, two of the trials only showed fair agreement beyond chance of these two anchors in classifying individuals as better, same or worse, and one trial showed poor to no agreement beyond chance. It is possible that the two anchors provide different perspectives of change over time. Alternatively, it may be that one anchor is superior to another or that both anchors have limitations, but this would require additional research comparing both anchors to a third independent anchor. However, as already discussed, criterion standard anchors for patient-reported outcomes are lacking. Moreover, global rating of change is among one of the most commonly-used anchors for assessing responsiveness.
Our study has several limitations. First, depression was generally mild in all three samples, thereby restricting the range in which depression improvement could be detected. Responsiveness needs to be further studied in more clinically depressed samples in which treatment is warranted and a responsive measure is especially important. Second, because the samples included predominantly male veterans with either chronic pain or stroke, findings need to be replicated in populations with more women and a broader range of medical and mental health conditions. Third, one legacy measure (SF-36 Mental Health) was used only in one trial (CAMEO). Although its responsiveness has been demonstrated in prior studies, its comparative responsiveness to the PHQ-9 and PROMIS scales requires additional testing. Fourth, because we made multiple statistical comparisons between depression measures, the differences between measures should be interpreted cautiously unless highly significant (i.e., p < 0.001). Fifth, the nested nature of the PROMIS scales (i.e., sharing many items in common) as well as the PHQ-2 items being included in the PHQ-9 would lead to some convergence of responsiveness within the same family of scales. Sixth, studies using additional responsiveness metrics besides SRMs anchored to global ratings of change are warranted. Finally, our findings are derived from secondary analyses of data from clinical trials rather than a primary hypothesis-driven psychometric study.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.