Elsevier

Journal of Psychiatric Research

Volume 56, September 2014, Pages 112-119
Journal of Psychiatric Research

Validation of the depression item bank from the Patient-Reported Outcomes Measurement Information System (PROMIS®) in a three-month observational study

https://doi.org/10.1016/j.jpsychires.2014.05.010Get rights and content

Highlights

  • PROMIS Depression demonstrates strong convergent validity with the CESD and PHQ-9.

  • A PROMIS score of 60 suggests depression of some clinical significance.

  • PROMIS scores are more normally distributed than those from the other 2 measures.

  • PROMIS Depression and the CESD classify more patients as recovered than the PHQ-9.

  • The PROMIS computerized adaptive test for depression requires a median of 4 items.

Abstract

The Patient-Reported Outcomes Measurement Information System (PROMIS®) is an NIH Roadmap initiative devoted to developing better measurement tools for assessing constructs relevant to the clinical investigation and treatment of all diseases—constructs such as pain, fatigue, emotional distress, sleep, physical functioning, and social participation. Following creation of item banks for these constructs, our priority has been to validate them, most often in short-term observational studies. We report here on a three-month prospective observational study with depressed outpatients in the early stages of a new treatment episode (with assessments at intake, one-month follow-up, and three-month follow-up). The protocol was designed to compare the psychometric properties of the PROMIS depression item bank (administered as a computerized adaptive test, CAT) with two legacy self-report instruments: the Center for Epidemiological Studies Depression scale (CESD; Radloff, 1977) and the Patient Health Questionnaire (PHQ-9; Spitzer et al., 1999). PROMIS depression demonstrated strong convergent validity with the CESD and the PHQ-9 (with correlations in a range from .72 to .84 across all time points), as well as responsiveness to change when characterizing symptom severity in a clinical outpatient sample. Identification of patients as “recovered” varied across the measures, with the PHQ-9 being the most conservative. The use of calibrations based on models from item response theory (IRT) provides advantages for PROMIS depression both psychometrically (creating the possibility of adaptive testing, providing a broader effective range of measurement, and generating greater precision) and practically (these psychometric advantages can be achieved with fewer items—a median of 4 items administered by CAT—resulting in less patient burden).

Introduction

The Patient-Reported Outcomes Measurement Information System (PROMIS®) is an NIH Roadmap initiative devoted to developing better measurement tools for assessing constructs relevant to the clinical investigation and treatment of all diseases—constructs such as pain, fatigue, emotional distress, sleep, physical functioning, and social participation (Buysse et al., 2010, Cella et al., 2010, Cella et al., 2007b, Fries et al., 2009, Fries et al., 2014, Pilkonis et al., 2011, Revicki et al., 2009). PROMIS has created and refined a comprehensive methodology for developing item banks of these health-related constructs using both qualitative and quantitative techniques and modern psychometric methods (item response theory, IRT) (Cella et al., 2007a, Cella et al., 2010, Hilton, 2011, Reeve et al., 2007). These item banks encompass physical, mental, and social health, consistent with the World Health Organization's tripartite framework (Cella et al., 2007a, World Health Organization, 2007).

The use of models from IRT to calibrate items not only results in greater precision at the item and test levels but also promotes greater flexibility in test administration. For example, items can be administered as computerized adaptive tests (CATs), or static short forms can be created and tailored for samples with different levels of severity of the construct being assessed. Analyses of potential differential item functioning due to gender, age, and educational attainment were performed during the development of the item banks to ensure that items performed comparably regardless of variations in these background characteristics. In general, experience with CAT suggests that the PROMIS depression item bank provides excellent precision with 4–6 items (Choi et al., 2010). A generic 8-item short form is also available, and this short form was one of the cross-cutting dimensional measures used in the DSM-5 field trials, where its feasibility was established and where it performed well with regard to test-retest reliability (Narrow et al., 2013). Following creation of the item banks, our priority has been to validate them, most often in short-term observational studies. These studies allow us to examine the psychometric properties of the item banks, their responsiveness to change, their relationships to clinically significant benchmarks of improvement, and their similarities and differences when compared with other commonly used instruments.

We report here on a prospective observational study with depressed outpatients in the early stages of a new treatment episode. For this purpose, all participants completed study assessments at three points: baseline (T1, as close to the beginning of treatment as possible but no later than four months after its start), one month following baseline (T2), and three months following baseline (T3). The protocol was designed to compare the psychometric properties of the PROMIS depression item bank (administered as a CAT) with two legacy self-report instruments: the Center for Epidemiological Studies Depression scale (CESD; Radloff, 1977) and the Patient Health Questionnaire (PHQ-9; Spitzer et al., 1999).

The study was not intended to evaluate treatment effectiveness. Rather, the main consideration was to conduct a study involving established treatments that would allow us to investigate the operating characteristics of the different measures of depression over a time frame (three months) consistent with the design of clinical trials and comparative effectiveness research. Regardless of their impact in the aggregate, treatments for depression generate considerable variability in individual outcomes, and this variability was desirable for examining psychometric issues. In our setting, the most common form of outpatient treatment for depression is a combination of antidepressant medication and supportive psychotherapy (both individual and group therapies), with smaller proportions of patients receiving medication only or psychotherapy only. No untreated or control group was included.

There have been other attempts to link PROMIS depression to legacy measures for depression. The PROsetta Stone project (Choi et al., 2012) was designed specifically to create “cross-walks” between PROMIS measures in several domains and commonly used measures (most often developed using classical test theory) in those same domains. A PROsetta Stone report provides a conversion table from raw CESD scores to PROMIS depression scores (Choi et al., 2013a). The PROMIS depression equivalent for the CESD threshold of 16 is 56.2; for the CESD threshold of 21, it is 59.1. (Note that PROMIS depression is scored with a T-score metric in which the mean of the general population is 50, with a standard deviation of 10.)

Another PROsetta Stone report provides a conversion table from raw PHQ-9 scores to PROMIS depression scores (Choi et al., 2013b). The PROMIS depression equivalent for the PHQ-9 threshold of 5 (mild depression) is 52.5; for the threshold of 10 (moderate depression), 59.9; for the threshold of 15 (moderately severe depression), 65.8; and for the threshold of 20 (severe depression), 71.5. Gibbons et al. (2011) also reported analyses linking PROMIS depression and the PHQ-9 in a sample of HIV patients. Their results were generally comparable to the PROsetta Stone linkages. However, there was some discrepancy at the mild end of the PHQ-9 where they found rather low PROMIS depression scores to be equivalent: “Mild depression (PHQ-9 score of 5–9) corresponds to scores of 42–51 on the PROMIS metric, moderate depression [10–14] to 52–63, moderately severe [15–19] to 64–72, and severe [20+] to scores of 73 and higher” (figure caption, p. 1353). In general, thresholds suggesting depression of some clinical significance (CESD = 21, PHQ-9 = 10) have been linked to a PROMIS score of about 60, the usual threshold used clinically with the T-score metric (1 SD above the mean).

Finally, in a study using two different IRT linking methods, Olino et al. (2013) compared the Beck Depression Inventory (Beck et al., 1961) the CESD, and the PROMIS depression item banks in a community sample of adolescents. Among the three measures, PROMIS depression provided information over the widest range of symptom severity while demonstrating the highest level of precision. This result was especially true for the full PROMIS depression item bank of 28 items, but it also applied to the PROMIS depression short form of 8 items, which is considerably briefer than either the BDI or the CESD.

Section snippets

Inclusion criteria

Men and women 18 years and older who were able to read and understand English and able and willing to give informed consent were enrolled in the protocol. They were required to be within the first four months of outpatient treatment for major depressive disorder (MDD) at Western Psychiatric Institute and Clinic (WPIC) and its affiliates. To ensure that participants were not too close to the floor for depression when beginning the protocol (and thus unable to show further change), we required a

Descriptive statistics

Cronbach's alpha was used to compute the reliabilities of the legacy measures at baseline, which were .86 for the CESD and .81 for the PHQ-9. For measures derived from IRT models, test information (and its converse, standard error, SE) varies along the spectrum of severity of the construct being assessed. The reliability of PROMIS depression was .92 when calculated asReliability=1SEbaseline2SDbaseline2where SEbaseline is the median of the SE of PROMIS depression in a range from −3 to +3

Discussion

We report here on a prospective observational study with depressed outpatients in the early stages of a new treatment episode which was designed to compare the psychometric properties of the PROMIS depression item bank (administered as a CAT) with two legacy self-report instruments: the CESD and the PHQ-9. The study allowed us to examine the psychometric properties of the measures (frequency distributions, reliabilities), their convergent validity (correlations, linkages to commonly used

Role of funding source

PROMIS® was funded with cooperative agreements from the National Institutes of Health (NIH) Common Fund Initiative (Northwestern University, PI: David Cella, PhD, U54AR057951, U01AR052177; Northwestern University, PI: Richard C. Gershon, PhD, U54AR057943; American Institutes for Research, PI: Susan (San) D. Keller, PhD, U54AR057926; State University of New York, Stony Brook, PIs: Joan E. Broderick, PhD and Arthur A. Stone, PhD, U01AR057948, U01AR052170; University of Washington, Seattle, PIs:

Contributors

Paul A. Pilkonis, PhD, contributed to study conception and design and took responsibility for drafting the manuscript. Lan Yu, PhD, provided data analysis and interpretation. Nathan E. Dodds, BS, Kelly L. Johnston, MPH, Catherine C. Maihoefer, MS, LPC, and Suzanne M. Lawrence, MS, contributed to study implementation (preparation of the protocol in the PROMIS Assessment Center; recruitment, testing, and interviewing of participants) and manuscript preparation (literature reviews, preparation of

Conflict of interest

There are no conflicts of interest for any authors.

Acknowledgments

We acknowledge the contributions of our colleagues in Behavioral Health Services at the DuBois (PA) Regional Medical Center, who assisted in the identification and assessment of patients: Scott Turkin, MD, DFAPA; Michelle L. Hetrick, MA, NCC, LPC; Betsy Lingle, BS; and Sherry L. Murphy, MN, CNS. Angela Stover, MA, a former program coordinator at the University of Pittsburgh, was instrumental in study implementation and data collection activities in the early stages of the project. Ms. Stover is

References (25)

  • S.W. Choi et al.

    Efficiency of static and computer adaptive short forms compared to full-length measures of depressive symptoms

    Qual Life Res

    (2010)
  • J. Crawford et al.

    Percentile norms and accompanying interval estimates from an Australian general adult population sample for self-report mood scales (BAI, BDI, CRSD, CES-D, DASS, DASS-21, STAI-X, STAI-Y, SRDS, and SRAS)

    Aust Psychol

    (2011)
  • Cited by (265)

    View all citing articles on Scopus
    View full text