Original Article
Using Patient Health Questionnaire-9 item parameters of a common metric resulted in similar depression scores compared to independent item response theory model reestimation

https://doi.org/10.1016/j.jclinepi.2015.10.006Get rights and content

Abstract

Objectives

To investigate the validity of a common depression metric in independent samples.

Study Design and Setting

We applied a common metrics approach based on item-response theory for measuring depression to four German-speaking samples that completed the Patient Health Questionnaire (PHQ-9). We compared the PHQ item parameters reported for this common metric to reestimated item parameters that derived from fitting a generalized partial credit model solely to the PHQ-9 items. We calibrated the new model on the same scale as the common metric using two approaches (estimation with shifted prior and Stocking–Lord linking). By fitting a mixed-effects model and using Bland–Altman plots, we investigated the agreement between latent depression scores resulting from the different estimation models.

Results

We found different item parameters across samples and estimation methods. Although differences in latent depression scores between different estimation methods were statistically significant, these were clinically irrelevant.

Conclusion

Our findings provide evidence that it is possible to estimate latent depression scores by using the item parameters from a common metric instead of reestimating and linking a model. The use of common metric parameters is simple, for example, using a Web application (http://www.common-metrics.org) and offers a long-term perspective to improve the comparability of patient-reported outcome measures.

Introduction

The vast amount of patient-reported outcome (PRO) measurement instruments assessing identical constructs hampers comparability in the field of psychotherapy research [1]. As the scores of different PROs are not scaled on the same metric, the communication among researchers and clinicians using different instruments is complicated. That is, it is unclear to users of instruments how to equate a score measured on one instrument to a score measured on another instrument. Furthermore, the pooling of study results based on different PROs in systematic reviews, and meta-analyses may also be biased [2], [3]. In case of measuring depressive symptom severity, there are more than 100 scales available [4]. Most of these instruments have been developed within the framework of classical test theory (CTT).

The standardization of PRO measures is urgently required to ensure the comparability of research results across different medical fields, diseases, languages, and location of study site [5]. Compared to CTT, item-response theory (IRT) is a more flexible approach in the analysis of PROs and offers a range of promising methods to calibrate items measuring the same construct on a standardized metric [6], [7]. Although IRT methodology is common in the field of educational psychology and well established in the testing of ability and achievement, its application to clinical assessments is comparatively new [8], [9]. Nonetheless, in recent years, IRT methods have been increasingly applied in PRO measurement [6], [10]. In this context, the Patient-Reported Outcomes Measurement Information System (PROMIS) is worth mentioning [11]. PROMIS is one of the most extensive PRO initiatives that provides a big range of instruments. The uniqueness of PROMIS is that individual domains, such as depression, are measured through a large number of items that are calibrated to a standardized metric, so-called item banks [7], [12].

Despite efforts such as the PROMIS initiative, however, many researchers and clinicians may continue to use existing legacy instruments. Therefore, calibrating these instruments to a common metric is highly desirable to ensure comparability of results [13]. Using the IRT-based possibilities of estimating common models for different instruments that are aimed to measure the same construct, such a common metric can generally be developed using one of two options. The first option is to link legacy instruments to an already existing metric of a well-established standardized instrument [14], [15], whereas the second option is to establish a new standardized metric resulting from modeling a common IRT model for the items of several instruments measuring a specific health domain. Using the latter approach, Wahl et al. [16] recently published such a common metric for 11 depression measures, including the Patient Health Questionnaire (PHQ-9), based on data of 33,844 German adults. Research to enhance the comparability of PROs by means of IRT has also been carried out exploring further health domains, for instance, in the measurement of pain [17], headache [18], anxiety [15], depression [19], physical function [13], fatigue [20], and health-related quality of life [21].

Common IRT models have been frequently used to develop crosswalk tables between measures [14], [15], [20], [21], [22], [23], [24] as it is possible to derive latent trait estimates solely from the sum score [25], [26]. However, it is also possible to estimate the latent trait directly from the response pattern to each item. This approach, which was also used to develop the depression metric by Wahl, has some advantages not only over the inherent limitations of CTT, but also over the use of crosswalk tables. In detail, individual pattern scoring leads to more accurate person parameter estimates compared to sum scores and allows scoring in case of missing data [6], [14], [15].

The development of common metrics as conducted by Wahl et al. [16] is a relatively new technique in the field of PRO measurement, and so far, there is little experience with their practical application. Peculiarities in respective research designs raise the question if the resulting common PRO construct is still equivalent to the specific constructs defined by the individual measures included. In the case of the depression metric by Wahl, only 89 of the initially 143 items of the instruments showed sufficient fit for the proposed unidimensional depression model. For instance, a third of the items of the PHQ-9, which has been shown to be a unidimensional measure before [27], had to be excluded. Consequently, only the remaining well fitting items were used for estimating the definitive common depression metric. Excluded items were fitted to this model subsequently. It remains unclear how this proceeding of posterior fitting of the items affected the construct validity of the common metric.

Several methods can be used to use common metrics in data analysis. A classic approach is to estimate an IRT model for each new sample and subsequently link those item parameters to the common metric, using the Stocking–Lord or Haebra–Method [14], [28]. Another approach to obtain comparable person estimates would be to simply use the item parameters from the common metric on the new sample. In this case, no reestimation of item parameters is necessary.

Using specifically the latter approach, werecently developed a Web application for researchers to facilitate IRT score estimation from such models (www.common-metrics.org). This Web site allows researchers to score their data on several published IRT models comprising different PROs, for example, the depression metric by Wahl et al. [16], the PROMIS depression metric [14], or the PROMIS anxiety metric [15]. This approach is explicitly suggested by the developers of those models [14], [15] and could be done in standard IRT software as well. However, we aim to lower the barriers of IRT scoring for those not familiar with modern test theory. Using aforementioned Web application, researchers can directly upload a spreadsheet of item responses and the scoring is done automatically. A more detailed explanation of the underlying estimation techniques can be found on the Web site itself.

Following from above, the aim of this article is to examine whether depressive symptom severity estimated by simply using the item parameters from the common metric by Wahl et al. [16] differs to reestimating and linking a model. This is especially important because some items of the PHQ-9 were excluded in the primary estimation of the depression metric by Wahl, while they would contribute to the latent trait definition in the case of reestimation. Therefore, we aim to investigate potential differences between the depression scores resulting from applying the depression metric by Wahl et al. [16] parameters and the depression scores resulting from reestimation using two different item parameter linking methods in four samples that did not contribute to the development of the common depression metric [16].

Section snippets

Samples

By secondary data analysis, we applied the depression metric by Wahl et al. [16] to four samples (n = 3,315) that had answered the depression scale of the German version of the PHQ-9/PHQ-8 as part of studies [29], [30], [31] or as part of clinical routine diagnostics. The included samples differed in treatment settings, country (two from Germany and two from Austria), medical conditions, and depressive symptom severity as measured by the PHQ-9/PHQ-8 (Table 1).

Sample 1 (n = 1,049) includes

Results

In each sample, CFA showed a comparable pattern of high correlations (0.62–0.89) between each item and one single latent factor (Table 2). Confidence intervals showed high overlap of loadings between samples. Significant chi-square tests indicated deviation from the model-implied covariance matrix in all samples, while CFI was above the widely used criterion of 0.95, indicating appropriate model fit. In contrast, RMSEA exceeded common criteria of 0.08 in three of four samples. No residual

Discussion

We compared three different methods to analyze data collected with two versions of the widely used depression measure of the PHQ-9/PHQ-8 on a previously published common metric. Our main finding is that latent depression scores of the different approaches are remarkably similar. The estimated depression score differences between the methods are about Δ = 0.1, which is not clinically relevant on a metric that is fixed to a German general population mean of 50 and a standard deviation of 10. When

References (66)

  • G. Liegl et al.

    Prevalence of common mental disorders in primary care in Austria

    J Psychosom Res

    (2014)
  • B. Löwe et al.

    Measuring depression outcome with a brief self-report instrument: sensitivity to change of the Patient Health Questionnaire (PHQ-9)

    J Affect Disord

    (2004)
  • K. Kroenke et al.

    The patient health questionnaire somatic, anxiety, and depressive symptom scales: a systematic review

    Gen Hosp Psychiatry

    (2010)
  • K. Kroenke et al.

    The PHQ-8 as a measure of current depression in the general population

    J Affect Disord

    (2009)
  • I. Razykov et al.

    The PHQ-9 versus the PHQ-8–is item 9 useful for assessing suicide risk in coronary artery disease patients? Data from the Heart and Soul Study

    J Psychosom Res

    (2012)
  • J.M. Bland et al.

    Comparing methods of measurement: why plotting difference against standard method is misleading

    Lancet

    (1995)
  • I. Wahl et al.

    Die Erfassung der Lebensqualität in der Psychotherapieforschung

    Klin Diagnostik Und Eval

    (2010)
  • M.A. Puhan et al.

    Combining scores from different patient reported outcome measures in meta-analyses: when is it justified?

    Health Qual Life Outcomes

    (2006)
  • G. Liegl et al.

    Guided self-help interventions for irritable bowel syndrome: a systematic review and meta-analysis

    Eur J Gastroenterol Hepatol

    (2015)
  • R.-D. Stieglitz

    Diagnostik und Klassifikation in der Psychiatrie

    (2008)
  • S.E. Embretson et al.

    Item response theory for psychologists

    (2000)
  • M.L. Thomas

    The value of item response theory in clinical assessment: a review

    Assessment

    (2010)
  • S.P. Reise et al.

    Handbook of item response theory modeling

    (2014)
  • S.P. Reise et al.

    Item response theory and clinical measurement

    Annu Rev Clin Psychol

    (2009)
  • P.A. Pilkonis et al.

    Item banks for measuring emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®): depression, anxiety, and anger

    Assessment

    (2011)
  • B.D. Schalet et al.

    Establishing a common metric for physical function: linking the HAQ-DI and SF-36 PF Subscale to PROMIS(®) physical function

    J Gen Intern Med

    (2015)
  • S.W. Choi et al.

    Establishing a common metric for depressive symptoms: linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression

    Psychol Assess

    (2014)
  • J.B. Bjorner et al.

    Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales

    Qual Life Res

    (2003)
  • L.E. Gibbons et al.

    Migrating from a legacy fixed-format measure to CAT administration: calibrating the PHQ-9 to the PROMIS depression measures

    Qual Life Res

    (2011)
  • P.M. ten Klooster

    Development and evaluation of a crosswalk between the SF-36 physical functioning scale and Health Assessment Questionnaire disability index in rheumatoid arthritis

    Health Qual Life Outcomes

    (2013)
  • N.J. Dorans

    Linking scores from multiple health outcome instruments

    Qual Life Res

    (2007)
  • D. Thissen et al.

    Item response theory for scores on tests including polytomous items with ordered responses

    Appl Psychol Meas

    (1995)
  • F.M. Lord et al.

    Comparison of IRT true-score and equipercentile observed-score “Equatings”

    Appl Psychol Meas

    (1984)
  • Cited by (0)

    Funding: Authors did not get any financial support for the secondary data analysis based on the four included samples. Original data collection regarding to sample 1, sample 3, and sample 4 was not financial supported, as it was part of clinical routine diagnostics. Pfizer GmbH, Karlsruhe, supported data collection regarding to sample 2, by a limited grant.

    Conflict of interest: None.

    View full text