Original ArticleUsing Patient Health Questionnaire-9 item parameters of a common metric resulted in similar depression scores compared to independent item response theory model reestimation
Introduction
The vast amount of patient-reported outcome (PRO) measurement instruments assessing identical constructs hampers comparability in the field of psychotherapy research [1]. As the scores of different PROs are not scaled on the same metric, the communication among researchers and clinicians using different instruments is complicated. That is, it is unclear to users of instruments how to equate a score measured on one instrument to a score measured on another instrument. Furthermore, the pooling of study results based on different PROs in systematic reviews, and meta-analyses may also be biased [2], [3]. In case of measuring depressive symptom severity, there are more than 100 scales available [4]. Most of these instruments have been developed within the framework of classical test theory (CTT).
The standardization of PRO measures is urgently required to ensure the comparability of research results across different medical fields, diseases, languages, and location of study site [5]. Compared to CTT, item-response theory (IRT) is a more flexible approach in the analysis of PROs and offers a range of promising methods to calibrate items measuring the same construct on a standardized metric [6], [7]. Although IRT methodology is common in the field of educational psychology and well established in the testing of ability and achievement, its application to clinical assessments is comparatively new [8], [9]. Nonetheless, in recent years, IRT methods have been increasingly applied in PRO measurement [6], [10]. In this context, the Patient-Reported Outcomes Measurement Information System (PROMIS) is worth mentioning [11]. PROMIS is one of the most extensive PRO initiatives that provides a big range of instruments. The uniqueness of PROMIS is that individual domains, such as depression, are measured through a large number of items that are calibrated to a standardized metric, so-called item banks [7], [12].
Despite efforts such as the PROMIS initiative, however, many researchers and clinicians may continue to use existing legacy instruments. Therefore, calibrating these instruments to a common metric is highly desirable to ensure comparability of results [13]. Using the IRT-based possibilities of estimating common models for different instruments that are aimed to measure the same construct, such a common metric can generally be developed using one of two options. The first option is to link legacy instruments to an already existing metric of a well-established standardized instrument [14], [15], whereas the second option is to establish a new standardized metric resulting from modeling a common IRT model for the items of several instruments measuring a specific health domain. Using the latter approach, Wahl et al. [16] recently published such a common metric for 11 depression measures, including the Patient Health Questionnaire (PHQ-9), based on data of 33,844 German adults. Research to enhance the comparability of PROs by means of IRT has also been carried out exploring further health domains, for instance, in the measurement of pain [17], headache [18], anxiety [15], depression [19], physical function [13], fatigue [20], and health-related quality of life [21].
Common IRT models have been frequently used to develop crosswalk tables between measures [14], [15], [20], [21], [22], [23], [24] as it is possible to derive latent trait estimates solely from the sum score [25], [26]. However, it is also possible to estimate the latent trait directly from the response pattern to each item. This approach, which was also used to develop the depression metric by Wahl, has some advantages not only over the inherent limitations of CTT, but also over the use of crosswalk tables. In detail, individual pattern scoring leads to more accurate person parameter estimates compared to sum scores and allows scoring in case of missing data [6], [14], [15].
The development of common metrics as conducted by Wahl et al. [16] is a relatively new technique in the field of PRO measurement, and so far, there is little experience with their practical application. Peculiarities in respective research designs raise the question if the resulting common PRO construct is still equivalent to the specific constructs defined by the individual measures included. In the case of the depression metric by Wahl, only 89 of the initially 143 items of the instruments showed sufficient fit for the proposed unidimensional depression model. For instance, a third of the items of the PHQ-9, which has been shown to be a unidimensional measure before [27], had to be excluded. Consequently, only the remaining well fitting items were used for estimating the definitive common depression metric. Excluded items were fitted to this model subsequently. It remains unclear how this proceeding of posterior fitting of the items affected the construct validity of the common metric.
Several methods can be used to use common metrics in data analysis. A classic approach is to estimate an IRT model for each new sample and subsequently link those item parameters to the common metric, using the Stocking–Lord or Haebra–Method [14], [28]. Another approach to obtain comparable person estimates would be to simply use the item parameters from the common metric on the new sample. In this case, no reestimation of item parameters is necessary.
Using specifically the latter approach, werecently developed a Web application for researchers to facilitate IRT score estimation from such models (www.common-metrics.org). This Web site allows researchers to score their data on several published IRT models comprising different PROs, for example, the depression metric by Wahl et al. [16], the PROMIS depression metric [14], or the PROMIS anxiety metric [15]. This approach is explicitly suggested by the developers of those models [14], [15] and could be done in standard IRT software as well. However, we aim to lower the barriers of IRT scoring for those not familiar with modern test theory. Using aforementioned Web application, researchers can directly upload a spreadsheet of item responses and the scoring is done automatically. A more detailed explanation of the underlying estimation techniques can be found on the Web site itself.
Following from above, the aim of this article is to examine whether depressive symptom severity estimated by simply using the item parameters from the common metric by Wahl et al. [16] differs to reestimating and linking a model. This is especially important because some items of the PHQ-9 were excluded in the primary estimation of the depression metric by Wahl, while they would contribute to the latent trait definition in the case of reestimation. Therefore, we aim to investigate potential differences between the depression scores resulting from applying the depression metric by Wahl et al. [16] parameters and the depression scores resulting from reestimation using two different item parameter linking methods in four samples that did not contribute to the development of the common depression metric [16].
Section snippets
Samples
By secondary data analysis, we applied the depression metric by Wahl et al. [16] to four samples (n = 3,315) that had answered the depression scale of the German version of the PHQ-9/PHQ-8 as part of studies [29], [30], [31] or as part of clinical routine diagnostics. The included samples differed in treatment settings, country (two from Germany and two from Austria), medical conditions, and depressive symptom severity as measured by the PHQ-9/PHQ-8 (Table 1).
Sample 1 (n = 1,049) includes
Results
In each sample, CFA showed a comparable pattern of high correlations (0.62–0.89) between each item and one single latent factor (Table 2). Confidence intervals showed high overlap of loadings between samples. Significant chi-square tests indicated deviation from the model-implied covariance matrix in all samples, while CFI was above the widely used criterion of 0.95, indicating appropriate model fit. In contrast, RMSEA exceeded common criteria of 0.08 in three of four samples. No residual
Discussion
We compared three different methods to analyze data collected with two versions of the widely used depression measure of the PHQ-9/PHQ-8 on a previously published common metric. Our main finding is that latent depression scores of the different approaches are remarkably similar. The estimated depression score differences between the methods are about Δ = 0.1, which is not clinically relevant on a metric that is fixed to a German general population mean of 50 and a standard deviation of 10. When
References (66)
- et al.
Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS)
J Clin Epidemiol
(2008) - et al.
The PROMIS Physical Function item bank was calibrated to a standardized metric and shown to improve measurement efficiency
J Clin Epidemiol
(2014) - et al.
The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008
J Clin Epidemiol
(2010) - et al.
Establishing a common metric for self-reported anxiety: linking the MASQ, PANAS, and GAD-7 to PROMIS Anxiety
J Anxiety Disord
(2014) - et al.
Standardization of depression measurement: a common metric was developed for 11 self-report depression measures
J Clin Epidemiol
(2014) - et al.
Linking pain items from two studies onto a common scale using item response theory
J Pain Symptom Manage
(2009) - et al.
Linking fatigue measures on a common reporting metric
J Pain Symptom Manage
(2014) - et al.
Correspondence between the RAND-negative impact of asthma on quality of life item bank and the Marks asthma quality of life questionnaire
Clin Ther
(2014) - et al.
Linking the activity measure for post acute care and the quality of life outcomes in neurological disorders
Arch Phys Med Rehabil
(2011) - et al.
Influence of depression, expectation of therapy effectiveness, and self-efficacy on the treatment outcome in patients with multiple somatic symptoms (MSS)
J Psychosom Res
(2014)
Prevalence of common mental disorders in primary care in Austria
J Psychosom Res
Measuring depression outcome with a brief self-report instrument: sensitivity to change of the Patient Health Questionnaire (PHQ-9)
J Affect Disord
The patient health questionnaire somatic, anxiety, and depressive symptom scales: a systematic review
Gen Hosp Psychiatry
The PHQ-8 as a measure of current depression in the general population
J Affect Disord
The PHQ-9 versus the PHQ-8–is item 9 useful for assessing suicide risk in coronary artery disease patients? Data from the Heart and Soul Study
J Psychosom Res
Comparing methods of measurement: why plotting difference against standard method is misleading
Lancet
Die Erfassung der Lebensqualität in der Psychotherapieforschung
Klin Diagnostik Und Eval
Combining scores from different patient reported outcome measures in meta-analyses: when is it justified?
Health Qual Life Outcomes
Guided self-help interventions for irritable bowel syndrome: a systematic review and meta-analysis
Eur J Gastroenterol Hepatol
Diagnostik und Klassifikation in der Psychiatrie
Item response theory for psychologists
The value of item response theory in clinical assessment: a review
Assessment
Handbook of item response theory modeling
Item response theory and clinical measurement
Annu Rev Clin Psychol
Item banks for measuring emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®): depression, anxiety, and anger
Assessment
Establishing a common metric for physical function: linking the HAQ-DI and SF-36 PF Subscale to PROMIS(®) physical function
J Gen Intern Med
Establishing a common metric for depressive symptoms: linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression
Psychol Assess
Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales
Qual Life Res
Migrating from a legacy fixed-format measure to CAT administration: calibrating the PHQ-9 to the PROMIS depression measures
Qual Life Res
Development and evaluation of a crosswalk between the SF-36 physical functioning scale and Health Assessment Questionnaire disability index in rheumatoid arthritis
Health Qual Life Outcomes
Linking scores from multiple health outcome instruments
Qual Life Res
Item response theory for scores on tests including polytomous items with ordered responses
Appl Psychol Meas
Comparison of IRT true-score and equipercentile observed-score “Equatings”
Appl Psychol Meas
Cited by (0)
Funding: Authors did not get any financial support for the secondary data analysis based on the four included samples. Original data collection regarding to sample 1, sample 3, and sample 4 was not financial supported, as it was part of clinical routine diagnostics. Pfizer GmbH, Karlsruhe, supported data collection regarding to sample 2, by a limited grant.
Conflict of interest: None.