Background
Depression is one of the most prevalent and burdensome mental health disorders worldwide. The World Health Organization (WHO) calls it one of the top risk factors for health and predicts depression and affective disorders will be the second most frequent widespread disease worldwide by 2020 [
1]. Standardized clinical interviews such as the Composite International Diagnostic Interview (CIDI, [
2]) are valid and reliable instruments to assess depression [
3‐
6]. However, their administration is time-consuming and requires trained interviewers. Therefore, shorter self-report measures are often used instead of clinical interviews in population-based surveys to screen for depression. The Patient Health Questionnaire-9 (PHQ-9, [
7]) is a nine-item self-report measure of depressive symptoms that has been used in clinical and general population samples [
8‐
10]. The questionnaire has been translated into several languages for widespread international use (e.g., [
11‐
13]). The nine items represent the nine clinical criteria for depression from the Diagnostic and Statistical Manual of Mental Disorders, fifth edition (DSM-5, [
14]): anhedonia, depressed mood, sleep disturbance, fatigue, appetite changes, low-self-esteem, concentration problems, psychomotor disturbances, and suicidal ideation. Thus, the PHQ-9 screens for affective, cognitive, and somatic aspects of depression. In intervention studies, the PHQ-9 is frequently used as a measure of changes in depression severity [
15‐
17]. The PHQ-9 has been validated as self-administered questionnaire [
7,
11] and as telephone interview [
18]. It may be used in clinical and non-clinical samples [
10]. Another widely used version of this questionnaire is the PHQ-8 [
19]. It is a short version of the PHQ-9, which has one additional item on self-injurious or suicidal ideas. However, data revealed that this item was often superfluous for assessments because thoughts of self-harm are rather uncommon even in samples of clinically depressed patients [
20,
21]. Furthermore, some studies suggest that this item shows a notably low discriminatory power [
8] and often indicates passive thoughts about death rather than suicidal or self-harm intentions [
22]. This confirms the suitability of the PHQ-8, which has shown good validity and reliability as a measure of different levels of depression. Still, most research on psychometric properties has been done using the PHQ-9.
Research has been undertaken to assess whether the PHQ-9 includes different subscales that indicate different symptom domains. For this purpose, its psychometric factor structure has been analyzed. Several findings on the factor structure of the PHQ-9 exist. They provide support for a one-factor [
23‐
26], a two-factor model [
27‐
29] or, albeit less frequently, a three-factor model [
30]. Overall, the results regarding the factor structure are still inconsistent. In their systematic review, Lamela, Soreira [
29] provide an overview of the heterogeneity in the factor structure of the PHQ-9. Their own results support the two-factor structure of the questionnaire. Similarly, Mattsson, Sandqvist [
31] found a two-factor structure for the PHQ-8. A two-factor structure was also found in a sample of patients with chronic heart failure [
32]. However, using exploratory factor analysis, Schantz, Reighard [
33] found a one-factor structure of the PHQ-8.
Measurement invariance is a crucial prerequisite for comparisons between groups of individuals and points of time in measurement. If measurement invariance is evidenced, we can conclude that the same construct is measured across groups and that observed group differences reflect true group differences. Failure to obtain measurement invariance renders group comparisons ambiguous because they might merely be caused by psychometric differences related to item responses instead of differences in the underlying construct. There are studies on the measurement invariance of the PHQ-9, especially in regards to gender specific measurement invariance [
9,
34]. However, there is a need for the comparison of groups for studies with experimental designs. In order to assume that we interpret true group differences when examining differences between intervention and control group, we first have to provide evidence for measurement invariance.
Measurement invariance analyses can also refer to different points of time. This is essential for longitudinal analyses because researchers should ensure that their measurement instruments are equivalent over time. Changes in PHQ scores over different points of time can only be meaningfully interpreted if measurement invariance can be assumed. However, evidence of measurement invariance over time is scarce. For example, Downey, Hayduk [
35] have examined longitudinal measurement invariance of the PHQ for family members of patients in intensive care units. They were unable to show invariance for either the PHQ-9 or the PHQ-8 and concluded that the questionnaire might not be adequate for the assessment of depression in this specific population. However, the authors only examined the fit of a constrained model without comparison to an unconstrained baseline model. A step-wise approach could be more adequate to analyze measurement invariance.
The aims of the current study were 1) to compare a one-factor structure to a two-factor structure for the PHQ-8 at one point of time (baseline assessment), 2) to provide evidence for measurement invariance across five points of time, including a baseline assessment and 2, 4, 6, and 12 month follow-up assessments separately for participants in the two study groups, and 3) to provide evidence for longitudinal measurement invariance between the intervention group and the control group.
Discussion
A two-factor structure with a somatic and a non-somatic factor showed the best model fit for all measurement models in our analyses. Full measurement invariance was only achieved across the 2-, 4-, 6-, and 12-month assessments. Including the baseline assessment into the model resulted in a substantial deterioration of the model fit at the metric invariance level. Thus, only the same factor structure could be assumed across all assessments.
So far, studies on the measurement invariance of the PHQ-9 have consistently shown invariance across sociodemographic variables [
29]. Although the number of studies is still small, this suggests that PHQ-9 scores can be meaningfully compared across sociodemographic groups. However, far less is known about the longitudinal measurement invariance of both PHQ-8 and PHQ-9. For example, Downey, Hayduk [
35] reported non-invariance of one-factor models for both questionnaires while Schuler, Strohmayer [
48] found at least partial scalar invariance for a one-factor model of the PHQ-9. Gonzalez-Blanch, Medrano [
49] even found strict invariance for a one-factor model of the PHQ-9 between two assessments. These differences might be due to different methodological approaches (i.e. one-step or four-step analysis) or differences in the sample populations (e.g. clinical or non-clinical populations). Our results demonstrate that longitudinal invariance can also be established for a two-factor model of the PHQ-8 (for four of five assessments) and further include measurement invariance between experimental groups which is crucial to show that differences between intervention and control group reflect the inferred underlying construct.
The reported lack of invariance across the baseline assessment and all other assessments could have several explanations, one of which being the different modes of presentation of the PHQ-8 (self-administered questionnaire versus telephone interview). Effects of presentation modes have been investigated for several tests and questionnaires. For the PHQ-9, there is evidence that the telephone version is comparable to a paper-pencil version of the questionnaire [
18]. However, to our knowledge, no such examination has been conducted for the PHQ-8 so far. Furthermore, no data exists on how a computerized assessment may differ from telephone assessments. This could have important implications for the PHQ. Future research could examine if different modes of presentation require different cut-off points for screening depressive symptoms with the PHQ.
It is possible, that the different timeframes for the items (i.e., the past 12 months at baseline, the past 2 months for the 2- and 4-months assessments, and the past 6 months for the 6- and 12-months assessments) contributed to the lack of longitudinal invariance. Our results could suggest that the retrospective assessment of depressive symptoms could be biased for longer periods of time such as the 12-months interval. This seems reasonable considering that an accurate recall of symptoms becomes increasingly difficult over longer periods. Possibly, our results did not show longitudinal invariance with the baseline assessment because participants were only asked to think about such a long timeframe at the baseline assessment. However, the strict invariance across the follow-up assessments suggests that smaller differences in the time frames for the PHQ-8 might not be a problem for longitudinal analyses.
Finally, completing the initial screening and agreeing to participate in a study focusing on depressive symptoms could have resulted in a heightened self-awareness of participants regarding their mental health. This might have led to participants having different perceptions of the respective items about depression in contrast to the initial assessment at which the majority of participants may not have thought about depressive symptoms before. It is important to note that measurement invariance was shown between the intervention and the control group. Therefore, it is highly unlikely that the application of questions about health behaviors caused biased responses only in the intervention group [
50]. Although the control group did not receive the intervention, the mere participation in the study and the baseline assessment were sufficient to change participants’ self-awareness about their own mental health states at the follow-up assessments, for participants in the intervention as well as the control group. This result shows that group comparisons between intervention and control group at 6-month, and 12-month follow-up assessments and, for separate analyses, at the baseline assessment for the PHQ-8 mean score are explicitly meaningful [
51].
Conclusions
The configural invariance for all five points of time shows that the PHQ-8 reliably captures the same conceptual framework (i.e., the same factor structure) when measured over time. However, the lack of metric invariance (i.e., factor loadings can not be assumed to be equal across time) means that the associations and patterns mapping the items and factors can not be assumed to be equal across the baseline and the follow-up assessments. Furthermore, we can not conclude that the PHQ-8 has the same operational definition across time due to the lack of scalar invariance (i.e., item thresholds can not be assumed to be equal across time). Nevertheless, we were able to establish strict longitudinal invariance across the 2-, 4-, 6- and 12-month assessments and between groups across the 6- and 12-month assessments. This emphasizes the influence of the varying factors between the baseline and the follow-up assessments on our results, such as the different modes of presentation (self-administered vs. telephone interview). Rather than the longitudinal design, it is very likely that the lack of invariance was caused by these factors. Altogether, the results indicate that the PHQ can be compared across time and between groups – at least when it is used under similar conditions (presentation mode, timeframe of the items, assessment setting). However, researchers interested in longitudinal measurements of the PHQ-8 should be careful with varying conditions between measurement points. Future research should investigate the validity and possible differences of a self-administered paper-pencil version, the digital version, and the telephone interview of the PHQ-8.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.