Item reduction based on rigorous methodological guidelines is necessary to maintain validity when shortening composite measurement scales

doi:10.1016/j.jclinepi.2012.12.015

Journal of Clinical Epidemiology

Volume 66, Issue 7, July 2013, Pages 710-718

https://doi.org/10.1016/j.jclinepi.2012.12.015 Get rights and content

Abstract

Objective

To review current practice and update guidelines for the methodology of shortening composite measurement scales (CMSs).

Study Design and Setting

A literature review gathered data on 91 shortening processes from 1995 to 2009. The validity of the initial CMS, the shortening methods, and the validity of the derived short-form scales were examined. The results were compared with those from a previous literature review (articles from 1985 to 1995) to develop updated guidelines for CMS shortening.

Results

The literature review revealed a persisting lack of use of rigorous methodology for CMS shortening. Of the 91 cases of CMS shortening, 36 combined a content approach and a statistical approach; 45 used only a statistical approach and 10 (11%) only a content approach. The updated guidelines deal with the validity and conceptual model of the initial CMS, the preservation of content and psychometric properties during shortening, the selection of items, and the validation of the short form.

Conclusion

Item reduction based on a rigorous methodology is necessary if the short-form instrument aims to maintain the validity and other measurement properties of the parent instrument, which in turn supports application in research and clinical practice.

Introduction

Many health constructs are too complex to be captured by direct measurement. When these constructs need to be examined, one of the most popular methods requires the use of a composite measurement scale (CMS). The CMS generally consists of items or questions that assess one or several attributes scored by a scale.

Over the years, the measurement of health constructs has led to the production of a large number of scales, with often a high number of items [1].

The burden of long scales and the increasing need for multiple instruments in the same study have logically created a strong need to shorten CMSs. To shorten a CMS consists in reducing its number of items while trying to preserve or improve its psychometric properties. The methodology to develop new CMSs is well documented [2] and includes consideration of different psychometric properties but has limited applicability as a means of informing the shortening of instruments. Guidelines for shortening existing CMSs are scarce. In 1997, Coste et al. [3] noted that most articles reporting on scale shortening lacked rigorous methodology: shortening processes were often inadequately conceptualized, and excessive credit was given to statistical techniques. The authors recommended carefully choosing the original scale according to its content, its possibility for shortening, and its psychometric properties; focusing on criterion validity if the original scale can be considered the gold standard and if not, preferring an expert-based approach to content validity, possibly helped by statistical considerations; and finally, performing a validation study in an independent sample. In 2000, Smith et al. [4] also noted methodological pitfalls concerning CMS shortening in psychology and recommended using a validated original scale, clarifying the intended use of the short form, estimating a priori the properties of the short form to balance resource or time savings against the loss of validity, preserving content, and using an independent sample to validate the short form. In 2002, Stanton et al. [5] also observed a lack of methodological recommendations for scale reduction. The authors argued that focusing on internal consistency should be avoided and proposed a set of item “quality indices” to help conceptualize the competing issues that influence item retention decisions. However, new approaches and new statistical methods such as item response theory (IRT) are now used for scale shortening [6], but we lack published recommendations integrating these methods. Another shortfall in current guidelines concerns content validity; retaining content during the shortening process is important, but only methodological guidelines for developing new CMSs focus on content analysis [7], [8].

This article aims to describe the methodology currently used to shorten CMSs through a literature review and to compare with a previous review for proposing updated and structured guidelines for CMS shortening.

Section snippets

Search and identification of articles

Articles reporting on the development of a short form of an existing CMS concerning health or psychology domains, published between January 1, 1995 and December 31, 2009 and written in English, were selected from MedLine and Psycinfo. The following query was used to search title, abstract, and keyword fields: (“short form” or “brief form” or “short version” or “brief version”) and (questionnaire or scale or instrument) and (development or validation or reduction or shortening or “item reduction

Results

In total, 103 articles met the inclusion criteria (list available in Appendix at www.jclinepi.com), corresponding to 91 shortening processes. The process of selecting the articles is in Fig. 1. The number of published articles describing CMS shortening increased during the study period (Fig. 2). Only 17 (19%) of the shortening processes cited at least one of the three articles proposing guidelines for CMS shortening [3], [4], [5].

The main domains covered by the involved CMSs were psychology (n =

Discussion

The analysis of the literature of CMS shortening points out to a persistent lack of rigorous methodology for such shortening. The original CMS on which short forms were developed were not sufficiently described. Their psychometric properties were not well described before than after publication of recommendations in 1997, 2000, and 2002 [3], [4], [5]. A shortening process aims to preserve the psychometrics properties of a CMS into a short form; thus, a shortening study must be started by

Document the validity of the original CMS and the objective of its shortening

The short form of an existing CMS may benefit from the quality, the validity, and possibly the popularity of the original CMS. However, the original CMS must have well-documented measurement properties.

Although the original CMS must have satisfactory measurement properties, the shortening process may be an opportunity to improve some of these. A short form should definitively not be developed from a CMS with insufficiently documented properties. The evaluation of a CMS includes content,

Conclusions

The methodology of CMS shortening too often lacks rigor, so their short forms may be at risk of poorer validity than the original long form. The guidelines we propose can help researchers identify the key steps to follow when developing a short-form CMS from an existing one. Following these guidelines to shorten a CMS will not lead to a perfect short-form CMS but will guarantee that the main pitfalls have been avoided. It will also let future users of the short-form CMS examine what choices

Acknowledgment

The authors thank F. Bonnetain, B. Fautrel, and T. Conroy for their careful review of the guidelines.

References (19)

J. Coste et al.
Methodological approaches to shortening composite measurement scales
J Clin Epidemiol
(1997)
C.B. Terwee et al.
Quality criteria were proposed for measurement properties of health status questionnaires
J Clin Epidemiol
(2007)
A. Garratt et al.
Quality of life measurement: bibliographic study of patient assessed health outcome measures
BMJ
(2002)
D.L. Streiner et al.
Health measurement scales, a practical guide to their development and use
(2003)
G.T. Smith et al.
On the sins of short-form development
Psychol Assess
(2000)
J.M. Stanton et al.
Issues and strategies for reducing the length of self-report scales
Pers Psychol
(2002)
M.O. Edelen et al.
Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement
Qual Life Res
(2007)
R. Sartori et al.
Quality and quantity in test validity: how can we be sure that psychological tests measure what they have to?
Qual Quant
(2007)
Standards for educational and psychological tests
(1985)

There are more references available in the full text version of this article.

Cited by (111)

Development of an optimal short form of the GAD-7 scale with cross-cultural generalizability based on Riskslim
2024, General Hospital Psychiatry
Despite the relatively small number of items in the GAD-7, fewer items are increasingly sought to shorten testing time in large-scale mental health screenings. As a result, short forms based on the GAD-7, the GAD-2, and GAD-mini, have become popular. However, the GAD-2 and GAD-mini have reported lower diagnostic accuracy in some cultural contexts, implying that a validated short-form version of the GAD-7 may be lacking in large-scale cross-cultural anxiety screening. Based on this, to develop an optimal short form of the GAD-7 with cross-cultural stability, we utilized seven GAD-7 datasets from six different countries, totaling 47,484 participants. Five 2 to 6 item short forms of the GAD were constructed using the Riskslim machine learning algorithm. We evaluated the diagnostic accuracy of the GAD-7 short forms in the training and test sets based on the coefficient of determination(R²) and area under the curve(AUC) metrics, and the results showed that GAD-R2 performed poorly in some cultures, and all of the 3 to 6 item short forms of the GAD performed good in cross-cultural diagnostic rates, with the GAD-R6 showing the highest diagnostic accuracy in all cultures; GAD-R3 outperformed GAD-R2, GAD-2, and GAD-mini in all cultures; GAD-R3 had higher generalizability across cultures and special populations; Given that the GAD-R3 was shorter and nearly as accurate as the GAD-R6, we recommend the use of the GAD-R3 in clinical studies and epidemiologic investigations. And we recommend the optimal actual cutoff value of 15 for GAD-R3. Overall, we recommend GAD-R3 as the short-form version of GAD-7 in cross-cultural studies. However, the 2-item GAD scale is also optimal for the short-form version in clinical practice.
Dyspnea-induced Limitation (DYSLIM), a new self-administered concise questionnaire to evaluate dyspnea-related activity limitation in chronic respiratory diseases
2023, Respiratory Medicine
Few questionnaires are available for routine assessment of dyspnea. The study aimed to design a self-administered questionnaire assessing the impact of chronic dyspnea on daily activities, named DYSLIM (Dyspnea-induced Limitation).
The development followed 4 steps: 1: selection of relevant activities and related questions (focus groups); 2: clinical study: internal and concurrent validity vs. modified Medical Research Council (mMRC), Baseline Dyspnea Index (BDI) and Saint George Respiratory Questionnaire (SGRQ); 3: item reduction; 4: responsiveness.
Eighteen activities (from eating to climbing stairs) were considered with 5 modalities for each: doing the task slowly, taking breaks, seeking assistance, changing habits, and activity avoidance. Each modality was graded from 5 (never) to 1 (very often). Validation study included 194 patients: COPD (FEV1 ≥ 50% pred: n = 40; FEV1 < 50% pred: n = 65); cystic fibrosis (n = 30), interstitial lung disease (n = 30), pulmonary hypertension (n = 29). Responsiveness was evaluated by post-pulmonary rehabilitation data in 52 COPD patients.
Acceptability was high and short term (7 days) reproducibility was satisfactory (Kappa mostly above 0.7). Concurrent validity was high vs. mMRC (Spearman correlation coefficient, r = 0.71), BDI (r = - 0.75) and SGRQ (r = - 0.79). The reduced questionnaire with 8 activities (from cleaning to climbing stairs) and 3 modalities (slowly, seeking help, changing habits) showed a comparable validity and was chosen as the final short version. Effect size of rehabilitation was good for both the full (0.57) and short (0.51) versions. A significant correlation was also found between changes of SGRQ and DYSLIM post rehabilitation: r = - 0.68 and r = - 0.60 for full and reduced questionnaires, respectively.
The DYSLIM questionnaire appears promising for the evaluation of dyspnea-induced limitations in chronic respiratory diseases and seems suitable for use in various contexts.
The validity and measurement equivalence of a brief safety climate questionnaire across casual and permanent workers
2023, Safety Science
Citation Excerpt :
Similarly, future research may benefit from examining the NOSACQ-24 within organisations to determine if specific demographic subgroups can be identified using the measure, allowing for more accurate and specifically tailored intervention strategies to remediate identified safety concerns. The additional evidence in this study regarding dimension structure, external validity, construct validity, concurrent validity, and measurement equivalence of the NOSACQ-24 further supports the item-reduction approach outlined by both Goetz et al. (2013), and the addition of a practice-focussed criterion for item selection proposed in a previous study by Summers et al. (2022), which ultimately resulted in superior model fit indices. Given that this previous research identified a potential issue regarding negatively worded items and the impact they have on factor structure, future research may benefit from examining model fit indices in a version of the NOSACQ that contains only positively worded questions, particularly because one dimension of the NOSACQ, “Workers’ Safety Priority and Risk Non-Acceptance”, largely comprises (∼85 %) negatively worded questions.
Safety climate is an effective leading indicator of safety incidents and accidents. However, frequently safety climate measures are only employed in times of crisis rather than for regular monitoring to identify and remediate safety issues before becoming critical.
This study aimed to validate a 24-item version of the 50-item Nordic Occupational Safety Climate Questionnaire (NOSACQ-50) that was developed for use as a regular monitoring tool.
Analyses undertaken included confirmatory factor analysis (CFA) and assessments of construct validity, external validity, concurrent validity, measurement equivalence, and benchmarking capabilities. CFA to examine external and construct validity included a combined sample of disability support workers and hospitality employees (N = 474), an independent sample of students in casual employment (N = 122), and employees from a vocational education and training (VET) organisation (N = 539). Concurrent validity was assessed by comparing correlations between the 50-item and 24-item versions of the NOSACQ with health and wellbeing outcome variables. External validity of the NOSACQ-24 was further established using the casual student workers and a sample employees from the VET organisation (N = 53). Paired samples t-tests examined the safety climate scores for the 50-item and 24-item measures across all participant samples to evaluate the benchmarking capability of the NOSACQ-24.
The NOSACQ-24 demonstrated a comparable factor structure to the NOSACQ-50. External, construct, and concurrent validity for the NOSACQ-24 were largely supported, as were measurement equivalence and benchmarking capabilities.
Use of the NOSACQ-24 is supported, and future applications are discussed.
Mental health risk factors for shift work disorder in paramedics: A longitudinal study
2023, Sleep Health
Citation Excerpt :
This is further supported by our finding that baseline insomnia failed to significantly predict SWD at follow-up, although it did have a large odds ratio. With respect to the PHQ-9, it is important to note that examining individual questionnaire items to assess specific symptoms may lack the validity and reliability that multi-component measures possess.49 Likewise, reducing items from a measure without appropriate validation may nullify the measure's ability to detect the construct it was designed for.49
Depression and anxiety are prominent in paramedics, as is the prevalence of shift work disorder (SWD), a circadian sleep condition comorbid with mental health disorders. However, the role of mental health risk factors for SWD is largely unknown. This study investigated whether mental health levels in recruit paramedics before shift work predicted greater risk of SWD at 6-months into their career and explored whether shift and sleep factors mediated this relationship.
A longitudinal study.
Victoria, Australia.
Recruit paramedics were assessed at baseline (n = 101; ie, pre-shift work) and after 6-months (n = 93) of shift and emergency work.
At both time points, participants completed self-reported measures of depression (Patient Health Questionnaire-9), anxiety (Generalized Anxiety Disorder Questionnaire-7), and SWD (SWD-Screening Questionnaire). Participants also filled a sleep and work diary for 14-days at each timepoint.
After 6-months of emergency work 21.5% of paramedics had a high SWD risk. Logistic regression models showed baseline depression predicted 1.24-times greater odds for SWD at 6-months. Through Lavaan path analysis we found shift and sleep variables did not mediate the relationship between baseline mental health and SWD risk. Baseline depression was associated with increased sleepiness levels following paramedics’ major sleep periods at 6-months. Pre-existing depression levels also predicted greater perceived nightshift workload.
Our results highlight depression symptoms before emergency work are a risk factor for SWD within 6-months of work. Depression represents a modifiable risk factor amenable to early interventions to reduce paramedics’ risk of SWD.
A psychometric assessment of the Military Suicide Attitudes Questionnaire (MSAQ)
2022, Psychiatry Research
Suicide rates remain high among military populations. Stigmatizing beliefs about suicide contribute to the problem of heightened suicide risk as a deterrent for help-seeking. Measurement of military suicide stigma is therefore an important gap in the literature as a necessity toward the development of military suicide prevention programming. This paper assessed the factor structure, reliability, and validity of the Military Suicide Attitudes Questionnaire (MSAQ). Study 1 featured secondary analysis of a suicide risk dataset from active duty treatment-seeking military personnel (N = 200). Study 2 was a secondary analysis of a statewide assessment of Army National Guard service members’ beliefs about mental health and suicide (N =1116). Factor analyses results collectively supported a four-factor Military Suicide Attitudes Questionnaire (MSAQ) structure: discomfort, unacceptability, support, and empathic views. Subscale reliabilities ranged from 0.77 to 0.83 across samples. Unacceptability and support displayed significant negative correlations with psychological distress. Men displayed more negative suicide-related beliefs compared to women counterparts. Discomfort and unacceptability beliefs displayed significant positive associations with perceived barriers to care. The final short version of the MSAQ is an efficient, multi-dimensional measure of military suicide-related beliefs. The instrument can be used for public health assessment and program evaluation in military settings.
External validation of a shortened screening tool using individual participant data meta-analysis: A case study of the Patient Health Questionnaire-Dep-4
2022, Methods
Shortened versions of self-reported questionnaires may be used to reduce respondent burden. When shortened screening tools are used, it is desirable to maintain equivalent diagnostic accuracy to full-length forms. This manuscript presents a case study that illustrates how external data and individual participant data meta-analysis can be used to assess the equivalence in diagnostic accuracy between a shortened and full-length form. This case study compares the Patient Health Questionnaire-9 (PHQ-9) and a 4-item shortened version (PHQ-Dep-4) that was previously developed using optimal test assembly methods. Using a large database of 75 primary studies (34,698 participants, 3,392 major depression cases), we evaluated whether the PHQ-Dep-4 cutoff of ≥ 4 maintained equivalent diagnostic accuracy to a PHQ-9 cutoff of ≥ 10. Using this external validation dataset, a PHQ-Dep-4 cutoff of ≥ 4 maximized the sum of sensitivity and specificity, with a sensitivity of 0.88 (95% CI 0.81, 0.93), 0.68 (95% CI 0.56, 0.78), and 0.80 (95% CI 0.73, 0.85) for the semi-structured, fully structured, and MINI reference standard categories, respectively, and a specificity of 0.79 (95% CI 0.74, 0.83), 0.85 (95% CI 0.78, 0.90), and 0.83 (95% CI 0.80, 0.86) for the semi-structured, fully structured, and MINI reference standard categories, respectively. While equivalence with a PHQ-9 cutoff of ≥ 10 was not established, we found the sensitivity of the PHQ-Dep-4 to be non-inferior to that of the PHQ-9, and the specificity of the PHQ-Dep-4 to be marginally smaller than the PHQ-9.

View all citing articles on Scopus

: Conflict of interest statement: The authors declare no conflict of interest.

View full text

Review ArticleItem reduction based on rigorous methodological guidelines is necessary to maintain validity when shortening composite measurement scales

Abstract

Objective

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Search and identification of articles

Results

Discussion

Document the validity of the original CMS and the objective of its shortening

Conclusions

Acknowledgment

J Clin Epidemiol

J Clin Epidemiol

Quality of life measurement: bibliographic study of patient assessed health outcome measures

BMJ

Health measurement scales, a practical guide to their development and use

On the sins of short-form development

Psychol Assess

Issues and strategies for reducing the length of self-report scales

Pers Psychol

Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement

Qual Life Res

Quality and quantity in test validity: how can we be sure that psychological tests measure what they have to?

Qual Quant

Standards for educational and psychological tests

Review Article
Item reduction based on rigorous methodological guidelines is necessary to maintain validity when shortening composite measurement scales