Introduction
Chronic obstructive pulmonary disease (COPD) is one of the major causes of mortality worldwide and is associated with high level of disability [
1]. COPD is a respiratory system disease with irreversible damage of pulmonary and bronchial tubes, represents the state of chronic airflow limitation [
2]. It not only causes physiological discomfort but also has a psychosocial influence on individuals. The clinical assessment of COPD often involves measurement of lung function parameters (e.g., FEV1) and exacerbation level of a patient to evaluate the disease progress and the therapeutic effect [
3]. However, the overall impact of COPD on individuals is multi-faceted and not entirely reflected by these clinical parameters. For this reason it is now realized that no single measure can adequately reflect the nature or severity of COPD and it often needs to be supplemented by other indicators from a patient’s perspective, such as those related to patient-report outcomes (PROs) or health-related quality of life (HRQOL). To date, evaluation of the treatment effect has emphasized the improvement of the quality of life rather than the small gains in survival rate or physiological indicators [
4]. PROs have gradually become an important element and a crucial source for monitoring disease condition or assessing the effectiveness of treatment, especially in some health problems such as subjective discomfort and psychological distress [
5]. Therefore, the U.S. Food and Drug Administration (FDA) has recommended that objective indicators combined with PROs be considered a more comprehensive form of outcome evaluation since 2006 [
6]. However, most of the measurement of PRO relies primarily on the construction of a questionnaire. Clinicians and Researchers are quite concerned about how well a questionnaire was developed in order to accurately measure PRO with minimal error, thereby integrating it into clinical practice and increasing the quality of clinical service.
The St. George’s Respiratory Questionnaire (SGRQ) is one of a widely used PRO tool to assess disease impact on patients with obstructive airways diseases, such as asthma, COPD and bronchiectasis, and it has also been translated and adopted in many countries [
7‐
9]. The SGRQ can provide a psychosocial impact profile of these patients that cannot be identified by the tests of lung function. Clinically, it has shown to be a valuable tool in quantifying the impact of chronic obstructive airways diseases on symptom, functional measures and well-being [
10,
11] and in evaluating the effectiveness of health care [
12].
Despite the demonstrated acceptable reliability and validity of the SGRQ, its data have been mostly validated using classical test theory (CTT) procedure. Although the CTT approach has been widely adopted in the psychological measurement, it also has some recognized shortcomings such as test or sample dependence [
13]. That is, within CTT a person’s test score may easily vary depending on which test is being administered and, in turn, the difficulty of the same item depends on which sample is being assessed.
Nevertheless, modern test theory based models such as the Item Response Theory (IRT) can overcome these potential disadvantages. IRT, known as latent trait theory, utilizes probabilitistic model to construct a questionnaire based on the relationship between a person’s response to a question and his or her level on the construct (symbolized by θ) being measured by the scale. This relationship is conditional in that people with higher levels on the underlying construct will have a higher probability of endorsing response categories that are consistent with higher trait levels [
13,
14]. Questionnaire constructed based on the IRT is superior to that of traditional CTT because IRT questionnaire is constructed using a model that take into consideration of both subject’s ability and degree of difficulty of test question. Therefore, the subject’s test score is not affected by the ability of the subject or difficulty of the test. i.e., the estimates of item location (difficulty) and person measures (ability) are independent regardless of respondents’ backgrounds or the items in a test [
14].
Additionally, The difference between CTT and IRT is that CTT gives equal weight to all the items even though, in reality, there is different in the degree of difficulty. For instance, CTT gives the same one point to each of mountain climbing and walking on flat surface. Obviously, these two categories are quite difference in the degree of difficulty. The appropriateness of the total unweighted score as way to characterize a person is not taken for granted. On the other hand, IRT gives different point to each item depending on the difficulty of the question [
14]. i.e., IRT allows the responses (raw scores) from different items representing different severity. Thus IRT model is that an individual’s response to any given item reveals a level of ability in the trait being measured.
Several studies have highlighted the advantages of Item Response Theory (IRT) over Classical Test Theory (CTT) methods [
15,
16]. Rasch model is one of the family of IRT-based models. The Rasch model aims to look beyond a logistic function that relates the respondent’s underlying traits (or abilities) and item difficulty to the probability of endorsing an item [
17]. Rasch models have been applied in many fields, such as health science, social psychology and education [
15,
18].
Besides, the Rasch model has been increasingly applied to identify measurement issues not easily detected by CTT [
15,
16,
18]. In the Rash model measures the only latent trait with a sufficient statistics for estimating the parameters of item difficulty and person ability [
17]. Sufficient statistics allow the cumulative total raw scores acquired by counting the observed responses to be summated, which constructed item hierarchy structure how a person ability and item difficulty interact to regulate the probability of approving of an item along a construct continuum being measured. Furthermore, the Rasch model provides a proper method for converting the ordinal raw scores into interval measures (logit). Due to nonlinear transformation to interval measures, the Rasch model can allocate the person ability and item difficulty jointly onto the same interval scale [
14] to allow for meaningful comparisons.
Although CTT-based methods have generally supported construct validity and internal consistency reliability of SGRQ, such methods cannot facilitate the evaluation of whether items are equivalent to different individuals. Lack of measurement equivalence may lead to incorrect estimates of effects in research and decision making [
19]. One approach to understand scale equivalence in different groups or conditions is to use Item Response Theory (IRT)-based models [
19,
20]. The situation where subjects from different groups, with the same level of the attribute, respond with different probabilities to endorse an items is defined as differential item functioning (DIF) [
21]. The purpose of DIF is used to make sure whether the differences of item difficulty exist when measuring different group. Scales containing such DIF items have reduced validity for between-group comparisons because their scores are influenced by a variety of attributes other than those intended [
19]. To date, most attention has been given to investigations of DIF associated with age [
20,
22], sex [
20,
22], culture [
23] or, disease [
24‐
26], but few studies have examined disease’s severity-related DIF.
The aim of this study attempted to apply the unique nature of Rasch model to rigorously evaluate the psychometric properties of the SGRQ questionnaire in COPD patients, both at the item and scale level in terms of dimensionality analysis and item fit evaluation. Specifically, item gaps along the construct continuum and the level of matching between the item difficulty and person ability (or traits) were examined for exploring possible scale modification. Finally, the analysis of differential item functioning (DIF) was performed based on different ages, and the disease’s severity of COPD patients.
Discussion
One advantage of Rasch model analysis is to allocate the person abilities and item difficulties jointly onto the same interval scale, which can serve as a guidance to revise or refine the questionnaire or test items. This study applied the Rasch model to rigorously examine the psychometric properties of the SGRQ in patients with COPD at both domain and item levels. The results showed that most items within their respective domain had a goodness-of-fit for unidimensionality. These findings were similar to those reported by Meguro [
36]. Moreover, each domain of the SGRQ reported good person reliability and separation, except the Symptom domain, which is similar to the result of CTT analysis in the previous study [
37] and by IRT [
36]. As the sample of this study had a wide range of disease severity (including ‘at risk’ to ‘severe’ group), the characteristics of the patient group had a greater variety of illness symptoms, leading to low person reliability and separation. While beyond our imagination, most items in the Symptom domain exhibited disordered thresholds, which were similar to those in Meguro et al.’s study [
36]. One possible explanation for this phenomenon is that symptoms varied considerably among patients due to the nature of COPD [
37]. Furthermore, the wording of response options might lead to disordered thresholds [
18,
36]. They have suggested that the scaling property of the ordered response options for the Symptom domain could be improved by combining two or more ambiguous categories [
18,
36]. We have revised our scaling based on their suggestion for modification; however, the disordered thresholds of the Symptom domain were not completely improved. We collapsed some of the response options from 5 response choices to 3 or 4, as described below, and this solved the phenomenon of disordered thresholds in our data. For the items S_a1 to S_a4, we combined “a few days a month” and”several days a week” into one category (denoted as “several days”) to form 4 response choices, which were “not at all”,“only with chest infection”, “several days” and “most days”. For item S_a5 “how many severe or very unpleasant attacks have you had”, the 5 response choices were combined into 3 response choices: “no attacks”, “1 or 2 attacks” and “3 or more attacks”. And for item S_a7 “how many good days have you had”, the 5 response choices were combined into 3 response choices: “no days”, “some or a few days” and “every day”. The results of thresholds in the Symptom domain after revision were shown in the Table and Figure (see
Appendix).
When the scale had a clear gradient of difficulty level across a set of items, Rasch model, as compared with the CTT, could exhibit its psychometric properties, such as item hierarchy, item redundancy and gaps of the scale more structurally [
38]. The results showed that the item difficulty in the Activity domain of the SGRQ gave a remarkably clear gradient activity from low exertion (e.g., Sitting or lying) to high exertion (e.g., running). In the Activity domain, there are two groups of items: “what activities make you feel breathless (group of A_c)” and “how activities may be affected by your breathing (group of A_g)”. An analysis of the estimates of item difficulty in these two sets showed that some items may be redundant (Table
2 and Fig.
2). For example, “A_c2 Getting washed or dressed” was similar to “A_g1 Take a long time to get washed or dressed”, and “A_c6 Walking up hills” was similar to “A_g7 Walk up hills, carry things up stairs or play golf”. Consequently, some items could be considered as possible candidates for item removal in order to improve tool efficiency. Moreover, there were apparent gaps between some items (Table
2 and Fig.
2), especially between items A_cl and A_c2, as well as items A_g5 and A_c7. These gaps indicated that some new items may be necessary to fill those gaps and cover the continuum in order to able to better differentiate the respondents’ abilities.
The Rasch model places the person measures and item difficulties on the same metric, allowing the identification of the level of matching between the item difficulty and the person ability. Our results showed that the targeting and the ceiling effect were high and the percentage of the coverage of the scale was low in the Impact domain compared to other domains. This showed that the items of the Impact domain were too simple for respondents with high ability to discriminate (such as at stage 0 & I). In the Impact domain, most items calibrated in the difficult end were related to the impact of daily life, which was caused by the activity with more effort. However, for most COPD patients in the early stages, they are generally not frail, which caused the high ceiling effect in our results. When the revision of the SGRQ is considered, it is imperative to increase the difficulty of some items and to add more items related to psychosocial adjustment, such as sense of control, in the Impact domain in order to reflect the psychosocial impact of the beginning of the illness. This would better discriminate the impact of COPD at different stages.
Establishing measurement equivalence is important because lack of measurement equivalence may lead to incorrect estimates of effects in research [
19]. Examination of DIF was to identify whether the item parameters will be invariant across the different subgroups. The results of this study showed that many items of SGRQ presented the age or severity related DIF, indicating somewhat unstable across different characteristics of group. In terms of the age related DIF, the effect of age on the Symptom and Impact domain of SGRQ was not much, but there was many DIF in the Activity domain, which implied age could be affected by underlying physical function to cause difference in a certain degree. Likewise, many items had the severity related DIF in the Activity and Impact domain of SGRQ, indicating the different stage of disease in COPD patients will bring the different results of the disease’s impact.
In spite of higher ratio of DIF in the Activity domain of SGRQ, the conformation of DIF exists most in the easiest and hardest end. Further investigation would find the similiar item hierarchy across different subgroups. The phenomena that more DIF exists in the Activity domain of SGRQ may be caused by an obvious difficulty gradient of underlying physical function. Furthermore, the analysis of DIF will be affected by response option. Multiple response option can have better ability to differentiate the results. However, the items in the Activity domain of SGRQ is dichotomous option response, so the items were easily prone to present DIF. Compared to the age related DIF, the severity related DIF exists more. This phenomenon was justified that the SGRQ is developed by specific disease and this kind of design may facilitate the DIF to become more apparent. Although the disease’s severity and age rendered some DIF, the existence of DIF within the health assessment can be considered as a sensitive measurement to differentiate the impact of quality of life that affected by disease’s severity or age across subgroups [
26,
39]. Although the result had a high proportion of DIF, it doesn’t mean that questionnaire is not applicable, which rather represent that these items may be suitable for developing the computer adaptive test. Questionnaire developer can use a few items to obtain almost the same accuracy as the result get from the original questionnaire with more items.
There are a few limitations in this study. First, this study was a cross-sectional, so responsiveness to changes at different time points could not be assessed. Second, the study population included only male COPD outpatients and predominantly in GOLD stages II and III. In Taiwan, smoking is prevalent (approximately 54 %, including ex-smokers) among males over 50 years, compared with only about 4 % in females in the same age group in 2001 [
40]. There are relatively few female patients with COPD compared with males in the clinical setting. Thus, we focused our analysis on male COPD patients. Consequently, the results of this study may not be applicable to female, hospitalized, or more severe patients with COPD. Furthermore, results were obtained only those patients whose conditions were stable enough to complete the questionnaires and could tolerate the interview and, thus, the final sample might have exluded patients with severe conditions. The domain scores might, therefore, have been better they were included in this study.
Competing interests
The authors declare that they have no competing financial interests.
Authors’ contributions
CHC and WML acquisition and interpretation of data, critical revision of manuscript for important intellectual content; CL and TCW conception, analysis and interpretation of data, drafting the manuscript; WML also designed the study’s analytic strategy; LWH direction of its implementation, including supervision of the field activities, quality assurance and control; TCW and YJC implementation of study and organization of data. All authors read and approved the final manuscript.