Background
Since there is no gold standard for the assessment of disease severity and impact in most rheumatic conditions, it is common practice to administer multiple outcome measures to patients. Initially, the severity and impact of most rheumatic conditions was typically evaluated with clinical measures (CMs) [
1,
2] such as laboratory measures of inflammation like the erythrocyte sedimentation rate [
3] and physician-based joint counts [
4,
5]. Since the eighties of the last century, however, rheumatologists have increasingly started to use patient-reported outcomes (PROs) [
1,
2]. As a result, a wide variety of PROs are currently in use, varying from single item visual analogue scales (e.g. pain or general health) to multiple item scales like the health assessment questionnaire (HAQ) [
6] which measures a patient’s functional status and the 36-item short form health survey (SF-36) which measures eight dimensions of health related quality of life [
7].
Statistical methods are essential for the development and evaluation of all outcome measures. By far, most health outcome measures have been developed using methods from classical test theory (CTT). In recent years, however, an increase in the use of statistical methods based on item response theory (IRT) can be observed in health status assessment [
8‐
10]. Extensive and detailed descriptions of IRT can be found in the literature [
11‐
14]. In short, IRT is a collection of probabilistic models, describing the relation between a patient’s response to a categorical question/item and the underlying construct being measured by the scale [
11,
15]. IRT supplements CTT methods, because it provides more detailed information on the item level and on the person level. This enables a more thorough evaluation of an instrument’s psychometric characteristics [
15], including its measurement range and measurement precision. The evaluation of the contribution of individual items facilitates the identification of the most relevant, precise, and efficient items for the assessment of the construct being measured by the instrument. This is very useful for the development of new instruments, but also for improving existing instruments and developing alternate or short form versions of existing instruments [
16]. Additionally, IRT methods are particularly suitable for equating different instruments intended to measure the same construct [
17] and for cross-cultural validation purposes [
18]. Finally, IRT provides the basis for developing item banks and patient-tailored computerized adaptive tests (CATs) [
9,
19,
20].
Although IRT appears to be increasingly used within health care research in general, a comprehensive overview of the frequency and characteristics of IRT analyses within the rheumatic field is lacking. The Outcome Measures in Rheumatology (OMERACT) network recently initiated a special interest group aimed at promoting the use of IRT methods in rheumatology [
21]. An overview of the use and application of IRT in rheumatology to date may give insight into future research directions and highlight new possibilities for the improvement of outcome assessment in rheumatic conditions. Therefore, the aim of this study was to systematically review the application of IRT to clinical and patient-reported outcome measures within rheumatology.
Discussion
IRT offers a powerful framework for the evaluation or development of existing and new outcome measures. This is the first study that systematically reviewed the extent to which IRT has been applied to measurements from rheumatology. Results showed a marked increase in IRT applications within the rheumatic field from the late eighties up to now. Even though most research focussed on PROs, IRT also appeared to be useful for application to CMs. Some opportunities for further IRT applications and improvements in the analyses and reporting of IRT studies were also pointed out.
IRT can be applied for various purposes. First, IRT analysis is useful for the development and evaluation of new measures [
22]. For instance, Helliwell et al. [
32] developed a foot impact scale to assess foot status in RA patients. Rasch modeling was used to facilitate item reduction by selecting items which were free of DIF and fitted model expectations. Where the CTT methods often discard items at the extremes of the measurement range because too few patients answer them affirmatively, IRT includes these items since they provide important information at the extremes of the measurement range [
61].
IRT is also suitable for the evaluation of existing (ordinal) outcome measures. For example, when evaluating an instrument’s included response categories it can be determined whether they perform as intended or whether categories should be collapsed into fewer options or expanded into more options [
22]. Furthermore, it can be evaluated whether the items in the outcome measure form a unidimensional scale as expected or whether item deletion is necessary [
22].
Another favourable feature of IRT is that it is expressed at the item level instead of test level as in CTT [
11]. By evaluating the performance of individual items, alternate or short form versions of existing measures can be developed. For example, Wolfe et al. [
62] developed an alternate version of the HAQ [
6,
63], known as the HAQ-II, specifically targeted at patients with a relatively high physical functioning.
Another commonly used feature of modeling at the item level is the robust assessment of DIF, as reflected in the high proportion of performed DIF analyses. Nevertheless, the full potential of modeling at the item level is not yet being used, given the low percentage of studies evaluating the items’ performance (i.e. measurement precision and local reliability) along the scale.
When comparing the studies focusing on RA patients with those focusing on OA patients, the measurement intensions of the analysed instruments and the applied IRT models were highly comparable. However, a notable difference was found in the main goals of these studies. Where the RA studies pursued widely varying main goals, including the development of new instruments, the evaluation of existing instruments, the comparison of different instruments, and cross-cultural validation, the studies on OA patients generally focused on the evaluation of existing instruments only.
There are several IRT applications which have not yet been (frequently) used within rheumatology. One IRT application which appears to be still in its infancy within rheumatology, but which is likely to gain importance in the future, is the development of computerized adaptive tests (CATs) [
2]. When testing by means of a CAT, every patient receives a test which is tailored (adapted) to his or her level on the underlying construct being measured. Consequently, each patient can be administered different sequences and numbers of items, drawn from a large item bank. By applying CATs, tests can be shortened without any loss of measurement precision, reducing measurement burden for both the patient and the rheumatologist [
1,
2,
9‐
11,
16].
The potential advantages of cross-calibration is another IRT application which has not yet been recognized within rheumatology. As opposed to CTT methods, the item responses are regressed on separate item and person parameters in IRT [
11]. This means that the definition of item parameters is independent of the sample receiving the test and the definition of person parameters is independent of the test items given. This separation of parameters facilitates the cross-calibration of various outcome measures based on the same underlying construct [
11,
64], making their scores comparable with each other.
As discussed earlier, it is important to test the assumptions of unidimensionality, local independence, and model appropriateness when analysing data by means of IRT methods. Items which violate one or more of these assumptions should be combined, rephrased, or deleted [
22,
23], since they complicate the interpretation of model outcomes. A promising observation was that the majority of the studies tested the assumption of unidimensionality and the appropriateness of the IRT model, albeit some studies did not report any fit statistics. Although comparisons between unidimensional and multidimensional IRT models provide a much more rigorous test of unidimensionality than factor analyses, such comparisons were not made. Analyses of model fit mainly involved overall fit statistics or item fit statistics, and to a lesser extent the evaluation of person fit. Person fit, however, is also important since deviant response patterns of patients may seriously affect the item fit. The removal of patients with such response patterns from the analysis may improve the scale’s internal construct validity significantly [
22]. Most studies, however, did not check the assumption of local independence. The importance of local independence has only more recently been recognized and, consequently, only some of the most recent studies (from the year 2007) did evaluate this assumption. Future studies should continue to pay attention to this assumption, since locally dependent items could cause parameter estimates to be biased, which may lead to wrong decisions concerning item selection when constructing a certain outcome measure [
15].
The results also showed room for improvement in the reporting of made choices and the rationale for specific decisions. For instance, the applied IRT model is often not specified and, if specified, the reasons behind the selected IRT model and used estimation methods are often not clearly motivated. This complicates the quality appraisal and replication of performed analyses.
Where Belvedere and de Morton [
8] examined the application of Rasch analysis only, this study included the whole spectrum of IRT models. A notable finding of this review was that the Rasch models dominate within rheumatology, and that two-parameter IRT models were applied in only a few studies. This may be due to the ease of use of a Rasch model and the easiness with which its results can be interpreted. However, this advantage of Rasch modeling comes with the strict assumption that every item of the measure is equally discriminative. Whether this assumption is appropriate can be tested by comparing the Rasch model fit with the 2-parameter model fit. Since the studies of Pham et al. [
65] and Siemons et al. [
54] are the only studies in which such a comparison was made, this is a point of interest for future studies.
Although IRT is becoming increasingly popular in health status assessment, IRT is quite complex to understand and is not yet a main-stream technique for most researchers and rheumatologists. To increase common understanding and to improve the interpretation of outcomes resulting from the performed IRT analyses, (bio)statisticians, rheumatologists, and researchers should closely collaborate. Clear guidelines on the quality appraisal of performed IRT analyses might increase the use and understanding of IRT in rheumatology even further. Currently, there are no clear guidelines available for rating the methodological quality of the performed IRT analyses. Although standardized tools like the COSMIN (COnsensusbased Standards for the selection of health status Measurement INstruments) checklist [
66] can be used for evaluating the methodological quality of studies on measurement properties, this checklist only contains a few questions regarding IRT analyses and is, therefore, more suitable for analyzing the quality of performed classical test theory analyses. Even though the quality checklist used in this study was based on both expert input and important issues from the literature, it was not exhaustive and, consequently, it might have some limitations. For example, when the sample size was considered, only the absolute number was reported. It was not checked whether the authors also justified the sample size for the analyses they wanted to perform. The varying sample size of the analysed patient groups which was found between studies, might be due to the absence of clear guidelines regarding sample size requirements. It is argued that the most simple Rasch analyses already require a minimum size of 50–100 persons [
15,
23]. However, many issues are involved in determining the right sample size for a certain study, including the model choice, the number of response categories, and the purpose of the study [
15,
23]. These issues should be carefully considered to determine the sample size which is minimally needed to achieve reliable model estimates. Consensus and clear guidelines on quality aspects concerning IRT analyses might guide the choice of an adequate sample size and might also stimulate the development of uniform guidelines for performing and reporting IRT studies, and the development of a checklist for evaluating the quality of the performed and reported IRT analyses.
The formulation of such guidelines will provide a strong foundation to future IRT studies. Tennant et al. already provided such guidelines for performing Rasch analyses [
22]. However, given the large diversity of approaches, models, and software used in the field of IRT it is difficult to recommend a single set of guidelines for all types of studies, and an expansion or modification of their guidelines might be needed. In order to get sufficient support for these guidelines it is important to first attempt to reach a more global consensus about recommendations. This article could provide input for such attempts and the COSMIN checklist [
66] can serve as an example of how such an international approach can lead to the development of a consensus-based checklist. Agreement should be reached on the minimum number of assumptions which should be met (e.g. unidimensionality, model fit, and DIF analysis) and best ways of testing these assumptions. Additionally, this review showed that IRT methods are rarely being applied for the evaluation of an instrument’s local reliability and measurement precision along the scale of the underlying construct and the construction of item banks and CATs, all unique features of IRT. Therefore, it is recommended that more b will be placed on these features in the guidelines and in future studies.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LS was responsible for the conceptualization of the manuscript. LS and PTK were responsible for the screening and identification of studies and the extraction of relevant data. PTK, ET, CG and MVDL supervised the whole study and the interpretation of the results. All authors critically evaluated the manuscript, contributed to its content, and approved the final version.