Background
Patients' assessment of physical function (PF) is a core outcome domain of disease status in rheumatoid arthritis (RA)[
1,
2]. Physical function scales are used in the majority of clinical trials to assess the effectiveness of treatment and have become established instruments for assessing health outcomes in clinical practice and observational studies as well [
3‐
5].
A number of efforts have currently been undertaken to compare the variety of disease-specific and generic PF scales that have been validated for use in patients with RA over the years [
6‐
11]. However, previous efforts have been limited to descriptive reviews of well-known instruments or non-systematic selections of the available literature on their measurement properties. To date, there are no comprehensive studies available that systematically evaluate the evidence for the quality of the measurement properties of all PF scales that are validated for patients with RA. Furthermore, until recently there was no comprehensive conceptual framework available to define physical function in RA and with which to judge the relevance and comprehensiveness of the items of PF scales. Therefore, content validity could only be evaluated indirectly in previous efforts, for example by evaluating whether patients were included in the item selection process. Currently, the International Classification of Functioning, disability and Health (ICF) provides a comprehensive frame of reference, which allows the relevance and comprehensiveness of the items of PF scales to be examined directly by linking them to their respective ICF codes. Within the ICF classification, the 'activity' dimension constitutesthe individual's perspective on functioning and is defined as 'difficulties an individual may have in executing activities [
12]. This dimension consists of the chapters domestic life, self-care and mobility, which respectively coincide with (instrumental) activities of daily living (IADL & ADL) and mobility which are traditionally used terms in the literature on physical functioning [
13].
The most relevant ICF categories for a particular condition are summarized in a core set. The ICF Core Set for RA is a list of the ICF categories, which represent the typical functional problems experienced by patients with RA [
14]. The outcome measures in rheumatology (OMERACT) group accepts the ICF core set for RA as the best currently available external standard of functioning and recognizes its utility for assessing the content validity of existing measurement instruments [
15].
The aim of this study was to systematically review the content validity and measurement properties of all PF scales that have been validated for use in patients with RA, by linking their content to the ICF and to appraise the currently available evidence of the quality of their measurement properties in order to offer recommendations for the use of PF scales for various purposes and settings.
Discussion
This study systematically reviewed the literature on measurement properties of PF scales that are validated for use in patients with RA. The results of this review provide a comprehensive assessment of the available evidence for the utility of available scales for patients with RA and may inform the appropriate selection of self-reported PF scales for various purposes in clinical practice and research.
PROs are commonly classified as disease-specific or generic. In this systematic review, a pragmatic classification was employed based on the intended target population of the included questionnaires. However, it should be noted that although developed for use in arthritic populations, PF scales that were classified as disease-specific do not necessarily have content that is exclusively relevant in these populations. In fact, some scales such as the HAQ which is often referred to as a disease-specific measure, assesses physical disability in general and does not focus on specific disease-associated impairments. As a result, the scale has been used across a wide range of general and clinical populations [
3].
Of the disease-specific scales that were rated positively for both aspects of validity, the HAQ received the most favourable overall evaluation. Owing to its longstanding and extensive use in RA, the measurement properties of the HAQ have been exhaustively studied. This review showed that it has predominantly favourable measurement properties that have been studied with adequate methodological rigor. The HAQ met the standards we set for responsiveness and its test-retest reliability was found to be very high in a sample of stable patients, indicating that the scale is appropriate for evaluative purposes (i.e., to track physical functioning over time), both at the group level and at the individual level. However, one important limitation of the HAQ is that multiple studies noted a considerable group of patients scoring the best possible score. Therefore, it may not be the most appropriate scale for use in patient populations with relatively good functional capacity, since it cannot measure improvement in a substantial proportion of patients. Both the MDHAQ (14 ADL) and the HAQ-II were rated favorably for all aspects of validity as well and were specifically developed to address the ceiling effects of the original HAQ [
40,
41]. Both scales indeed demonstrated substantially smaller ceiling effects in direct comparison with the original HAQ, indicating that these scales might be more appropriate than the original HAQ for use in relatively well functioning groups. Another advantage of these scales is that they contain only 14 and 10 items, making them more feasible for use in clinical practice or when administering multiple PROs simultaneously. However, the measurement properties of HAQ-II and MDHAQ (14-ADL) have been less extensively studied. In particular, before recommending their use in evaluative studies, the responsiveness of these scales should be compared to that of the HAQ and their reproducibility in stable patients should be established. The revised CSHQ-RA and AIMS2 were also rated favorably for validity, but no information is available known about their distributional properties and the evidence testifying to the responsiveness of the revised CSHQ-RA is limited to methods that rely on statistical significance. Further research is required before a comprehensive evaluation of the quality of the revised CSHQ-RA is possible. The AIMS2 might be the most comprehensive disease-specific questionnaire. Its items were linked to 31 relevant ICF categories and issues such as fine hand use and arm use and domestic life are addressed in more detail than in the HAQ, which was also noted by Stucki et al [
14]. However, with its 28 items it is also the most lengthy questionnaire and much of the work on its measurement properties is outdated. Further psychometric testing is therefore desirable. Finally, the short AIMS was also rated favorably for all aspect of validity, but it contains scales that lack internal consistency, perhaps because some subscales consist of only 2 items or because the response format is often yes/no. Therefore we would not recommend it for use or for further testing.
The CSHQ-RA and ROAD are among the most recently developed disease-specific scales and the methodology of the work on their measurement properties conforms to the rigorous methodological standards of COSMIN, enhancing the interpretability of their psychometric quality in this review. Regrettably however, these scales suffer from irrelevant content. Therefore their use cannot be recommended for the assessment of PF, despite generally favorable evaluations for their other measurement properties.
Although it is well known that measurement properties are context-specific attributes that can differ across populations, previous studies have paid no attention to verifying the content validity of the included generic scales for use in RA patient groups. Therefore, by linking their content to the comprehensive ICF core set for RA, this review provides the first assessment of the content validity of included generic scales for assessing physical functioning of patients with RA.
The SF-36 PF scale is probably the most frequently used generic scale in patients with RA. However, although all of its items are relevant, it measures predominantly mobility and has no content relevant to the assessment of domestic life, which was already recognized as an important shortcoming by its developers [
42]. Another limitation of the scale is that it has been associated with substantial floor effects (i.e., patients scoring the worst possible score). Most of its measurement properties have been studied in patients with RA, but studies of more rigorous methodological quality are desirable. For instance, no studies were found reporting on the dimensionality of the original version and its reproducibility has been studied in small patient groups (n < 25) only. On the other hand, the SF-36 PF-10 is the only generic PF scale that was rated positively for responsiveness.
Except for the MHIQ, the other health profiles, (SIP and NHP) demonstrated limited content coverage as well. Because health profiles intend to cover all major areas of health, it might be expected that content coverage within their components is less comprehensive. The GARS on the other hand is a dedicated PF instrument which is reflected in the finding that its content more comprehensively reflects the overall PF domain. Therefore, the GARS may be well suited when the primary outcome of interest is physical function rather than overall health. However, as with most generic scales in this review, its measurement properties are currently poorly understood. More research is required to establish its performance in longitudinal settings before its use can be recommended.
With the inclusion of items of the participation chapters of the ICF, the WHODAS-II covers a wider spectrum of disability than just physical function. The same applies to the BI and SIP. These measures include multiple items belonging to ICF categories E120 (Products and technology for personal use in daily living), E30 (support and relationships) and B5253 and B6202 (fecal/urinary incontinence). Therefore, they might be better thought of as measures of dependence rather than physical function per se. This interpretation is further strengthened by the observation that the SIP and BI were evaluated negatively for construct validity. In particular, both scales correlated only moderately with other PF instruments.
With respect to rating the measurement properties of the included scales, it was notable that in one-third of the studies that assessed reliability, samples of less than 25 patients were used. Although observed ICCs were generally well above the commonly accepted cut-off point of 0.70, it is important that reliability is studied in sufficiently large samples. Simulation studies have shown that even when a value as high as 0.80 is observed, a sample size of 60 patients is required to reliably conclude that ICC > 0.70 in the population the sample was drawn from [
43,
44]. Furthermore, for most scales, information on reproducibility of scores was limited to reports on test-retest reliability. For evaluative purposes, especially when monitoring functional status of individual patients, it is informative to report on the absolute agreement of test-retest scores for patients with unchanged functional status as well. Representative values of the LOA or SEM can serve as benchmarks for distinguishing real change in functional status from changes due to random measurement error [
17]. Finally, minimally important change scores have not been widely reported and should be addressed in future research, as they greatly enhance the interpretability of change scores. Instruments should be administered longitudinally before and after treatment known to improve PF, and health transition questions should be included as external criteria of change (26). A point worth mentioning is that this systematic review is limited to traditional static questionnaires.
Recently, item response theory (IRT) based item banking is receiving increasing attention in PRO assessment. Of special relevance to PF assessment in RA populations is the patient reported outcome measurement information system (PROMIS) initiative. PROMIS is an NIH initiative aimed at revising instruments in many domains including PF, using IRT calibrations and computerized adaptive testing (CAT) [
45]. The PROMIS PF item bank contains 124 calibrated items and CAT algorithms allow for the adaptive selection of the most relevant item for a particular patient in terms of relative difficulty based on previous answers given by that patient [
46]. The main advantage of using these modern psychometric approaches is that the use of extensive item banks potentially eliminates floor and ceiling effects, while the CAT algorithm ensures that patients only need to answer a minimum number of questions [
47,
48]. Short forms can also be developed from the PROMIS item banks. For example, the PROMIS HAQ has been developed from the PROMIS PF item bank [
46]. Unfortunately, none of the PROMIS studies met the inclusion criteria for this review of at least 50% RA patients, however the PROMIS PF item bank is likely to become a prominent measurement system in RA and it would be highly interesting for future research to study the psychometric properties of the PROMIS PF item bank specifically for RA populations.
There are some limitations to our study that deserve attention. First, we used the ICF as an external standard to evaluate the content validity of the included scales, as have a number of previous similar systematic reviews [
49,
50]. The ICF aims to provide a common language for functional status assessment in clinical practice and research. However, most included scales were developed before the ICF was available. Moreover, concerns have been voiced regarding the exhaustiveness of the ICF as a comprehensive classification of disability [
51] and several validation studies of the ICF core set for RA have found some omissions from the perspective of patients and physicians that future research should address [
52,
53]. Therefore some caution must be taken when interpreting the results of the analysis of content validity. Still, the ICF is frequently recommended for assessing the content validity of health status instruments [
15] and 95% of all PF items included in this systematic review could be linked to at least one ICF code. Moreover, the items that were linked to ICF categories other than mobility, self-care or domestic life were all clearly irrelevant to the assessment of PF. Our results therefore seem to indicate that the ICF is a useful taxonomic tool for assessing the relevance of disability items, such as those included in this systematic review. Second, for most scales, the work on their measurement properties was predominantly or exclusively performed with the original language versions. However, the majority of the studies on the measurement properties of the AIMS2 and AIMS2-SF concerned translated versions. Users of translated versions are therefore advised to examine if a validation study is available for their language version, rather than solely depending on the results of this review. For several translations, individual items were omitted, changed, or added in order to adapt a questionnaire for use in a different culture. Since in some instances up to 10% of items were changed, it is unclear to what degree measurement properties are generalizable across versions and cultures [
54,
55].
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MOV was responsible for the search strategy and conceptualisation of the manuscript. MOV and PTK reviewed the included papers. PTK, ET and MVDL supervised the study and the interpretation of the results. All authors critically reviewed, contributed to and approved the final manuscript.