Introduction
Individuals suffering from chronic nonspecific musculoskeletal pain (CMP) such as back and neck pain are often restricted in performing activities of daily living and work [
1,
2]. The financial burden of CMP on society arises mainly due to indirect costs because of temporary or permanent work disability. Work disability due to CMP may be associated with reduced activity levels and work performance [
3,
4]. Functional capacity evaluation (FCE) in addition to self-reported measures have been recommended for a comprehensive assessment of physical work performance for persons with CMP [
5‐
8].
Functional capacity evaluation employs physical performance tests such as lifting, postural tolerance tests, repetitive movements, and ambulation to assess work-related functioning [
9]. Discrepancies in FCE outcomes and the physical workload of a patient may be addressed in rehabilitation to restore this imbalance [
10‐
12]. Moreover, FCEs are used to evaluate the effects of rehabilitation and determine fitness-for-work, and as such FCEs may facilitate the return-to-work process or prelude case closure [
13‐
17].
To determine physical capacity during the FCE the patient must perform to his or her maximum level of physical ability. The level of physical effort during FCE is estimated by the evaluator, based on observational criteria during material and non-material handling tests [
9,
18]. Submaximal effort is assumed when a person stops a FCE test before the criteria indicative of maximal effort are observed. Because clinical decision-making is based on the results of FCE, sound clinimetric properties of observational criteria are required to determine physical effort. Acceptable reliability of physical effort determination FCE tests such as lifting has been reported [
19,
20]. However, the reliability of non-material handling tests such as kneeling and forward bending has rarely been studied [
21‐
25]. Moreover, most studies on lifting tests were performed by FCE experts, which limits the generalizability and applicability of the study results among less experienced raters [
25‐
27].
The aim of this study was to determine the intra- and inter-rater reliability of physical effort determination of FCE tests in patients with CMP. A second aim was to investigate whether an increase in rater experience would alter the reliability of physical effort determination.
Discussion
When applying cut-off scores of agreement ≥80 %, κ > 0.60, the overall reliability of PED and SED was acceptable for less than half (46 %) of all FCE observations. For SED reliability was acceptable in the majority (67 %) of the FCE tests. However, the reliability of the PED was acceptable in only 38 % of tests. Inter- and intra-rater reliability between each FCE test varied considerably. The increase in mean reliability scores from session 1 to session 2 was on average higher in the PED than in the SED.
S
ED during FCE tests can be reliably detected in the majority of cases. However the results of this study are disappointing, as raters reached the required reliability cut-off values for both the P
ED and S
ED in less than half of the observations. This finding has clinical relevance for four reasons. First: some FCEs claim to support fitness-for-work determination with an extrapolation of FCE results to job demands [
14,
34]. The job demands and their frequencies during a working day (occasional, 1–33 %; frequent, 34–66 %; constant 67–100 %) are matched to P
ED “maximum”, “heavy” and “light to moderate”. Good reliability of P
ED is needed to enable adequate matching between FCE performance and work demands. Second: FCEs have been reported to accurately describe physical capacity only if a person exerts “maximal” voluntary effort [
23,
35]. Good reliability of determination of effort is a prerequisite for such a clinical interpretation. Third: FCE reports are used by third parties to inform on the progress of insurance claims. Some interpret submaximal physical effort as ‘unmotivated’. The debate over whether this interpretation is valid is beyond the scope of this paper, but it highlights the relevance of the psychometric properties of this determination. Fourth: whether the FCE score represents maximal or submaximal capacity, and the reasons for performing submaximally, are relevant for designing individualized vocational rehabilitation aimed at improvement of functional capacity.
Compared to three previous reliability studies on material handling tests, our values are clearly lower [
22,
23,
26]. In some of these previous studies with high reliability values two-point scales for determination of physical effort were used, which increases the a priori probability for agreement compared to a multiple item scale as in our study. In our study agreement on the dichotomous scale (submaximal effort determination) was substantially higher too. Moreover the results show on average an increase in the agreement and reliability rating on both the P
ED and S
ED scales when administered 10 months apart, indicating a “learning” effect. Our data support the assumption that postural tolerance tests may be difficult to rate using the FCE observational methods, but that experience can substantially improve reliability. The average agreement and Kappa values for the inter-rater reliability of P
ED increased by 0.40 during the 10-month period. This may be partly attributed to experience. The raters participating in this study used 1-day FCEs for the standard assessment of most in-patients. In addition they received one-to-one supervision from an FCE expert once a year, and their superiors supervised each FCE report as part of regular quality control. Based on the observation in this study that experience and basic training increased reliability scores, we suggest that novice raters using the observational criteria are supervised more intensively than in our study. To what extent observational criteria for effort determination can be improved by additional training remains unknown.
The only slight increase in the agreement and reliability of S
ED might be due to the high scores obtained in the first observation session. When tests were grouped according to type of task the reliability of the physical effort determination scale was generally lower when applied to postural tolerance tests, such as overhead working and kneeling, than when used with material handling tests. This is consistent with results from studies reporting on forward bend, standing and crouching [
25,
35,
36]. Moreover observational criteria seem to be less reliable when applied to ambulation tests such as walking and stair climbing compared to material handling tests [
25,
36]. However, the results may be influenced by the fact that postural tolerance tests were not part of the regular 1-day FCE utilized in most in-patients, but were only used when indicated. Thus, raters collected more test-experience with the observation of material handling tests than with postural tolerance. Other possible reasons for the lower reliability of the postural tolerance and ambulation tests could be the ceiling effect due to the predefined maximal time limit of the test or the muscular use at submaximal rates. It is theoretically infeasible to judge maximum effort level when submaximal muscular effort is requested e.g. in the overhead work test, the duration of 5 min is not the requested maximum performance, but a time limit. The results of this study underscore this problem. We suggest that observational criteria of physical effort in postural tolerance and ambulation tests need further refinement. To our knowledge no study has been conducted to determine the validity of observational criteria for postural tolerance and ambulation tests in FCE.
In two videos in which a patient performed the one-handed carrying test, ratings showed low agreement. After rating, we discussed these two videos with the raters and asked them where the difficulty lay. Almost half of the raters responded that these were debatable videos due to the pain behavior of the patient. The maximum performance of a patient is determined by the individuals’ ability, motivation, and other psychosocial factors [
37,
38]. However, physical effort determination cannot be used interchangeably with non-organic signs described by Waddell et al., despite some important overlap of the two measurement methods [
38]. It has been questioned whether lay persons and health care providers can accurately classify effort during a lifting task performed by actors [
39]. Similarly to our results this underscores the challenge of determining effort using a categorical rating scale.
Strengths and Weaknesses of the Study
The strengths of the study were that the inter- and intra-rater reliability measures were based on the results of a large sample of raters, and multiple observations on patient videos. Compared to most other studies on the reliability of P
ED, additionally to the material handling tests, we included postural tolerance and ambulatory tests. Furthermore this is to our knowledge the first study on the reliability of observational criteria used in FCE tests based on two ratings taken within a period of 10 months, excluding the risk of recall bias. We used 18 videos instead of real patients to test the reliability of the observers. The results may therefore only partly reflect a FCE performed live with the patient. One may argue that several clinical parameters may not have been visible on video tape, such as respiration, and that the raters did not benefit from three-dimensional vision. Observing videos without sound and communication is relevantly different from a clinical setting. In clinical practice FCE raters observe the same patient at different levels of effort when performing the same FCE test. This might facilitate comparison of their own ratings with their previous observations. Studies should be performed to analyze whether the availability of additional information would have changed the results. This study was performed with a sample of four patients. We might therefore not have seen all types of movement patterns of patients with back pain. Because the study was designed to measure the reliability of the raters observing the performance rather than the reliability of that performance, this may have been adequate. The Kappa statistic has an advantage over percentage of agreement because it corrects for chance [
31]. In some tests high agreement between raters was observed and Kappa values were in some cases extremely low. This phenomenon may occur when the variation in row and column totals is low [
40]. Furthermore it may be debatable if the cut-off score for Kappa values of κ > 0.60 for acceptable reliability used in our study is enough rigorous when one has to make decisions at the individual patient level [
41]. The results should therefore be interpreted accordingly. Category 5, “not classifiable”, was excluded from the analysis for two reasons. First “not classifiable” relates to another dimension than those categories related to effort. Therefore it cannot be analyzed in the effort domain. Secondly, only a few ratings were “not classifiable”, indicating its minor influence.
Future Studies
Although there have been some advances in the study of reliability of physical effort determination, major gaps remain: for example, what are valid and practical reference standards for determining maximal physical effort during FCE tests? While some experimental studies measuring muscle activity measurements such as surface EMG, superimposed electrical stimulation, and lactate concentration have been performed, they lack practicality for clinical use [
42,
43]. How should evidence-based cut-off scores of reliability be defined that are useful for the various purposes of FCE? Future studies should address these unresolved questions and promote the development of a reliable tool for the determination of physical effort, above all for postural tolerance tests.
Conclusions
The reliability of observing physical effort varied substantially between FCE tests, ranging from unacceptable to good. The dichotomous rating of sub-maximal effort was more reliable than the categorical rating for physical effort determination. However, with both rating scales acceptable reliability values were reached on average only in every second observation, which limits their utility for clinical decision-making. Regular education and training may improve the reliability of observational criteria for effort determination. Further research is needed to develop reliable observation scales.
Acknowledgments
The authors thank the physiotherapists of the Department of Work Rehabilitation, Rehaklinik Bellikon who participated in this study. We also thank Doug Gross and Dee Delay for the fruitful discussions on the criteria for physical effort determination. Part of the study was funded by the Swiss Accident Insurance Fund, SUVA (Schweizerische Unfallversicherungsanstalt).