Reliability
The test–retest reliability out of 11–12 WAD FCE test items was good to excellent. Healthy volunteers [
15], patients with chronic LBP [
14] or patients with osteoarthritis of hip and/or knee [
16] showed smaller variability in this FCE test compared to the WAD FCE. The following reasons may explain these differing results. In the case of healthy volunteers, who are less affected by pain, less variability in the test results is expected. FCE in the capacity of a patient with chronic low back will not change between two sessions because they are in a relatively stable i.e. chronic phase of the illness. The study of osteoarthritis patients [
16] involved conducting the retest study 1 day after the first test session, therefore a lower variability may be explained by recall bias due to the limited time between the two test sessions. As expected from WAD patients suffering from pain in the neck region, larger LoA scores were observed in the tests affecting the upper body regions i.e. “overhead work” and “lifting waist to overhead”.
Lifting from waist to overhead had a moderate ICC (0.66), with significantly different values recorded between the first and second session. This result was in part due to a participant who refused to lift any weight overhead in the first session, but lifted 15 kg in the second session. An post hoc sensitivity analysis was performed by excluding that participant from the analysis. The ICC value then increased to 0.80, which indicated good reliability.
Regarding the overhead work test with an ICC of 0.83, the larger LoA ratios may also be partly explained by the longer duration of the test at 5 min, compared to the maximum of 90 s in the material handling tests. The longer a test, the greater the chance that the patient would perform differently in another test session. For example, in the study of Brouwer et al. [
14], the reliability expressed as an ICC of a 15 min overhead work test was 0.36. To prevent ceiling effects, other researchers have modified the overhead work test by having the patients wear two cuff-weights of 1 kg around their forearm [
36]. This procedure results in a reduction of endurance in the overhead work in healthy participants, and an ICC of 0.90 [
17]. The results of the hand grip force (in position 2 of the Jamar hand dynamometer) proved to have good to excellent reliability, similarly to the findings of previous studies on hand grip force [
37], underlining its clinical use in the evaluation of grip strength in several musculoskeletal disorders. In the repetitive reaching test, ICC values were slightly higher in WAD patients when compared to healthy participants, while LoA were between −21.5 and 32.0 in WAD patients and −9.0–12.6 in healthy participants [
17]. Tests results of the 3 min step test and 50 m walking test did not change significantly between the two sessions compared to the materials handling tests. It is very unlikely that endurance and gait speed would improve in that length of time between the two sessions. Our participants were a sample of patients with sub-acute WAD, whose health status was still subject to possible change (improvement). The time interval between the two sessions therefore had to be far enough apart to avoid fatigue, learning or memory effects, but not too far apart to allow a change in health status. We therefore chose a time interval of 7 days to take these factors into account. This time period was shorter than previous reliability studies, which had time intervals of 10–21 days [
14,
17,
38]. Clinically the measurement error of the test under investigation lies within ±95 % LoA. This means that, at the individual level, a patient’s performance could be considered to be changed when it exceeded the LoA. For example in “lifting floor to waist”, a patient’s performance improved if his performance increased by more than 6.7 kg.
Large limits of agreement scores in health outcome measure are common in pain patients [
33,
39,
40]. As already stated there are no cut-off points of LoA [
41]. However one study from Keller et al. [
42]. who calculated the LoA for the Astrand bicycle test and other back strength tests in LBP patients judged a test with LoA of ≥42 % as unreliable. Based on this arbitrary cut-off value, 2 out of the 12 tests of the WAD FCE would be classified as unreliable. This large within-patient variance may be attributed to measurement and random errors of test procedure, evaluator inconsistencies, and patient behavior being influenced by motivation or pain. As hypothesized by others [
14,
43], but not tested in this study, we argue that a large part of the variance can be attributed to variation within the patients.
Safety
In a Delphi Survey of FCE experts, safety was defined as: “a situation that, given the known characteristics of the person, the procedure should not be expected to lead to injury” [
12]. We controlled for safety by using self-report measures such as the NRS, with a diary questionnaire, the PRQ, and measurements taken by the physiotherapist (e.g. heart rate, observation criteria). Based on our results of the PRQ, as reported in Fig.
1, we conclude that the WAD FCE temporarily increased pain at a similar rate to healthy volunteers [
24] and patients with LBP following FCE [
21]. Similarly to both other studies, symptoms in WAD patients also decreased within a week. No safety problems were encountered, and heart rate increased only moderately, with only one patient reaching the 85 % heart rate limit in the WAD FCE tests. From the eligible 71 patients, 4 refused to participate due to temporary pain increase directly after the first FCE session. None of these, nor any other participant, reported a formal complaint and no serious adverse effects were reported. We therefore believe that safety was not compromised.
Limitations and Strengths of the Study
A limitation of this study was that only 45 % of the eligible 71 participants were willing to participate in the second session. The main reason was: lack of time (most were already returned to work, others were on holiday, or were living a long distance away etc.). The same phenomenon was found in a FCE test–retest study of Brouwer et al. [
14] were approximately 100 patients were eligible during 1 year, but only 30 patients were willing to participate. In most instances, reasons for not participating were that testing would take too much time, which is similar to the Brouwer et al. study. It is unknown how non-participants would have influenced reliability of the WAD FCE tests. As learning effects influence test–retest reliability [
44,
45], we did not inform participants of the detailed test results, and ensured the memory effect was minimized by maintaining a large enough time interval between test occasions. Additionally, all test protocols from the first session were collected immediately after the test procedure by an independent person, who was not involved in the testing procedure. Test protocols remained inaccessible for the testers involved. Results of paired t-tests between the two test occasions showed a general trend towards a slightly increased performance on the second occasion. This is in line with test results of healthy volunteers, who scored on average higher on the second test session [
15,
17]. Although we did not expect test effects such as increased strength and mobility after the first testing session, other effects, such as increased self-efficacy, reassurance etc., may have occurred, creating consistent change within participants. Such a systematic effect will not necessarily affect reliability coefficients [
44].
In our study 30 % of non-native Swiss patients participated in the study, which is a slight overrepresentation compared to the general Swiss population with 23 % with non-native citizens [
46]. This is in contrast to previous FCE reliability studies [
14,
16,
38] where mainly native citizens participated. Results of interventions may vary considerably between native and non-native patients [
47], but to our knowledge, this has never been the subject of a study in a setting similar to ours (performance testing, reliability, agreement, safety). We therefore think that the results, although taken from a small study sample, might support the utility of the WAD FCE in non-native patients.
Secondly our testers were selected from a sample of 24 physiotherapists. The range of clinical experience covered a wide range of experience (from very low to extensive) encountered in clinical daily practice. Contrary to previous reliability studies where very experienced clinicians performed the FCE tests [
6,
16,
37], our sample of assessors covered a wider range of working experience and age. This might strengthen the generalizations of the results of this study. Our study was conducted in a “real world” environment where patients with delayed recovery were sent to the WAD FCE, compared to some previous FCE reliability studies based on video analysis [
43,
48].
Participants were referred by physicians and case managers from the German speaking part of Switzerland; to what extent this referral resulted in a population different from other WAD populations is unknown. Because the clinical characteristics of the non-participants did not differ from the participants, nor did the majority of test results, we assume that the selection procedure did not introduce bias relevant for the outcomes of this study (i.e. reliability, agreement, safety). Since the majority of WAD patients are suffering from WAD Grade 1 and 2 [
49], the results of this study may be applied to patients with WAD Grade 1 and 2 who are still suffering from WAD 9–12 weeks after injury and are not working due to WAD.