Background
The individual and societal burden of musculoskeletal injuries, in particular of bone fractures remains to be underestimated. Proximal humeral fractures (PHF) are among the leading causes of functional impairment in patients after trauma resulting in limitations in basic, instrumental and advanced activities of daily living. PHFs and wrist fractures are recognized as the most common fractures of the upper extremities accounting for more than 20% of hospital admissions caused by a fracture [
1]. In patients over 40 years of age, the proportion of PHFs increases to 76% [
2]. Since 2000 in Germany the incidence of PHFs has risen from 178 to 246/100.000 inhabitants/year [
3]. In addition, an analysis from Bauer and colleagues showed that one third of the patients are still integrated into the work process [
4]. Due to the demographic change, a further increase in the number of PHFs is expected [
5,
6]. This will lead to a significant increase in PHFs requiring operative or non-operative treatment and post-trauma hospitalization and rehabilitation.
To date, there is no robust evidence-based consensus on rehabilitation after PHF regarding standardisation of content, duration, intensity or frequency [
7‐
9]. One essential requirement to perform controlled studies on surgical and rehabilitation interventions is the availability of objective, reliable and valid assessments. If possible, these assessment tools should be blinded to treatment allocation. To assess functional capacity and task performance after PHF, at least two types of measures are required: patient-reported outcomes using questionnaires to assess activities of daily living and a supervised clinical-based assessment to measure functional capacity of the patients [
10,
11]. The Disability of the Arm, Shoulder and Hand (DASH) questionnaire is the most commonly used questionnaire for assessing activities of daily living after shoulder and arm injuries [
10]. A clinically administered assessment of the functional capacity including the quality of movement of PHF patients is the Wolf-Motor-Function-Test-Orthopaedic (WMFT-O) which has previously been assessed regarding re-test reliability, inter-rater reliability, and internal consistency [
12]. One further property required of an outcome measurement is the sensitivity to change. It is understood as the ability to describe changes occurring during a treatment or observational period. The particular meaning of this property is described by the fact that positive changes in a given period represent the classic therapeutic goal [
13]. Beyond assessing functional change in clinical state over time with sufficient sensitivity to change [
14‐
16] clinical-based assessment tools also need a high responsiveness to decide if a change over time is clinically meaningful [
17,
18]. There are no studies that examined the sensitivity to change of the WMFT or the WMFT-O which shows the need to consider also the change over time of the functional capacity and the quality of movement of the WMFT-O.
The aims of this study were 1st to test the inter-rater reliability of the videotaped WMFT-O, 2nd to describe the correlation of the functional capacity assessed by the WMFT-O and the activities of daily living from a patient perspective assessed by the DASH questionnaire, and 3rd to describe the sensitivity to change and the responsiveness of the WMFT-O and the DASH in a group of patients with PHFs.
Discussion
Our study on patients with fractures of the proximal humerus demonstrated high sensitivity to change and good responsiveness of the orthopaedic Wolf-Motor-Function-Test indicating its usefulness as a functional capacity assessment tool in operative and rehabilitation studies.
This study also found moderate inter-rater reliability for the videotaped version of the WMFT-O (Table
2). According to Landis and Koch [
36] the calculated Fleiss’ Кappa values for the functional capacity and quality of movement can be interpreted as a fair to substantial agreement. The observed agreements were weaker than in a previous clinical study using a non-videotaped assessment of WMFT-O in similar patients [
12]. Another publication investigating the reliability of the neurological based WMFT-N version [
37] also had higher interrater agreement. The lower agreement of the videotaped measurement could have several reasons. Possibly the training session of the video raters were not enough. A further explanation for the lower agreement could also be an inadequate positioning of the chosen video camera and a different camera position might lead to better results. The performance of the tasks on such videos (e.g. in task 16- “Turn key in lock frontal”, Fleiss’ Kappa for functional capacity =0.33 and quality of movement =0.27; Table
2) was not easy to identify which resulted in more diverse ratings between the raters. Moreover, an inadequate description of the correct end-position in task 1 (“Forearm to table lateral”, Fleiss’ Kappa functional for capacity =0.41 and quality of movement =0.30; Table
2) lead to different opinions about fulfilling or not fulfilling the task. A possible solution could be reducing the prescribed tasks to improve rater agreement. The tasks that were poorly rated could potentially be omitted and a future short version of the WMFT-O could be provided. According to Landis and Koch [
36], an inter-rater reliability of 0.01 to 0.20 is considered as a slight agreement and 0.21 to 0.40 as a fair agreement. If these values were used as a basis to decide which items of the WMFT-O could be deleted in a short version, this would delete task 1 and 16 due to the low inter-rater reliability of the functional capacity and quality of movement as well as task 2ab, 12, 13 and 14 due to the low inter-rater reliability of the quality of movement (Table
2).
A third aspect is the calculation of different statistical measures. Inter-rater agreement for two different raters, as calculated in Oberle and colleagues 2018 [
12] should be determined by weighted Cohen’s Kappa (Кw) statistics [
38]. For three or more raters, Fleiss’ Kappas is recommended as the method of choice. The Fleiss’ Kappa assumes that the examiners were randomly selected from a group of available examiners. Cohen’s Kappa, on the other hand, assumes that examiners have been specifically selected and trained. Therefore, the probability of agreement in the Fleiss’ Kappa and the Cohen’s Kappa is estimated in different ways. In some cases, Fleiss’ Kappa in general may produce lower values even if the agreement is actually high as described before [
39]. That in turn may be a possible explanation why the inter-rater reliability values of the video-based WMFT-O calculated by Fleiss’ Kappa (Table
2) were lower than the inter-rater reliability values of the clinically WMFT-O as reported by Oberle and colleagues [
12] and calculated by Cohen’s Kappa.
We found a very strong correlation between the WMFT-O clinical baseline rating and the video baseline ratings for the functional capacity and a strong correlation for the quality of movement according to the standards of Evans [
40].
To assess treatment effects of interventions for musculoskeletal conditions, functional capacity and personal activity need to be evaluated. Therefore, the WMFT-O has to be augmented by other methods to evaluate levels of disability, activity and participation. A widely applied method is the DASH. It is expected that the correlation between activity levels and functional capacity is often less than anticipated. The correlation between the clinical-based measures (WMFT-O clinical rating of the functional capacity and the quality of movement) and the patient-reported questionnaires (DASH) at baseline was indeed weak [
40]. One aspect is that patients do not return to their activity levels due to psychological problems such as insufficient self-efficacy. Other aspects are methodological problems. The DASH is not explicitly designed for the affected arm. This means that in instances where tasks are mostly carried out with the dominant hand (e.g. turning a key in a lock) the restriction in the non-dominant arm, shoulder and hand are not necessarily being captured through the DASH. This is only the case if the affected hand is also the dominant hand. This misjudgement could be avoided if care is taken that when answering the questions of the DASH, the assessment of the restriction always relates to the performance of the activity with the affected shoulder, arm or hand. If it is not possible to carry out the activity of daily living with the affected shoulder, arm or hand the patient must be able to imagine the execution of the activity of daily living and the possible restrictions as best as possible and then answer the question.
One study from Wu and colleagues [
41] developed the streamlined WMFT which includes the performance rating of 6 timed tasks for neurological patients. They found a low effect size for the streamlined WMFT and the original WMFT [
41]. In comparison to this study the WMFT-O had a large effect size for both the functional capacity and the quality of movement and can thus be regarded as being sensitive to change over time according to the published standards [
29,
32]. This result was confirmed by the absolute values of the standardized response mean, which can be considered higher than the standardized effect sizes for both the functional capacity and the quality of movement. This is a relevant finding in terms of clinical use to provide a more objective and sensitive measure for assessing functional capacity and quality of movement of the upper extremities for patients in orthopaedic rehabilitation and may improve current assessments which currently are mostly subjective. Compared to the WMFT-O, the sensitivity to change of the DASH was lower. These findings indicate that the WMFT-O is a more sensitive outcome measure for assessing functional change over time through rehabilitation in patients with PHF. MacDermid and colleagues [
42] and Westhphal and colleagues [
11] determined the sensitivity to change of the DASH for patients after wrist fractures. They found higher values for the effect size of the DASH in patients with wrist fractures after 0–3 months. After 3–12 months they observed to be somewhat lower [
11]. This indicates that the standardized effect sizes and standardized response mean depend on the point of time of the baseline- and reassessment. In the first 12 weeks after baseline, a large treatment effect can be expected. The next three quarters might lead to further improvements but the effect sizes will be smaller. In our study, the baseline assessment was conducted approximately one month after surgery. In order to find out whether the WMFT-O can also be used to assess long-term therapy results, a future study should be carried out including longer therapy intervals assessing upper extremity functional capacity measured with the WMFT-O after three month. A similar relation was found for the DASH questionnaire. In conclusion, both the WMFT-O and the DASH are responsive and complimentary assessment instruments in order to measure the functional change in patients with PHF.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.