Background
Physical examination tests of the shoulder (PETS) aim to reproduce specific symptoms and signs as an aid for clinicians in diagnosing the painful shoulder. However, more than 180 different single PETS have been described in the literature [
1] making the choice of which tests to use challenging. In addition, confusion arises because different names are used for the same test (e.g. Supraspinatus test = Empty can test = Jobe’s test [
2‐
4]). Also, different criteria of positivity have been used for the same test (e.g. both ‘weakness’ [
2] and/or ‘pain’ [
3] as criterion of positivity for the supraspinatus test). Last but not least, several of the single PETS have been used for several different shoulder diagnoses (e.g. Yergason’s test originally published as a test of biceps pathology [
5] is also used as test of glenoid labral pathology [
6]). At present, therefore, there is a need to clarify the basis for an evidence based approach [
7].
The validity of PETS based on meta-analysis from studies in primary care settings is scarce due to primary studies of insufficient quality [
8]. However, several meta-analyses on PETS have been published in the specialty care setting. In one of these, a meta-analysis limited to PETS for subacromial impingement syndrome [
9], the diagnostic validity of ‘Hawkins’, ‘Supraspinatus’, ‘Drop arm’ and ‘Lift-off’ tests was concluded to be limited by low pooled likelihood ratio (LR), but that ‘Lift-off’ test could be used to rule in a subscapularis tear. A more recent meta-analysis on rotator cuff tear recommended the ‘External rotation lag sign’ and ‘Painful arc’ tests based on findings of the highest pooled estimate of positive likelihood ratio and smallest confidence interval [
10]. However, there was no overlap between the two meta-analyses regarding the studies finally retained for statistical pooling. Two additional meta-analyses have been published on PETS for superior labral anterior posterior (SLAP) lesions. In the first., ‘Active compression’, ‘Anterior slide’, ‘Crank’ and ‘Speed’ tests were included in the meta-analysis and assessed by estimated receiver operating characteristic curves [
11]. ‘Anterior slide’ was concluded to perform worse than the other three tests but there were otherwise no significant differences [
11]. The second meta-analysis on SLAP lesions [
12] assessed Compression-rotation, Crank, Relocation, Speed and Yergason tests by pooled positive likelihood ratios and concluded that only the Yergason test showed statistical significant validity based on a likelihood ratio of 2.29 [1.21, 4.33]. In the update [
13] of the only previous meta-analysis that has analyzed single PETS for all shoulder diagnosis (not limited to a specific diagnosis) [
14], the concusion was that no single PETS were pathognomonic for any specific diagnoses and that the performance of PETS in general was low.
Given that the previous meta-analysis included different PETS and came to different conclusions, there is still a lack of robust evidence guiding clinicians on which tests to use in clinical practice and there is a need to assess if they are useful at all. The previous meta-analyses [
9‐
14] were all aimed to pool data for single PETS assuming they were based on different biomechanical rationales. Only one of them included PETS for all shoulder diagnoses. It is therefore reasonable to suggest a different approach to meta-analysis of PETS.
In this systematic review we want to initially include PETS for all shoulder diagnoses commonly seen in specialty shoulder clinics, but limit the meta-analysis to include only high quality primary studies with a low risk of bias. Furthermore, we will try to pool different PETS that are based on similar biomechanical rationales in order to evaluate the validity of PETS in general.
This meta-analysis aims to use diagnostic odds ratio (DOR) [
15], to evaluate how much PETS shift overall probability and to rank the test performance of single PETS in order to aid the clinician’s choice of which tests to use.
Discussion
This meta-analysis found statistical evidence for diagnostic validity of PETS when different tests for SLAP lesions were pooled (DOR = 1.38). Among the single PETS included in the meta-analysis, the highest DOR (9.24) overall was obtained for the Supraspinatus test in diagnosing any full thickness rotator cuff tear. The Compression-Rotation test was ranked highest of the SLAP tests (DOR 6.36) and the Hawkins test (DOR 2.86) for subacromial impingement syndrome (See Table
2 for details). However, the high risk of bias in primary studies and the fact that single PETS were performed and interpreted in diverging ways, limited the number of single PETS available for meta-analysis.
What constitutes superior clinical performance of a clinical test? In line with previous findings [
13], no single PETS in this meta-analysis showed superior diagnostic validity when pooled test performance was assessed. An ideal test should have the ability to discriminate between subjects with and without the condition in question, i.e. a concurrent high sensitivity and specificity is sought. LR and DOR both convey a measurement for this concurrency (LR + =sensitivity/1-specificity; LR- = 1-sensitivity/specificity and DOR = LR+/LR-) of which DOR is the most sensitive single indicator of test performance [
15]. For instance, when sensitivity and specificity both rise above 0.91; LR+ rises above 10 and DOR rises above 100. When reaching perfect test performance DOR rises to infinity. Nevertheless, LR may be more intuitive to the clinician when assessing clinical performance. According to Jaeschke et. al. [
43], LR ratios >10 (LR+) or <0.1 (LR-) are needed to generate clinically conclusive changes in probability and moderate shifts are generated by a LR+ of 5–10 or LR- of 0.1-0.2.
When Walton et al. [
12] recommended the Yergason test for SLAP lesions this was based on a pooled LR+ of 2.29. We found a similar LR+ (2.50) for the Yergason test and a slightly higher LR+ (3.91) for the Compression-Rotation test. However, when ranked by DOR the Yergason test performed second to Compression-Rotation test in our results (Table
2). None of the pooled results for single PETS resulted in LR+ above the range of 2–5 representing a small shift in probability [
43].
The original study of the validity of a single PETS tend to report much better performance than later less biased attempts to replicate results. Despite the high sensitivity and specificity reported in the first study on Biceps load II [
30], outlier characteristics led to exclusion from our meta-analysis (Fig.
3b). This decision is supported by previous reports about extensive bias in original studies and is in line with the exclusion of the original study on the Active Compression Test in a previous meta-analysis [
13].
The forest plot (Fig.
3) visualizes the variation in the estimated performance of presumably different PETS. As we see, the estimated performance tends to vary between studies more than between the different tests, with a possible exception for the anterior slide test which also was found inferior to other SLAP tests in a previous meta-analysis [
11]. In PETS aimed to detect SLAP lesions, most are designed to manipulate the superior labrum by stressing the glenohumeral joint often in combination with pulling on the biceps tendon (e.g. the Yergasons test of O’Brian test). This could be one of the reasons that performances of different tests vary relatively little, but this cannot explain why the general validity of PETS is poor. However, pathoanatomical/biomechanical rationale that most PETS are based on have recently been debated. For example, in subacromial impingement syndrome, the rationale for PETS (e.g. Hawkins and Neer’s sign tests) is that the greater tuberosity is rotated up underneath the acromion to force pinching of the bursa and supraspinatus tendon to reproduce impingement pain. The evidence for this postulated biomechanical explanation for the pain elicited is lacking [
44]. Moreover, the fact that the interplay between genetics and psychological factors predicts shoulder pain in experimental and postoperative settings [
45] also challenges the idea of a sole biomechanical explanation of shoulder pain.
In some of the previous meta-analysis of PETS hierarchical statistical modeling has been used to estimate receiver operating curves [
9,
13]. No optimal curves for any single PETS have been documented apart from one possible exception for the Lift-off test though there was great uncertainty in the estimated curve. Hierarchical and bivariate random effects modeling were attempted also in our review but were not found feasible due to a low number of articles with acceptable risk of bias included for each single PETS. As heterogeneity was insignificant, a fixed effect model was used.
Despite the meticulous procedure to ensure high-quality input with an acceptable risk of bias, 9 of the 20 studies identified as eligible could not be included in the meta-analysis. In some, this was due to significant errors in reconstructing 2 × 2 tables such as test performance reported in the text of the result section that differed from that reported in tables [
24] and that labels of several tables had been switched [
28]. Unfortunately, some of these results have been included in previous systematic reviews [
13].
Due to low quality of primary studies and strict selection criteria, we were only able to pool data for PETS within three shoulder diagnoses (SLAP lesions, subacromial impingement syndrome and for different degrees of rotator cuff tears only the supraspinatus test). Since gold standard reference tests have not been established for all shoulder diagnoses (e.g. multidirectional instability [
46]), the accuracy study design itself may also present a challenge for the complete review of PETS as the validity of some PETS cannot be compared to a gold standard reference test. This may partially explain why no single PETS for multidirectional instability and adhesive capsulitis or other glenohumeral pathologies could be included in this meta-analysis. However, these and other shoulder diagnoses should still be assessed by the clinician as part of the general clinical examination.
The lack of uniform diagnostic labeling used in randomized controlled trials has led Schellingerhout et al. [
7] to argue for abolishing diagnostic labels in shoulder pain patients altogether. Hence, there is a need for a new approach in future research on the validity of PETS and shoulder diagnoses. The GRADE initiative [
47] suggests that validity of different diagnostic subgrouping strategies should be evaluated in a randomized design providing direct comparison of effects on patient-important outcomes (e.g. pain and shoulder function) for different diagnostic strategies, rather than the indirect evidence provided by the accuracy design. We therefore suggest that future research on the validity of PETS consider using such a randomized design.
Limitations and strengths
This study adhered to the state of the art methodology for systematic reviews and diagnostic meta-analysis. A broad scope without limitations to any specific shoulder diagnoses was chosen to strengthen the potential clinical applicability of results. In the meta-analysis, a clear description of inclusion criteria was made mandatory for primary studies to ensure that applicability in other clinical settings can be assessed for all studies included. The chosen QUADAS cutoff in this study was in line with that used in several previous reviews [
14,
48] and particularly strong selection criteria were used for the meta-analysis to ensure inclusion of only high quality primary studies with a low risk of bias. However, with strong selection criteria, there is a risk that relevant primary studies were excluded from the meta-analysis and that this may have biased our conclusions. In addition the application of a QUADAS cutoff score has been advised against by its developers [
49] and our choice may have induced a selection bias of primary studies. Also, due to the small number of primary studies available for pooling, hierarchical or bivariate random effects modeling were not feasible. However, since heterogeneity was low, a fixed effects approach could be used. A revised edition of the original QUADAS tool has been published [
50]. Implementation was not possible in this review as QUADAS scoring had already started with the original tool. This was a meta-analysis of single PETS but in clinical practice a combination of tests is commonly used. Several of the included primary studies reported diagnostic performance when different tests were combined [
3,
26,
34,
35,
37]. However, as test combinations differ, meaningful statistical pooling was not feasible and assessment of test combinations was beyond the specific scope of this meta-analysis. Another important limitation regarding conclusions and recommendations of this meta-analysis is the designated context of specialist care with high prevalence of shoulder pathology and co-morbidity. Care should be taken to assess applicability of results to any specific clinical context. To enable clinicians to assess transferability of primary research findings to their own specific spectrum of patients, we only included studies where inclusion criteria had been clearly described. The extraction of raw data from the included primary studies have been provided for clinicians own scrutiny (Additional file
5).
Acknowledgements
The authors wish to extend their gratitude to Research Librarian Solveig Isabel Taylor (University Library, NTNU) for designing and executing the electronic database searches, to Kari Skinningsrud for help with preparation of figures, tables and the manuscript, and to Dr. Ulrich Schattel who facilitated this project through continuous support and contributions in discussions with SG.