Selection of physical examination tests
We identified all physical examination tests through a systematic review of the literature. This process identified 74 physical examination tests for shoulder pathology. We used a modified Delphi process to determine which of these 74 physical examination tests to include in our study. To do this, we administered an online survey, using Survey Monkey (©2005 SurveyMonkey.com), to the five participating surgeons with expertise in shoulder physical examination and surgery who were asked to identify their preference to include or exclude each test. The survey included the original description of the test and any subsequent modifications along with the original and modified instructions for scoring each test. Next, we tallied the results of this survey and included tests for which the majority of surgeons indicated that the test should be included, excluded tests for which the majority of surgeons indicate that the test should not be included, and produced a second survey for tests for which no majority was reached. In each case, ‘majority’ was defined as at least 4 of 5 surgeons.
All surgeons completed all rounds of the Delphi process. Following the first round of the modified Delphi survey, 14 tests were marked as include and 28 tests were marked as exclude. There was a discrepancy for 32 tests; these were included in the second survey.
The second survey presented the results of the first survey and identified tests for which there were discrepancies between surgeons. This survey asked each surgeon to present arguments for why the test should or should not be included in the study and to reaffirm their decision. If, following this second survey, any test was still without a majority decision, a document reproducing the argument for and against including each test was created and circulated, and a meeting with the surgeons was held until consensus was reached.
Following the second survey, there were 11 tests without a majority decision. Following the third survey round where surgeons provided free-text arguments for or against the inclusion of the remaining 11 tests and a revote, consensus was reached; nine tests were included and two were excluded. Therefore, a total of 32 tests will be included in the study. Included tests are presented in Table
1 by shoulder pathology.
ROM
|
General
|
SLAP
|
Anterior
| O’Briens Test |
Forward Flexion | Transdeltoid Palpation | Speeds Test | Load and Shift | Cross Body Adduction Stress Test |
External Rotation |
Tendinosis
| Anterior Slide Test | Apprehension Test |
Internal Rotation | Painful Arc | Active Compression | Relocation Test |
Strength
| Hawkins Kennedy | Compression Rotation | Surprise/Release Test |
External Rotation | Neers Impingement | Biceps Load Test I |
Posterior
|
Internal Rotation |
Supraspinatus
| Biceps Load Test II | Posterior Apprehension |
Jobes Test | Resisted Supination External Rotation | Modified Barlow Test |
Full Can Test |
Other Labral
|
Multidirectional
|
Infraspinatus
| Kims Test | Sulcus Sign |
External Rotation Lag |
Subscapularis
|
Lift Off Test |
Belly Press Test |
| Internal Rotation Lag | | | |
Clinical examination testing
Richardson, Wilson and Guyatt [
12] have identified two underlying steps to differential diagnosis. The first step involves arriving at a list of diagnostic possibilities and their relative likelihood of being responsible for the patient’s complaints. The first attempt at listing the possible diagnoses comes from listening to the patient describe the history behind the symptoms. The relative likelihood, coined the
pretest probability, is the probability that the patient has the disease of interest based on the physician’s experience with the disease and the signs and symptoms presented by the patient [
12].
In the second step, diagnostic tests are performed or administered by the physician and the results of those tests are used to revise the initial pretest probability to a
posttest probability. The posttest probability is the probability that the patient has the disease of interest following the results of a diagnostic test [
12]. It follows then that the diagnostic process involves a continuum of probabilities between two thresholds (Figure
2); where a probability of 0.50 or 50:50 chance of having the disease represents the greatest amount of uncertainty, probabilities less than 0.50 indicate greater certainty that the disease is not the cause of the patient’s symptoms, and probabilities greater than 0.50 indicate greater certainty that the disease is contributing to the symptoms. In fact, the clinician’s perception about the probability of having a specific disease may become sufficiently high that it surpasses the
treatment threshold, such that the physician recommends therapy without further testing. On the other hand, the clinician’s perception about the probability of having a particular disease may become sufficiently low that it falls below the
test threshold; at which point no further testing is recommended and the clinician rules out the disease.
The more accurate the diagnostic test, the greater the reduction in uncertainty about the diagnosis either toward dismissing a particular diagnosis from the list of possibilities or toward offering treatment for a highly probable disease. Less powerful diagnostic tests are unlikely to sufficiently change the degree of uncertainty, sometimes necessitating more invasive or expensive tests to further reduce uncertainty and reach a final diagnosis. For example, if physical examination tests cannot differentiate between a significant SLAP lesion and a rotator cuff tear, the surgeon whose expertise is insufficient to perform an arthroscopic SLAP repair has essentially just performed a risky, invasive and expensive diagnostic test by performing the arthroscopic examination without being able to offer treatment.
In our study, therefore, the physician will take a thorough history including, mechanism of injury, duration of symptoms, history of shoulder injuries and patient characteristics such as age, occupation and daily activities. Following the history, the physician will indicate the pretest probability of eight common shoulder pathologies using a 100 mm visual analogue scale (VAS). These will include rotator cuff tendinopathy, rotator cuff tear, AC joint pathology, SLAP lesion, other labral lesions and instability (anterior, posterior, or multi-directional each represented by a separate scale).
Patients for whom the physician feels some uncertainty in the diagnosis (i.e. placed a mark between the two thresholds) will undergo the physical examination tests for those diagnoses only. For example, if the physician is certain that the patient has instability without AC joint pathology, though he or she remains uncertain about the direction of instability, this patient will undergo physical examination tests for instability but will not undergo the tests for AC joint pathologies. Similarly, the clinician may be certain that the patient does not have instability (i.e. the pretest probability that the patient has instability is below the testing threshold) but is uncertain whether the diagnosis is tendinosis or more severe rotator cuff pathology, a labral lesion or AC joint pathology. This patient would undergo physical examination tests for tendinosis, rotator cuff tears, labral lesions and AC joint osteoarthritis but tests for instability would not be performed.
To standardize the technique and scoring for each test, we constructed a glossary (Additional file
1) that will be provided to clinicians. Each clinician is required to review the glossary and ensure their method of application matched the description provided.
To assist with standardization, we included pictures that illustrate the technique. Further, we will use a standardized data collection form that includes the description of how each test is performed and scored. Finally, the graduate student will be trained how to perform all physical examination tests and familiarized with alternative techniques so that she can provide correction if the clinician is performing the test in a manner other than as described in the protocol. Tests will be ordered according to the position of the patient during the test (e.g. seated, supine, standing) although the clinician will be free to order the tests as he or she sees fit. A research assistant will be present to ensure that all tests are completed and to record the results of the test on the data collection form.
The research assistant will remove any imaging studies, reports or other test results from the patient’s chart so that the clinician performing the tests is not biased in their interpretation of the physical examination tests. All imaging and other tests including any reports will be made available to the clinician after the physical examination tests are complete.
Choice of reference standards
One of the most common methodological flaws within the literature of diagnostic validity studies for shoulder physical examination tests is the exclusion of patients who did not undergo surgery. Obviously not all patients who present to an orthopaedic practice are recommended for surgery or elect to undergo recommended surgery. The sample formed by excluding these two subpopulations from the greater population of patients with shoulder pain or disability is no longer representative of typical clinical practice. Further, we might expect that estimates of the accuracy of physical examination tests that are restricted to patients who ultimately undergo surgical treatment are overly optimistic since the sample is made up of a non-representative proportion of (perhaps) more severely affected individuals.
Thus, this study will include two comparable reference standards. We will use arthroscopic examination as the reference standard for patients who undergo surgical treatment within eight months of physical examination, and magnetic resonance imaging with arthrogram (MRA) for patients who do not undergo surgery within this timeframe.
We developed a standardized arthroscopic examination and reporting protocol to minimize differences between surgeons in diagnoses due to variations in methods of examination (Additional file
2) and to minimize any detection bias should the clinician recall the physical examination or results of imaging or other special tests at the time of interpreting the surgical examination.
MRA was chosen as the reference standard over MRI due to its ability to diagnose disorders of the internal soft tissue structures such as the labrum. The literature has shown that MRI is not as accurate for diagnosing SLAP tears as MRA with reported sensitivities for MRI ranging from 43% - 75% [
13‐
17] and specificities between 58% - 70% [
14,
15,
17]. MRA has been shown to be highly sensitive and specific for detecting both rotator cuff pathology and labral injuries [
18,
19]. In some cases patients will undergo both surgery and an MRA. For these cases we will calculate the agreement between these two standards to further justify the use of MRA as a second reference standard.
Plan for statistical analyses
We will calculate sensitivity and specificity for each test individually including 95% confidence intervals around these estimates. Sensitivity is calculated by dividing the number of patients with the disease who had a positive test (true positive) by the total number of patients with the disease. Specificity is calculated by dividing those without the disease who had a negative test (true negative) by the total number of patients without the disease. We will use these values to calculate positive and negative likelihood ratios (LR). A positive likelihood ratio is the likelihood that a positive test result is elicited in a patient with the target disorder compared to the likelihood that a positive test result is elicited in a patient without the target disorder (sensitivity/(1-specificity)). A negative likelihood ratio is the likelihood that a negative test result is elicited in a patient with the target disorder compared to the likelihood that a negative test result is elicited in a patient without the target disorder ((1 – sensitivity)/specificity). LRs have advantages over sensitivity and specificity because they can be calculated for several levels of the symptom/sign or test, they can be used to combine the results of multiple diagnostic test and they can be used to estimate a post-test probability for a target disorder all of which is more useful in a clinical setting.
We will divide the tests into groups according to which disease they tested for. We will then dummy code these sets of tests to indicate whether one test, two tests or all tests are positive. We will test whether combinations of the tests improves the ability to diagnose disease. We will calculate the sensitivity, specificity and likelihood ratio if all tests positive, one test is positive, at least one test is positive and so on. Additionally we will assess whether particular tests can be removed from the set of tests without losing any diagnostic ability for each disease. Poor indicators of disease will be removed from the analysis and the change in accuracy measures will be evaluated. This analysis will determine the appropriate number and combinations of tests for each disease category that will provide the greatest clinical yield.
Estimation of sample size
To address our first two hypotheses, we assumed a sensitivity and specificity of at least 0.85 with a 95% confidence interval with a bounds of +/− 0.10. This boundary was selected because the authors felt that if the uncertainty around the estimate of validity included the possibility of a sensitivity or specificity of less than 0.75 that the conclusions about the usefulness of the test change. Using these parameters a sample size of 50 patients tested at each disease state (AC joint pathology, rotator cuff pathology, SLAP lesions, other labral lesions, and anterior instability) is required [
20]. Since some of these patients may be lost-to-follow-up or drop out, we inflated this sample size by 10% for a total of 55 patients tested in each disease category.
Since maintaining the distribution of disease severity is crucial to the validity of our study, we will recruit consecutive patients up to and until the required 55 patients are recruited for the slowest recruiting disease category. We anticipate that some patients will have multiple diagnoses (e.g. rotator cuff tear and SLAP lesion), which will mean that they are counted as disease positive for more than one analysis, thus our sample size for each disease group is likely to be larger than the required 55 patients tested per disease group.