To our knowledge, this is the first reliability study covering eight common cervical MRI findings. The overall inter-rater reliability was substantial for all variables except zygapophyseal osteoarthritis where moderate reliability was found. Intra-rater reliability was substantial for the majority of variables and almost perfect for kyphosis. These reliability estimates reflect that the observed agreement notably exceeds the agreement that can be expected by chance.
For disc degeneration, other studies [
9,
12] reported higher reliability estimates than the disc height estimates in the current study. Although the use of intraclass correlation coefficient in the study by Jacobs et al. [
12] does not allow for direct comparison, possible explanations for the reliability differences are the use of a ubiquitously accessible reference image of a normal disc [
12] and the notable experience among readers with the same educational background [
9].
For disc contour, the reliability estimates were similar to those of other studies despite the fact that we used a three-category classification compared to the previously reported dichotomous classifications [
8,
30,
31] and comparison of more experienced readers [
30,
31].
For spinal canal stenosis, the current study’s unweighted reliability estimates exceeded those previously reported by use of weighted kappa statistics [
13,
32], although the use of weights are expected to yield higher estimates. A higher number of readers (six [
13] and nine [
32]) could explain this difference, but even when compared to the three most experienced readers in these studies, better reliability estimates were still achieved in the current study. The most probable reason appears to be the limited introduction of their classification [
13,
32]. When using both written and visual descriptions, our moderate to almost perfect reliability among readers with considerable experience differences suggest good applicability of this classification of spinal canal stenosis.
For zygapophyseal osteoarthritis, both the intra- and inter-rater reliability estimates were better than previously reported [
11], which is most likely explained by the use of a dichotomous variable in the current study compared to a classification with four severity categories [
11].
For neural foraminal stenosis, this study still achieved higher reliability estimates compared to studies with more experienced readers [
30,
31]. The inferior reliability estimates may be explained by unclear definitions [
30] and by low prevalence estimates together with images obtained using a 0.5 T field strength [
31]. Compared to the study from which we modified the classification of neural foraminal stenosis [
10], the current study was unable to reach the same almost perfect reliability estimates (Κ > 0.9). Nevertheless, we consider the substantial to almost perfect reliability to be satisfactory, bearing in mind differences in reader experience and the heterogeneous image material (i.e. images with different field strengths and available sequences). The modified classification (dichotomous versus the original four categories) proved reliable and the association with clinical findings has previously been reported [
33].
Methodological considerations
A limitation of the study is that it was not preceded by a power calculation. However; the confidence intervals for the Κ estimates only comprised more than two levels (e.g. from moderate to almost perfect for spinal canal stenosis) in a minority of cases. A larger sample would have narrowed the confidence intervals but would probably not have caused substantial changes in the reliability estimates.
Another limitation is the involvement of only reader A in the intra-rater analysis. Two considerations explain this: 1) previous reliability studies found higher [
7‐
9,
12,
14,
21] or similar/higher [
10,
11,
13] intra-rater reliability than inter-rater reliability and 2) involvement of reader A was necessary since a future prognostic study will involve MRI assessments performed by reader A. As for the inter-rater reliability, the study included three readers, only one of these being a radiologist. However, the results suggest that our method is applicable among other health care professionals (i.e. rheumatologists and chiropractors) in a controlled research setting. Involvement of other relevant healthcare professionals, e.g. spine surgeons, would have been desirable but was unfortunately not possible.
Owing to the properties of Κ, the measure does not disentangle systematic and random misclassification [
28]. Therefore, we provided the prevalence tables from which we find no suspicion of systematic misclassification.
The prevalence table discloses a notable difference in the number of disc levels assessed for disc contour on levels C2/C3, C3/C4 and C7/T1: Reader A assessed fewer levels than Readers B and C owing to the lack of axial images of the selfsame disc levels. This discrepancy suggests a difference among the readers, and whether this partly explains why higher reliability estimates were not achieved for disc contour cannot be refuted.
Another potential limitation is that all MRIs were derived only from individuals with neck pain. But since cervical spine MRI is seldom performed in patients without neck pain and since the future use of the evaluation manual applies to patients with neck pain, we consider the current sample appropriate for its purpose.
Finally, a potential limitation of the study is the heterogeneous image material (MRIs were performed at five different hospitals. Different field strengths and sequences were available). Yet, as it resembles everyday clinical practice, this was an intended challenge and an attempt was made to manage this heterogeneity by using a standardized evaluation manual. The differences between OA and AC (Tables
2 and
3) reflect that both inter- and intra-rater agreement notably exceed the agreement that can be expected by chance. Furthermore, the high levels of observed agreement reflect only a minor degree of misclassification. Based on these observations of OA, our interpretation is that the evaluation manual and the standardized procedures explain the high levels of agreement rather than pure chance when assessing heterogeneous images.
Ultimately, the heterogeneous image material and the use of three different health care professionals both add to the generalizability and thus constitute strengths of the study. The blinding of the readers, the use of simple and easily comprehensible classifications along with regular encouragement to follow the evaluation manual, are other important strengths of the study.
In contrast to the controlled settings of the current study, a study comparing narrative MRI reports demonstrated considerable variability [
34]. In this study [
34], a patient with low back pain and right L5 radicular symptoms had lumbar spine MRI performed at 10 different MRI centers within 3 weeks. Comparison of the 10 narrative reports revealed considerable variability; none of the 49 described findings occurred in all 10 reports and only one finding occurred in nine reports. Even if this amount of variability is unusually large [
34], it supports our clinical experience that variability also prevails in the interpretation of cervical spine MRIs. A possible way to overcome this is by using classifications sufficiently comprehensible to be applied 1) by different health care professionals and 2) when assessing heterogeneous images from different MRI scanners. Such classifications were presented in the current study. Confirmatory studies will be needed. If those studies were to involve experienced radiologists, provide proper training for lesser experienced MRI readers, and use an evaluation manual, better reliability might be achieved in clinical settings. So far, the results suggest that the evaluation of MRI findings can be used in controlled research settings studying individuals with neck pain. Suggestions for future research include comparison of reliability with and without the use of an evaluation manual. Also, including more than one of each health care professional could allow for comparison of experience levels both among and within different types of health care professionals.