Despite the vast amount of CT-scans of the cervical spine that are made addressing degenerative changes, to date there is a lack of clinical standardized rating models. Cervical spondylosis is a common radiological finding and the association to disability and pain is still unclear. This might be partly due to lack of consensus in grading models for degeneration why it is of importance to obtain reliable assessment models. The effort of this study was to contribute to establishing such a scoring system and validate it in the aspects of inter-rater and intra-rater reliability. Focus was put in creating a user-friendly system for clinical implementation.
Inter-rater reliability
The kappa value for the overall degree of degeneration showed a substantial agreement. However, this value represents the agreement between the raters when adding their degeneration scores on the separate variables and then dividing the subjects into three separate categories (no degeneration, moderate degeneration or severe degeneration). When grouped together, the rate of disagreement on the separate variables is masked and the agreement when only three categories exist is presumably consequently higher than it would be if more than three categories of degeneration were eligible. This becomes apparent when analyzing the separate variables, where kappa values are considerably lower. In two of the variables the null-hypothesis could not be rejected (endplate sclerosis and facet joint osteophytes). The only variable where the strength of agreement was substantial was anterior osteophytes. This variable is weighted to contribute less to the disc degeneration score than height loss, which only reached a moderate strength of agreement. The fact that agreement for the total level of degeneration was higher than for the separate variable could be explained by compensation mechanisms of the individual rater. For example, a borderline case of facet joint osteophytes could have been neglected with a compensatory affirmation of borderline irregularity of the articular surface.
Our study showed similar results with those of the study of Walraevens et al. [
16] concerning facet joint degeneration, even though our classification criteria differed, with low strength of agreement on osteophytes and irregularity of the articular surface but slightly higher strength of agreement for joint space narrowing.
However, when applying the radiograph-based scoring system for disc degeneration to CT there seemed to have been a slight loss of reliability compared to Walraevens et al. [
16]. They showed “good” or “excellent” agreement on the disc degeneration variables apart from endplate sclerosis which was low in both studies, whereas our results ranged from “moderate” to “substantial” with a slightly lower level of agreement overall. However, the trend is clear; assessing endplate sclerosis, facet joint osteophytosis and irregularity of facet joint articular surfaces is more complex than the three other variables.
Considering a cut-off limit of 0.40 for strength of agreement, which is arbitrarily set, many of our obtained kappa values indicate an acceptable or good level of agreement. However, several Kappa values were below 0.40. There are a few reasons for the relatively low values that must be considered. First, the relatively small sample size could have affected the level of agreement. Another factor might have been the multi-segment assessment. Determining the spinal segment with the highest level of degeneration is an assessment by itself. It is plausible that the raters were in fact reviewing different segments and consequently assessing them differently. Lack of training among the raters might also affect the level of agreement. In this material, the raters deliberately had no joint training session of the scoring system prior to the assessment procedure. This was to simulate a clinical setting to a high extent.
The goal of developing a scoring system that is easily applicable and experience- and discipline independent is of importance. However, we believe minor modifications could be done to improve the scoring system while still keeping it user-friendly. For example, one source of disagreement on the height loss-variable may have been presence of endplate compression affecting the disc height.
Intra-rater reliability
The ICC-values obtained all indicated fair, good or excellent intra-rater agreement, with total degeneration scores having the strongest agreement for both raters. However, the confidence intervals were large and the true ICC-values thus hard to discern. They are interpreted to originate from the variation between examiners using an ordinal scale on a relatively small material. Only two of the raters participated in the intra-rater reliability part of the study. As in the inter-rater analysis, the agreement of the total degeneration score was higher when summing disc degeneration and facet joint degeneration scores.
In comparison with other scoring scales in the field the agreement is regarded equivalent. Considering inter-rater reliability of the assessment of disc degeneration, previous scales vary from 0.41–0.78 [
16,
28,
29] intra-rater reliability of the discs vary from 0.71–0.86 [
16,
29]. In the material reviewed, the inter-rater agreement for facet joint degeneration the agreement varied from 0.43–0.49 [
15,
16] and the intra-rater agreement from 0.57–0.72 [
15,
16]. When comparison is made, one must consider the different radiologic modalities that are used in previous materials.
In summary, our results indicate a well acceptable level of agreement regarding both inter-rater and intra-rater reliability of a CT based scoring system, especially addressing facet joint degeneration and overall degeneration. The findings enable a role for this scoring system in both future research and clinical practice. However, when analyzing individual parameters in the scores, the agreements were lower than in the total scores. Hence, we recommend the system to be clinically applied in its aggregated form to assess disc degeneration, facet joint degeneration and overall degeneration.
This study has a few limitations. First, the sample size is rather small and for wide clinical implication, future studies with larger material are required to confirm the results.
Second, the study population in this this material consists exclusively of post-traumatic patients. This makes it less representable for the general population and is neither to be considered an asymptomatic cohort nor a cohort with non-specific neck pain. We welcome further investigations in a different clinical setting to validate the scoring system.