Introduction

Currently, there is no objective way to measure the effectiveness of surgery for a degenerative lumbar spine. The dominant indications for this surgery are pain and back-specific dysfunction in relation to pain [1].

Although prospective patient-reported outcome measures (PROMs), completed both preoperatively and at follow-up, are currently accepted as outcome measures in both clinical studies and in clinical practice [2,3,4], there are limitations and questions regarding the interpretation of the results. Many studies testify to the difficulty associated with determining how large a change in a certain score needs to be before it can be considered relevant for the patient and/or the caregiver [5, 6]. Sometimes it is even difficult to tell whether the score change reflects a true change or if the change simply reflects a measurement error [7]. These extensive questionnaires are also administratively cumbersome for the patient, as well as for the health care system [8]. PROMs are further burdened by bias in the shape of response shift, as described by Schwarz et al. [9]. A response shift is described as a subtle change that concerns the different ways in which patients perceive their health status, and how they interpret and respond to questions, based on cognitive psychological mechanisms. It has also been pointed out that the patients may comprehend the questions listed in the form in different ways, or even consider them irrelevant, which increases the difficulty in interpreting the responses [10].

The frequently used PROMs are validated against some kind of basic reference, a so-called transition question (TQ), where the patient is asked to evaluate the overall result of the surgical intervention [11, 12]. The meaning of this type of single-item PROM (i.e., face validity) instantly becomes obvious to the patient, who then can state directly if the surgery was worthwhile or not. However, over the last decades the performance of single-item PROMs has been called into question mainly because of recall bias [11, 13]. Currently, TQs are recommended as independent measures of construct validation of PROMs designed to measure change over time, and as a strategy to interpret such questionnaires.

The extensive use of retrospective single-item measures indicate their important roles in interpreting the outcome of spinal surgery. Ward et al. [14] demonstrated higher correlations with changes of clinical measures for domain-specific TQs than for general health TQs in a population with rheumatoid arthritis. Since the domain of interest for this type of surgery is pain, it is reasonable to evaluate how a pain-specific TQ works as an overall outcome measure in a population with degenerative spine conditions.

In the Swedish Spine Registry (Swespine) several PROMs, including the pain-specific TQ global assessment (GA) of outcome are reported. The majority of the patients undergoing elective spine surgery in Sweden are registered in this national register and today more than 90,000 individuals operated for degenerative lumbar spine conditions are included in the database.

GA is based on a single question about treatment effect (“How is your back/leg pain today as compared to before the surgery?”) with six response options in a Likert type format. It should, however, be considered a five-point asymmetric score, since the first response option—“I had no pain prior to surgery”—is excluded in the analyses. GA has previously been evaluated within a Swedish RCT [15]. It was within the context of that efficacy study found to be a feasible substitute for prospective multi-item PROMs. In the Hägg et al. study the population (n = 289) consisted of patients undergoing surgery for degenerative disc disease (DDD). GA was compared with two of the PROMs used in Swespine (VAS and ODI), but not against the remaining EQ-5D and SF-36. Also in that setting GA had four response options as opposed to five in the current study.

The aim of the present study was to explore the utility of using the simple transition question GA as an overall PROM based on the large real-life database in the Swedish spine registry (Swespine). The three most common diagnoses for surgical treatment in the degenerative lumbar spine—spinal stenosis (LSS), disc herniation (LDH), and degenerative disc disease (DDD)—were included in the analyses. Further, we specifically studied the DDD group by ROC curve analysis to be able to compare the results to the previous study by Hägg et al. [15]. In this study, we propose how GA can be used to assess the outcome of spine surgery.

Patients and methods

Characteristics of the population

Retrospective analyses were performed on prospectively collected data from 94,132 patients registered in Swespine, between 1997 and 2015, who underwent surgery for degenerative lumbar spinal disorders. In Table 1, we describe the baseline characteristics of the three most common diagnostic groups in Swespine: Lumbar Spinal Stenosis (LSS, N = 50,194), Lumbar Disc Herniation (LDH, N = 30,102) and Degenerative Disc Disease (DDD, N = 13,836). The response rates for the various questionnaires at baseline, was in the range of 71–85%. At 1 year post-surgery the response rates had fallen to 55–68% (Fig. 1).

Table 1 Baseline characteristics of three common diagnosis groups in Swespine from 1997 to 2015
Fig. 1
figure 1

Study flow chart

Swespine is considered to have nationwide coverage. The coverage, i.e., the proportion of units conducting spine surgery that report to Swespine, is approximately 90% [16]. Questionnaires with pre-, as well as post-operative data were completed by the patients after informed consent. The protocol included anthropometrics, co-morbidities, smoking status, usage of analgesics, sick-leave, and PROMs. Follow-up was conducted at 1, 2, 5 and 10 years after surgery and sent to the patients by mail, including one reminder. Reporting of surgical and perioperative data was performed by the surgeon.

Outcome variables

Global assessment (GA)

The single-item transition question global assessment for back pain (GABACK) and leg pain (GALEG) was used at follow-up: “How is your back/leg pain today as compared to before the surgery?” where 0 represents no back/leg pain before the surgery, 1 completely pain free, 2 much better, 3 somewhat better, 4 unchanged, 5 worse [15].

Patient-reported outcome measures (PROMs)

The following PROMs were used: the Visual Analogue Scale for back (VASBACK) and leg (VASLEG) pain, measured on a scale from 0 (no pain) to 100 (worst imaginable pain); the disease specific measure Oswestry Disability Index (ODI) [17] which aims to capture important aspects of low back pain that can influence quality of life with the index score ranging from 1 (best) to 100 (worst); the quality-of-life measures EuroqoL-5-dimensions [18] translated into a preference based single index score based on a scale from 0.000 (death) to 1.000 (perfect health) and the Short Form 36 (SF-36) [19] that is subdivided into 8 subscales and 2 composite domains.

Statistics

Descriptive data are presented as mean ± SD, or number (%). The final scores as well as changes in score (Δ-scores) of the prospective PROMs in relation to the different response options in GA are visualised by box-plots.

A non-responder analysis was performed where mean differences in gender, age, and PROMs at baseline were compared between patients undergoing surgery who filled in the PROM questionnaires sent out by the Swespine register and those who did not fill in these forms using the independent samples t test.

The behavior of GA in relation to the prospective PROMs was studied by calculations of several Spearman rank correlation coefficients. The correlations between GA and the Δ-scores and the postoperative scores, respectively, for the four PROMs were calculated. The three diagnosis groups LDH, LSS and DDD were analyzed. The correlation coefficient assesses in a general way whether the variables tend to rise (or fall) together. According to Di Fabio [20], the descriptive significance of the magnitude of the absolute value of the coefficient is judged arbitrarily. The recommended interpretation is that a correlation coefficient from 0.00 to 0.24 is of little to none significance. Coefficients from 0.25 to 0.49 are considered weak, from 0.50 to 0.74 moderate and from 0.75 to 1.00 strong. The direction of the correlation between GABACK/LEG and the absolute score at 1 year follow-up for ODI and VASBACK/LEG are expected to be positive and for SF-36 and EQ-5D negative. For example, the lower value in postop ODI, the lower value in GA. As for the change scores, negative correlations for all PROMs are expected. The 95% confidence intervals for the correlation coefficient point estimates were previously estimated in the same diagnoses registered until 2014. The widest interval was 0.09 (for the change score of EQ-5D vs GABACK). The Swespine registry expanded with more than 15,000 patients within the three examined diagnostic groups during 2014–2015. Therefore, an even narrower confidence interval can be expected in the current population. Hence, this analysis is left out. P values will not be presented, since they, in these correlation analyses, lack clinical relevance.

ROC curve analyses were used to generate indexes (the area under the curves, AUC) to see how the discriminative ability of GA varied according to different PROMs. GA was dichotomized into ‘success’ (i.e. the patients that responded as “pain free” or “much better”) and ‘failure’ (i.e. “somewhat better”, “unchanged” and “worse”). An AUC below 0.70 was interpreted as low discriminative ability, between 0.70 and 0.90 as good and above 0.90 as excellent [21].

The ROC analyses included only the DDD group. In the ROC analyses the Δ-scores as well as the absolute scores at 1 year post-operatively of VAS, ODI and EQ-5D were analyzed.

The SPSS software package version 24 was used for the statistical analysis.

Ethical considerations

Informed consent was obtained from all participants in Swespine. The use of anonymised data, that cannot be traced to any individual patient, has been approved by the holders of the registry for this research project. This research project was approved by the Regional Ethics Committee.

Results

Non-responder analysis

At the 1-year follow-up differences between responders and non-responders in gender, age, and mean PROM baseline values were significant at the 0.001 level although the differences in values were very small and the standard error of difference did not exceed 0.2 for any of the examined variables. In the non-responder group 49% were female as opposed to 51% in the responder group. The mean difference in age was 2.9 years, and in VASBACK and VASLEG 1 point, in ODI 1.5, in EQ-5D 0.043, and in SF-36PCS and SF-36MCS 1 point, respectively (additional data are given in Online Resource 1).

Correlations between GA and PROM scores

In Fig. 2a–f, box-plots of the Δ-scores and the final scores for patients that completed all of the PROM-forms at base-line and follow-up are presented according to the response options in GABACK. In this analysis complete data were available for 37,781 patients. The boxes, representing 50% of the cases, for the response options of “pain free” and “much better” appear to be separated from the other boxes, in the diagrams of the final scores of VAS and ODI, as demonstrated by the cut-off lines. Separation of the EQ-5D and the Δ-scores is not as evident. The analyses were also made for GALEG, showing similar results and were, therefore, not presented.

Fig. 2
figure 2

Box-plots of the final scores at one year post-operatively (ac) and the Δ-scores, i.e. score-changes (df) of PROMs according to the response options in GABACK. The boxes represent 50% of the study population and the line inside each box is the median. The whiskers represent 1.5-times the interquartile range (IQ). Outliers with more than 1.5-times the IQ are marked as circles and those with 3-times the IQ as lines. The horizontal lines inside the figures ac highlights a possible cut-off point between success (“pain free” and “much better”) and failure (“somewhat better”, “unchanged”, “worse”) in outcome. Only patients that completed all the PROMs at baseline and at follow-up are included. N total = 37,781. N “pain free” = 7061; “much better” = 16,115; “somewhat better” = 7618; “unchanged” = 4078; “worse” = 2909. GA BACK Global Assessment of back pain, VAS BACK Visual Analogue Scale of back pain, ODI Oswestry Disability Index, EQ-5D Euroqol-5-Dimensions index score

As shown in Table 2, GA presented stronger correlations with the final scores than with the Δ-scores in all three diagnosis groups. The number of responders varied among the PROMs within the range of 12,710–19,660 in the LDH group, 25,507–34,824 in the LSS group, and 7581–9624 in the DDD group. The lower numbers correspond to the Δ-scores where completion of the pre-operative as well as the post-operative PROM questionnaire is needed. GA correlated somewhat stronger with domain-specific PROMs (VAS, ODI) than with the quality-of-life measure EQ-5D. In the LDH group, GALEG vs. VASLEG presented the strongest correlations. In both the LSS and the DDD groups, GABACK vs. VASBACK yielded the highest correlation coefficients.

Table 2 Spearman rank correlations between GA, Δ-scores, and 1 year post-operative scores, for four PROMs in patients who underwent surgery for LDH, LSS and DDD, respectively

Correlations between GA and item-specific domains within PROMs

Since GA seemed to correlate well with the pain-specific VAS, we examined if that finding was consistent with the bodily pain (BP) domain in SF-36 as well as the pain-specific items in EQ-5D and ODI as opposed to the other, less similar domains/items. The analyses were performed on the whole study population. The number of respondents to each domain or item varied within the range of 50,212–65,823.

The SF-36-domain (indicated in bold in Table 3) that had the strongest correlation with GA was Bodily Pain. Similarly in Tables 4 and 5, GA correlated best to the pain-specific items in both the EQ-5D (item 4) and ODI (item 1). The correlations to dissimilar domains or items were weaker.

Table 3 Spearman rank correlations between GA to domains within the SF-36 at 1 year post-operatively
Table 4 Spearman rank correlations between GA to specific items of EQ-5D at 1 year post-operatively
Table 5 Spearman rank correlations between GA and specific items in ODI at 1 year postoperatively

Discriminative ability of PROMs with GABACK as reference criterion

In Fig. 3 the AUC for DDD patients for the VASBACK, ODI, and EQ-5D are visualized with GABACK as the reference criterion defining success and failure. Only DDD patients that completed all the PROMs at baseline and at follow-up are included (n = 7372). At 1 year post-operatively (Fig. 3a) the AUC for VASBACK was 0.924 (95% CI 0.918–0.931), for ODI 0.912 (95% CI 0.906–0.919) and for EQ-5D 0.882 (95% CI 0.874–0.890). The 95% confidence intervals indicate that the EQ-5D has a statistically significant lower AUC than the other two outcome measures. When using the Δ-scores, (Fig. 3b) the AUC for VASBACK was 0.864 (95% CI 0.855–0.873), for ODI 0.877 (95% CI 0.868–0.885) and for the EQ-5D 0.752 (95% CI 0.739–0.764).

Fig. 3
figure 3

Receiver operating characteristic (ROC) curve plots for a the absolute scores at 1 year post-operation and b the score changes of VASBACK, ODI and EQ-5D with GASUCCESS/FAILURE as the reference criterion. The analysis was performed on patients diagnosed with degenerative disc disease. N in Success group = 4591. N in Failure group = 2781. Missing = 6464. GA Global Assessment for back pain, VAS BACK Visual Analogue Scale for back pain, ODI Oswestry Disability Index, EQ-5D Euroqol-5-Dimensions Index Score

Discussion

The results suggest that GA mainly measures and reflects the patient´s pain status, as it correlates well with the condition-specific VAS and ODI, as well as with pain-specific items within the quality-of-life PROMs. This was expected given that GA specifically asks about the change in perceived pain. This finding is coherent with the results of a recent study [14]. As also expected, the degree of association to EQ-5D and domains other than “Bodily Pain” within SF-36, was lower. The directions of the correlations were as postulated.

GA further appears to be an appropriate discriminator between a successful outcome and failure, since there were clear cut-off points in VAS and ODI between patients assessing themselves as “pain free or much better” and “somewhat better, unchanged or worse”. This was not seen with EQ-5D. Our results indicate that GA, used as a reference criterion (for instance, in the calculation of minimal important change thresholds) can increase the interpretability of pain-specific PROMs. It is debatable whether GA is applicable as an anchor for generic quality-of-life measures, since it specifically asks about pain. However, GABACK demonstrated similar correlation coefficients with EQ-5D to those for ODI for all three diagnosis groups. This illustrates the importance of pain as a determinant of quality of life in individuals with degenerative spinal disorders. To increase the accuracy and decrease the bias of a transition question, it is recommended that it should address a specific construct that is anchored to a specific point in time. We believe, that these criteria are fulfilled for GA [22].

Our study further suggests that the respondents—irrespective of diagnosis—base their answers primarily on their present symptom state, and to a lesser degree, reflect on their pain in a longitudinal way, which should be the case if GA worked as an ideal transition question. During the last decades, several reports have been published that support this finding [11, 13, 23]. Two theories that are frequently used to explain this phenomenon are recall bias, as described by Ross in his paper on implicit theory of change [24]; and response shift, defined by Sprangers and Schwartz [9]. Consequently, we may question the validity of this type of question. The latter form of bias, i.e., response shift, however, applies to all PROMs and one can argue that neither a PROM that measures two points in time with an interval of ≥ 1 year, nor a retrospective PROM is an appropriate outcome measure. Due to these limitations the concept of a patient acceptable symptom state, PASS [25] was introduced suggesting that a satisfactory outcome of surgery should not be expressed in terms of change but rather as an absolute score at follow-up above which the patients express an acceptable symptom state. Van Hooff et al. [26] proposed the threshold for such a state for ODI as being 22 in a population of lumbar spine surgery patients. GA could serve as the reference criterion for calculations of acceptable symptom states for VAS and ODI and other condition-specific PROMs. The corresponding thresholds for ODI and VASBACK using GABACK/SUCCESS as the reference for the PASS was 27 and 32.5, respectively, in the DDD group. For the whole population the PASS for ODI was also 27 and for VASBACK 31.5.

The discriminative ability of ODI and VASBACK with GABACK as the reference criterion for DDD, tested in ROC curve analyses, was above 90% (interpreted as “excellent”) for the post-operative scores and above 86% (“good”) for the Δ-scores. Compared to the Hägg et al. study [15] which demonstrated a sensitivity and specificity for the Δ-scores of 75%, the current real-life database analysis strengthened the usefulness of GA as a reference criterion.

There are several limitations to the present study. The percentage of responders 1 year postoperatively in Swespine, and, therefore, in our study, was 55–68%, which would be unacceptable in a randomized controlled study. In registry-based studies numerous non-responders is often the case. In the analyses where comparisons of PROMs are made, completion of all PROMs before as well as after surgery are required. As a consequence, there is a substantial decrease in eligible patients for these analyses and there is, therefore, a risk of selection bias. The results in the non-responder analysis, however, suggest this risk to be low. There is evidence that a high number of missing values do not distort the outcome data in a registry [27]. However, the implications of non-response bias need further investigation.

Were there a Gold standard for the assessment of effectiveness in spine surgery, then of course it would be better to correlate GA to this. The fact that outcome instruments, such as GA, are used as proxies for a Gold standard prevents the use of common psychometric methods for validity testing.

Overall the results indicate that GA, used as a reference criterion, can increase the interpretability of pain-specific PROMs. However, its applicability as an anchor for the quality-of-life PROMs is less obvious. The results on the one hand support the use of GA as an outcome measure of the present state of pain and physical function, while on the other hand demonstrate its limitations as a measure of change.

The findings in this study are exploratory and need to be confirmed in a study that examines potential confounders. Further studies that explore how different variables, such as gender, age, smoking habits, and social status, affect the response pattern of GA are needed. It has also not been explored how GA works in a non-surgically treated population. Since GA appears to have the same function as a PASS, a preliminary threshold for ODI of 27 was suggested, which is rather high in comparison with 22, as demonstrated in the van Hoof et al. study [26]. This difference needs further investigation.

Conclusion

The single-item outcome measure global assessment (GA) shows a stronger correlation to postoperative scores in multi-item patient-reported outcome measures (PROMs) than to their corresponding change scores and a somewhat stronger correlation to pain-specific PROMs than to quality-of-life PROMs.GA can be an appropriate tool in the measurement of effectiveness of degenerative lumbar spine surgery and may work as a discriminator of success and failure in the interpretation and validation of pain-specific PROMs.