Introduction
Locally advanced rectal cancer (LARC) is typically treated with neoadjuvant chemoradiotherapy (CRT) followed by surgery [
1]. In up to 15–27% of the cases a complete tumor remission is achieved as a result of CRT [
2]. This has contributed to the recent paradigm shift in rectal cancer treatment towards organ preservation (e.g., “watch-and-wait” or local treatment of small tumor remnants) for selected patients with clinical evidence of a very good or complete tumour response after CRT. For these organ-preservation approaches, the morbidity and mortality risks associated with major surgery are avoided, with good reported clinical outcomes regarding quality of life and overall survival [
3,
4]. Predicting the response to CRT and thus the chance of achieving organ preservation before the start of treatment, i.e., at baseline, may open up new possibilities to further personalize neoadjuvant treatment strategies depending on the anticipated treatment benefit, particularly for smaller tumors that do not necessarily require CRT for oncological reasons.
Recent studies have suggested a possible role for imaging in this setting [
5‐
9]. Promising results have been reported for clinical staging variables (MRI-based TN-stage) [
6,
7], tumor volume [
10‐
12], and functional parameters derived from diffusion-weighted imaging (DWI) [
8,
9] or dynamic contrast-enhanced MRI (DCE) [
13] to predict rectal tumor response on baseline MRI, and more recently also for more advanced quantitative variables derived using modern post-processing tools such as radiomics [
5]. However, the available evidence mainly comes from single-center studies and comprehensive multicenter studies incorporating clinical, functional as well as advanced quantitative imaging data are scarce [
14,
15]. Moreover, the effects of multicenter data variations and diagnostic staging differences between observers so far remain largely uninvestigated. Prediction studies on larger multicenter patient cohorts with imaging data acquired and analyzed as part of everyday clinical routine are therefore urgently needed to develop a more realistic view of the potential role of image-based treatment prediction models in general clinical practice.
In this retrospective multicenter study, we therefore set out to develop and validate a model to predict response to neoadjuvant treatment in rectal cancer using rectal MRIs acquired for baseline staging in 9 different centers in the Netherlands, intended to be a representative sample of rectal imaging performed in everyday clinical practice.
Discussion
This multicenter study shows that when combining clinical baseline variables with image-based staging quantitative variables, overall model performance to predict neoadjuvant treatment response in rectal cancer is disappointing, with externally validated AUCs ranging between 0.60 and 0.65 to predict either a complete response (ypT0) or a good response (TRG1-2). Best model performance was achieved when combining clinical baseline information (e.g., time to surgery) and image-based staging variables (e.g., cT-stage). Quantitative imaging features had no added value. Notably, model performance was considerably better when including modern staging parameters such as cT-substage, extramural invasion depth, and EMVI, compared to more traditional staging including only simplified cTN-stage and MRF involvement. Moreover, model performance seemed to be affected by staging variations between observers with better performance when staging was performed by a dedicated expert compared to the original staging reports acquired by a multitude of readers.
Previous studies typically included staging variables such as cTN-stage as part of the “baseline patient variables,” which implies that these are “objective” variables with little variation between readers [
31,
32]. While measurement variations are commonly considered when analyzing quantitative imaging data, our results demonstrate that interobserver variation is also an important issue to take into account for the more basic staging variables. The improved model performance when including also modern staging variables such as cT-substage and EMVI in the expert re-evaluations further highlights the importance of high-quality diagnostic staging using up-to-date guidelines. The clinical impact of ‘state-of-the-art’ staging was also demonstrated by Bogveradze et al, who showed in a retrospective analysis of 712 patients that compared to “traditional” staging methods, advanced staging according to recent guideline updates would have led to a change in risk classification (and therefore potentially in treatment stratification) up 18% of patients [
33]. The fact that our cohort dates back as far as 2008 and covers a 10-year inclusion period explains why many of these advanced staging variables could not be derived from the original reports. The use of older data will likely also have impacted the quality of the images and thus the quantitative imaging features derived from the data. Following developments in acquisition guidelines and software and hardware updates, the image quality will have evolved over time. This is also reflected by the large number of different imaging protocols including 112 unique T2W and 94 unique DWI protocols. The question, therefore, remains if and how model performance would have improved using only state-of-the-art and/or more harmonized (prospectively acquired) MRI data. In our current dataset, quantitative imaging features showed no added benefit to predict response. This contradicts previous single-center and smaller bi- and tri-institutional studies that achieved more encouraging AUCs ranging from 0.63 to as high as 0.97 [
5,
14,
15]. These previous results are likely at least in part an overestimation of how such models would perform in everyday practice, as especially earlier pilot studies are hampered by limitations in methodological design (e.g., small patient cohorts, re-using of training data for testing, and multiple testing) as also outlined in several review papers reporting on the quality and/or reproducibility of image biomarker studies [
5,
19,
34‐
38]. The fact that most previous studies have been single-center reports will likely have also played an important role. Though reflective of data acquired in everyday practice, our results confirm the known difficulties of building generally applicable prediction models using heterogeneous retrospectively collected multicenter data. While some data variations are necessary to identify robust features to vendor and acquisition differences, too much variation will negatively impact model generalizability. Attempting to directly compare and investigate the effects of multicenter (heterogeneous) versus single-center (homogeneous) modelling using our own data, we mimicked a single-center comparison by repeating our study analyses on a homogeneous single-center subset within our cohort. Though results have to be interpreted with caution considering the wider confidence intervals and lack of external validation in the single-center arm, this comparison suggests that the best-performing model indeed appeared to be better for the homogeneous single-center subset (AUC 0.79) than for the multicenter (AUC 0.69) cohort. Though full data harmonization will likely never be achieved in daily clinical practice, these findings do support a need for further protocol guidelines and standardization to benefit future multicenter research.
There are some limitations to our study design. As mentioned above, data was acquired over the time span of a decade including scans acquired using outdated protocols dating back as far as 2008. A detailed analysis of the impact of these spectrum effects was outside the scope of this study, but a preliminary analysis (results not reported) showed that the impact of temporal changes was negligible. All segmentations were performed on high b-value DWI and then copied to T2W-MRI and ADC maps. Although care was taken to include anatomical information from T2W-MRI during segmentation, ideally a separate segmentation would have been performed. Finally, the comparison between the original basic staging reports and the advanced staging performed as part of this study was influenced by the fact that all re-evaluations were done by a single reader. In contrast, original staging reports were performed by a multitude of readers with varying levels of expertise. Due to the time-consuming nature of the expert re-evaluations (and segmentations), it was unfortunately not deemed feasible to include an independent extra reader.
In conclusion, this multicenter study combining clinical information and MRIs acquired as part of everyday clinical practice over the time span of a decade rendered disappointing performance to predict response to neoadjuvant treatment in rectal cancer. The best results were obtained when combining clinical baseline information with state-of-the-art image-based staging variables, highlighting the importance of good quality staging according to current guidelines and staging templates. No added value was found for quantitative imaging features in this multicenter retrospective study setting. This is likely at least in part the result of acquisition variations, which is a major problem for feature reproducibility and thus model generalizability. To benefit from quantitative imaging features—assuming a predictive potential—further optimization and harmonization of acquisition protocols will be essential to reduce feature variation across centers. For future research, it would also be interesting to see how model performance may improve when combining the information that can be derived from imaging with other clinical biomarkers such as molecular markers (e.g., DNA mutations, gene expression, microRNA) [
39,
40], blood biomarkers (e.g., CEA, circulating tumor DNA) [
39,
41], metabolomics (e.g., metabolites, hormones, and other signaling molecules) [
42], organoids [
43], and immune profiling [
44].
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.