Background
Imaging is a fundamental tool in medicine and especially in personalized medicine [
1]. Medical imaging in oncology is important for diagnosis, staging, treatment and response assessment. However, data extracted from radiological imaging have traditionally largely been qualitative, limiting the role of imaging in precision medicine. Only recently, imaging has been recognized as non-invasive biomarkers by extracting a large amount of data using mathematical image-analysis methodologies [
2,
3]. Imaging-based biomarkers have therefore found their way into prognostic models to predict clinical outcome in investigative settings [
4].
Radiomics refers to the extraction of a large number of quantitative features from medical images [
5]. In addition to using standard imaging tumour characteristics, such as tumour volume, contrast enhancement, or maximum standardized uptake value (SUV
max), numerous other parameters, which may not be visible to the naked eye, may be extracted with radiomics [
6,
7]. Radiomic features quantitatively describe different tissue characteristics, such as grey-value distribution or inter-pixel relationships. They can be categorized into shape, intensity, texture and filter-based (wavelet) features [
5]. Radiomics has been applied to a variety of imaging, including computed tomography (CT), magnetic resonance imaging (MRI) and positron emission tomography (PET), with good prognostic power for different entities in a research setting [
8‐
13].
Biomarkers are objective, quantifiable characteristics of a biological process [
14]. Biomarkers are commonly seen as molecular markers, measured in biological samples, such as blood or tissue. However, medical parameters and indexes, such as heart rate or blood oxygen saturation, and imaging-based biomarkers can function as biomarkers following the same principles. Radiomics, as image-based biomarkers, has emerged as a novel approach in precision medicine, as it allows for thorough multi-modality image assessment accounting for intratumoural heterogeneity and change over time in a non-invasive, fast and affordable way by extracting a large number of phenotypic tumour characteristics from routine imaging [
3,
15‐
18]. Radiomic features are currently being used in research studies, but before becoming part of clinical decision making, further validation and qualification are needed [
19]. Addressing variability of PET imaging and radiomics methodology has been called for by several studies [
20,
21].
One of the major strengths of radiomic biomarkers is that they can be extracted non-invasively, from routinely acquired imaging, which makes it cost-effective [
5]. However, the protocols in these routinely acquired images have been mostly optimized for qualitative assessment. Therefore, image quality often varies among centres or among scanners, as well as over time. In PET imaging, several studies showed high instability rates of radiomic features depending for example on tumour motion, delineation, image reconstruction or image resampling [
4,
22‐
27]. These robustness studies were based on phantom investigations or repeated imaging of the same patients. They are however often limited to investigation of one single cause of feature instability. Moreover, the robustness studies mostly focused on the stability of the features without its implication on the prognostic power of PET radiomics.
This study aims to investigate the value of preselection of robust radiomic features in clinical routine PET images, which are subject to the aforementioned variations, to predict clinical outcomes in locally advanced non-small cell lung cancer (NSCLC). If indeed prognostic models based on robust PET-based radiomic features alone are feasible, image acquisition standardization may not be necessary, allowing for wider implementation of the methodology and higher generalizability of the results. On the other hand, it might be that robust preselection removes features with high prognostic values and image standardization is therefore the preferred option. Patient-based datasets were used to evaluate three sources for radiomic feature instability: tumour delineation, attenuation correction and tumour motion. Consecutively, throughout stable features for tumour delineation, attenuation correction and tumour motion were tested for prediction of event-free survival (EFS) and overall survival (OS) using standardized image datasets. For comparability, two sets of models using standardized datasets were built, with the first set using all available features, independent of their robustness, and the second one using robust features only, to test the hypothesis whether EFS and OS prediction using robust features in locally advanced NSCLC using a real-life highly heterogenous dataset is feasible, or if image standardization is required.
Discussion
Event-free survival prediction of locally advanced NSCLC based on radiomics models was successfully performed using multi-centre clinical routine FDG-PET datasets and validated using an independent internal VC. The 12-month EFS model using LHL coefficient of variation (wavelet feature), LHH neighbouring grey-level dependence matrix high dependence high grey-level emphasis (wavelet feature) and HLH skewness (wavelet feature) resulted in the highest AUC in validation (AUC = 0.74). Increased LHL coefficient of variation as well as decreased LHH neighbouring grey-level dependence matrix high dependence high grey-level emphasis and HLH skewness were associated with no events at 12 months, suggesting that more heterogeneous FDG uptake pattern is associated with worse prognosis, which has been observed in other studies [
37,
38]. Except for 12-month OS, models using preselected robust features only, could not be built using our feature selection scheme. The 12-month OS model with robust feature preselection yielded poor performance (AUC = 0.53). This indicates that robust feature preselection excluded a prohibitive large number of important radiomic features from the prognostic models. It can therefore be extrapolated that a clean, standardized dataset with similar imaging quality, especially in terms of attenuation correction and motion compensation, is required for generalizable and transferable EFS and OS prediction allowing for all available features to be potentially included.
While a majority of radiomics studies in locally advanced NSCLC assessed the prognostic value of CT-based features, our study is adding to the emerging body of the literature on PET-based models [
4,
16,
22,
23,
37‐
44]. Although it is challenging to systematically compare NSCLC PET-based radiomics studies at this time due to different cohorts in terms of stage and treatment, as well a different radiomic, biological and clinical features tested, varied statistical methodology employed and common scarcity of model validation, some studies indicate a link between higher heterogeneity and worse prognosis [
37,
38]. Interpretation of radiomic features as biomarkers may be challenging for clinicians, given their large number and not yet well-understood association with tumour characteristics and biological processes. Imaging variations not only affect the robustness of features but may also influence the association between imaging features and the underlying tumour activity distribution. In a phantom study, the association between PET radiomic features in the lung and underlying intratumoural heterogeneity was shown to be strongly influenced by image acquisition and PET imaging reconstruction [
45]. While these results need additional validation for wavelet features, they point to an inherent problem in radiomics. In addition to modelling strategies, where our results indicate improved modelling performance with standardized imaging, feature association with tumour activity distribution can be maintained across all patients when using standardized imaging settings. On the other hand, simpler imaging-based biomarkers have already found their way into clinical practice, such as the SUV
max [
4]. However, conventional PET-parameters, such as SUV
max, SUV
peak or SUV
mean were not selected into our final models, as they showed lower prognostic value than more complex radiomic features. While this is in agreement with previous publications, other studies have shown some prognostic value of SUV descriptors [
37,
38,
42,
43,
46]. In our study, a comprehensive number of radiomic features (
n = 1404) were tested for their predictive value, including shape, texture, intensity and wavelet features. This is in contrast to other studies, where a more selective number of features was assessed [
16,
38‐
40,
42‐
44]. This comprehensive approach allowed to identify the prognostic value of wavelet features, which is different from other studies, where mostly textural features were included in predictive models. The majority of PET-based outcome prediction studies in NSCLC used single-institution imaging, thereby minimizing heterogeneity within datasets. Ohri et al. tested 43 textural features, metabolic tumour volume (MTV) and SUV
max, using a multi-centre trial dataset, and identified one texture feature, SumMean, an indicator of homogeneity, as prognostic for OS among patients with large primary tumours [
40]. The authors hypothesize that the strong association may indicate feature robustness in their multi-institutional dataset, thereby increasing generalizability [
40]. A second multi-centre study by Arshad et al. claimed that slice thickness and matrix size did not significantly affect the predictive feature vector discovered (FVX), concluding that robustness of FVX permits this variable to be applied in multi-institutional studies [
42]. In our study, standardized multi-centre PET imaging was used to allow for generalizable comparison of prediction models, using robust feature preselection and all available features.
After the initial success of radiomic models for prediction of different outcomes, robustness of the features started to be frequently discussed in the context of multi-centre validation [
3,
4]. By its design, PET imaging characterizes tumour biology by the use of different radiopharmaceuticals, which makes it a perfect modality for prognosis assessment. However, an inherent complex nature of image acquisition (processes linked to radiotracer uptake, signal acquisition, image reconstruction and postprocessing) makes it a challenging modality to analyse with quantitative methods. Several initiatives exist worldwide to improve the comparability between images acquired in different institutions [
47‐
51]. Rapid development of detector technology and reconstruction methods makes collection of large and homogenous datasets difficult. Recently, in the context of quantitative texture analysis, specialized PET radiomics phantoms have been investigated to depict heterogeneity of PET tracer uptake [
4]. While phantoms may facilitate the analysis of a larger number of confounding factors in a single study, to date, most radiomics robustness studies have focused on a single factor only.
The secondary investigation focus of our study was the robustness of radiomic features in presence of multiple confounding factors (delineation variability, attenuation correction and tumour motion). Only 13% of the features were robust against all three studied factors. In the individual studies, a higher percentage of stable features was observed for delineation variability (78%) than for attenuation correction (31%) and motion difference (25%). This translated into a larger number of stable and prognostic features (AUC > 0.6) in the presence of interobserver delineation variability than different attenuation correction and motion compensation. However, the types of stable features differed considerably between robustness studies. For delineation variability, shape features were found to be the least stable ones, which is expected since shape is directly affected by different delineations. On the other hand, wavelet and texture features, which are less dependent on boundary definition, displayed a higher number of robust features. In the case of attenuation correction and motion studies, findings were the opposite. Shape features were less affected by these factors, but the number of stable texture and wavelet features was lower. Texture and wavelet features constitute a majority of the studied features, and thus, the overall percentage of features stable against attenuation correction and motion was low. Impact of delineation variability and motion on robustness of PET radiomic features was also studied by other groups and results were comparable to ours. For delineation variability, a study by Leijenar et al. reported robustness of 91% of the features for NSCLC patients [
23]. The value is 13% higher than in our study; however, they used a less strict criterium of ICC > 0.8 [
23]. A study conducted by Takeda et al. found 86% (ICC > 0.8) robust features for interobserver variability, but only seven radiomic features were investigated [
39]. The impact of motion on feature robustness was studied by Oliver et al. finding that the percentage of stable features between respiration-gated images and averaged images over all phases was 26.2% [
22]. This value is very close to the one obtained in our study; however, a detailed comparison is not possible as stability was not defined using ICC.
Strengths of our study include a prospective multi-centre training dataset, a large number of tested radiomic features, radiomic feature robustness assessment, model validation on a separate dataset and stratification by disease stage and treatment. This study adds to our recent publication on CT-based radiomics to predict OS of locally advanced NSCLC and shows that in contrast to CT-based radiomics, prognostic PET-based radiomics models require harmonized PET imaging, as robust feature preselection excluded a prohibitive large number of important radiomic features from the prognostic models [
52]. Generalizability of our models is therefore restricted to patients who underwent similar PET imaging. The importance of using a clean imaging dataset was further illustrated by a recent phantom study by Ger et al. which found that most radiomic feature values showed good reliability when PET imaging protocol parameters were within clinically used ranges, but that interscanner variability was similar to interpatient variability, leading the authors to caution radiomics analyses on patients scanned on different PET scanners [
53]. While our results support the need for harmonized PET imaging and we advocate for standardization of protocols, the impact assessment of different PET scanners was not the thrust of our study. Often heterogeneity by different PET scanners is unavoidable in a clinical setting. Another limitation of our study may be the restricted reproducibility of our results, as they depend on an in-house software. However, our software was benchmarked within the Image Biomarker Standardization Initiative (IBSI) [
18]. While the TC consisted of prospectively acquired data following a strict trial protocol, the VC consisted of a retrospective dataset potentially allowing for introduction of patient selection bias. Further, while our study stratified for stage and surgical treatment, it did not include other clinical or molecular outcome predictors, such a smoking habits or epidermal growth factor receptor (EGFR) status, which may influence EFS and OS and could hypothetically improve the models’ predictive power. However, inclusion of clinical parameters was out of scope of this study, as we aimed to investigate the optimal modelling strategy for robust multi-centre PET radiomics models. Similarly, only the primary tumour was taken into account in our study, as it is commonly the case in radiomics studies and in accordance with the study objectives. While we were able to categorize the patient cohort into low-risk and high-risk groups, biological correlation of these groups and individual radiomic factors used in the prognostic models remain unknown. In addition, sample size is another limitation of our study, which is related to availability of data. However, methodological steps were taken to address potential related statistical issues such as using PCA to reduce dimensionality, excluding correlated features and evaluating model performance in cross-validation using the TC as well as testing the models in a completely separate VC as recommended by the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement [
54]. Additional data is needed to further validate the study findings. Further, although this study investigated more than one robustness factor, the set of factors was still limited. However, two factors (motion and attenuation correction) are linked to new developments in the field, motion correction (with the recent advent of data-driven deviceless gating techniques) and PET/MR hybrid devices, and thus make them relevant factors to be studied. A recent study showed that randomization of voxel intensities had an impact on model prognostication [
55]. Since the goal of this study was to simulate a real-world environment, this aspect remained outside the scope of this work. Another limitation is the fact that all the robustness studies were conducted on different datasets with a small number of patients. This may be solved in the future with introduction of highly specialized heterogenous PET phantoms. Only one setting of radiomics calculation parameters was investigated (bin size, voxel size, HU range). The results of robustness studies might be influenced by this choice but considering the low number of patients in the robustness studies and moderate size of datasets in the prognostic modelling step, it was deemed important to limit the number of extracted features.
In conclusion, a PET-based radiomics model using multi-centre datasets to predict EFS in locally advanced NSCLC was successfully established. However, PET acquisition standardization is necessary, as prediction models using robust features alone could not be built or showed poor performance. Therefore, a standardized dataset with similar image acquisition and reconstruction is required for EFS prediction based on PET-based radiomics models.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.