Introduction

Radiomics is the high-throughput quantitative analysis of medical imaging to facilitate model-based treatment decisions1,2. A prevalent approach relies on the computation of image biomarkers (features) within a region of interest (ROI). In this approach features quantify different aspects of the ROI, such as mean intensity, volume and texture heterogeneity. Variations in patient positioning, image acquisition and segmentation affect each feature to varying degrees3,4. If radiomic models use features that are not robust against such influences, they will perform poorly when applied to new data5. Assessing feature robustness is thus recommended to improve generalisability of radiomic models.

Non-robust image features are commonly identified using test-retest imaging6,7,8,9,10. In test-retest imaging, the same region of interest is imaged twice within a time interval of minutes to days, usually with the same acquisition protocol. Consequently, these two images are similar, but not identical, which allows the identification of non-robust features. After identification, non-robust features are excluded from further analysis.

Although the identification of robust features is important, implementing test-retest imaging for every radiomic study has been difficult to achieve for several reasons. First, feature robustness is dependent on the phenotype of interest as well as the imaging modality. This means that information concerning feature robustness cannot be transferred between studies on different phenotypes11 and modalities7. Furthermore, feature values depend on multiple factors, including the voxel size and discretisation used12,13,14. Thus, even if a previous study determined feature robustness for a particular phenotype and modality, the results may not be transferable due to the use of different computational settings. Second, test-retest imaging may be difficult to obtain generally, as it is not part of the clinical routine. Acquiring test-retest imaging would thus require additional resources in terms of personnel and imaging time, and, potentially, an increased patient radiation dose. An alternative would be to use the appropriate publicly available test-retest data set, but such data are likewise sparse.

It would therefore be convenient if feature robustness against perturbations could be assessed from single images. To do so, we can use methods more prevalent in the deep learning computer vision field. Here, networks are constructed to be invariant to various perturbations, e.g. noise, rotation and translation15. To achieve invariance, such perturbations are created on purpose, distorted images are generated and subsequently used as input data to develop deep learning models. The same principle may apply to the hand-crafted features that are considered in this work. We hypothesise that perturbations of single images may successfully identify the majority of features that are not robust in test-retest imaging. The aim is thus to identify perturbations that minimise the number of false positive robust features, using robustness in test-retest imaging as reference.

Results

Two test-retest data sets of computed tomography (CT) images were assessed, namely: (I) a publicly available non-small cell lung cancer (NSCLC) cohort of 31 patients; and (II) an in-house head and neck squamous cell carcinoma (HNSCC) cohort of 19 patients.

After delineating the gross tumour volume (GTV), the CT images were perturbed by rotation (R), Gaussian noise addition (N), translation (T), volume adaptation (growth/shrinkage of the ROI mask; V) and supervoxel-based contour randomisation (C), see Fig. 1 and Table 1. Eighteen combinations of perturbations were created by chaining perturbation operations. All chains involved repetition with different settings or randomisation. Morphological, statistical and texture features (4032 in total) were computed from the GTV ROI in each distorted image.

Figure 1
figure 1

Perturbation examples. To perturb an image (blue) and the region of interest mask (orange overlay), the original image is translated, rotated, noised, and has its mask adapted and randomised. Translation and rotation change both the image and its mask, whereas noise only distorts the image. Volume adaptation and contour randomisation change the mask by adding (green overlay) and removing voxels (red overlay). Note that translation and rotation require additional interpolation (not shown).

Table 1 List of perturbations, with their abbreviation and the number of different images generated by each perturbation.

Robustness of each feature was measured by the intraclass correlation coefficient (1, 1) (ICC)16. We computed the ICC of a feature between either the test and retest images (test-retest ICC), or between the perturbed images of each perturbation chain (perturbation ICC), see Fig. 2. The 95% confidence interval (CI) of the ICC was then used to determine robustness by comparing with a threshold of 0.9017. Thus, a feature is robust if CI ≥ 0.90, non-robust if CI < 0.90, and has an indeterminate robustness if the CI overlaps with the threshold.

Figure 2
figure 2

Workflow to determine the test-retest and perturbation intraclass correlation coefficients (ICC) for each feature. The test-retest ICC was calculated directly between the same features in both images. To derive the perturbation ICC, an ICC was first calculated between feature values in perturbations of image 1 (ICC 1) and then again in perturbations of image 2 (ICC 2). The perturbation ICC is the average of ICC 1 and 2.

A table containing all estimated ICC values and their 95% confidence intervals for all features and both cohorts was appended as supplementary data.

Comparison between NSCLC and HNSCC cohorts

To validate the basic premise that feature robustness is dependent on the phenotype, we compared feature robustness based on the test-retest ICC in both cohorts.

In the NSCLC cohort 2310 (57.3%) features were found to be robust, 597 (14.8%) were non-robust and 1125 (27.9%) had an indeterminate robustness. In the HNSCC cohort 582 (14.4%) features were robust, 1369 (34.0%) were non-robust and 2081 (51.6%) had an indeterminate robustness.

454 (11.3%) and 280 (6.9%) features were robust and non-robust in both cohorts, respectively. Additionally, 656 (16.3%) features were robust in the NSCLC cohort, but not in the HNSCC cohort, and 35 (0.9%) features were robust in the HNSCC cohort, but not in the NSCLC cohort. The remainder could not be compared due to indeterminate robustness in the NSCLC cohort (526; 13.0%), the HNSCC cohort (1482; 36.8%) or both cohorts (599; 14.9%).

Robustness under image perturbations

The fraction of robust features for test-retest imaging and image perturbations is shown in Fig. 3. In both cohorts, the N perturbation yielded the highest number of robust features (NSCLC: 95.0%; HNSCC: 97.4%), which was higher than the number of robust features as determined by test-retest imaging (NSCLC: 57.3%; HNSCC: 14.4%). The lowest number of robust features in the NSCLC cohort was identified by the TVC perturbation chain (32.9%), followed by RVC (33.3%), NTVC (33.7%), RNVC (34.2%) and RC (38.3%). In the HNSCC cohort, TVC (16.6%), NTVC and RNVC (both 16.7%), RVC (16.8%), VC (17.8%) and V (30.8%) identified fewest robust features.

Figure 3
figure 3

Overall robustness of features for test-retest and perturbation conditions. Robustness was determined using the 95% confidence interval (CI) of the intraclass correlation coefficient. Features with CI ≥ 0.90 were considered to be robust (+), CI < 0.90 non-robust (−), and indeterminate (0) otherwise. Perturbations are abbreviated, see Table 1: R: rotation; N: noise addition; T: translation; V: volume adaptation; C: contour randomisation.

Feature-wise comparison of perturbation and test-retest robustness

Test-retest and perturbation robustness were also compared directly for the same feature. Thus, when comparing test-retest and perturbation robustness for each feature, a feature may be robust under both perturbation and test-retest conditions, non-robust under both, robust under test-retest or perturbation conditions only, or of indeterminate robustness. Using test-retest robustness as a reference, these conditions represent true positive, true negative, false negative, false positive and indeterminate cases, respectively. The direct feature-wise comparison of robustness is presented in Fig. 4.

Figure 4
figure 4

Feature-wise comparison of robustness under test-retest and perturbation conditions. Robustness was determined using the 95% confidence interval (CI) of the intraclass correlation coefficient. Features with CI ≥ 0.90 were considered to be robust (+), CI < 0.90 non-robust (−), and indeterminate (0) otherwise. By comparing robustness states between test-retest (T) and perturbation (P) conditions, a feature was either robust under both conditions (T+P+; true positive), non-robust under both conditions (T−P−; true negative), only robust under perturbations (T−P+; false positive), or only robust under test-retest conditions (T+P−; false negative). The state of the remaining features is either indeterminate due to overlap of the test-retest CI with the threshold (T0P−, T0P+), overlap of the perturbation CI with the threshold (T + P0, T − P0) or both (T0P0). Test-retest robustness was used as reference, and the corresponding column therefore only contains true positives and negatives, as well as indeterminate robustness. Perturbations are abbreviated, see Table 1: R: rotation; N: noise addition; T: translation; V: volume adaptation; C: contour randomisation.

No perturbation identified every feature that was non-robust under test-retest conditions in both cohorts. The number of false positives differed between perturbations and cohorts. Perturbation chains in the NSCLC cohort yielded less false positives than the HNSCC cohort on average (2.0% vs. 9.4%).

In the NSCLC cohort, the RC perturbation chain caused the lowest number of false positives (0.0%), followed by RVC (0.2%), RNVC (0.5%) and NTVC (0.7%). The lowest false positive fraction in the HNSCC cohort was produced by RNVC perturbation chain (1.7%), followed by RVC (1.8%), TVC and NTVC (both 1.9%). In the HNSCC cohort, the RC perturbation chain led to 5.7% false positives.

Discussion

We compared several methods for perturbing images to determine feature robustness. The perturbation chains that combine rotation or translation with volume adaptation and contour randomisation (RVC, RNVC, TVC, NTVC) led to a low number of false positives in both cohorts, using test-retest robustness as reference, and where otherwise comparable. Hence any of these chains may be used as an alternative to test-retest imaging to assess feature robustness.

Other perturbation methods performed poorly, particularly if only one kind of perturbation was used. This includes methods such as noise addition or simple rotations or translations. The combination of rotation and translation was not better than rotation or translation alone. Chaining perturbation methods that primarily alter the intensity content (noise, translation, rotation) with methods that update the region of interest mask (volume adaptation and contour randomisation) improved results in terms of less false positives with regard to test-retest imaging.

Considerable difference in overall robustness was observed between NSCLC and HNSCC cohorts. Specific image processing parameters or contributions of particular feature family are unlikely to cause this difference (Supplementary Notes 7 and 8). The differences are more likely caused by either inherent differences between tumour phenotypes11 or by limitations inherent to test-retest imaging in patients. As only two test-retest images are usually acquired in patients, the number of possible acquisition options that can be assessed is constrained. Lack of access to raw imaging data to assess different reconstruction settings compounds this limitation. In this study, two different image acquisition and reconstruction protocols were used in the HNSCC cohort, whereas only one protocol was used for test-retest imaging in the NSCLC cohort. In the HNSCC cohort exposure and reconstruction kernels differed between protocols (Supplementary Note 1). The exposure between both HNSCC images differed by a factor 4 on average, whereas exposure in the NSCLC set was similar between images. The HNSCC test-retest set may thus have captured differences in exposure. However, the effect of exposure and tube current on feature robustness has been contested. Larue et al. and Mackin et al. both found that exposure had a marginal effect on feature robustness18,19, whereas Midya et al. found that it had a more pronounced effect20. The HNSCC test-retest set may also have been affected by the difference in reconstruction kernels. Though both kernels in the HNSCC cohort produce smooth images, differences in reconstruction kernels may strongly affect feature values21,22.

Aside from the overall difference in robustness between the NSCLC and HNSCC cohorts, a large difference in indeterminate robustness fractions can be observed between both cohorts. This is reflected in the 95% confidence interval of the ICC value of each feature. The average width of the 95% confidence interval of test-retest ICCs was 0.12 (NSCLC) and 0.35 (HNSCC). This indicates that feature values in the HNSCC cohort were less consistent between both images of the test-retest set, which may be related to the aforementioned difference in acquisition and reconstruction protocols. Yet, the decreased consistency between test and retest images may also be related to delineation uncertainties. The potential role of delineation uncertainties may observed by comparing the single perturbations for volume adaptation and contour randomisation between both cohorts with perturbations that only affect intensities. In the NSCLC cohort, delineation perturbations affect feature robustness less than in the HNSCC cohort, which was also found by Pavic et al.23.

Image perturbations allows performing repeated measurements without actual acquisition of multiple images, which could be considered an advantage over test-retest imaging. We consider three methods for incorporating repeated measurements into radiomics modelling. The first, straightforward, method is to include only robust features in the modelling process, and omit indeterminate and non-robust features. This method is commonly used when robustness is determined using test-retest imaging and its implementation into a modelling workflow should therefore be easy5. Moreover, this method is useful when only a subset of the development cohort is perturbed, or a separate data set is used for robustness analysis.

It should be noted that the number of indeterminate features correlates with the number of perturbations, as the 95% confidence interval of the ICC shrinks with increasing repeated measurements. It is thus possible to increase the number of robust and non-robust features by increasing the number of perturbations, albeit with diminishing returns. Many studies sidestep this issue entirely by applying a threshold against the estimated ICC24 instead of its confidence interval. This criterion is less stringent than comparison against the confidence interval and may lead to the inclusion of features that have reasonable probability (between 2.5 and 50%) of actually not meeting the criterion. This is particularly risky if the confidence intervals are wide and overlap with ICC values < 0.50 (poor robustness) and 0.50 ≥ ICC < 0.75 (moderate robustness)17. Thus, if a confidence interval is provided with an ICC value, it would be preferable to use this interval instead of the estimated ICC for selecting robust features.

The second way to use repeated measurements for radiomics modelling is by averaging the measurements for each feature. Averaging suppresses noise and as a consequence the corresponding panel ICC is always higher than that of a single measurement16, and its 95% confidence interval smaller. The mean values of the features that are robust according to the panel ICC are then included in the modelling process. This method requires that all images in the development cohort are perturbed, and may thus computationally be more expensive than the first.

The final method builds upon the second, and is conceptually close to the use of image perturbations for deep learning. Instead of averaging values and selecting robust features prior to modelling, all values are included in the model development process. One advantage of this method is that information concerning the distribution of feature values within and across samples is not lost, and may be exploited during the model development process. Another advantage is that an explicit robustness threshold is not required. However, this method does require that all images in the development cohort are perturbed and may add complexity to radiomics modelling frameworks. A future study should compare the three methods and their effect on the performance of radiomic models.

One limitation of the current study is that we only assessed test-retest imaging based on computed tomography, as test-retest data sets for other modalities were not available to us. The proposed methodology should be assessed for other modalities, e.g. positron emission tomography (PET) and magnetic resonance imaging (MRI). Some image perturbation parameters, such as the volume of supervoxels, may require revision for other modalities.

Another limitation of the current study is that we did not assess the effect of expert delineation uncertainties directly. As mentioned before, delineation uncertainties also cause variability in feature values23. Volume adaptation and contour randomisation perturbations try to induce this uncertainty, but a comparison against a multiple delineation data set should be performed in the future.

In conclusion, we investigated the use of image perturbations to determine the robustness of radiomic features, using test-retest imaging as reference. Our findings indicate that perturbation methods that distort image intensities and deform the ROI mask (NTVC, TVC, RNVC and RVC) may be used as an alternative to test-retest imaging to determine feature robustness.

Methods

Test-retest cohorts

Two patient cohorts with test-retest computed tomography imaging were used: a publicly available non-small cell lung cancer cohort of 31 patients25,26 and an in-house cohort (DRKS 00006007) of 19 patients with locally advanced head and neck squamous cell carcinoma27. The NSCLC cohort is available from the Cancer Imaging Archive28. For the NSCLC cohort, two separate images were acquired within 15 minutes of each other, using the same scanner and acquisition protocol. Images in the HNSCC cohort were acquired within 4 days of each other using a different protocol, i.e. one CT image was acquired for 18F-Fludeoxyglucose positron emission tomography (PET) attenuation correction, and the other for attenuation correction of 18F-Fluoromisonidazole PET. Image acquisition parameters for both cohorts are shown in Supplementary Note 1.

Informed consent was obtained from all patients. Approval for analysis of the in-house data set was provided by the local ethics committee (Ethikkomission an der TU Dresden: EK 177042017). This study was conducted according to relevant guidelines and regulations.

The GTV was delineated by experienced radio-oncologists (L.A., K.P., E.G.C.T) using the Raystation 4.6 treatment planning system software (RaySearch Laboratories AB, Stockholm, Sweden), and subsequently used as the region of interest.

Image processing

Image processing was conducted using the scheme and recommendations provided by the Image Biomarker Standardisation Initiative (IBSI)29. An overview of the processing steps is provided in Fig. 5, and further details may be found in the IBSI documentation. A complete overview of the image processing parameters, excluding perturbation-related parameters, may be found in Table 2, and are reported in compliance with the preliminary IBSI reporting guidelines29,30.

Figure 5
figure 5

Image processing scheme with perturbations. A computed tomography (CT) image and a segmented gross tumour volume (GTV) are used as the input image data and the region of interest (ROI) respectively. The CT and ROI are processed to compute image features. Rotation, translation, noise addition, volume adaptation and contour randomisation are optional perturbation steps. Other image processing steps are detailed in the documentation of the image biomarker standardisation initiative (IBSI)29. IH: intensity histogram; IVH: intensity-volume histogram; GLCM: grey level co-occurrence matrix; GLRLM: grey level run length matrix; GLSZM: grey level size zone matrix; GLDZM: grey level distance zone matrix; NGDTM: neighbourhood grey tone difference matrix; NGLDM: neighbouring grey level dependence matrix. This figure is based on the image processing scheme in the IBSI document.

Table 2 Image processing parameters for both NSCLC and HNSCC data sets.

In short, after loading a CT image, DICOM RTSTRUCT polygons were used to generate a voxel-based segmentation mask for the GTV ROI. The image and mask were then both rotated over a set angle θ (optional). Gaussian noise, based on the noise levels present in the original image, was added to the image (optional). Subsequently, both image and mask were translated with a sub-voxel shift η (optional) and interpolated with prior Gaussian anti-aliasing (Supplementary Note 2). After interpolating to isotropic voxel dimensions, the image intensity values were rounded to the nearest integer Hounsfield unit, and the mask was re-labelled based on the partial voxel volume threshold. The mask was then grown or shrunk to alter the volume by a fraction τ (optional), before being perturbed by supervoxel-based contour randomisation31 (optional). The mask was subsequently copied to generate an intensity mask and a morphological mask. The intensity mask was re-segmented to an intensity range which includes only soft-tissue voxels. Voxels with intensities deviating more than three standard deviations from the mean of the ROI were excluded from the intensity mask as well32,33. The image and both masks were subsequently used to compute radiomic features, with several feature families requiring additional discretisation (Supplementary Note 3).

Image perturbations

Five basic image perturbation methods were implemented in the image processing scheme described above. These were rotation (R), noise addition (N), translation (T), volume adaptation (V) and contour randomisation (C). Examples are shown in Fig. 1. Rotation perturbs the image and mask by performing an affine transformation that rotates the image and mask in the axial (x, y) plane, i.e. around the z-axis, for a specified angle \(\theta \in [\,-\,{13}^{\circ },\,{13}^{\circ }]\). Noise addition perturbs image intensities by adding random noise that was drawn from a normal distribution with mean 0 and a standard deviation equal to the estimated standard deviation of the noise present in the image. Translation perturbs the image and mask by performing an affine transformation that shifts the image and mask for specified fractions \((\eta \in \mathrm{[0.00,}\,\mathrm{0.75]})\) of the isotropic voxel spacing along the x, y and z axis. Volume adaptation grows and/or shrinks the mask by a specified fraction \(\tau \in [\,-\,\mathrm{0.28,}\,\mathrm{0.28]}\). Contour randomisation is based on simple linear iterative clustering31, and perturbs the mask by randomly selecting supervoxels based on the overlap with the original mask. The algorithmic implementation of these perturbations is described in Supplementary Note 4.

Perturbations were chained using the settings documented in Supplementary Note 5. Each rotation angle and volume adaptation fraction led to generation of a new image. Noise addition and contour randomisation could be repeated multiple times, with each repetition producing a new perturbed image as well. The translation fraction was permuted over the different directions. For example, for translation fractions \(\eta =\{\mathrm{0.25,}\,0.5\}\), 23 = 8 permutations were generated. Each permutation generated a new image. When chaining perturbations, all provided parameters were permuted.

An overview of the perturbation chains and the number of perturbed images created is shown in Table 1. All perturbation chains produced between 27 and 40 perturbed images.

Features

All features defined in the IBSI documentation were implemented29, leading to a set of 182 base features that were used to assess morphological, statistical and texture characteristics of the ROI. These base features belong to the morphological, local intensity, intensity-based statistical, intensity-histogram, intensity-volume histogram, grey level co-occurrence matrix-based texture, grey level run length matrix-based texture, grey level size zone matrix-based texture, grey level distance zone matrix-based texture, neighbourhood grey tone difference matrix-based texture, and neighbouring grey level dependence matrix-based texture feature families. All base features were computed at multiple scales, namely for isotropic voxel spacings of 1, 2, 3 and 4 mm34. 118 base features required discretisation. Both fixed bin size and fixed bin width discretisation algorithms were used, each with four settings. Thus, a total of 4032 features were computed in each image. Supplementary Note 3 contains further details with regard to feature computation.

Both image processing and feature computation were conducted using our IBSI-compliant in-house framework based on Python 3.635.

Robustness analysis

Feature robustness was assessed using the intraclass correlation coefficient (1, 1) (ICC)16, based on the assumption that test-retest images, as well as perturbations, possess no consistent bias. The highest possible ICC value is 1.00, which indicates that feature values are fully repeatable between test-retest images or perturbations. Lower values denote an increasing measurement variance with respect to the intra-patient variance, and thus lower repeatability.

The test-retest ICC was determined between both CT images, see Fig. 2. Perturbation ICCs were first computed separately for the test and retest images. Subsequently, perturbation ICCs were averaged over test and retest images to facilitate comparison with the test-retest ICC, as no consistent bias toward higher ICC values for one image set could be established (see Supplementary Note 6). The boundary values of the 95% confidence interval for perturbations were likewise averaged between test and retest images.

The 95% confidence interval of the ICC was used to determine robustness by comparison with a threshold of 0.9017. Thus, a feature is robust if CI ≥ 0.90, non-robust if CI < 0.90, and has an indeterminate robustness if the CI overlaps with the threshold.

Feature robustness was assessed using R 3.4.236. ICCs and their confidence intervals were computed using code adapted from the psych R-package37.