Introduction
Diffuse large B-cell lymphoma (DLBCL) is the most common type of non-Hodgkin lymphoma and is associated with an aggressive disease course. Adding the monoclonal antibody rituximab to treatment regimens has improved outcome significantly [
1‐
3]. However, still approximately 30% of patients with DLBCL experience disease progression or relapse, leading to poor outcome [
4]. In rituximab-treated DLBCL patients, both the original International Prognostic Index (IPI) score [
5] and other IPI variants such as the revised IPI and National Comprehensive Cancer Network IPI fail to identify a subgroup with a poor long-term survival (e.g., < 50%) [
6], stressing the need to identify new biomarkers that can accurately select a specific subgroup with poor outcome when treated with standard chemo-immunotherapy.
Recent studies have shown that radiomics features extracted from baseline [
18F]-fluorodeoxyglucose positron emission tomography computed tomography ([
18F]FDG PET/CT) scans have promising predictive abilities in DLBCL [
7‐
11]. Radiomics features provide detailed quantitative information regarding tumor morphology, texture, dissemination, and intensity, reflecting the tumor biology. In solid cancers, radiomics features are usually extracted from the primary lesion. However, radiomics analysis in lymphoma is more challenging due to the absence of one primary lesion in most patients and the often disseminated spread of the disease throughout the body in many different nodal and extranodal sites. Due to the high inter- and intra-tumor heterogeneity within patients, the metabolic tumor volume (MTV) at patient level best reflects disease burden. Therefore, some studies calculated radiomics features at patient level [
7,
12]. As texture features become hard to interpret at patient level, other studies calculated radiomics features only for the lesion with the highest metabolic activity (highest maximum standardized uptake value (SUV
max)) [
8,
9], or for the lesion with the largest volume [
10,
11].
We previously demonstrated that radiomics features at patient level are more predictive than radiomics features of the hottest and the largest lesion [
12]. We now aimed to investigate how to aggregate information from multiple individual lesions in a patient to predict progression after 2 years, and whether this would improve prediction of progression after 2 years. We compared different lesion selection approaches and combined radiomics features from individual lesions with patient level radiomics features. Moreover, we explored the influence of different data reduction methods on model performance and investigated the feature importance of individual features in models.
Discussion
This study showed that radiomics features at patient level had the highest predictive value. Prediction models based on more complex radiomics features with information of multiple lesions had no added predictive value compared to our previously published selection of more simple radiomics features when predicting progression after 2 years, regardless of the applied feature reduction method. Dissemination features showed high predictive abilities and improved outcome prediction for radiomics features extracted from the hottest lesion, largest lesion, or patient level MTV, although not significantly.
Historically speaking, the hottest lesions have been used to measure response during or after treatment [
24,
25], and parameters quantifying uptake, such as SUV
max and SUV
peak, have shown to be predictive in DLBCL [
12,
26‐
28]. Therefore, it is surprising that radiomics features extracted from the hottest lesion have limited predictive value. For 75% of the patients in our dataset, the largest lesion also represented the hottest lesion. This was somewhat lower than the 84% reported in a recent study [
10]. In our data, a mismatch occurred more frequently in smaller lesions. However, we could not find any clear correlation between PET parameters or clinical parameters, making it hard to hypothesize an explanation for this mismatch. We previously reported a lower CV-AUC of the hottest lesion, compared to the largest lesion and patient level radiomics features [
12].
Currently, there is no valid approach to test whether the CV-AUCs of the various models are statistically significantly different, as there is no method to quantify correlation between trained models within cross-validation and there is an additional correlation between train-test data between CV iterations [
29]. Therefore, we cannot definitely state which lesion selection approach has the highest predictive value. To be able to compare AUCs, we calculated the median AUC for each model for each fold. The disadvantage of this approach is that the
p value is based on data of a single fold (20% of the data), resulting in low power to detect true differences. Therefore, the procedure was repeated 50 times with a random 20% sample of the data in order to obtain a reliable estimate of the
p value using its median value over 50 repeats
. Nevertheless, there were no significant differences between individual models, making them interchangeable. However, it seems that patient-level dissemination features play an important role. When including dissemination feature, the predictive value consistently increased for all models. A combination of MTV, SUV
peak, and Dmax
bulk showed the best predictive abilities. Nonetheless, this CV-AUC is only marginally higher than the CV-AUC of other models that included dissemination features and patient-level conventional PET features. Yet, the model based on MTV, SUV
peak, and Dmax
bulk might be preferred for translation into the clinic as these features are easy to understand and relate to disease characteristics that can be easily recognized in the PET image by eye.
There is a growing interest in radiomics features to predict outcome or select patients for innovative new treatment options, as more and more studies show their independent predictive value besides well-established clinical predictors [
8,
10‐
12,
30,
31]. In order to implement radiomics in a clinical setting, user-friendliness is important. After extensively testing lesion and feature selection approaches combined with different data reduction methods, we could not find any added value for textural and morphological radiomics features. Moreover, textural features are known to have reproducibility and repeatability issues in a clinical setting [
17], making feasibility of application of prediction models using textural features in clinical practice lower. Since the predictive value of dissemination features combined with conventional PET features was highest, it is advisable to calculate these features. Contrary to morphological and textural radiomics features, dissemination features are easy to interpret because they quantitatively reflect what can be visualized on PET/CT scans. They are also relatively simple to calculate and relatively insensitive to differences in acquisition, reconstruction, and delineation method [
17,
32]. From an ease-of-use perspective, median and maximum lesion selection methods are more time-consuming and therefore not preferred since all individual lesions have to be processed individually to calculate radiomics features for each lesion. Moreover, the median prediction model showed limited discriminative power compared to other lesion selection methods. Radiomics features extracted from the patient level MTV (Model 4) are predictive of outcome. However, the interpretation of multi-cluster radiomics features is complex, both mathematically and clinically. Therefore, features extracted from this model might not be suitable for a clinical setting. Currently, there is no consensus on the optimal segmentation method in DLBCL, although the SUV4.0 method has been suggested [
33]. However, we recently showed that the segmentation method does not influence the discriminative power of dissemination features [
32].
Several other studies have evaluated the predictive value of baseline radiomics features in DLBCL. Aide et al. [
11] showed that for the largest lesion, nine textural features (out of 19) were univariate significant [
11]. Parvez et al. calculated 42 features for the 1–3 hottest lesions and reported that 3 textural features significantly predicted disease-free survival [
9]. Decazes et al. showed that in a multivariate analysis IPI, chemotherapy, MTV, and the total volume surface ratio were all significant [
31]. Two studies extracted the metabolic heterogeneity from the hottest lesion [
8], or the largest lesion [
10]. Both studies showed that patients with high MTV and high metabolic heterogeneity had significantly lower survival rates compared to patients with only one of these risk factors. Nonetheless, MTV was the only significant predictor of outcome in a multivariate analysis. Due to the different (numbers of) features that were extracted, it is hard to directly compare these studies. Generally speaking, our results confirm that radiomics features are predictive of outcome and have added value compared to MTV. Moreover, we extend on these findings by showing that dissemination features are very important and that adding complex textural radiomics features does not have additional predictive abilities compared to dissemination features.
Dissemination expressed as distance between lesions was first introduced by Cottereau et al., showing that dissemination was a predictor of outcome independent of MTV [
7,
30]. In our study, Dmax
bulk consistently had the highest feature importance, indicating that dissemination is more important than MTV when predicting outcome. Our study adds to their findings by showing that dissemination features quantifying the differences in uptake or difference in volume between lesions also showed high predictive value.
By applying different lesion selection approaches on the same patient samples, we could directly compare their predictive value using progression after 2 years as outcome. Because our aim was to compare the predictive value of radiomics features using different lesion selection approaches, we did not add any clinical predictors. When developing a prediction model, adding clinical predictors to radiomics features showed improved prediction of outcome in DLBCL [
12,
30] and other types of lymphoma [
34,
35]. A limitation of this study was that we did not externally validate our findings in a separate cohort making our findings explorative, although we applied internal-validation by using cross-validation. More specifically, most patients who were included in this study had advanced stage disease; therefore, our results need to be validated in other cohorts with limited-stage DLBCL patients. Lastly, the majority of the patients that were included in this study did not experience progression, causing imbalance in outcome. We corrected for this by creating synthetic samples. CV-AUCs of datasets with and without synthetic samples were comparable for all models, yet an effect of class imbalance cannot be ruled out.
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.