Brought to you by:
Paper

Multi-view radiomics and dosiomics analysis with machine learning for predicting acute-phase weight loss in lung cancer patients treated with radiotherapy

, , , , , , , and

Published 2 October 2020 © 2020 Institute of Physics and Engineering in Medicine
, , Citation Sang Ho Lee et al 2020 Phys. Med. Biol. 65 195015 DOI 10.1088/1361-6560/ab8531

0031-9155/65/19/195015

Abstract

We propose a multi-view data analysis approach using radiomics and dosiomics (R&D) texture features for predicting acute-phase weight loss (WL) in lung cancer radiotherapy. Baseline weight of 388 patients who underwent intensity modulated radiation therapy (IMRT) was measured between one month prior to and one week after the start of IMRT. Weight change between one week and two months after the commencement of IMRT was analyzed, and dichotomized at 5% WL. Each patient had a planning CT and contours of gross tumor volume (GTV) and esophagus (ESO). A total of 355 features including clinical parameter (CP), GTV and ESO (GTV&ESO) dose-volume histogram (DVH), GTV radiomics, and GTV&ESO dosiomics features were extracted. R&D features were categorized as first- (L1), second- (L2), higher-order (L3) statistics, and three combined groups, L1 + L2, L2 + L3 and L1 + L2 + L3. Multi-view texture analysis was performed to identify optimal R&D input features. In the training set (194 earlier patients), feature selection was performed using Boruta algorithm followed by collinearity removal based on variance inflation factor. Machine-learning models were developed using Laplacian kernel support vector machine (lpSVM), deep neural network (DNN) and their averaged ensemble classifiers. Prediction performance was tested on an independent test set (194 more recent patients), and compared among seven different input conditions: CP-only, DVH-only, R&D-only, DVH + CP, R&D + CP, R&D + DVH and R&D + DVH + CP. Combined GTV L1 + L2 + L3 radiomics and GTV&ESO L3 dosiomics were identified as optimal input features, which achieved the best performance with an ensemble classifier (AUC = 0.710), having statistically significantly higher predictability compared with DVH and/or CP features (p < 0.05). When this performance was compared to that with full R&D-only features which reflect traditional single-view data, there was a statistically significant difference (p < 0.05). Using optimized multi-view R&D input features is beneficial for predicting early WL in lung cancer radiotherapy, leading to improved performance compared to using conventional DVH and/or CP features.

Export citation and abstract BibTeX RIS

1. Introduction

Weight loss (WL) is known as an independent prognostic factor in patients with lung cancer (Mytelka et al 2018). Lung cancer patients often experience loss of appetite and unintentional WL. The loss of appetite resulting in the inability to eat is known as anorexia, and the weakness and emaciation resulting from ill health and malnutrition associated WL are known as cachexia. These symptoms constitute cancer anorexia-cachexia syndrome that causes additional complications and negative effects on the health of lung cancer patients and their quality of life. In addition, they are associated with a higher incidence of postoperative complications and increased side effects from chemotherapy (Ross et al 2004). Approximately half of all cancer patients experience anorexia and cachexia, and associated WL typically exacerbates as disease progresses, with around 60% of lung cancer patients reporting significant WL in their last few months of life (Ross et al 2004). WL in patients with lung cancer is thus of symptomatic and prognostic relevance.

Cancer-related WL is a common comorbid condition that is best described for patients with advanced malignancy who receive systemic therapy, but its relationship with patients who undergo radiation therapy (RT) is less well described (Lau and Iyengar 2017). Over half of cancer patients receive RT during their course of treatment (Barton et al 2014). In addition, radiation-induced side effects can augment or prompt WL for thoracic malignancy where cachexia is most prevalent. For example, acute radiation esophagitis is a common toxicity related to thoracic RT during treatment for lung cancer (Rose et al 2009). The clinical manifestation of acute esophagitis that appears as dysphasia or odynophagia can adversely affect the ability to take proper nutrition, and thus leads to WL and malnutrition (Bovio et al 2009). When an evaluation of malnutrition is absent, WL is often used as a surrogate measure of nutritional status (Buskermolen et al 2012). Therefore, in order for clinicians to identify patients who are at high nutritional risk, provide timely and appropriate nutritional intervention before the onset of malnutrition and optimize patient outcomes, identifying early predictors of WL is key for developing personalized treatment.

To date, many studies have been worked on in lung cancer patients to identify dosimetric factors of RT-induced esophagitis (Rose et al 2009, Werner-Wasik et al 2010), but a few studies have investigated factors associated with WL during RT in lung cancer patients (Kiss et al 2014). Moreover, in most RT dose-outcome studies, 3D dose distribution is reduced to a dose-volume histogram (DVH), which describes the volume of an organ that receives each dose level. However, DVH-based measures of treatment plan quality have long been understood to have limited predictive value (Deasy et al 2002), because information on spatial relationships between voxel doses is completely lost when 3D dose distributions are summarized with the DVH. Moreover, adjacent dose levels in the DVH are generally highly correlated and the predictive power of clinical outcomes often remains moderate (Carillo et al 2014, D'Avino et al 2015, Fiorino et al 2016).

Radiomics plays an important role in characterizing gray-level distributions in radiographic images and deriving quantitative imaging biomarkers to improve predictive modeling (Aerts et al 2014). The emergence of radiomics has provided new opportunities to develop more informative outcome models for RT (El Naqa et al 2018). However, while radiomics has been applied to various standard-of-care clinical images such as CT, MRI or PET, only a limited number of recent studies have used dosiomics that describes dosimetric texture features of 3D dose distributions that can be considered as images (Rossi et al 2018, Gabrys et al 2018).

Predicting WL is particularly important in patients receiving RT, because RT may preferentially be used for cachectic patients and radiation-induced toxicity may precipitate WL (Lau and Iyengar 2017), which may in turn be associated with worse prognosis (Sanders et al 2016). So far, the use of radiomics and dosiomics (R&D) features associated with clinically important WL (⩾5%) (Fearon et al 2011, Tuca et al 2013) has not yet been investigated. The R&D texture features are of high dimension in nature. Even though multiple R&D feature groups can be extracted for obtaining more useful information, a large number of features with limited sample size could hinder the predictive power of a classifier model. To date, most radiomics studies have concatenated the multiple feature groups all together to form a single-view feature vector without devising a systematic way in the analysis of high-dimensional data. Hence, radiomics feature selection and classifier performance are based on the single-view data analysis scenario (Parmar et al 2015). However, the drawbacks of this method are that the over-fitting problem may arise on comparatively small training sets and the specific statistical property of each view is ignored. To overcome such limitations in single-view analysis, multi-view learning approaches have been increasingly adopted to handle machine learning problems with high-dimensional data represented by multiple distinct feature sets (Sun 2013, Cao et al 2019), showing potentials of a well designed multi-view data analysis strategy for performance improvements.

A simple way to convert from a single view to multiple views is to split the original feature set into different views. The R&D texture features can be represented by heterogeneous feature spaces in the form of multiple views because of two main reasons: 1) they come from different imaging sources, i.e. CT images for radiomics and RT dose volume for dosiomics, and 2) they are composed of different statistics-based phenotypes as they contain first- (L1), second- (L2), and higher-order (L3) statistics (Gillies et al 2016). In contrast with existing studies that treat radiomics features as single-view concatenated data, we make use of different statistics-based imaging phenotypes to explore the performance of classification models with multi-view R&D texture features. In this study, we hypothesized that the multi-view analysis strategy of R&D texture features improves the accuracy of predicting early radiation-induced WL. The objective of this study was to identify predictors associated with ⩾5% WL during two months from the commencement of RT in lung cancer patients receiving RT and to compare predictive performance of the conventional approaches based on DVH and/or clinical parameters (CPs) with the use of R&D texture features.

2. Materials and methods

2.1. Patients

Lung cancer patients who underwent intensity modulated radiation therapy (IMRT) at the Johns Hopkins Hospital from July 23, 2009, through March 22, 2019 were identified. Patients were eligible for inclusion if they had a primary diagnosis of non-small cell or small cell lung cancer; had body weight recorded at baseline and RT phase; were treated with definitive RT with or without concurrent chemotherapy. Patients with multiple treatments, reirradiation and palliative RT, and missing weight measurements at baseline or RT phase were excluded. A total of 388 patients met the study inclusion criteria. For temporally independent validation, patients who underwent RT prior to April 2015 were assigned to a training cohort (n = 194), and the subsequent patients were assigned to a test cohort (n = 194). The institutional review board approved this retrospective study, and the requirement for informed consent was waived because of the retrospective nature of the study.

2.2. Outcome

WL during RT was the primary outcome of the study. Baseline patient weight was measured between one month prior to and one week after the start of RT. In addition, weight at any time point between one week and two months after the commencement of RT was recorded. The percent change in weight was calculated using each patient's lowest weight during the time period compared to weight at baseline. Patients were divided into two groups: those who lost greater than or equal to 5% body weight and those who lost less than 5% body weight. Critical WL (CWL) was defined as involuntary WL of 5% or more. The study cohort included 169 patients with CWL (⩾5%) and 219 patients with no CWL (<5%). The training cohort consisted of 93 patients with CWL and 101 patients with no CWL, and the test cohort had 76 patients with CWL and 118 patients with no CWL.

2.3. CT image acquisition and segmentation

Planning CT images were acquired with lung cancer patients in free breathing on 16-slice Philips Brilliance Big Bore CT scanners (Philips Medical Systems, Andover, MA, USA) using the following parameters: tube voltage, 120 kVp; exposure, 200 mAs; in-plane spatial resolution, 1.2 × 1.2 mm2; and slice thickness, 3 mm. An expert radiation oncologist manually contoured primary gross tumor volume (GTV) and esophagus (ESO) for each patient for treatment planning purposes. Patients with severe artifacts in CT were excluded in the image analysis to avoid undesirable strong influence to the image features and analysis.

2.4. Clinical parameters

A total of 10 baseline CPs were included in the analysis: age, gender, race, marital status, serum albumin level, RT with or without concurrent chemotherapy, body weight, body mass index (BMI) and Karnofsky performance score (KPS). All CPs were recorded from the electronic medical record prior to RT. Missing values were imputed using multivariate imputation by chained equations (Buuren and Groothuis-Oudshoorn 2011). Eligible patients aged from 31 to 90 years (mean age, 65.44 (standard deviation (SD), 10.65) years), and were composed of 200 male and 188 female patients. Patient race included Caucasian (n = 218), African American (n = 104), Asian (n = 7), American Indian (n = 2), other (n = 12), and unknown (n = 45) race. Patient marital status was categorized into partnered (n = 217), non-partnered (n = 157), and unknown (n = 14) status, where partnered patients consisted of both married and unmarried ones, and non-partnered patients were single, widowed or divorced ones. Patient serum albumin level ranged from 2 to 4.9 g dl−1 (mean ± SD serum albumin level, 3.69 ± 0.54 g dl−1). A total of 347 patients received full (n = 217) or sensitizing (n = 130) dose RT with concurrent chemoradiotherapy, while the remaining 41 patients received full dose RT without concurrent chemotherapy. Patient height, body weight, BMI and KPS ranged from 149.9 to 198.1 cm (mean ± SD height, 169.3 ± 9.977 cm), 38.8 to 149.9 kg (mean ± SD body weight, 76.65 ± 17.44 kg), 16.49 to 51.49 kg m−2 (mean ± SD BMI, 26.69 ± 5.462 kg m−2) and 50 to 100 (mean ± SD KPS, 86.57 ± 9.31), respectively. More details about CPs are summarized for each of the training and test cohorts with and without CWL in table 1.

Table 1. Baseline population characteristics of clinical parameters in the training and test sets (n = 388). Mean and SD are calculated for continuous variables and number (n) and percentage (%) for categorical variables in the training or test set, respectively.

  Training (n = 194) Test (n = 194)
CP feature No CWL (n = 101) CWL (n = 93) No CWL (n = 118) CWL (n = 76)
Age (mean ± SD) 65.47 ± 10.91 64.84 ± 10.74 64.88 ± 10.64 67.01 ± 10.24
Gender (n (%))
Male 54 (27.84) 46 (23.71) 65 (33.51) 35 (18.04)
Female 47 (24.23) 47 (24.23) 53 (27.32) 41 (21.13)
Race (n (%))
African American 30 (15.46) 30 (15.46) 27 (13.92) 17 (8.763)
American Indian 0 (0) 0 (0) 1 (0.515) 1 (0.515)
Asian 2 (1.031) 2 (1.031) 2 (1.031) 1 (0.515)
Caucasian 60 (30.93) 57 (29.38) 63 (32.47) 38 (19.59)
Others 7 (3.608) 1 (0.515) 4 (2.062) 0 (0)
Unknown 2 (1.031) 3 (1.546) 21 (10.82) 19 (9.794)
Body weight (mean ± SD) 75.27 ± 16.67 78.16 ± 19.75 77.42 ± 17.43 75.45 ± 15.47
Serum albumin level (mean ± SD, g dl−1) 3.705 ± 0.489 3.980 ± 0.547 3.559 ± 0.482 3.525 ± 0.533
Marital status (n (%))
Partnered 59 (30.41) 54 (27.84) 66 (34.02) 38 (19.59)
Non-partnered 38 (19.59) 38 (19.59) 46 (23.71) 35 (18.04)
Unknown 4 (2.062) 1 (0.515) 6 (3.093) 3 (1.546)
Height (mean ± SD) 169.8 ± 9.964 169.0 ± 9.607 169.5 ± 9.495 168.4 ± 11.23
RT with or without chemotherapy (n (%))
Full-dose RT w/chemotherapy 54 (27.84) 50 (25.77) 68 (35.05) 45 (23.20)
Chemotherapy-sensitized RT 33 (17.01) 31 (15.98) 41 (21.13) 25 (12.89)
Full-dose RT w/o chemotherapy 14 (7.216) 12 (6.186) 9 (4.639) 6 (3.093)
KPS (mean ± SD) 85.35 ± 9.855 86.56 ± 9.498 87.88 ± 9.044 86.18 ± 8.636
BMI (mean ± SD, kg m−2) 26.02 ± 4.483 27.26 ± 5.987 26.84 ± 5.608 26.63 ± 5.751

2.5. DVH features

For all patients, normalized volume of GTV or ESO irradiated to x Gy or higher (Vx), with x ranging from 5 to 80 Gy, in 5 Gy steps, was included in the analysis. In addition, dose delivered to x percent of GTV or ESO volume (Dx), with x ranging from 10 to 90%, in 5% steps, as well as minimum (Dmin), maximum (Dmax), mean (Dmean) and variance of dose (Dvariance) were considered. Therefore, a total of 74 DVH features that consisted of 32 Vx and 42 Dx parameters were extracted from GTV and ESO (GTV&ESO). A major part of DVH features are summarized for each of the training and test cohorts with and without CWL in table 2.

Table 2. Summary of a major part of DVH features (mean ± SD) in the training and test sets (n = 388). Five (out of 16) Vx features and seven (out of 21) Dx features are shown.

  Training (n = 194) Test (n = 194)
DVH feature No CWL (n = 101) CWL (n = 93) No CWL (n = 118) CWL (n = 76)
GTV Vx
V50 0.988 ± 0.100 0.962 ± 0.158 0.930 ± 0.253 0.974 ± 0.161
V55 0.899 ± 0.284 0.914 ± 0.255 0.870 ± 0.334 0.965 ± 0.171
V60 0.780 ± 0.404 0.897 ± 0.264 0.852 ± 0.352 0.916 ± 0.272
V65 0.479 ± 0.481 0.558 ± 0.461 0.430 ± 0.459 0.559 ± 0.448
V70 0.043 ± 0.136 0.107 ± 0.251 0.075 ± 0.211 0.034 ± 0.139
GTV Dx (Gy)
D10 64.30 ± 5.520 65.43 ± 8.656 64.44 ± 6.746 65.37 ± 4.670
D50 63.43 ± 5.419 64.16 ± 8.935 63.24 ± 8.540 64.68 ± 4.491
D90 62.09 ± 7.109 61.70 ± 11.87 61.96 ± 8.874 63.99 ± 4.344
Dmin 60.20 ± 7.247 57.37 ± 15.17 60.00 ± 9.890 61.77 ± 5.842
Dmax 65.60 ± 5.706 67.27 ± 8.333 65.63 ± 6.940 66.64 ± 4.972
Dmean 63.33 ± 5.421 63.76 ± 9.338 63.25 ± 7.646 64.68 ± 4.489
Dvariance 4.394 ± 38.73 14.19 ± 90.15 7.985 ± 51.26 0.489 ± 0.758
ESO Vx
V5 0.508 ± 0.136 0.553 ± 0.141 0.479 ± 0.163 0.521 ± 0.138
V10 0.444 ± 0.135 0.493 ± 0.143 0.403 ± 0.161 0.451 ± 0.140
V20 0.353 ± 0.144 0.420 ± 0.151 0.304 ± 0.160 0.360 ± 0.162
V30 0.277 ± 0.153 0.359 ± 0.149 0.229 ± 0.162 0.297 ± 0.162
V40 0.214 ± 0.152 0.298 ± 0.150 0.180 ± 0.153 0.246 ± 0.158
ESO Dx (Gy)
D10 48.49 ± 15.88 55.98 ± 15.82 44.72 ± 18.08 52.05 ± 16.87
D50 9.592 ± 10.54 15.07 ± 14.10 7.773 ± 9.398 11.28 ± 13.06
D90 0.983 ± 0.698 1.430 ± 3.706 1.033 ± 1.201 1.033 ± 1.027
Dmin 0.580 ± 0.401 0.664 ± 0.646 0.559 ± 0.475 0.581 ± 0.520
Dmax 57.03 ± 13.73 63.29 ± 10.50 55.70 ± 15.62 60.37 ± 12.71
Dmean 18.56 ± 7.553 22.79 ± 7.668 16.40 ± 8.359 20.18 ± 8.588
Dvariance 415.5 ± 221.3 533.4 ± 217.4 363.4 ± 229.7 482.8 ± 247.4

2.6. R&D features

In this study, we performed statistical texture analysis of 3D images to analyze the spatial distribution of gray values, by computing local features at each point in the image, and deriving a set of statistics from the distributions of the local features. Depending on the number of pixels defining the local feature, the R&D texture features were categorized into three different feature types: L1, L2 and L3. The L1 statistics estimate properties of individual pixel values, ignoring the spatial interaction between image pixels, whereas L2 and L3 statistics estimate properties of two or more pixel values occurring at specific locations relative to each other (Ojala and Pietikainen 2004). Either radiomics or dosiomics features included 18 L1 (computed from the gray level frequency histogram), 38 L2 (24 from the gray level co-occurrence matrix (GLCM) (R. M. Haralick et al 1973); 14 from the gray level dependence matrix (GLDM) (Juleszand and Bergen 1983)) and 37 L3 texture features (16 from the gray level run-length matrix (GLRLM) (Galloway 1975); 16 from the gray level size-zone matrix (GLSZM) (Thibault et al 2009); 5 from the neighboring gray tone difference matrix (NGTDM) (Amadasunand and King 1989)). In line with the texture analysis literature, CT numbers or doses are referred to as gray levels, in the remainder of this paper. Table 3 summarizes R&D texture features used in this study. An open-source package in Python, Pyradiomics was used for R&D feature extraction from CT images and 3D dose map (van Griethuysen et al 2017).

Table 3. R&D texture features used in this study.

Feature statistics Feature class Feature name No. of features
Higher-order (L3) Neighboring gray tone difference matrix (NGTDM) Coarseness, Complexity, Strength, Contrast, Busyness 5
   
  Gray level size zone matrix (GLSZM) GrayLevelVariance, SmallAreaHighGrayLevelEmphasis, GrayLevelNonUniformityNormalized, SizeZoneNonUniformityNormalized, SizeZoneNonUniformity, GrayLevelNonUniformity, LargeAreaEmphasis, ZoneVariance, ZonePercentage, LargeAreaLowGrayLevelEmphasis, LargeAreaHighGrayLevelEmphasis, HighGrayLevelZoneEmphasis, SmallAreaEmphasis, LowGrayLevelZoneEmphasis, ZoneEntropy, SmallAreaLowGrayLevelEmphasis 16
   
  Gray level run length matrix (GLRLM) ShortRunLowGrayLevelEmphasis, GrayLevelVariance, LowGrayLevelRunEmphasis, GrayLevelNonUniformityNormalized, RunVariance, GrayLevelNonUniformity, LongRunEmphasis, ShortRunHighGrayLevelEmphasis, RunLengthNonUniformity, ShortRunEmphasis, LongRunHighGrayLevelEmphasis, RunPercentage, LongRunLowGrayLevelEmphasis, RunEntropy, HighGrayLevelRunEmphasis, RunLengthNonUniformityNormalized 16
Second-order (L2) Gray level dependence matrix (GLDM) GrayLevelVariance, HighGrayLevelEmphasis, DependenceEntropy, DependenceNonUniformity, GrayLevelNonUniformity, SmallDependenceEmphasis, SmallDependenceHighGrayLevelEmphasis, DependenceNonUniformityNormalized, LargeDependenceEmphasis, LargeDependenceLowGrayLevelEmphasis, DependenceVariance, LargeDependenceHighGrayLevelEmphasis, SmallDependenceLowGrayLevelEmphasis, LowGrayLevelEmphasis 14
   
  Gray level co-occurrence matrix (GLCM) JointAverage, SumAverage, JointEntropy, ClusterShade, MaximumProbability, Idmn, JointEnergy, Contrast, DifferenceEntropy, InverseVariance, DifferenceVariance, Idn, Idm, Correlation, Autocorrelation, SumEntropy, MCC, SumSquares, ClusterProminence, Imc2, Imc1, DifferenceAverage, Id, ClusterTendency 24
First-order (L1) Gray level histogram InterquartileRange, Skewness, Uniformity, Median, Energy, RobustMeanAbsoluteDeviation, MeanAbsoluteDeviation, TotalEnergy, Maximum, RootMeanSquared, 90Percentile, Minimum, Entropy, Range, Variance, 10Percentile, Kurtosis, Mean 18

2.6.1. Radiomics features.

A total of 93 radiomics features were extracted from GTV. These radiomics features quantified tumor phenotypic characteristics on CT images. The ESO was excluded in the radiomics analysis because it generally contains an air-filled lumen (Goldwin et al 1977, Halber et al 1979), which does not carry much textural information and could hamper prediction performance. Radiomic feature computation was performed at original CT voxel dimensions with fixed bin widths of 25 HU for discretizatin of CT numbers.

2.6.2. Dosiomics features.

DVH features do not fully consider all available information in the dose distribution. For example, spatial information is not contained in the DVH curve, and it is only found on discerte data points such as Vx or Dx. For this reason, dosiomics features were extracted to describe the spatial pattern of local variations in 3D dose distribution (dose grids) within GTV and ESO, separately. These dose grids represent the distribution of radiation to be delivered during treatment. A 3D dose map was created by resampling the dose grids with trilinear interpolation to have the same spatial resolution as its corresponding CT, where each voxel held a dose value. Radiomic approach was also used to extract dosiomics features. A total of 186 dosiomics features were extracted from GTV and ESO on the 3D dose map computed. Dosiomic feature computation was performed at the resampled 3D dose voxel dimensions with fixed bin widths of 25 cGy for discretization of radiation doses.

2.7. Univariate analysis

Univariate logistic regression analysis was performed to evaluate the predictable values of each single feature by measuring the area under the receiver operating characteristic curve (AUC) as well as its statistical associations with CWL. The univariate regression model was built using the training set and its prediction performance was measured on the test set. Bonferroni-corrected P < 0.05/355 $ \cong $ 0.00014 was considered statistically significant. The univariate analysis was not used for the feature selection.

2.8. Multivariate analysis

An overview of multivariate feature selection and classification strategies for predicting CWL is illustrated in figure 1. First, the integration of the Boruta algorithm (Kursa et al 2010, Kursa and Rudnicki 2010) with a stepwise regression using variance inflation factor (VIF) (Belsley et al 1980) was used to select significantly important and uncorrelated features in the training set. Then, by using features selected through the Boruta and VIF, classification models for predicting CWL were built using two different classifiers, support vector machine (SVM) (Schölkopf and Smola 2002, Kuhn 2008) and deep neural network (DNN) (LeCun et al 2015, Github), as well as an ensemble classifier averaged from both classifiers.

Figure 1.

Figure 1. Feature selection and classification algorithm pipeline for predicting WL. On the right, the DNN model architecture and data processing pipeline are summarized. 'None' in output shape for every step means 'unspecified' and that it can vary with the size of the input. 'Feature dimension' in output shape for the first three steps stands for the number of input features to the DNN that are selected by the Boruta and VIF.

Standard image High-resolution image

2.8.1. Feature selection: Boruta.

The Boruta algorithm is designed as a wrapper around random forest (RF) algorithm to capture all the important features with respect to an outcome variable within a classification framework. The Boruta approach has been used for feature selection in many studies including omics data sets resulting from gene expression and microbiome data analysis (Saulnier et al 2011, Guo et al 2014). In this study, we used the Boruta algorithm implemented in R (Kursa and Rudnicki 2010) with the maximal number of importance source runs of 1000 and a significant level of 0.01 with a multiple comparisons adjustment using Bonferroni method.

2.8.2. Feature selection: variance inflation factor.

Features selected by the Boruta algorithm might be correlated with each other, i.e. there are collinearities among the features. A simple but effective approach for eliminating multicollinearity is the use of VIF (Graham 2003). A VIF for a single explanatory feature is obtained using the r-squared (R2) value of the regression of that feature against all other explanatory features, i.e.: $VI{F_j} = 1/\left( {1 - R_j^2} \right)$, where the VIF for feature j is the reciprocal of the inverse of R2 from the regression. This is repeated using a stepwise feature selection procedure until all VIF values are below a desired threshold. In this study, collinearity was removed by eliminating features with a VIF > 10 (Hair et al 2013).

2.8.3. Classification model: support vector machine.

The SVM is a supervised classifier, which has proven to be highly effective in solving a wide range of pattern recognition and computer vision problems (El-Naqa et al 2002, Lee et al 2010). In this study, we used a Laplacian kernel SVM (lpSVM) model because of its potential to achieve the highest level of classification accuracy (Hasan et al 2016, Fadel et al 2016). The lpSVM has a set of two different hyperparameters: soft margin constant $C$ that makes a balance between the classification error and maximum margin, and kernel parameter $\sigma $, both of which require problem-specific tuning. The lpSVM hyperparameters $C$ and $\sigma $ were tuned with grid search and bootstrapping conducted with 200 replicates to identify the best combination of hyperparameters. Because of the imbalance in the frequencies of the observed cases between CWL and no CWL groups, we applied a synthetic minority oversampling technique (SMOTE) (Chawla et al 2002) in the analysis to resolve such a class imbalance problem. The SMOTE was conducted repeatedly during the bootstrap resampling to minimize biased results. All the machine learning algorithms for training the lpSVM classifier was performed using the R caret package (Kuhn 2008).

2.8.4. Classification model: deep neural network.

The DNN is an artificial neural network with multiple layers of nonlinear processing units between the input and output layers to learn the value of parameters that results in the best prediction of outcome. We explored a range of combinations of batch size, layer depth and layer size, and determined the optimal architecture for this DNN model empirically after testing numerous variants. The resulting DNN was a 5-layer dense feedforward model with adaptive gradient-based (Adagrad) optimizer (Duchi et al 2011). The Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for sparse data. For the DNN, we developed the final models by randomly and manually tuning the hyperparameters such as the number of layers and hidden units, learning rate, dropout rate, batch size and epochs using the R Keras package (Github). To minimize potential overfitting in the DNN model when developing the model, we used dropout (Srivastava et al 2014, Cao et al 2018), L2 regularization (Pavlou et al 2015), and batch normalization (Ioffe and Szegedy 2015). Batch normalization was used to normalize activations in intermediate layers of the DNN, and dropout to prevent the DNN from over-fitting, the fraction of which was decreased from 0.6 to 0.3 with a step size of 0.1. The DNN model was trained using a cross entropy loss over 3000 epochs with batch size (number of samples per gradient update) of 100 and the learning rate of 0.01, and the validation split (fraction of the training data to be used as validation data) of 0.3. A rectified linear unit (ReLU) was used for activation in the hidden layers. The final output of the network was obtained by applying the softmax function to the last hidden layer.

2.8.5. Classification model: ensemble classifier.

After the lpSVM and DNN classifiers were individually trained with the same training set, a simple ensemble technique known as soft voting (Kittler et al 1998) was used to combine the outputs from both classifiers on the test set. Through this technique, the output probabilities from the individual classifiers were averaged and then used as the final predicted probability.

2.8.6. Multi-view input data analysis.

An overview of the single- and multi-view strategies for handling R&D features is shown in figure 2. Combined GTV L1 + L2 + L3 radiomics and GTV&ESO L1 + L2 + L3 dosiomics were set as traditional single-view concatenated features, as they were full R&D features used in this study. Multi-view R&D analysis was performed to identify optimal R&D input features based on multivariate analysis by using different combinations of five sets of radiomics or dosiomics features: L1, L2, L3, L1 + L2, L2 + L3 and L1 + L2 + L3. The single-view features and the optimal R&D input features identified by the multi-view paradigm were separately combined with CP and/or DVH features to generate seven different input configurations: CP-only, DVH-only, R&D-only, DVH + CP, R&D + CP, R&D + DVH and R&D + DVH + CP. It should be noted that DVH features are essentially equal to the gray level frequency histogram that belongs to L1 statistics. Because Dmin, Dmax, Dmean and Dvariance in DVH features were as same as Minimum, Maximum, Mean and Variance in L1 dosiomics features, respectively (see table 1), their duplication was avoided when DVH and L1 dosiomics features were merged together. Accordingly, when the single-view R&Dfull and DVH features were combined, 66 DVH features excluding eight redundant GTV&ESO Dmin, Dmax, Dmean and Dvariance features were used. Therefore, the total number of features used in this study is 355 features = 10 CPs + 66 (=74−8) DVHs + 93 GTV radiomics features + 186 GTV&ESO dosiomics features.

Figure 2.

Figure 2. Two strategies for handling R&D features. (a) The single view concatenating strategy: converting multi-view features into single-view features by concatenating heterogeneous feature spaces into one homogeneous feature space. (b) The multi-view separation strategy: building a classifier model for each feature view separately and then selecting a best performance classifier trained with an optimal view input feature vector. Note that every classifier model is built through the proposed machine learning algorithm pipeline shown in figure 1.

Standard image High-resolution image

2.8.7. Performance evaluation.

For both univariate and multivariate analyses, training data were standardized to zero mean and unit variance. The test data were standardized with the statistics of the corresponding training data. In the task of predicting CWL, classifiers were evaluated with receiver operating characteristic (ROC) analysis, with the AUC as the performance metric. The AUC was calculated by using the lpSVM, DNN and ensemble classifiers on the test set, respectively. Differences between various AUCs were compared by using a DeLong test (Delong et al 1988). Cohen's kappa coefficient (Cohen 1960) was calculated to measure the agreement between the outputs from the best lpSVM model and the outputs from the best DNN model. Statistical analyses were performed using R packages, pROC and fmsb (Robin et al 2011, Nakazawa 2018, 2018).

2.8.8. Model-agnostic explanations.

A machine-learning algorithm like the lpSVM is often easy to decode how it worked. It is interpretable regarding what parameters were chosen in its trained model. However, a deep learning algorithm like DNN is usually complex and considered as a 'black box' without explicit interpretation of the outputs. To provide a better quantitative interpretable explanation for the prediction results of the lpSVM and DNN classifiers on the test set, we used local interpretable model-agnostic explanations (LIME) algorithm (Ribeiro et al 2016) that can reveal the importance of features and their underlying contribution to the predictive decision. This is accomplished by dissecting and locally approximating the larger model with simpler models (such as linear or decision tree models) that are conceptually easier to understand and interpret. In this study, we illustrated case studies to interpret the prediction results and the underlying reasoning for our lpSVM and DNN models with the LIME algorithm in the test set. For local interpretation, supporting and contradicting features to make a CWL prediction were plotted for each patient. Model-agnostic feature importance was also measured to provide insight into global behaviors of our lpSVM and DNN models in the test set, by measuring the cross-entropy loss from feature dropout. The local and global model-agnostic procedures were implemented using R packages, lime and DALEX (Ribeiro et al 2016, Biecek 2018).

3. Results

Table 4 shows the results of the univariate logistic regression analysis. Only 4 GTV dosiomics (2 L3 and 2 L2 features) and 9 ESO dosiomics (4 L3, 3 L2 and 2 L1 features) features were statistically significantly associated with CWL in the training set. Of the 13 significant features, GTV_D_GLDM_DependenceEntropy (OR = 1.951, 95% CI 1.413–2.786, adjusted p = 0.038, training AUC = 0.656 (95% CI 0.579–0.733)) showed the highest test AUC of 0.589 (95% CI 0.504–0.674), followed by GTV_D_GLRLM_RunEntropy (OR = 2.144, 95% CI 1.510–3.195, adjusted p = 0.038, training AUC = 0.670 (95% CI 0.594–0.745)) with the test AUC of 0.579 (95% CI 0.492–0.666) and ESO_D_GLSZM_SmallAreaHighGrayLevelEmphasis (OR = 2.002, 95% CI 1.460–2.823, adjusted p = 0.012, training AUC = 0.690 (95% CI 0.615–0.765)) with the test AUC of 0.573 (95% CI 0.488–0.658). Performance in AUC of any single feature on the test set was not higher than 0.6.

Table 4. Results of univariate logistic regression analysis. Only features that are statistically significantly associated with CWL in the training set are shown and their performances are evaluated on the training and test sets. Bonferroni adjusted p-values are used for multiple comparison correction, which are uncorrected p-values multiplied by the number of comparisons (n = 355).

     Univariate logistic regression in the training set  
Type ROI Class Name OR (95% CI) Adjusted p-value AUC (95% CI) AUC (95% CI)in the test set
D GTV GLSZM ZoneEntropy 1.909 (1.393, 2.687) 0.038 0.659 (0.582, 0.735) 0.522 (0.438, 0.607)
GLRLM RunEntropy 2.144 (1.510, 3.195) 0.022 0.670 (0.594, 0.745) 0.579 (0.492, 0.666)
GLDM DependenceEntropy 1.951 (1.413, 2.786) 0.038 0.656 (0.579, 0.733) 0.589 (0.504, 0.674)
GLCM Idn 1.893 (1.383, 2.660) 0.043 0.664 (0.588, 0.740) 0.552 (0.467, 0.637)
   
ESO GLSZM SmallAreaHighGrayLevelEmphasis 2.002 (1.460, 2.823) 0.012 0.690 (0.615, 0.765) 0.573 (0.488, 0.658)
HighGrayLevelZoneEmphasis 2.088 (1.517, 2.958) 0.005 0.695 (0.621, 0.769) 0.549 (0.464, 0.633)
GLRLM ShortRunHighGrayLevelEmphasis 1.992 (1.457, 2.792) 0.011 0.684 (0.609, 0.759) 0.552 (0.466, 0.638)
HighGrayLevelRunEmphasis 1.960 (1.438, 2.737) 0.014 0.679 (0.603, 0.754) 0.556 (0.470, 0.642)
GLDM HighGrayLevelEmphasis 1.854 (1.367, 2.568) 0.041 0.667 (0.591, 0.743) 0.552 (0.466, 0.638)
SmallDependenceHighGrayLevelEmphasis 1.905 (1.402, 2.647) 0.023 0.664 (0.588, 0.741) 0.556 (0.469, 0.643)
GLCM Autocorrelation 1.848 (1.363, 2.557) 0.043 0.666 (0.590, 0.742) 0.554 (0.468, 0.640)
Histogram Energy 1.856 (1.366, 2.581) 0.046 0.674 (0.598, 0.750) 0.536 (0.451, 0.622)
TotalEnergy 1.955 (1.433, 2.731) 0.015 0.683 (0.608, 0.758) 0.552 (0.466, 0.639) 

Table 5 compares performances in AUC of lpSVM, DNN and ensemble classifiers on the training and test sets among top 17 R&D input feature sets, which are sorted in descending order according to the best AUC among the 3 classifiers for each input feature set. The first to sixteenth rankings are multi-view R&D input feature sets configured based on the statistical texture feature categories (see table 3), while the seventeenth ranking is a traditional single-view concatenated R&D input feature set (i.e. full R&D features). It should be noted that L3 dosiomics features consistently appear in all the top 17 R&D input feature sets. The best performance for the single-view R&D input feature set was obtained with the ensemble classifier achieving the test AUC of 0.653 (95% CI 0.576–0.731). An optimal R&D (R&Dopt) input feature set was identified by the multi-view paradigm. The highest performance was accomplished with the ensemble classifier for the GTV L1 + L2 + L3 radiomics and GTV&ESO L3 dosiomics input feature set, which showed the test AUC of 0.710 (95% CI 0.637–0.782) and statistically significant performance improvement over the best classifier performance achieved with the full R&D (R&Dfull) input features (p < 0.05). The lpSVM classifier's performance for the GTV L1 + L2 radiomics and GTV&ESO L2 + L3 dosiomics input feature set was also statistically significant (p < 0.05), which showed the test AUC of 0.702 (95% CI 0.628–0.776).

Table 5. Comparison of performances among top 17 R&D input feature sets as evaluated with the AUC values of lpSVM, DNN and ensemble classification models on the test set. The last row represents full R&D input features, which reflect traditional single-view concatenated data.

   AUC (95% CI)
Input feature set lpSVM DNN Ensemble in testing
Radiomics Dosiomics Training Testing Training Testing
L1 + L2 + L3 L3 0.705 (0.698, 0.711) 0.702 (0.627, 0.776) 0.829 (0.796, 0.854) 0.702 (0.628, 0.775) 0.710a (0.637, 0.782)
L1 + L2 L2 + L3 0.714 (0.708, 0.721) 0.702a (0.628, 0.776) 0.829 (0.798, 0.860) 0.669 (0.593, 0.745) 0.690 (0.616, 0.764)
L2 L3 0.702 (0.696, 0.709) 0.689 (0.614, 0.765) 0.820 (0.779, 0.861) 0.660 (0.581, 0.739) 0.687 (0.613, 0.762)
L2 + L3 L3 0.700 (0.693, 0.706) 0.689 (0.614, 0.764) 0.791 (0.742, 0.841) 0.665 (0.587, 0.744) 0.687 (0.611, 0.763)
L1 + L2 + L3 L2 + L3 0.693 (0.686, 0.699) 0.688 (0.612, 0.764) 0.829 (0.796, 0.862) 0.624 (0.543, 0.704) 0.676 (0.601, 0.751)
L2 + L3 L2 + L3 0.703 (0.697, 0.710) 0.685 (0.609, 0.762) 0.795 (0.746, 0.844) 0.659 (0.580, 0.737) 0.676 (0.600, 0.752)
L3 L1 + L2 + L3 0.700 (0.694, 0.707) 0.679 (0.604, 0.755) 0.794 (0.751, 0.836) 0.635 (0.555, 0.715) 0.674 (0.598, 0.749)
L2 + L3 L1 + L2 + L3 0.702 (0.695, 0.708) 0.678 (0.603, 0.754) 0.788 (0.746, 0.829) 0.600 (0.518, 0.681) 0.658 (0.582, 0.734)
L1 + L2 L3 0.718 (0.711, 0.724) 0.671 (0.593, 0.748) 0.792 (0.743, 0.841) 0.639 (0.558, 0.719) 0.672 (0.595, 0.749)
L1 L3 0.706 (0.700, 0.713) 0.668 (0.590, 0.745) 0.793 (0.742, 0.843) 0.623 (0.542, 0.704) 0.654 (0.576, 0.732)
L2 L2 + L3 0.696 (0.689, 0.704) 0.666 (0.589, 0.742) 0.811 (0.775, 0.848) 0.621 (0.541, 0.702) 0.653 (0.575, 0.730)
- L3 0.697 (0.690, 0.704) 0.665 (0.587, 0.742) 0.828 (0.789, 0.867) 0.636 (0.557, 0.715) 0.662 (0.585, 0.739)
L3 L2 + L3 0.709 (0.703, 0.716) 0.647 (0.568, 0.726) 0.779 (0.729, 0.829) 0.645 (0.564, 0.725) 0.663 (0.585, 0.741)
L1 L2 + L3 0.696 (0.688, 0.703) 0.662 (0.585, 0.740) 0.856 (0.823, 0.888) 0.628 (0.548, 0.708) 0.657 (0.580, 0.734)
L3 L3 0.709 (0.702, 0.715) 0.660 (0.581, 0.738) 0.796 (0.746, 0.847) 0.645 (0.564, 0.727) 0.660 (0.581, 0.738)
L2 L1 + L2 + L3 0.698 (0.690, 0.706) 0.660 (0.583, 0.737) 0.807 (0.773, 0.841) 0.633 (0.553, 0.712) 0.656 (0.579, 0.734)
L1 + L2 + L3 L1 + L2 + L3 0.714 (0.706, 0.721) 0.651 (0.574, 0.729) 0.795 (0.759, 0.832) 0.638 (0.559, 0.718) 0.653 (0.576, 0.731)

aStatistically significant performance improvement over the best AUC (=0.653) achieved with full R&D input features in testing (p < 0.05).

Table 6 compares performances in AUC for each of lpSVM, DNN and ensemble classifiers on the training and test sets among different input feature sets consisting of R&Dfull, DVH and/or CP features. The lpSVM classifier's best performance was obtained with a combined R&Dfull and DVH (R&Dfull + DVH) feature set, showing the test AUC of 0.655 (95% CI 0.576–0.733). The best performances of the DNN and ensemble classifiers were obtained with the R&Dfull feature set, which showed the test AUCs of 0.638 (95% CI 0.559–0.718) and 0.653 (95% CI 0.576–0.731). The best performances of all three classifiers were statistically significantly higher than their performances for the CP input feature set (p < 0.05 for the lpSVM and ensemble classifiers; p < 0.01 for the DNN classifier), while not than their performances for the DVH + CP or DVH input feature set.

Table 6. Comparison of performances among different input feature sets consisting of full R&D, DVH or CP features as evaluated with the AUC values of lpSVM, DNN and ensemble classification models on the test set. The bold number indicates the highest AUC values for each classifier.

  AUC (95% CI)
  lpSVM DNN Ensemble
Input feature sets Training Testing Training Testing Testing
R&Dfull 0.714 (0.706, 0.721) 0.651 (0.574, 0.729) 0.795 (0.759, 0.832) 0.638 (0.559, 0.718) 0.653 (0.576, 0.731)
R&Dfull + DVH 0.718 (0.712, 0.725) 0.655 (0.576, 0.733) 0.792 (0.745, 0.840) 0.603 (0.522, 0.684) 0.648 (0.569, 0.726)
R&Dfull + DVH + CP 0.746 (0.739, 0.752) 0.649 (0.572, 0.726) 0.851 (0.815, 0.886) 0.598 (0.516, 0.680) 0.650 (0.573, 0.727)
R&Dfull + CP 0.741 (0.735, 0.747) 0.649 (0.572, 0.726) 0.859 (0.827, 0.891) 0.611 (0.531, 0.690) 0.645 (0.568, 0.722)
DVH 0.699 (0.693, 0.706) 0.620 (0.539, 0.701) 0.727 (0.679, 0.775) 0.617 (0.536, 0.698) 0.623 (0.543, 0.704)
DVH + CP 0.741 (0.736, 0.747) 0.595 (0.514, 0.677) 0.833 (0.794, 0.872) 0.605 (0.523, 0.686) 0.598 (0.517, 0.679)
CP 0.657 (0.649, 0.665) 0.535* (0.450, 0.620) 0.626 (0.577, 0.872) 0.503** (0.441, 0.565) 0.534* (0.449, 0.619)

Asterisks indicate statistically significant difference in comparison with the highest test AUC values for each classifier: * p < 0.05, ** p < 0.01.

Table 7 compares performances in AUC for each of lpSVM, DNN and ensemble classifiers on the training and test sets among different input feature sets consisting of R&Dopt, DVH and/or CP features. The best performance for each classifier was obtained by using the R&Dopt input feature set, with which the test AUCs for the lpSVM, DNN and ensemble classifiers were 0.702 (95% CI 0.627–0.776), 0.702 (95% CI 0.628–0.775) and 0.710 (95% CI 0.637–0.782), respectively. Both the lpSVM and ensemble classifiers' best performances were statistically significantly higher than their performances for the CP (p < 0.01), DVH + CP (p < 0.01) and DVH (p < 0.05) input feature sets. The DNN classifier's best performance was statistically significantly higher than its performances for the CP (p < 0.001), DVH + CP (p < 0.05), DVH (p < 0.05), R&Dopt + DVH + CP (p < 0.05) and R&Dopt + DVH (p < 0.05) input feature sets. By using the R&Dopt input feature set, all three classifiers' best performances were statistically significantly higher than their performances for CP and/or DVH input feature sets. Note that all the classifier models were built through our proposed machine learning pipeline (figure 1) with use of different input feature sets. Figure 3 shows ROC curves demonstrating statistically significant performance improvement in prediction of CWL when the ensemble classifier for the R&Dopt input feature set was used.

Figure 3.

Figure 3. Fitted binomial ROC curves comparing performances of a lpSVM model for the CP input feature set, a DNN model for the combined DVH and CP input feature set (DVH + CP), and three ensemble classifier models for the R&Dopt, R&Dfull and DVH input feature sets.

Standard image High-resolution image

Table 7. Comparison of performances among different input feature sets consisting of optimal R&D, DVH or CP features as evaluated with the AUC values of lpSVM, DNN and ensemble classification models on the test set. The bold number indicates the highest AUC values for each classifier. Performances of DVH, DVH + CP and CP features are as same as those in table 6.

  AUC (95% CI)
  lpSVM DNN Ensemble
Input feature sets Training Testing Training Testing Testing
R&Dopt 0.705 (0.698, 0.711) 0.702 (0.627, 0.776) 0.804 (0.753, 0.854) 0.702 (0.628, 0.775) 0.710 (0.637, 0.782)
R&Dopt + DVH 0.696 (0.690, 0.703) 0.695 (0.620, 0.770) 0.821 (0.782, 0.859) 0.631* (0.550, 0.711) 0.675 (0.600, 0.751)
R&Dopt + DVH + CP 0.745 (0.739, 0.751) 0.678 (0.603, 0.753) 0.876 (0.841, 0.911) 0.634* (0.554, 0.714) 0.677 (0.602, 0.752)
R&Dopt + CP 0.744 (0.738, 0.751) 0.649 (0.571, 0.726) 0.834 (0.791, 0.877) 0.627 (0.544, 0.710) 0.649 (0.571, 0.727)
DVH 0.699 (0.693, 0.706) 0.620* (0.539, 0.701) 0.727 (0.679, 0.775) 0.617* (0.536, 0.698) 0.623* (0.542, 0.704)
DVH + CP 0.741 (0.736, 0.747) 0.595** (0.514, 0.677) 0.833 (0.794, 0.872) 0.605* (0.523, 0.686) 0.598** (0.517, 0.679)
CP 0.657 (0.649, 0.665) 0.535** (0.450, 0.620) 0.626 (0.577, 0.675) 0.503*** (0.441, 0.565) 0.534** (0.449, 0.619)

Asterisks indicate statistically significant difference in comparison with the highest AUC values for each classifier: * p < 0.05, ** p < 0.01, *** p < 0.001.

Our feature selection procedure (using Boruta and VIF) selected 13 R&D features from the R&Dopt input feature set. Figure 4 shows feature importance scores on the training set for the lpSVM model trained with the 13 R&D features. For the lpSVM model, each feature's AUC was alternatively computed as the measure of feature importance, as it does not have a built-in importance score (Kuhn 2008). The 13 selected R&D features ranked in descending order according to their importance were: ESO_D_glszm_SmallAreaHighGrayLevelEmphasis (AUC = 0.690), GTV_D_glrlm_RunEntropy (AUC = 0.670), ESO_D_glszm_Small AreaLowGrayLevelEmphasis (AUC = 0.660), GTV_D_glszm_ZoneEntropy (AUC = 0.659), GTV_D_glszm_SizeZoneNonUniformity (AUC = 0.647), GTV_D_glszm_GrayLevelNonUniformity Normalized (AUC = 0.646), ESO_D_glrlm_LongRunHighGrayLevelEmphasis (AUC = 0.643), GTV_D_glszm_LowGrayLevelZoneEmphasis (AUC = 0.642), GTV_D_glszm_GrayLevelVariance (AUC = 0.638), GTV_R_gldm_LargeDependenceHighGrayLevelEmphasis (AUC = 0.628), GTV_R_gldm_LowGrayLevelEmphasis (AUC = 0.616), GTV_D_ngtdm_Strength (AUC = 0.610) and GTV_R_glszm_LowGrayLevelZoneEmphasis (AUC = 0.580). As shown, a majority of the selected features were dosiomics features (77%), which tended to appear in higher importance rankings than radiomics features, where ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy were the first and second most important features for the lpSVM model.

Figure 4.

Figure 4. Feature importance scores on the training set for the lpSVM model trained with 13 R&D features selected by our feature selection procedure with use of the preidentified optimal R&D input feature set (R&Dopt). Each feature's AUC was computed as the measure of feature importance. ESO = esophagus, R = radiomics, D = dosiomics, glszm = gray level size-zone matrix, glrlm = gray level run-length matrix, ngtdm = neighboring gray tone difference matrix, and gldm = gray level dependence matrix.

Standard image High-resolution image

Table 8 shows parameter values of the 13 R&D features selected by our feature selection procedure with use of the R&Dopt input feature set, for training and test sets. For both datasets, ESO_D_glszm_SmallAreaHighGrayLevelEmphasis, GTV_D_glrlm_RunEntropy, GTV_D_glszm_ZoneEntropy, ESO_D_glrlm_LongRunHighGrayLevelEmphasis and GTV_R_gldm_LargeDependenceHighGrayLevel Emphasis were higher in the CWL group, whereas ESO_D_glszm_SmallAreaLowGrayLevelEmphasis, GTV_D_glszm_LowGrayLevelZoneEmphasis, GTV_R_gldm_LowGrayLevelEmphasis and GTV_R_glszm_LowGrayLevelZoneEmphasis were lower. On the other hand, GTV_D_glszm_SizeZoneNonUniformity, GTV_D_glszm_GrayLevelNonUniformityNormalized, GTV_D_glszm_GrayLevelVariance and GTV_D_ngtdm_Strength did not have consistently distinguishable trends between the training and test sets. ESO_D_glszm_SmallAreaLowGrayLevelEmphasis, GTV_D_glszm_SizeZoneNonUniformity, GTV_D_glszm_GrayLevelVariance, GTV_R_gldm_LowGrayLevelEmphasis and GTV_D_ngtdm_Strength had the standard deviation (SD) greater than the mean (i.e. a wide range of variation among different patient samples).

Table 8. Parameter values (mean ± SD) of 13 R&D features selected by our feature selection procedure with use of the optimal R&D input feature set (R&Dopt).

  Training (n = 194) Test (n = 194)
R&D feature No CWL (n = 101) CWL (n = 93) No CWL (n = 118) CWL (n = 76)
GTV radiomics
GLDM
LargeDependenceHighGrayLevelEmphasis 168 990 ± 104 524 221 082 ± 122 364 117 512 ± 91 759.8 159 189 ± 103 795
LowGrayLevelEmphasis (0.294 ± 0.336) × 10−2 (0.281 ± 0.472) × 10−2 (1.240 ± 4.900) × 10−2 (0.726 ± 1.580) × 10−2
GLSZM
LowGrayLevelZoneEmphasis (0.459 ± 0.445) × 10−2 (0.429 ± 0.424) × 10−2 (1.416 ± 5.995) × 10−2 (0.747 ± 1.411) × 10−2
GTV dosiomics
GLRLM
RunEntropy 4.777 ± 0.468 5.141 ± 0.639 4.624 ± 0.563 4.660 ± 0.530
GLSZM
ZoneEntropy 6.555 ± 0.987 7.139 ± 0.941 6.221 ± 1.199 6.374 ± 1.121
SizeZoneNonUniformity 29.22 ± 116.4 170.5 ± 551.8 110.7 ± 906.0 41.76 ± 135.7
GrayLevelNonUniformityNormalized 0.084 ± 0.034 0.066 ± 0.034 0.105 ± 0.050 0.112 ± 0.061
LowGrayLevelZone-Emphasis 0.047 ± 0.043 0.030 ± 0.033 0.072 ± 0.069 0.071 ± 0.071
GrayLevelVariance 72.72 ± 510.8 146.1 ± 434.5 124.2 ± 724.9 30.00 ± 81.52
NGTDM
Strength 2.083 ± 16.27 12.67 ± 49.39 16.08 ± 115.7 0.862 ± 3.275
ESO dosiomics
GLRLM
LongRunHighGrayLevelEmphasis 25 329 ± 16 787 34 307 ± 18 623 21 611 ± 16 944 32 068 ± 21 166
GLSZM
SmallAreaHighGrayLevelEmphasis 15 626 ± 7521 20 418 ± 7091 14 059 ± 7790 17 825 ± 7910
SmallAreaLowGrayLevelEmphasis (1.422 ± 1.638) × 10−4 (0.932 ± 0.963) × 10−4 (1.719 ± 2.863) × 10−4 (1.232 ± 1.319) × 10−4

Figure 5 shows the values of the loss function and the classification accuracy visualizing the learning progress of the DNN model on the training and validation datasets for 3000 epochs, when using the 13 selected R&D features among the R&Dopt input feature set. As can be seen in the figure, the DNN model converged gradually in the training process, achieving 70% accuracy on both the training and validation datasets near the end of epochs. The performance of the loss function was also comparable on both datasets, although the loss was smaller in the training dataset than in the validation dataset. The validation loss function reached a plateau near the end of epochs, and the final value of the training and validation loss had an average of 0.6.

Figure 5.

Figure 5. Training history of classification loss and accuracy for the DNN model on the training and validation datasets with 13 selected R&D features among the R&Dopt input feature set.

Standard image High-resolution image

The ensemble classifier of lpSVM and DNN outperformed each single classifier in predicting CWL when the R&Dopt input feature set was used. Figure 6 shows the classifier agreement levels between the two classifiers. Although the level of agreement measured by kappa was substantial (kappa coefficient = 0.676 and p < 0.0001) using the algorithm of Landis and Koch (Landis and Koch 1977), it was below the level of almost perfect agreement (kappa coefficient ⩾ 0.81), and there still appeared to be moderate disagreement between the lpSVM and DNN classifiers, likely explaining why fusion improves predictive performance.

Figure 6.

Figure 6. (a) Diagonal classifier agreement plot for the estimated probability of CWL in the test set, between the lpSVM and DNN classifiers trained with 13 selected R&D features. Each point represents a patient for whom predictions were made. Points near or along the diagonal from bottom left to top right indicate high classifier agreement while points far from the diagonal indicate low agreement. (b) Bland-Altman plot illustrating classifier agreement between the lpSVM and DNN classifiers. The x and y-axes represent the averaged output and the difference (output of DNN—output of lpSVM) of the two classifiers for the test set, respectively.

Standard image High-resolution image

Figure 7 shows facetted heatmap-style plots of case-feature combinations for all test set samples (n = 194) for the lpSVM and DNN classifiers. All test sample cases are visualized horizontally (case numbers omitted), and categorized features are displayed vertically. There are two types of outcome events for CWL, denoted as 'No' and 'Yes' in the left and right panels, respectively. In each panel, the dashed line divides cases into CWL (left) and no CWL (right) groups. Feature weights are shown with the colors. Positive (green) weights suggest a feature is supporting the outcome, and negative (red) weights suggest a feature is contradicting the outcome. The LIME explains the predictions of the lpSVM and DNN classifiers by approximating them locally with interpretable models whose predictions can be summarized by simple if-else rules, and visualizes which features are important for the predictions using those classifiers. The ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy showed relatively distinguishable patterns between two outcomes ('No' or 'Yes') for both lpSVM and DNN models. For the lpSVM model, ESO_D_glszm_SmallAreaHighGrayLevelEmphasis ⩽ −0.529 tended to appear in green for no CWL cases on the left panel, and ESO_D_glszm_SmallAreaHighGrayLevelEmphasis > 0.723 for CWL cases on the right panel. For the DNN model, GTV_D_glrlm_RunEntropy ⩽ −0.595 tended to appear in green for no CWL cases while red for CWL cases; GTV_D_glrlm_RunEntropy > 0.404 tended to appear in green for CWL cases. Note that feature values in the plots are standardized to zero mean and unit variance.

Figure 7.

Figure 7. Heatmaps showing how 13 selected R&D features influence each case in the test set (194 cases), by means of local interpretable model-agnostic explanations (LIME). (a) lpSVM. (b) DNN. All test sample cases are visualized horizontally (case numbers omitted), and categorized features are displayed vertically.

Standard image High-resolution image

Figure 8 shows the plots of three major supporting or contradicting features to make a CWL prediction for four selected cases in the test set, by means of LIME for the lpSVM and DNN classifiers. The upper left and lower right show example CWL cases for which lpSVM and DNN do not agree (DNN is correct for the upper left and lpSVM is correct for the lower right case). The upper right and lower left show example CWL and no CWL cases, respectively, for which two classifiers agree and correctly predict the outcome. Although feature importance was different for each individual case, the case studies demonstrate that ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy play an important role for a specific prior decision of whether a patient will have CWL during/after the course of RT.

Figure 8.

Figure 8. Model-agnostic feature importance plots of three major supporting or contradicting features for four selected cases in the test set are shown by means of local interpretable model-agnostic explanations for the lpSVM and DNN classifiers. Features with green color support the outcome and those with red color contradict. The length of the bar is proportional to the weight of feature. The scatter plot zoomed out is the diagonal classifier agreement plot shown in figure 6(a).

Standard image High-resolution image

Figure 9 shows plots of model-agnostic feature importance measured by cross-entropy loss from feature dropout in the test set with our lpSVM and DNN models for the 13 selected R&D features. Right or left edges correspond to difference between loss after permutation of a single feature and loss of a full model. In the lpSVM model, ESO_D_glszm_SmallAreaHighGrayLevelEmphasis was deemed most important, followed by ESO_D_glrlm_LongRunHighGrayLevelEmphasis. In the DNN model, GTV_D_glrlm_RunEntropy was deemed most important, followed by GTV_D_glszm_GrayLevelNonUniformityNormalized. In both models, dosiomics features tended to be more important than radiomics features. In the lpSVM model, dosiomics features for ESO tended to be more important as compared to those for GTV, but vice versa in the DNN model.

Figure 9.

Figure 9. Model-agnostic feature importance in the test set compared between the DNN and lpSVM models. Feature importance is estimated by cross-entropy loss from feature dropout, and right or left edges correspond to difference between loss after permutation of a single feature and loss of a full model.

Standard image High-resolution image

4. Discussion

Our results showed that L3 dosiomics signatures play a major role in predicting CWL in lung cancer patients treated with RT. Although features generated from L1 statistics provide information related to the gray level distribution of an image, they do not give any information about the relative positions of various gray levels within the image. Thus, L1 features are not able to estimate whether all low-value gray levels are positioned together or they are interchanged with high-value gray levels. When features based on L1 statistics do not suffice, L2 statistics can improve texture discrimination. Because all L1 statistical information of an image is captured in the histogram signatures, an obvious extension is to analyze features that depend on spatial relationships between the gray levels of the image for describing its L2 statistics. The L2 statistical features are considered important because the human visual system is capable of identifying different textures only if L2 statistics are different. Thus, textures with similar L1 statistics, but different L2 statistics, can easily be discriminated. However, textures that differ in L3 statistics but have the same L1 and L2 statistics cannot be recognized by the human visual system (Julesz 1975). Because the human visual perception mechanism does not necessarily use L3 statistics for discriminating iso-L2 textures, L3 features may detect subtle statistical differences that a human observer is unable to identify. Therefore, the measure of L3 statistics might help to quantitatively account for the differences of dose distribution pattern between RT patients with and without CWL in this study.

IMRT treatments involve the delivery of complex dose distribution shapes that would lead to numerous regions containing steep dose gradients, even within target volumes as well as near critical structures in an optimized 3D configuration (Low et al 2011). Thus, understanding dosiomics features to measure these dose distribution patterns would be critical for safe IMRT implementation. Our results showed that ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy were the most important dosiomics features for predicting CWL in patients treated with RT (see figures 4 and 9), where both features were higher in the CWL group than in no CWL group (see table 8). The small area high gray level emphasis (SAHGLE) measures the proportion of the joint distribution of smaller size zones with higher gray-level values within a volume of interest from an image. A smaller number of connected voxels with higher gray-level values that share the same gray-level intensity will yield a higher SAHGLE value. Thus, patients with CWL might have received a higher radiation dose (see table 2) to a smaller spot of the ESO during RT than those without CWL. On the other hand, the run entropy (RE) measures the uncertainty or randomness in the distribution of run lengths and gray levels. A higher RE value indicates more heterogeneity in the texture patterns. Thus, patients with CWL might have received a more heterogeneous radiation dose to the GTV during RT. In general, dose inhomogeneity increases as the required dose gradient between the target and an adjacent critical structure increases, the concavity of the required dose distribution increases, the distance between the target and a critical structure decreases, and the number of available beam directions decreases (Deye et al 2005). In addition, IMRT doses are often calculated by dividing beams into smaller sections, called beamlets, through the use of a multileaf collimator, and each beamlet can be modulated in dose intensity by duration of exposure. Thus, IMRT provides the opportunity to construct heterogeneous dose deposition within the target volume, so that tumors containing heterogeneous group of cell population may receive higher dose to the areas where there is increased density of malignant cells or there are pockets containing radiation resistant cells. Although inhomogeneous dose with higher central dose may improve local control in such cases, the increased local control may lead to increased risk of esophageal toxicity (Rose et al 2009). A trade-off between a very conformal and uniform dose to the target and lower doses to organs at risk is often required, and these decisions are guided by the risks related to different doses to certain volumes of organs at risk and the possibility of different treatment outcomes (Craft et al 2005). To reach a good compromise, it is important to strike a balance between target dose conformity and dose homogeneity for each patient in IMRT planning (Mundt and Roeske 2005). Because the balancing between conformity and homogeneity is not perfect in clinical practice, cold spots or hot spots can be created in unexpected locations, which are not easily appreciated on DVHs (Deye et al 2005). As demonstrated in this study, dosiomics signatures could potentially contribute to unfolding these geometric uncertainties. For example, from our results, one might be able to conjecture that patients with CWL might receive a more inhomogeneous dose to the GTV and thereby inconspicuous small hot spots outside the target might be created in the adjacent ESO, considering that the CWL group had higher SAHGLE on the ESO dose distribution map as well as higher RE on the GTV dose distribution map in the RT planning. To mitigate CWL incidence during RT, improving target volume dose homogeneity and better sparing of critical structures may be required.

The present study demonstrated that multi-view R&D analysis could provide the opportunity to improve predictive performance of CWL in patients treated with RT compared to a single-view R&D analysis. In the multi-view paradigm, R&D features were represented by multiple distinct feature groups configured based on statistical texture feature categories, and each feature group was referred to as a particular view. To select an optimal R&D view, we compared classification performances of feature selection solutions by exploiting information from different views of the same input data. Our results showed that an R&Dopt input feature set (L1 + L2 + L3 radiomics features for GTV and L3 dosiomics features for GTV&ESO) could lead to a significant performance improvement compared to an R&Dfull input feature set (L1 + L2 + L3 radiomics features for GTV and L1 + L2 + L3 dosiomics features for GTV&ESO). To increase the accuracy of our prediction, we typically measure more features to describe characteristics of an object in finer details. However, in pattern recognition systems that recognize data patterns by using machine learning algorithms, this approach may initially help but at some moment it become counterproductive because more features increase the complexity of the recognition system and introduce more parameters to be estimated. The corresponding estimation errors may harm the potentially achievable classification performance if the feature size approaches to a point where the peaking phenomenon occurs, i.e. the error of a designed classifier decreases and then increases as the number of features grows (Trunk 1979, Hughes 1968). This peaking phenomenon can affect all classifiers in a manner depending on the feature-outcome distribution, and leads to the need for feature selection. Although the steepness of the peaking phenomenon can be mitigated by feature selection, the peaking phenomenon in the presence of feature selection may vary extremely depending on the combination of feature selection algorithms and classifier models (Sima and Dougherty 2008). Hence, combinatorial search for finding optimal feature subsets with multiple views and gaining an idea of which view is optimal input to provide the best classification performance may be of substantial benefit for optimizing a feature-outcome model in high-throughput image data analysis.

In a high-dimensional feature space, feature selection algorithms need to be used with classifier design. The most acceptable benefit of feature selection is to help improving accuracy and reducing model complexity, as it can remove redundant and irrelevant features to reduce the input dimensionality and help understanding the underlying mechanism that connects effective features with outcome. To avoid an exhaustive search, which is impossible except when there is a very small number of available features, many suboptimal algorithms have been proposed for feature selection. The most commonly used feature selection methods in radiomic studies can be divided into three broad categories: filter, wrapper, and embedding. Among these, most radiomic studies have made use of filter selection methods (Parmar et al 2015, Wu et al 2016, Sun et al 2018), because they are simple to implement and computationally fast. Filter methods are feature ranking techniques that evaluate the relevance of features by looking at the intrinsic properties of the data independent of the classification algorithm (Saeys et al 2007). A suitable ranking criterion is used to score the features and a threshold is used to remove the feature below the threshold. However, filter methods ignore the interaction with the classifier, i.e. the search in the feature subset space is separated from the search in the hypothesis space. They may also easily filter some useful information for the classification task, especially when a high-dimensional feature vector is inputted to the feature selection algorithm. If only a small subset of features is chosen, definitely a lot of information is lost and the heterogeneity cannot be well represented. On the other hand, some radiomic studies have performed feature selection based on embedded methods (Krafft et al 2018, Lafata et al 2019). These methods are a catch-all group of techniques that perform feature selection as part of the model construction process. A representative of these approaches is the Least Absolute Shrinkage and Selection Operator (LASSO) method for constructing a linear model (Tibshirani 1996). Although the LASSO method may efficiently find a sparse model with a large number of features, a drawback of deriving insight from LASSO selected features is that it arbitrarily selects a few representatives from the correlated features, and the number of features selected relies heavily on the regularization strength. Thus, the models become unstable because changes in the training subset can result in different selected features. In this study, we used the Boruta feature selection algorithm that is a wrapper-based all-relevant feature selection method. Wrapper methods utilize the classifier as a black box to score the subsets of features according to their predictive power (Guyon and Elisseeff 2003). They consider all the subsets but evaluate their merits by building a classification model only with the selected features and considering the performance of the model. Instead of defining a subset of features with minimal error, the Boruta algorithm tries to find all relevant features useful for prediction. The advantage of the Boruta strategy is that it clearly decides if a feature is important or not and helps to select features that are statistically significant, by a multiple repetition of the RF procedure. In addition, the Boruta approach is sensitive regarding detection of causal features while controlling the number of false-positive findings at a reasonable level, and can be applied in both high- and low-dimensional data settings (Degenhardt et al 2019). Thus, it has an ability to distinguish truly important features from those that gain importance due to random correlations in data. It should also be recognized that the ability of the Boruta algorithm for identifying important features was refined by subsequently combining the VIF and reinforced by dealing with feature heterogeneity based on the multi-view data analysis in this study.

Regarding feature classification, our results showed that the ensemble classifier for the R&Dopt input feature set, which combined the outputs from the lpSVM and DNN classifiers, yielded the best classification performance. It may be due to the fact that information provided by two different classifiers is complementary. This observation was supported by the agreement level of the kappa coefficient obtained between the outputs from the two classifiers. It should be noted that, even if the kappa coefficient (=0.676) indicated substantial agreement between the two classifiers according to the algorithm of Landis and Koch, it pointed to unacceptable agreement according to the norm of Spitzer and Fleiss (Spitzer and Fleiss 1974). In addition, the Automotive Industry Action Group (AIAG) (2010) suggests that a kappa value of at least 0.75 indicates good agreement. However, larger kappa values, such as 0.90, are preferred. The main advantage of classifier fusion is the exploitation of different classifier specific strengths in terms of their specific suitability for different forms of classification problems (Ho et al 1994). The achievable performance of the fusion method mainly depends on the selection of the most diverse and accurate single base classifiers (Ho et al 1994). No best classifier for all problems exists and the individual classifier performance itself depends on characteristics inherent to the classified set of data (Ali and Smith 2006). A major drawback of decision fusion is the possibility of impeding the overall system performance by combining different single classifiers. This negative situation by classifier fusion occurred in our results for some input feature sets (see tables 57), which demonstrates that the best performance classifier can change according to the type of input features used.

Our comparative results of predictive performances showed that the prediction ability of R&Dopt features significantly outperform CP, DVH and their combined (DVH + CP) features (see table 7). Meanwhile the DVH features had better performance than the CP or DVH + CP features. Considering that dosiomics features were more important than radiomics features in our study, these results demonstrate that the predictive ability can improve with use of the information of dose distribution. The application of dosiomics method is not limited to our current CWL prediction. Indeed, recent studies for predicting another RT outcome such as radiation pneumonitis prediction (Liang et al 2019), and predictive modeling of gastrointestinal and genitourinary toxicities in prostate cancer RT (Rossi et al 2018) have also shown that dosiomics features effectively improve prediction performance, compared to DVH features. Because there is no information regarding the spatial relationship of voxels in the DVH features, the same Vx (or Dx) value may be obtained from spatially separated or connected voxels. Therefore, the enhanced performance for predicting RT outcome may be achieved by characterizing the spatial relationships of dose distribution. On the other hand, when R&Dfull features were used instead of R&Dopt features, the best predictive performance was achieved with R&Dfull + DVH features (see table 6), which was higher than the performance for R&Dfull features alone. From the results, one may believe mistakenly that DVH features should be combined with R&D features to achieve the best performance. As such, using traditional single-view concatenated R&Dfull features may be prone to lead to a suboptimal conclusion.

Model interpretation is important in the context of human decision-making based on outcome prediction with a machine learning method. Because of the black-box nature of complex models, many machine learning methods (especially, deep learning) suffer from the limitation of providing meaningful interpretations that can enhance understanding of how particular decisions are made by these models. To mitigate these problems, we tried to explain the lpSVM and DNN models by using model-agnostic methods, which allow explaining predictions of arbitrary machine learning models independent of the implementation. It would be valuable for patient management if clinically trusted inference could be disclosed. In this study, we used the LIME algorithm to provide an explanation for each individual prediction result in the test set (see figures 7 and 8). Feature importance could be different for each case because the LIME approximates the classifier's decision boundary around the neighborhood of a given input via linear regression (Ribeiro et al 2016). Thus, the LIME model does not have to work well globally, but approximate the black-box model locally. To capture the non-linear behavior of a model, the LIME framework is constructed using if-then rules (Marco et al 2018). In our study, the interpretation of the classifier's behavior was not still easy even using the LIME because the selected 13 R&D features were still in high dimension. Of those 13 features, a couple of the most important features appeared clinically intuitive. Certainly, quantitative prediction should be performed by using a fixed model built with all selected features. Nevertheless, a further study may be necessary to develop a clinically intuitive quality assurance method that reports only a few key metrics relevant to the potential impact on improved tumor control and normal tissue protection, keeping in mind that a physician typically makes an intuitive judgment based on his or her experience and various clinical and test data. From our case studies, ESO_D_glszm_SmallAreaHighGrayLevelEmphasis ⩽ −0.529 and GTV_D_glrlm_RunEntropy ⩽ −0.595 may be suggestive of no CWL based on their standardized values, while ESO_D_glszm_SmallAreaHighGrayLevelEmphasis > 0.723 and GTV_D_glrlm_RunEntropy > 0.404 may imply a potential for CWL. In the training set, the actual mean values of ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy were 17 923.24 and 4.951311, and their SD were 7683.277 and 0.5843308, respectively. Thus, their actual range will be ESO_D_glszm_SmallAreaHighGrayLevelEmphasis ⩽ 13 858.79 and GTV_D_glrlm_RunEntropy ⩽ 4.603634 for no CWL, while ESO_D_glszm_SmallAreaHighGrayLevelEmphasis > 23 478.25 and GTV_D_glrlm_RunEntropy > 5.187381 for CWL. These dosiomics features still had strong global influences on predicting CWL in the test set (see figure 9). Note that the global feature importance may look very different from the local feature importance for individual predictions because of the nonlinearity of the prediction model. The global feature importance measure works by calculating the increase of the model's prediction error after permuting the feature (Fisher et al 2018). A feature is important if permuting its values increases the model error, because the model depends on the feature for the prediction. In contrast, a feature is unimportant if permuting its values keeps the model error unchanged, because the model ignores the feature for the prediction. This model-agnostic approach enabled the exploration and interpretation of a complex black box model such as DNN that is described by a large number of parameters that make the model hard to understand. Such a technique may increase the transparency and trust of the complex model by providing easily understandable explanations of the contribution of important features that are consistent with medical understanding, and may promote data-driven personalized medicine from a machine learning perspective. Furthermore, it is of great interest among machine learning society how to present the selected important features in more easily interpretable form, e.g. visualizing differences in images, to help clinicians intuitively interpret them and infer potential reasons. Our interpretations on the selected features may lead to a new hypothesis generation that is worth further testing through larger clinical studies, which is the topic of our future research.

In this study, the univariate logistic regression analysis method does not consider interactions between features, and thus is unable to assist in determining what feature combination is optimal. By contrast, the optimal features selected by the multi-view analysis were obtained based on a multivariate method using the Boruta algorithm, which evaluates subsets of features and allows to detect the possible interactions between features. In addition, the VIF might exclude features with high collinearity that were selected from the Boruta. Therefore, the features selected in the univariate logistic regression analysis could be different from the optimal subset of features selected by the multi-view analysis. Although they were different each other, we found that dosiomics features had relatively stronger influence on CWL than radiomics features because all features selected from the univariate logistic regression analysis were dosiomics features. Of these, features selected in both univariate and multivariate analyses such as ESO_D_GLSZM_SmallAreaHighGrayLevelEmphasis, GTV_D_GLRLM_RunEntropy and GTV_D_GLSZM_ZoneEntropy could be relatively robust features as compared to those selected in either univariate or multivariate analysis.

Our study has several potential limitations. First, the training and test cohorts were temporally disjoint and may cause problems if there were temporal changes in image acquisition protocols, treatment strategy, or patient characteristics. However, this study was conducted without any changes in CT scan protocol used for RT planning and delivery between the training and test cohorts. No significant difference was found in the clinical conditions between the training and test cohorts. In addition, it is known that temporal validation is better than random split according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guideline (Moons et al 2015). Second, choosing an appropriate classifier model and optimizing the hyperparameters with a single split of data into training and test sets might yield biased results that rely solely on this single split of the data. Nested cross-validation (NCV) may provide a more reliable criterion for choosing the best model that may reduce the bias. However, the NCV procedure does not calculate a single best set of hyperparameter values because each training set of the outer CV may select a different set of optimal hyperparameters. This method provides an unbiased estimator for choosing the best classifier system, but it does not affect the operational hyperparameter values (Wainer and Cawley 2018), being still vulnerable to the overfitting if the entire dataset is repartitioned into new training and test sets (Skocik et al 2016). In addition, NCV procedures are not probably needed when selecting from among any set of classifier algorithms (e.g. RF, SVM, etc), provided they have only a limited number of hyperparameters that must be tuned (Wainer and Cawley 2018). On the other hand, NCV is not often used for evaluating DNN models because of the greater computational expense. A resampling method such as CV and bootstrapping is a model selection mechanism that is mainly used for selecting hyperparameters. Changing hyperparameters will affect the number of model parameters (weights and biases) in DNN. For instance, increasing the number of layers, which is one of hyperparameters in DNN, would introduce thousands more parameters in a specific layer, depending on the number of nodes in the layer. The number of layers can take theoretically any value between one and infinite, which means that an infinite number of models can be generated by just changing the hyperparameters. Because resampling has nothing to do with DNN model parameters, it has limits in optimizing a DNN model. Optimizing DNN parameter values is handled by the backpropagation algorithm. Thus, instead of doing CV or bootstrapping, we used a random subset of the training set as a holdout for validation while training the DNN model, and evaluated its performance on the test set unseen by the model. This is the most common way in the deep learning community. Third, hyperparameters (e.g. batch size, number of epochs, learning rate, dropout rate, etc) in DNN were tuned empirically. There are various ways to optimize hyperparameters from manual trial and error to sophisticated algorithmic methods, but there is no consensus on what works best. Fourth, we acknowledge that our R&D feature analysis was limited to a single bin width for both CT (25 HU) and 3D dose map (25 cGy). Although feature values change for different bin widths, it is known that bin width has a marginal effect on the total number of stable radiomic features of CT images when comparing different scanners, slice thicknesses or exposures (Larue et al 2017). Concerning bin width for the 3D dose map, to the best of our knowledge, there is no study to investigate the influence of gray level discretization on dosiomic feature stability. A previous dosiomics study for predictive modeling of gastrointestinal and genitourinary toxicities in prostate cancer RT has discretized dose to 70 levels from 0 to 70 Gy, i.e. bin width of 1 Gy. In our study, a smaller bin width of 25 cGy was applied to reveal subtle differences between 3D dose map images. Although further research is needed to investigate the dependency of bin width and the prognostic value of R&D features on CWL prediction, our multi-view analysis results successfully demonstrated the importance of using dosiomics features for predicting CWL in lung cancer RT patients. Fifth, this study did not include R&D features extracted from filtered images. However, given limited number of patient data to train and test the model, adding filtered images to already many R&D features made the prediction model significantly unstable, and did not improve the prediction performance. Therefore, we tried to minimize the number of extracted features, thus focusing only on image features from original images, as our goal of the present study was to demonstrate that the multi-view strategy outperforms the conventional single-view concatenating strategy in the analysis of R&D features. However, there may be potential confounding features among ones extracted from filtered images that may lead to prediction performance improvement, which would be our future research as we add more patient data to the study. Sixth, handcrafted R&D features used in this study may not fully reflect the unique characteristics of a particular RT structure, although they can provide domain knowledge that is necessary either to explain the scientific outcome or to derive scientific findings from the model. Combining knowledge obtained from the handcrafted features and the intrinsic features extracted from the convolutional neural network may enhance the performance of a predictive model (Li et al 2019). Finally, although this study was performed based on a multi-view feature analysis strategy, our machine learning method is founded on a specific single-view feature learning that was optimally chosen by the proposed multi-view feature analysis. Future studies need to explore multi-view learning that considers learning with multiple views to improve the generalization performance by means of data fusion or integration from multiple feature sets (Sun 2013, Cao et al 2019).

5. Conclusion

We performed R&D feature analysis with multiple views by splitting the original R&Dfull feature set into different views according to statistical texture feature categories classified based on the spatial distribution properties of the local patterns of image pixel values. We demonstrated that the multi-view R&D feature analysis strategy could significantly improve performance for predicting early CWL in lung cancer RT patients, compared to the traditional single-view R&Dfull feature analysis strategy. The predictive performance for the R&Dopt features identified by the multi-view paradigm was also significantly higher than that for the conventional CP and/or DVH features. An ensemble classifier averaged from the lpSVM and DNN classifiers achieved the best predictive performance when using the R&Dopt input features. Of the selected 13 R&D (10 dosiomics and 3 radiomics) features used for our machine learning model, ESO_D_glszm_SmallAreaHighGrayLevelEmphasis and GTV_D_glrlm_RunEntropy were the most important features for predicting CWL in lung cancer patients treated with RT. These dosiomics features were both higher in patients with CWL, reflecting that they received more inhomogeneous dose to the GTV, leading to potential small hot spots in the ESO that might not readily be detected by the human eye during RT planning. R&D-based quantitative RT planning for improved target dose homogeneity and better critical structure sparing is recommended to mitigate CWL incidence in lung cancer patients treated with RT. We believe that our proposed R&D-based prediction model could help in identifying prognosis associated with early CWL following RT and in developing personalized RT planning to mitigate RT-induced toxicities.

Acknowledgments

This study was supported by Canon Medical Systems Corp.

Please wait… references are loading.