Background
Lung cancer, one of the highest risky cancers, is the leading cause of cancer death with a high mortality rate of 82.3% in 5 years after diagnosis (National Cancer Institute). [Noone AM, Howlader N, Krapcho M, et al. SEER Cancer Statistics Review, 1975–2015,
https://seer.cancer.gov/csr/1975_2015/] Non-small cell lung cancer is a subtype lung cancer, which accounts for 85% among lung cancers [
1]. The 5-year survival rate decreases dramatically as the cancer entering advanced, from 40% for stage I to only 1% for stage IV [
2,
3]. It was reported that CT texture analysis could be helpful to further classification of treatment as it provided information on the intratumor heterogeneity [
4] which might be the reason for disparate outcomes of patients. Thibaud P. Coroller and etc. used 7 radiomic features to predict pathological response after chemoradiation [
5].
In the past 10 years, medical digital image analysis has grown dramatically as advancement of the pattern recognition tools and increase of the data collection. Radiomics offers unlimited imaging biomarkers which are promising to help cancer detection, prognosis and prediction of treatment response [
6,
7]. With the high-throughput computing, it’s possible to quickly extract various quantitative features from digital images such as MRI and CT. Since cancers are more likely to be temporal and spatial heterogeneous, the use of biopsy might be limited. Furthermore, medical digital imaging could give a whole picture of the tumor shape, texture and volume, and it is also a noninvasive way to get comprehensive tumor information [
8]. Some researches indicated that there was a relationship between radiomic features and tumor grades, histology, metabolism, and patient survival and clinical outcomes [
9‐
11]. Kitty Huang et al. also found high risk CT features were significantly associated with local recurrence [
12]. Chintan et al. studied the prognostic characteristics of radiomic features between lung cancer and head & neck cancer and found association among 11–13 features and prognosis, histology and stage [
13]. Jiangdian et al. investigated the prognostic and predictive ability of phenotypic CT features in non-small cell lung cancer patients and reported an overall clinical stage prediction accuracy of 80.33% [
14]. Those previous studies have shown that medical image analysis has a promising ability in improving cancer diagnosis, detection, prognostic prediction on oncology [
8].
With those antecedent studies, radiomics displayed its hopeful and cost-effective potential in the area of precision oncology. Even though there have been already numerous researches on the prediction of cancer diagnosis or stage classification, most of them used default parameter or manual selection which might not efficient enough. Most of the time default parameters could give us great result, but the ability of the model would be maximized by parameters optimization when we conduct the training stage [
15]. This study intended to construct an automatic grid search [
16] hyper-parameters tuning classifier to make detections on the survival status of non-small cell lung cancer patients based on the radiomics features. The dataset was randomized into training set and validation set. A random forest classifier with hyperparameters tuning was used to make classification of survival status of non-small cell lung cancer patients in training set. The model was assessed on the validation data by ROC curve as well as the prediction accuracy.
Methods
Data sets
The study included 186 non-small cell lung cancer (NSCLC) patients from two merged datasets R01 and AMC. The patient characteristics and CT images were obtained from the cancer imaging archive (TCIA) database (
https://doi.org/10.7937/K9/TCIA.2017.7hs46erv). Clinical data of all 186 NSCLC patients are provided in Table
1, including the gender, smoking history, histology, treatment, and overall survival data. Additional file
1: Table S1 shows detailed staging information.
Table 1
Demographic characteristics
Gender |
Male | 120 (64.5%) |
Female | 66 (35.5%) |
Smoking Status |
Nonsmoker | 39 (20.9%) |
Former smoker | 117 (62.9%) |
Current smoker | 30 (16.2%) |
Histology |
Adenocarcinoma | 154 (82.7%) |
Squamous cell carcinoma | 29 (15.6%) |
NOS | 3 (1.7%) |
Treatment |
Surgery | 33 (17.7%) |
Chemotherapy | 40 (21.5%) |
Radiotherapy | 19 (10.2%) |
Adjuvant Treatment | 40 (21.5%) |
Unknown | 54 (29.1%) |
Overall Survival |
Dead | 37 (19.9%) |
Alive | 149 (80.1%) |
One thousand, two hundred eighteen tumor characteristics were quantified by extracted features from the lesion segmented from patients’ CT images. The radiomic features can be categorized into four types such as intensity, shape, texture and wavelet. An open-source package in Python, Pyradiomics was used for various features extraction from CT images [
17]. A list of 50 quantitative features including first order features, shape features, Gray Level Size Zone Matrix (GLSZM) features, Features Gray Level Run Length Matrix (GLRLM) features and etc. were extracted. The shape descriptors were extracted from the label mask and also not associated with gray value.
Data balance and data splitting
Extracted features were then weighted differently as a result of data balancing. In machine learning, algorithms assume the distributions of groups are similar. In practice, when the disproportion of classes happens, the learning algorithms tend to be biased towards the majority class. But in this study, we are more interested in the minority class with more adverse events take place [
18]. Due to sample imbalance (the number of being alive is much less than that of being dead), rather than simply applying over-sampling with replacement for data balance, we conducted a synthetic minority over-sampling (SMOTE) method to increase the size of the minority group. SMOTE can be used when the number of the category is larger than 6 since it generated the new examples by taking samples of the feature space for the target class and its 5 nearest neighbors. SMOTE has an advantage of making the decision region of the minority class more general [
19,
20]. The final size of the dataset was
n = 298. Because the minority group was oversampled thus the dataset was randomized into training set (
n = 223) and validation set (
n = 75). The distribution of being alive and dead between training and validation set was also plotted.
Classifier construction
Based on the radiomics features, we aimed to build a radiomics-based survival status prediction model using random forest classifier and hyperparameters tuning in Python [
21]. Random forest creates multiple decision trees by randomly choosing subsets of features to make the classification based on the mode (for classification) or the mean (for regression) from all the smaller trees [
22,
23]. It has the advantage of being less vulnerable to overfitting problem compared to decision-trees. A generic random forest classifier was constructed first. Parameter estimation using grid search with 10-fold cross-validation was applied to the training data for parameters tunings such as the number of features to consider for the best split (max_feature), the number of trees in the forest (n_estimators), the maximum depth of the tree (max_depth), the minimum number of samples required to split an internal node (min_sample_split). Precision and recall score were used as evaluation standard for parameters tuning respectively. The two best models with different optimal hyperparameters can be acquired by the top mean precision or recall score respectively. The final preferred model was selected by comparing their performance on the validation data.
Decision threshold adjustment
Instead of directly adopting the absolute predictions, this study also applied for decision value tuning to balance the trade-off between precision and recall. A function of decision values was used to determine the decision threshold of the chosen model to maximize the precision with high recall.
Radiomics model assessment
Model performance was evaluated in terms of the operator characteristic curve (ROC), the area under the curve (AUC) and accuracy, which could quantify the prediction performance of the classifier model.
Discussion
Radiomics has gained great attention as a potential method to promote personal medicine. Its image signatures derived from digital images are promising to help diagnostics and prognostics [
24]. It has been shown that features such as texture, shape and intensity had prognostic power in independent data of lung and head-and-neck cancer patients since they were able to capture the intratumor heterogeneity [
25]. Applying machine learning techniques on the output data from radiomics has also become a hot topic in oncology, personalized medicine and computer aided diagnosis since its compatibility with the big data generated from digital images [
26].
This study intended to predict the non-small cell lung cancer survival status with radiomics features. A total of 1218 features were obtained after feature derivation using Pyradiomics. And those features captured the information about tumor shape, intensity and texture. Data imbalance is always a common problem in classification problem since most interested events like disease, network intrusion and etc. are rare. When sample size is large enough, slight or medium imbalance is not a big problem for training since there is enough information for learning from the minority class. However, when sample size is small, especially for decision tree the leaves that predict the minority class are likely to be pruned [
27]. Thus, in this study, a SMOTE oversampling method was necessary to decrease the training fit error. The model was trained with automatic hyperparameters tuning aiming to make use of the best potential of our model. One common problem with hyperparameter tuning is overfitting, which means that the model performs well on the training data but poorly on the test data. This issue can be amended by using cross-validation where the model performance is evaluated by averaging k models. This study skipped the part of feature selection for two reasons: one was that the sample feature ratio was not low to introduce overfitting and another one was that random forest with parameters tuning was powerful since it optimized the number of tress and selected the best feature at each node. The final result of our study also proved that the model without feature selection had a great generalization on the test data.
Moreover, we studied how the adjustment of decision values impacted the precision and accuracy because a good classification model was not only evaluated on the accuracy. Precision and recall are further standards for model evaluation but there is a trade-off between them. Precision decreases as recall increases [
28]. For survival status prediction, it is important to differentiate the death to stratify the patient into high risk group automatically. Thus, it might be more cost expensive to misidentify the high-risk patients. Without further test data validation, the result after decision threshold adjustment could be optimistic. However, this study gave a reference for the decision threshold of a non-small cell patient not being alive based on the radiomic feature. This may indicate that radiomics is promising into automatic computer aided patient risk stratification in a non-invasive way. For hyperparameters tuning, this study used grid search, which performed well in low dimensional space. When the dimension space is large or unknown, random search could be considered [
2].
Despite of the satisfactory model performance, the study had a few limitations. First, since all training and testing data were acquired from one study, it may not be generalized to all cases [
29] with a more heterogeneous dataset, and the accuracy might be lower than the current study. The extents to which the data we used are representative to the real situation also affect how well the trained model could perform in the practical use. Thus, different CT images from different sources are needed to construct a more rigorous and general classification model. The second limitation of the study is disproportion of patients at different disease stages. According to Table
2, the number of people with specific characteristic was far more than the rest, which means patient with a certain stage of the cancer might have more images or more pathologic images. For instance, there were 154 Adenocarcinoma patients, 29 Squamous cell carcinoma patients, and 3 not otherwise specified (NOS) patients.
For future research, more data from diverse patients’ background, different databases, and multiple image modalities should be utilized for further testing and validation. Other mathematical model can be developed to improve feature extraction. Our model can be adopted to improve the performance of classifier. The most relevant features can provide useful information for future exploration to develop a better detection method. Also, with larger dataset, different classification criteria can be tuned according to the different types of lung cancer and disease stages.
Conclusion
To conclude, this study intended to construct a survival status classifier with automatic hyperparameters tuning. In order to optimize classification outcomes, the tuning of decision threshold can serve as a reference for future work. Our classification methods has the potential to contribute to a survival prediction model, which is beneficial to better palliative care and treatment decision.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.