Background
Low-dose CT screening has been widely accepted as a means of mortality reduction and early detection of lung cancer [
1‐
3]. It is a long-term rather than one-take effort because large numbers of indeterminate pulmonary nodules may require diagnostic workups, follow-up scans, or annual repeat screenings [
4,
5]. Currently, recommendations on the time targets for follow-up are based on nodules’ diameter and solidity [
6,
7]. However, it has been shown that these features are insufficient to measure nodules’ complex appearance [
8], and visual interpretations of solidity are prone to inter-rater variability [
9,
10]. The timeliness of follow-up after positive screening remains suboptimal because of limited evidence [
11]. Innovating the way we help patients with nodules to make subsequent decisions is important, as we must weigh the benefits of early cancer diagnosis against the danger and cost of over-investigating unaggressive nodules.
In this study, we present a radiomics pipeline to select, synthetize, and recode radiomics data extracted from CT images and produce follow-up schedules to facilitate timely management of screening-detected nodules. We show the potential clinical impact of this approach by comparing its performance against that of five existing protocols specified in current guidelines.
Methods
Study participants
The study’s subjects were from an institute-based lung cancer screening cohort. The participants were those who underwent low-dose CT screening in August 2014–August 2018 and had at least one noncalcified pulmonary nodule detected. The inclusion criteria for a baseline screening were age 40–80 years and nodule diameter ≥ 3 mm (defined as the mean of the major and minor axis lengths, rounded to the nearest integer). The exclusion criteria were: (1) pregnancy; (2) severe illness of the brain, heart, or kidney; (3) other conditions not suitable for CT examination, determined by radiologists; (4) already hospitalized or transferred from hospitals for further workup; (5) distant residence that prevented timely follow-up.
By the end of May 2019, we had included 61 cases diagnosed with lung cancer (including 52 with histopathologically confirmed adenocarcinoma, 7 with adenocarcinoma in situ, 1 with squamous cell carcinoma, and 1 with metastatic carcinoma of the prostate), and we retrospectively selected 31 cancer-free patients with nodules who met any of the following conditions: (1) histopathologically confirmed benign lesion by pathology test (n = 24, including 17 with hamartoma, 3 with pneumocytoma, 3 with inflammation, and 1 with carcinoid); (2) nodule disappeared or decreased in size in follow-up screening (n = 3); (3) no sign of malignancy during follow-up for at least 2 years (n = 4). Cancer-free status was cross-validated through medical records to minimize the effects of missed detection by histopathology tests.
Data collection
We used a site-based research database to collect, store, and perform quality control of the following data: (1) demographic information, including age at baseline, sex, personal and family (first-degree) cancer history, and smoking status (current smoking defined as ≥ 10 pack-years; quit smoking defined as ≥ 5 years’ cessation); (2) outcome of follow-up, including date of lung cancer diagnosis (analyzed as time/status outcome), specific pathological type, and cancer stage at diagnosis; (3) semantic phenotypes of the nodules, recorded as categorical variables, including nodule type (solid, part-solid, or non-solid), lobular, specular, juxtapleural, and pleura tag.
Baseline and follow-up CT images were acquired according to standardized protocols using a SOMATOM Definition Flash scanner, a SOMATOM Force scanner, and others. Images were reconstructed up to a thickness of 5.0 mm with a spacing of no more than 1.5 mm, stored as DICOM files, and retrieved from our Picture Archiving and Communication Systems.
Radiomics data generation
For each patient with nodules, one baseline CT image with the maximum nodule area in the transaxial plane was selected for primary analysis. The same rule was applied if there was more than one pulmonary nodule. Temporal changes in radiomic features were analyzed among patients with nodules who had one or more repeat CT scans during follow-up.
Regions-of-interest were delineated following the multi-step interactive process detailed in Additional file
1: Method S1. Four groups of region-of-interest-based radiomic features, which have been extensively used in radiomics studies [
12,
13], were extracted for quantitative characterization of the nodules: 21 shape features (Euclidean and fractal), 8 intensity features (histogram-based statistics), 41 texture features (gray-level co-occurrence matrix and run-length matrix), and 240 wavelet features (Additional file
1: Method S2).
Biomarker development
The proposed follow-up schedules were based on a composite radiomic biomarker developed using the random survival forest (RSF) method [
14] that discriminates between the time-to-diagnosis distributions of patient subgroups. To increase interpretability and avoid over-fitting, only a few predictive, noise-robust, clinically meaningful, non-redundant radiomic features were selected as inputs to the RSF (see technical details about feature selection and biomarker development in Additional file
1: Method S3–4). This feature selection process was performed in a training set that was composed of 67% the participants; A radiomics biomarker was then trained in this training set and tested in the rest of the participants. A cross-validation approach (by approximately equal sized and mutually exclusive folds; no stratification variable applied) was used to evaluate the robustness of the results. The biomarker’s performance was also examined after being combined with demographic and semantic phenotype variables to investigate whether the addition of such information is necessary.
Schedule design
The radiomics biomarker was used to stratify the patients with nodules in a way that resembles previously published nodule management protocols [
6,
7,
15‐
17]. To minimize delayed cancer diagnosis, a “low” cutoff value was selected to provide high time-dependent sensitivity to the decision about early diagnostic workup (within 3 months). Similarly, to reduce unnecessary follow-up, a “high” cutoff was selected to provide high time-dependent specificity to the decision about repeat screening (after 12 months). Nodule management plans were then made for the low-, middle-, and high-risk patient subgroups, defined as having biomarker values below, between, and above the cutoffs, respectively. The proposed schedules’ performance was visually assessed with a time-to-diagnosis plot and benchmarked against the protocols recommended in five expert consensus-based guidelines in a contingency table.
Statistical analysis
Because our research goal concerns the timing of diagnosis and follow-up rather than binary classification, more than one time point of interest (e.g., 3 months, 1 year) was selected to reclassify each study participant as a “cumulative case” (diagnosed with lung cancer before the time of interest) or “dynamic control” (not diagnosed with cancer by the time of interest, including lung cancers diagnosed later and patients with cancer-free nodules). This time-dependent definition was based on Heagerty’s analysis framework [
18] and allows us to evaluate the performance of potential nodule descriptors and the composite biomarker at predicting lung cancer diagnosis within several time intervals. A time-dependent version of the area under the curve metric (termed AUCt) was then calculated [
18]. The bootstrap method (resampling 200 times) was used to estimate 95% confidence intervals (CIs) where indicated.
In the calculation of sample size, a ratio ranging from 1:2 to 2:1 was applied to allow the numbers of “cumulative cases” and “dynamic controls” to vary according to different time points of interest. For an expected AUCt that ranges from 0.7 to 0.9, we need 6 to 60 cases and 60 to 6 controls at different time points to achieve a power of 0.9 at a significance level of 0.050 (two-sided). With the available sample, the statistical powers of a significance test for AUCt values of 0.7 or above at 3 months, 6 months, and 12 months are ≥ 0.90, 0.92, and 0.93, respectively.
All statistical tests were two-sided, with a significance level of p = 0.050, and were performed with R version 3.5.2.
Discussion
In this study, we developed a radiomics biomarker on the basis of eight predictive, noise-robust, non-redundant radiomic features. The clinical usefulness of those features as nodule descriptors was justified by their high relevance to semantic phenotypes, higher discriminative value, and greater temporal sensitivity than nodule diameter. The biomarker had high time-dependent predictive accuracy for lung cancer and could well differentiate subgroups of patients with nodules according to their distinct times to cancer diagnosis. When benchmarked against five current guideline protocols, the proposed approach performed best at reducing both delayed and over-diagnosis rates, suggesting the great potential of applying radiomics to secure a timely cancer diagnosis as well as sparing patients with unaggressive nodules from unnecessary diagnostic testing in lung cancer screening.
Automatic detection of pulmonary nodules and prediction of their malignancy and benignity have been extensively investigated [
19‐
21]. The major differences of our study from these works are that we applied radiomics to schedule the timeliness of nodule management and used a new analysis method to allow for the addition of the time dimension. There are some advantages associated with this change. First, some cancers can be diagnosed immediately but some cannot (e.g., as long as 33.5 months in this study). Compared with treating them as one group for prediction, it is more clinically meaningful to determine whether the patient can wait a while (e.g., 6 months, 12 months) to make a judgment through follow-up. The time-dependent definitions of case and control are more pertinent to the longitudinal nature of lung cancer screening. Second, challenges with defining a disease-free group using a “gold standard” arise in the screening setting because few cancer-free participants undergo histopathology tests [
5]. Their time to lung cancer diagnosis is censored, as it should be viewed from a lifetime horizon. The time-dependent analysis can properly employ this censored information, whereas simply ignoring this idea or treating the cancer-free group as non-diseased would result in bias. Third, by incorporating the temporal information, the proposed method can contribute to more precise risk assessment of lung cancer. The method can also address screening-related issues such as the harms associated with over-diagnosis (e.g., repeat exposure to radiation, invasive diagnostic procedures) and delayed diagnosis and intervention [
2], all of which are core to the interests of screening participants. According to a recent review from the Population-based Research to Optimize the Screening Process Consortium [
11], timely follow-up for positive cancer screening results remains suboptimal because of the low quality of available evidence across cancers. The proposed approach could outline an important step in addressing these challenging issues.
One of the major concerns with radiomics is whether radiomic features are as reliable as has been reported [
22,
23]. In view of this, we adopted very stringent feature selection criteria. Among the reasons for exclusion listed in the flowchart, non-robustness to image noise was a particular consideration in our study beyond the level that was applied in other studies [
18,
24,
25]. Noise-sensitivity was important in our study because it is a unique issue in low-dose CT and could affect the stability of the results if different modality parameters or reconstruction algorithms are adopted [
23]. We found that the majority of sophisticated radiomic features, such as wavelet-based features, are very sensitive to image noise and less relevant to semantic phenotypes, despite the fact that some have high predictive value. This finding indicates that there may be a balance between complexity, interpretability, and suitability in the search for new nodule descriptors. For this reason, we did not resort to 3D features, given that the results with 2D features were satisfactory and saved computing time for easier clinical uptake. Further, in lieu of semantic phenotypes, which are subject to moderate–high inter-rater variation [
9,
10], the selected radiomic features could provide automatic (and thus more reliable) quantification of nodule characteristics. Clinical confidence in the use of these radiomic features may be improved by considering the following: first, as shown by our results, they were naturally associated with the semantic phenotypes commonly used by radiologists. Second, the addition of semantic phenotype variables did not improve predictive performance, meaning that the radiomic features already carry such qualitative information and thus may substantially reduce human labor. Third, the selected radiomics features’ clinical value has been suggested in other studies. For instance, kurtosis and energy, as measures of the “tailedness” and homogeneity of the intensity distribution, showed high variable importance in our model (Additional file
1: Figure S1) and have been reported to be useful for discrimination between benign and malignant nodules [
24,
26], helpful for prediction of prognosis, and associated with gene expression in lung cancer [
27].
Among existing protocols, Lung-RADS has been widely accepted as a reliable tool, and its performance is especially accurate when previous images are available [
28]. However, the performance of Lung-RADS has been shown to deteriorate on the baseline screening, when no priors are available [
29]. The proposed radiomics approach performed much better than Lung-RADS and other protocols at decision making following the baseline screen. However, the low frequency of repeated screens prevented us from planning subsequent decisions. After this proof-of-concept study, we plan to apply the proposed time-dependent analysis framework to serial image data (recently termed delta-radiomics [
30]) with a large sample size. This will hopefully contribute to refining dynamic management.
This study is limited in several aspects. First, its external validity is limited by the narrow spectrum of diseases investigated (particularly, most of the cancers were adenocarcinoma, a finding similar to other reports from China [
20]). Second, the observed temporal data could have been affected by delayed or over-diagnosis. Third, the extraction of radiomic features is intrinsically repeatable, but variability may be introduced by the semi-automatic segmentation method. Fourth, the cancer-free group received significantly fewer follow-up screenings than the cancer group, and thus, detection bias may exist.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.