Introduction
Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer death worldwide, with non–small-cell lung cancer (NSCLC) comprising more than 85% of these cases [
1,
2]. NSCLC typically metastasises to the hilar and mediastinal lymph nodes, and the presence of metastases significantly impacts the staging, prognosis, and patient management. Five-year survival rates are 54% for patients without any metastases and 27% for patients with mediastinal metastases [
3]. Tumour progression, prognosis evaluation, and decisions on treatment plans are mainly dependent on this staging, and thus, it is critically important to accurately detect the mediastinal cancer nodes.
Currently, a
18F-labelled fluoro-2-deoxyglucose ([18F]FDG) positron emission tomography/computed tomography ([18F]FDG-PET/CT) scan is acquired, and the nuclear medicine physician/radiologist examine all the slices. The sensitivity and specificity of these lesion-based analyses have been shown to be 0.59 and 0.97, respectively, meaning many nodes remain undetected [
4]. Additionally, the agreement between observers is limited. Inter-observer agreement, defined by the kappa score (
κ), has been shown to range from 0.48 to 0.88, depending on the type of node (agreement was lower for aortopulmonary nodes (
κ = 0.48–0.55) but higher for inferior and superior nodes (
κ = 0.71–0.88)) [
5]. We hypothesise that an artificial intelligence–based system could improve the sensitivity and reproducibility of mediastinal lymph node staging while also saving radiologists time.
In recent years, the use of machine learning algorithms to aid and automate medical-related problems has increased massively. Programmes have been built to automate tasks across the gamut of clinical oncology, from tumour detection and segmentation to therapy decisions [
6‐
9]. A subset of machine learning, convolutional neural networks (CNNs) have emerged as a leading tool for classifying images [
10]. In contrast to mathematical radiomic features, CNNs self-learn optimised features through a series of layers. Successive layers learn increasingly higher-level features from the images using non-linear mappings, eventually using these features to make predictions. The network is trained so that the features become optimised for the task. CNNs have shown promise in a wide range of image recognition and classification tasks, in some cases challenging the accuracy of medical experts [
11‐
13]. One important issue impeding the widespread adoption of machine learning models is their lack of ability to generalise to data from different sources (
domain shift) [
14,
15]. A solution is to use transfer learning, which involves further training of the model using data from the second domain, often with some CNN layers frozen or with a smaller learning rate (also called fine-tuning). This has been shown to improve inter-scanner performance [
16,
17].
For lung cancer, machine learning models have been built to detect pulmonary nodules in CT scans at an accuracy comparable to or better than that of physicians [
18,
19]. Ardila et al. used a cohort of 14,851 patients from the publicly available National Lung Screening Trial to predict risk of lung cancer, achieving an AUC of 0.94 and outperforming radiologists [
20].
However, there has been less progress on the automated detection of pathological mediastinal lymph nodes. Roth et al. used a CNN to automatically detect enlarged lymph nodes (indicating disease) from CT scans [
21]. The input data consisted of 2.5D patches (three orthogonal slices) centred on the lymph node locations. With a cohort of 86 patients, they achieved a sensitivity of 0.70 with three false positives/patient.
Several [18F]FDG-PET thresholding methods have also been proposed, as described in the meta-analysis by Schmidt-Hansen et al. [
22]. These studies used features such as [18F]FDG-PET SUV
max and [18F]FDG-PET SUV
mean to classify nodes as pathological. Using an
SUVmax ≥
2.5 criterion gave an average sensitivity and specificity of 0.81 and 0.79, respectively. However, the meta-analysis showed high inter-study heterogeneity. Additionally, these studies all required the nodes to first be located manually by an expert.
Given the diversity of mediastinal node shapes, sizes, and locations, using solely the CT or [18F]FDG-PET is limiting. Wu et al. showed that using [18F]FDG-PET/CT for nodal staging of NSCLC confers significantly higher sensitivity and specificity than only contrast-enhanced CT and higher sensitivity than only [18F]FDG-PET [
23]. With a cohort of 168 patients, Wang et al. used a 2.5D CNN (using six axial patches of size 51 × 51 mm
2) to diagnose mediastinal lymph node metastasis, achieving a sensitivity and specificity of 0.84 and 0.88, respectively [
24].
While these models achieved good results, they all required as inputs the locations of the mediastinal nodes. Locating these nodes is already a time-consuming task. Furthering these studies, our aim was to build a model that could automatically classify mediastinal nodes directly from whole-body [18F]FDG-PET/CT scans, without the need for prior annotation. This is akin to other works that have used CNNs to go directly from whole-body [18F]FDG-PET/CT scans to suspicious [18F]FDG-PET/CT foci localisation [
25,
26]. This model could then be used by a physician to quickly identify high-risk lymph nodes, saving time and increasing diagnosis consistency.
Discussion
The performance of our model on the first test set was good, better than the reported performance of physicians (a meta-analysis of physician performance showed sensitivity and specificity values of 0.59 and 0.97, respectively, for lesion-based labelling [
4]). However, without having a ground truth, these results are not directly comparable. Additionally, this meta-study showed a huge range in these results (sensitivity from 0.46 to 0.90 and specificity from 0.65 to 0.98). One factor that was shown to affect the results was geography (some studies were in high tuberculosis-prevalent regions, which lead to more false positives), but other factors such as scanner type, centre protocols, and physician experience may have made a difference. This variation highlights the difficulty in comparing results across studies, so any comparison must be treated with care. Taking this into account, our model achieved a comparable sensitivity to previous automated mediastinal node detection studies, with the CNN used in [
24] achieving a sensitivity of 0.84. Most importantly, unlike all previous reports regarding the identification of mediastinal cancer nodes in lung cancer patients, our approach only requires the full [18F]FDG-PET/CT scan as an input and does not rely on preliminary manual identification of patches containing suspicious uptake, thus representing a significant improvement in the overall automation of the process. This complete automation increases the consistency and reproducibility of mediastinal lymph node labelling. As demonstrated in Supplementary Fig.
1, the model could thus be used to aid physicians by quickly guiding them to potential positive node sites.
Comparing the results to those between two physicians, the agreement between the algorithm and the first physician was similar to that between the two physicians (
κ = 0.77 [0.68, 0.87] vs
κ = 0.66 [0.54, 0.77]). The agreement between our model and the second physician was slightly lower (
κ = 0.71 [0.60, 0.82]). This is not surprising, as our model was trained using the first physician’s labelling. Only one study could be found reporting inter-physician agreement in identifying pathological mediastinal nodes using [18F]FDG-PET/CT scans and reported an inter-observer agreement ranging from 0.48 to 0.88, depending on the type of mediastinal node [
5]. These results are consistent with ours. As mentioned, however, comparisons to other studies should be treated with caution due to the potentially large variations.
When applying our model directly to data from the second scanner, the detection sensitivity dropped (Table
2). More work would be needed to determine which parameters are most relevant to this performance drop (scanner hardware, reconstruction parameters, etc.). Fine-tuning the model using transfer learning made it possible to improve the sensitivity up to that observed on the scanner 1 test data, but at the cost of more FPs/patient.
As is usually the case with CNNs, there is no sure-fire way to predict which parameters will result in the best performance, so a range of hyperparameter combinations were tested to optimise the model. In addition, single-phase approaches were tested (going directly from the scans to node locations) but found to not be discriminant enough, leaving too many false positives. Models with [18F]FDG-PET–only and CT-only inputs were also tested, but these produced poor results. This is probably because their uses are complimentary: the CT provides anatomical information to locate the nodes, and the PET provides metabolic information to determine if the nodes are positive. Full details of the parameters tested for all phases of model creation are summarised in the
Supplementary Material.
One limitation of this study, and thus problem with the above comparisons, was the lack of comparison with a gold standard. Not all pathological nodes are detectable on [18F]FDG-PET/CT scans (e.g. [18F]FDG-negative cases), so without histological analysis, it is impossible to know which nodes are truly positives. However, histological analysis also has its limitations in this application. In standard clinical practice, only suspicious nodes are biopsied, and if a patient has several suspicious nodes, not all will be biopsied. Additionally, some nodal stations are easier to access and thus more often biopsied, leading to sampling errors. These problems may explain the huge range in the number of nodes per patient seen in the meta-analysis [
4]. A consensus judgement would have improved our study, making our model less physician-specific.
Our dataset was quite small, and more data would undoubtedly improve these results. In a broader sense, this is the key advantage of a machine learning approach. As clinical practice adapts and training datasets get bigger, a CNN can learn from a broader range of nodes, allowing it to become more accurate. It could be that a certain misclassified node had no direct comparison in the training set, but with a larger training set, this would rarely happen.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.