Skip to main content
Erschienen in: BMC Medicine 1/2021

Open Access 01.12.2021 | Research article

Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study

verfasst von: Huan Yang, Lili Chen, Zhiqiang Cheng, Minglei Yang, Jianbo Wang, Chenghao Lin, Yuefeng Wang, Leilei Huang, Yangshan Chen, Sui Peng, Zunfu Ke, Weizhong Li

Erschienen in: BMC Medicine | Ausgabe 1/2021

Abstract

Background

Targeted therapy and immunotherapy put forward higher demands for accurate lung cancer classification, as well as benign versus malignant disease discrimination. Digital whole slide images (WSIs) witnessed the transition from traditional histopathology to computational approaches, arousing a hype of deep learning methods for histopathological analysis. We aimed at exploring the potential of deep learning models in the identification of lung cancer subtypes and cancer mimics from WSIs.

Methods

We initially obtained 741 WSIs from the First Affiliated Hospital of Sun Yat-sen University (SYSUFH) for the deep learning model development, optimization, and verification. Additional 318 WSIs from SYSUFH, 212 from Shenzhen People’s Hospital, and 422 from The Cancer Genome Atlas were further collected for multi-centre verification. EfficientNet-B5- and ResNet-50-based deep learning methods were developed and compared using the metrics of recall, precision, F1-score, and areas under the curve (AUCs). A threshold-based tumour-first aggregation approach was proposed and implemented for the label inferencing of WSIs with complex tissue components. Four pathologists of different levels from SYSUFH reviewed all the testing slides blindly, and the diagnosing results were used for quantitative comparisons with the best performing deep learning model.

Results

We developed the first deep learning-based six-type classifier for histopathological WSI classification of lung adenocarcinoma, lung squamous cell carcinoma, small cell lung carcinoma, pulmonary tuberculosis, organizing pneumonia, and normal lung. The EfficientNet-B5-based model outperformed ResNet-50 and was selected as the backbone in the classifier. Tested on 1067 slides from four cohorts of different medical centres, AUCs of 0.970, 0.918, 0.963, and 0.978 were achieved, respectively. The classifier achieved high consistence to the ground truth and attending pathologists with high intraclass correlation coefficients over 0.873.

Conclusions

Multi-cohort testing demonstrated our six-type classifier achieved consistent and comparable performance to experienced pathologists and gained advantages over other existing computational methods. The visualization of prediction heatmap improved the model interpretability intuitively. The classifier with the threshold-based tumour-first label inferencing method exhibited excellent accuracy and feasibility in classifying lung cancers and confused nonneoplastic tissues, indicating that deep learning can resolve complex multi-class tissue classification that conforms to real-world histopathological scenarios.
Begleitmaterial
Hinweise

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1186/​s12916-021-01953-2.
Huan Yang, Lili Chen, and Zhiqiang Cheng contributed equally as first authors.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abkürzungen
AI
Artificial intelligence
AUC
Area under the curve
CI
Confidence interval
CNN
Convolutional neural network
FLOP
Floating-point operation per second
FNR
False-negative rate
FPR
False-positive rate
GPU
Graphic processing unit
H&E
Haematoxylin and eosin
ICC
Intraclass correlation coefficient
LUAD
Lung adenocarcinoma
LUSC
Lung squamous cell carcinoma
NL
Normal lung
NSCLC
Non-small cell lung carcinoma
OP
Organizing pneumonia
PTB
Pulmonary tuberculosis
ROC
Receiver operating characteristic curve
ROI
Region of interest
SCLC
Small cell lung carcinoma
SYSU1
Sun Yat-sen University dataset 1
SYSU2
Sun Yat-sen University dataset 2
SYSUFH
First Affiliated Hospital of Sun Yat-sen University
SZPH
Shenzhen People’s Hospital dataset
TCGA
The Cancer Genome Atlas dataset
TME
Tumour microenvironment
TPR
True-positive rate
WHO
World Health Organization
WSI
Whole slide image

Background

Lung cancer is the leading killer-cancer worldwide and referred to either non-small cell lung cancer (NSCLC) or small cell lung cancer (SCLC) customarily. Nowadays, with the emerging targeted therapy and immunotherapy, accurate morphological classification is in urgent need [1]. Optical microscopic examination with eyes by pathologists remains the routine in establishing a diagnosis and determining cancer subtypes. However, the scarcity of pathologists and the time-consuming procedure escalate the conflict between clinical demand and actual productivity. Moreover, inter- and intra-observer variations introduce additional bias and risk into histopathology analysis [2, 3]. Fortunately, the digitization of histopathological slides is shifting the way pathologists work and allowing artificial intelligence (AI) to integrate with traditional laboratory workflows.
Over the past few years, deep learning approaches have shown promise in tumour histopathology evaluations [4]. Labour-intensive tasks such as regions of interest (ROIs) detection or segmentation [5, 6], element quantification [7], and visualization [8] can be well executed by deep learning approaches. Experience-dependent problems including histological grading [9], classification or subclassification [10, 11], and prognosis inference [12] have also been solved to some extent with AI approaches. Furthermore, researches on imaging genomics, covering biomarker prediction or discovery [13, 14] and tumour microenvironment (TME) characterization [15] from digital histopathological slides, were explored and demonstrated feasible.
Several deep learning approaches for lung cancer histopathological classification have gained success, in a supervision or weakly supervision manner, via single or multiple convolutional neural network (CNN) models [1621] (Table 1). Computational tools have been developed for viewing, annotating, and data mining of whole slide images (WSIs) [2226] (Table 1). Notably, QuPath [22], DeepFocus [23], ConvPath [24], HistQC [25], and ACD Model [26] are referenced in Table 1 as general WSI analysing tools, not specific for lung cancer. Additionally, the relationships between molecular genotypes and morphological phenotypes have been explored in several pioneering studies [16, 17] (Table 1). However, existing advances were confined either to NSCLC, single cohort, or a small number of cases, still a long way to make clinical effects. Furthermore, pulmonary tuberculosis (PTB) cases with nontypical radiographic features require surgical inspections to be differentiated from cancer for potential infectiousness [27]. Organizing pneumonia (OP) is also difficult to be distinguished from bronchogenic carcinoma and thus patients often undergo surgical resection for high suspicion of a malignant tumour [28, 29].
Table 1
Glance of deep learning-based lung cancer histological classification algorithms and general slide image analysing tools
Research
Year
Objective
Cohort
AUC
Architecture
Framework
Language
Coudray et al. [16]
2018
Classification between LUAD, LUSC, and NL; mutation prediction (STK11, EGFR, FAT1, SETBP1, KRAS, and TP53)
TCGA (1634 slides); NYU (340 slides)
0.970 (classification) 0.733–0.856 (mutation)
Inception-V3
TensorFlow
Python
Yu et al. [17]
2020
Identification of histological types and gene expression subtypes of NSCLC
ICGC (87 LUAD patients, 38 LUSC patients); TCGA (427 LUAD patients, 457 LUSC patients)
0.726–0.864
AlexNet; GoogLeNet; VGGNet-16; ResNet-50
Caffe
Python
Gertych et al. [18]
2019
Histologic subclassification of LUAD (5 types)
CSMC (50 cases); MIMW (38 cases); TCGA (27 cases)
Accuracy, 0.892 (patch-level)
GoogLeNet; ResNet-50; AlexNet
Caffe
MATLAB
Wei et al. [19]
2019
Histologic subclassification of LUAD (6 types)
DHMC (422 LUAD slides)
0.986 (patch-level)
ResNet-18
PyTorch
Python
Kriegsmann et al. [20]
2020
Classification between LUAD, LUSC, SCLC and NL
80 LUAD, 80 LUSC, 80 SCLC and 30 controls from NCT
1.000 (after strict QC)
Inception-V3
Keras (TensorFlow)
R
Wang et al. [21]
2020
Classification between LUAD, LUSC, SCLC, and NL
SUCC (390 LUAD; 361 LUSC; 120 SCLC; and 68 NL slides); TCGA (250 LUAD and 250 LUSC slides in good quality)
0.856 (for TCGA cohort)
Modified VGG-16
TensorFlow
Python
QuPath [22]
2017
Tumour identification, biomarker evaluation, batch-processing, and scripting
Specimens of 660 stage II/III colon adenocarcinoma patients from NIB
/
/
/
JAVA
DeepFocus [23]
2018
Detection of out-of-focus regions in WSIs
24 slides from OSU
/
CNN
TensorFlow
Python
ConvPath [24]
2019
Cell type classification and TME analysis
TCGA (LUAD); NLST; SPORE; CHCAMS
/
CNN
/
MATLAB; R
HistoQC [25]
2019
Digitization of tissue slides
TCGA (450 slides)
/
/
/
HTML5
ACD model [26]
2015
Colour normalization for H&E-stained WSIs
Camelyon-16 (400 slides); Camelyon-17 (1000 slides); Motic-cervix (47 slides); and Motic-lung (39 slides)
0.914 (for classification)
ACD
TensorFlow
Python
Abbreviations: LUAD, lung adenocarcinoma; LUSC, lung squamous cell cancer; NL, normal lung; TCGA, the Cancer Genome Atlas; NYU, New York University; ICGC, International Cancer Genome Consortium; CSMC, Cedars-Sinai Medical Center; MIMW, Military Institute of Medicine in Warsaw; DHMC, Dartmouth-Hitchcock Medical Center; NCT, National Center for Tumor Diseases; QC, quality control; SUCC, Sun Yat-sen University Cancer Center; NIB, Northern Ireland Biobank; OSU, Ohio State University; NLST, National Lung Screening Trial; SPORE, Special Program of Research Excellence; CHCAMS, Cancer Hospital of Chinese Academy of Medical Sciences; H&E, haematoxylin and eosin; WSIs, whole slide images; ACD, adaptive colour deconvolution
Here, we developed a deep learning-based six-type classifier for the identification of a wider spectrum of lung lesions, including lung cancer, PTB, and OP. EfficientNet [30] and graphic processing unit (GPU) were utilized for better efficacy. We also implemented a threshold-based tumour-first aggregation method for slide label inferencing, which was inspired by clinical routine and proved to be effective through multi-centre validation. Extended comparison experiments and statistical analyses were conducted for the verification of model efficiency, efficacy, generalization ability, and pathologist-level qualification. We intended to test the hypothesis that deep learning methods can identify lung cancer and mimics histologically with high accuracy and good generalization ability.

Methods

The study workflow is illustrated in Fig. 1. First, specimens were scanned and digitized into pyramid-like structured WSIs. Second, WSIs were reviewed and annotated by pathologists. Third, ROIs were extracted and cropped into tiles for model development. Fourth, the deep convolutional neural network was trained and optimized to gain the optimum classification performance. Fifth, tile-level predictions were aggregated into slide-level predictions. Ultimately, multi-centre tests were conducted for adequate validations of the model’s generalization abilities.

WSI datasets

The initial dataset consisted of 741 haematoxylin and eosin (H&E)-stained lung tissue slides with a confirmed diagnosis of either LUAD, LUSC, SCLC, PTB, OP, or NL from the First Affiliated Hospital of Sun Yat-sen University (SYSUFH) (Table 2). The inclusion criterion was that each slide should show typical lesions indicative of one of the aforementioned diagnostic categories. Before the WSI annotation, two pathologists reviewed all the histological slides of each case microscopically, including immunohistochemistry and histochemical staining slides used for auxiliary diagnosis, and accessed to patients’ medical reports when necessary. Cases with confirmed diagnosis (one slide per case) were included in this study. The slides were then scanned with a KF-PRO-005-EX scanner (KFBIO, Ningbo, China) at × 40 equivalent magnification (0.25 μm per pixel) and digitized into KFB format. In pursuit of an unbiased assessment, the diagnostic annotations were reviewed by pathologists with at least 7 years of clinical experience from the Department of Pathology of SYSUFH according to the 2015 World Health Organization (WHO) classification criteria of lung tumours [1].
Table 2
Details of SYSU1 dataset for the development of six-type classifier
Number of slides (tiles)
Subsets
LUAD
LUSC
SCLC
PTB
OP
NL
SUM
Training
210 (179,402)
77 (51,949)
65 (17,342)
43 (22,617)
46 (17,987)
70 (65,143)
511 (354,440)
Validation
45 (43,153)
18 (14,552)
16 (1077)
11 (3047)
10 (4170)
15 (12,526)
115 (78,525)
Testing
43
16
22
10
10
14
115 (276,247)
SUM
298
111
103
64
66
99
741 (709,212)

Data pre-processing

The raw gigabyte multi-layer WSIs from SYSUFH were converted from KFB to TIFF format with the KFB_Tif_SVS2.0 tool (provided by the scanner vendor KFBIO) for compatibility with mainstream computer vision tools. To retain both global overview and local details, the images of × 20 equivalent magnification (0.5 μm per pixel) was adopted throughout the processing procedure. The TIFF-format WSIs were manually annotated by the pathologists using the ASAP platform [31], with separate areas of coloured irregular polygons responsible for a certain histopathological lung tissue type. Tumorous and inflammatory regions were obtained by masking annotated areas, and normal regions were retrieved by excluding the background of normal lung slides with Otsu’s method [32]. The annotation guaranteed that no non-lesion tissues were included in the annotated area, and thus, some lesion areas that were difficult to be marked clearly may be lost. These outlined areas were annotated with their respective categories, including LUAD, LUSC, SCLC, PTB, and OP. Normal lung slides were derived from normal adjacent tissues of cases with the above diseases. The selected normal lung WSIs referred to the tissues of the whole slides that were normal without any lesions. Specifically, unannotated regions of neoplastic slides were not considered normal due to the rigorous labelling method that excluded minor areas of tumour tissue surrounded by mostly normal tissues. ROIs were traversed and tailored into non-overlapping tiles at the size of 256 × 256 pixels with a sliding window (stride = 256) to match the input scale of CNNs and avoid overfitting. Tiles with over 50% background space were removed to reduce noise and redundancy. The tile distributions are detailed in Table 2.

Deep neural networks

A CNN with high accuracy and low tuning costs was our aspirational framework. The EfficientNet networks benefited from compound scaling and auto architecture search, achieving state-of-the-art accuracy on ImageNet [33] with fewer floating-point operations per second (FLOPs). PyTorch supported the EfficientNet network up to the B5 version at the time this study was conducted. Hence, EfficientNet-B5 was adopted for the histopathological classification task with its last fully connected layer replaced by a Softmax layer that output a six-dimension vector. To train and optimize the networks, we randomly divided the slides at the slide level into the disjoint training, validation, and testing sets (Table 2). ResNet is another popular CNN architecture that frequently appeared in research articles. Therefore, we also fine-tuned a ResNet-50 network using the same data and settings as EfficientNet-B5 and threw the same testing slides to conduct a fair comparison between the two network models.

Network training

Limited by the reality of strict privacy policies and nonuniform medical management systems, most medical samples are inaccessible, especially labelled samples [34, 35]. Hence, transfer learning techniques were employed to train the EfficientNet-B5 network given our relatively moderate training dataset. The training process was comprised of two steps. First, we initialized the network with default weights transferred from the ImageNet dataset, froze all the layers except the last fully connected layer, and trained it with our data. Second, we unfroze the frozen layers and finetuned the whole network to fit the target best. The parameters of the trainable layers were modified and optimized referring to the cross-entropy between the predictions and the ground truths. The initial learning rate was 0.0005, and the optimizer was Adam [36], with both momentum and decay set as 0.9. On-the-fly data augmentations, including rotating between 0 and 30°, flipping horizontally or vertically, random brightness or contrast or gamma, zooming in or out, shifting, optical or grid distortion, and elastic transformation, were performed to aggrandize data varieties. Except for horizontal flipping, all the other augmentation operations were conducted with a certain probability, either 0.3 or 0.5. To improve the learning properties on convergence, pixels were rescaled from 0 to 255 to 0–1 by dividing 255, and then Z-score-normalized with mean (0.485, 0.456, 0.406) and std. (0.229, 0.224, 0.225). The training process lasted for 60 epochs, and the optimized model with the minimum loss was saved and adopted.

Whole-slide label inferencing with threshold-based tumour-first aggregation

Outputs of the network were tile-level predictions that should be aggregated into slide-level diagnoses. Traditionally, a tile would be inferred as the class with the maximum prediction probability. Classical aggregation approaches usually fell into two categories to draw the slide-level inference. One is known as the majority voting method, which counts the tile number per class and assigns a slide with the label corresponding to the most numerous class, and the other is the mean pooling method that adds the probabilities of each class and deduces the slide label from the maximum mean class probability. In our datasets, compound tissue components may coexist in one slide. For example, normal, inflammatory, and neoplastic components may scatter across different regions of a tumorous slide; meanwhile in this study, only one major type of neoplastic component would appear in the tumorous slide label. Accordingly, we proposed a two-stage threshold-based tumour-first aggregation method that fused the majority voting and probability threshold strategies. Pathologists often encountered cases in which multiple lesions coexisted, for example lung cancer and PTB or OP may coexist in one H&E slide. If all lesion types were equally treated and the type with the highest prediction probability regarded as the slide-level diagnosis, the model output may miss the cancer lesion due to its small size, which could be much more harmful to patients. Therefore, we aimed to improve the diagnostic sensitivity of cancer and proposed the tumour-first approach. Our method prioritized the tissue types according to the severity of diseases and reported the most threatening tissue type, especially tumorous types.
It is reasonable to set different thresholds for different lesion types. For inflammatory diseases, the threshold range of PTB was initially set slightly lower than that of OP, because PTB is more characteristic morphology microscopically. The threshold range of normal lung tissue was set as high as possible to improve the diagnostic precision. Because LUAD, LUSC, and SCLC are all tumour types, their thresholds should be the same. Also, the thresholds should be roughly inversely proportional to disease severity in order to improve sensitivity. Consequently, the thresholds were set to satisfy the criteria: Tumour_threshold < PTB_threshold < OP_threshold < NL_threshold.
Our expert pathologists agreed the threshold-based tumour-first idea and suggested the threshold ranges according to clinical experiences as following: Tumour = [0.1, 0.5], PTB = [0.2, 0.5], OP = [0.3, 0.5], and NL = [0.7, 0.95]. We adopted these threshold principles and ranges and applied a grid search method with a step of 0.05 to obtain the optimal threshold settings on the first testing dataset SYSU1 (Sun Yat-sen University dataset 1). Accordingly, we got 450 groups of thresholds and calculated their corresponding micro-average and macro-average AUCs. By descending order micro-average AUC first, descending order macro-average AUC as an additional condition, the combination of Tumour (LUAD, LUSC, or SCLC) = 0.1, PTB = 0.3, OP = 0.4, and NL = 0.9 satisfied the principles aforementioned and ranked the top for SYSU1 testing cohort (Additional file 2: Table S1); therefore, it was selected as the threshold setting in the following work.
After the thresholds being defined, the two-stage aggregation was implemented. In the first stage, the aggregation principle was applied to draw each tile’s label and formulated as following (Additional file 1: Figure S1): (i) if the prediction probability of NL exceeded 0.9, the tile was inferred as NL; (ii) otherwise, if the probability of any neoplastic category was greater than 0.1, the label was assigned with the neoplastic class of the maximum probability; (iii) otherwise, if the prediction values of PTB or OP were higher than other thresholds, the corresponding class label was assigned; and (iv) if any of the above conditions were unmet, the tile would be labelled as the class with the maximum probability value. In the second stage, a similar protocol was applied with the tile number per class divided by the total tile number used as the input vector (Additional file 1: Figure S2). We got each tile’s label from the first stage and counted the number of supporting tiles in each class; the number was then divided by the sum of all tiles to obtain the slide-level probability proportion of each class; and finally, we used the slide-level proportion as the input of the second stage to inference the slide-level label. As result, the tile-level predictions aggregated to reach the human-readable slide-level diagnoses in accordance with medical knowledge.

Multi-centre model testing

To explore the generalization ability of our classifier, further validations were conducted on four independent cohorts, including two inner cohorts SYSU1 and SYSU2 (Sun Yat-sen University dataset 2), and two external cohorts SZPH (Shenzhen People’s Hospital dataset) and TCGA (The Cancer Genome Atlas dataset) (Table 2, Table 3, and Additional file 3: Table S2). Both SYSU1 and SYSU2 datasets came from the First Affiliated Hospital of Sun Yat-sen University. All the slides were anonymized to protect patients’ privacy. Different from and without intersection with the slides subjected to the development of the model, the slides for testing had clinical diagnosis labels only and obtained an inferred diagnosis from the model. Tiles were extracted from the whole slide exhaustedly excluding the background, allowing 10% overlapping with adjacent tiles, and those with tissue proportion less than 20% were filtered for computation efficiency. Appropriate measurements, including recall, precision, F1-score, accuracy, and AUC were computed to quantify and compare the models’ performances across these four testing cohorts.
Table 3
Multi-centre cohorts collected for model validation
Cohorts
LUAD
LUSC
SCLC
PTB
OP
NL
SUM
SYSU2
56
64
52
30
25
91
318
SZPH
60
75
43
0
0
34
212
TCGA
141
134
0
0
0
147
422

Comparison between the deep learning model and stratified pathologists

Four pathologists of different professional level diagnosed the WSIs with ASAP independently and blindly in a single stretch and documented the time they spent. Then, we collected their diagnosis results for performance evaluations and comparisons with our six-type classification model.

Visualization of the predictions

Heatmap is widely used for visualization due to its variegated colours and expressive exhibitions. In this work, heatmaps were plotted overlying the tiles, displaying equivalent colours corresponding to the tile-level class probability that ranged from 0 to 1. A more saturated colour indicated a larger probability. As appropriate, the coordinate system marked where specific tiles located was omitted for integral aesthetics. Receiver operating characteristic curves (ROCs) were plotted to show the dynamic tendency in which sensitivity varied with specificity. Bar plot and Cleveland graph were plotted to illustrate tile distributions within slides and across cohorts. Sankey figure was drawn to show the comparisons between our deep learning model and the most experienced pathologist.

Statistical analysis

To evaluate the performances of our model and pathologists, precision, recall, F1-score, AUC, micro-average AUC, and macro-average AUC were calculated in Python with the scikit-learn [37] library using functions including classification_report, auc, and roc_curve. Micro- and macro-AUCs were computed as sample- and class-average AUCs, respectively. 95% CIs were estimated for categorical AUC, micro-average AUC, and macro-average AUC by bootstrapped [38] resampling the samples 10,000 times. The intraclass correlation coefficient (ICCs) were calculated with the ‘irr’ package [39] in R using the ‘oneway’ model, the corresponding 95% CIs were also given by 10,000-fold bootstrapping. ICC ranges from 0 to 1, and a high ICC denotes good consistency. Conventionally, when ICC > 0.75 and P < 0.05, high reliability, repeatability, and consistency were indicated [40].

Hardware and software

The raw WSIs were viewed with K-Viewer (provided by the scanner vendor, KFBIO). OpenSlide [41] (version 1.1.1) and OpenCV [42] (version 4.1.1) in Python (version 3.6.6) were utilized for image extracting and analysing. The main working platform was a high-performance computing node equipped with dual NVIDIA P100 16GB Volta GPUs, and the deep learning model was constructed, trained, and validated with PyTorch [43] (version 1.2.0) on a single GPU. Scikit-learn (version 0.21.2) and Matplotlib [44] (version 2.2.2) in Python undertook major estimation and visualization work cooperatively. The ‘gcookbook’ and ‘tidyverse’ packages in R (version 3.6.1) were adopted to draw bar plots and Cleveland graphs.

Results

Internal cohort testing

A total of 741 lung-derived digital WSIs, consisting of 512 tumorous tissues, 130 inflammatory tissues, and 99 normal tissues from the SYSUFH, constituted the initial dataset and were randomly divided into the training (n = 511 slides), validation (n = 115 slides), and internal testing (SYSU1) (n = 115 slides) subsets (Table 2). The WSIs for training and validation were annotated by experienced pathologists and reviewed by the head of the Pathology Department at SYSUFH, and only ROIs were extracted and tessellated into small 256- × 256-pixel tiles at × 20 magnification as inputs of the EfficientNet-B5 network. As for the testing slides, simply diagnostic labels were assigned and the whole excluding background was utilized and pre-processed in the same fashion as annotated slides. In total, 709,212 tiles yielded, of which 432,965 joined the training and validation processes and 276,247 were subject to evaluating the classification performance of the model. The tile distributions are detailed in Table 2. With the training and validation datasets, we developed a deep learning-based six-type classifier that can identify histopathological lung lesions of LUAD, LUSC, SCLC, PTB, OP, and normal lung (NL).
Tested on the internal independent cohort of 115 WSIs, micro- and macro-average AUCs of 0.970 (95% CI, 0.955–0.984) and 0.988 (95% CI, 0.982–0.994) were achieved respectively (Fig. 2a). AUCs for all tissue types were above 0.965, and the successes in SCLC (0.995), PTB (0.994), and OP (0.996) suggested the model competent in distinguishing cancerous and noncancerous lung diseases. Precision, recall, and F1-score were adopted for static assessment (Table 4). It was gratifying that SCLC and noncancerous tissues tended to obtain high precisions, and SCLC, NL, and OP even achieved 1. This meant fewer false positives for SCLC and mild diseases thus lower the risks of serious consequences of missed diagnoses. Meanwhile, cancerous tissues were observed to obtain high recalls, which coincided with the purpose of high sensitivities of malignant tissues. In brief, the deep learning-based six-type classifier exhibited substantial predictive power in the internal independent testing. The whole slide level confusion matrixes (Additional file 1: Figure S3) for each testing cohort illustrated the misclassifications by our method.
Table 4
Model performances across SYSU1, SYSU2, SZPH, and TCGA testing sets
Metrics
LUAD
LUSC
SCLC
PTB
OP
NL
Macro-avg
Cohorts
Precision
SYSU1
0.80
0.75
1.00
0.89
1.00
1.00
0.91
SYSU2
0.85
0.88
0.79
0.80
0.88
0.96
0.86
SZPHa
0.97
0.84
0.94
1.00
0.94
TCGAb
0.82
0.70
1.00
0.84
Macro-avg
0.86
0.79
0.91
0.85
0.94
0.99*
0.89
Recall
SYSU1
1.00
0.75
0.77
0.80
0.60
0.93
0.81
SYSU2
0.84
0.72
0.94
0.93
0.84
0.95
0.87
SZPHa
0.93
0.97
0.67
0.91
0.87
TCGAb
0.68
0.94
0.78
0.80
Macro-avg
0.86
0.85
0.79
0.87
0.72
0.89*
0.84
F1-score
SYSU1
0.89
0.75
0.87
0.84
0.75
0.96
0.84
SYSU2
0.85
0.79
0.86
0.86
0.86
0.95
0.86
SZPHa
0.95
0.90
0.78
0.95
0.90
TCGAb
0.74
0.80
0.88
0.80
Macro-avg
0.86
0.81
0.84
0.85
0.81
0.94*
0.85
aFor the SZPH dataset, no PTB or OP WSIs were available
bFor TCGA dataset, only LUAD, LUSC, and NL WSIs were available
*Maximum Macro-avg value across the datasets of different diseases
Bold font: Maximum value of specific metrics across different data cohorts

Multi-cohort testing

Another batch of specimens from SYSUFH (SYSU2) (n = 318 slides), an external validation dataset from Shenzhen People’s Hospital (SZPH) (n = 212 slides), and a randomly selected subset of The Cancer Genome Atlas (TCGA) (n = 422 slides) were collected for further multi-cohort testing (Table 3 and Additional file 3: Table S2). Notably, due to the limitation of the external data resource, data for PTB and OP were unavailable for SZPH and TCGA, and data for SCLC was unavailable for TCGA as well. Similarly, AUC, precision, recall, and F1-score were computed for the evaluation of classification performance (Table 4).
Our classifier attained micro-average AUCs of 0.918 (95% CI, 0.897–0.937) (Fig. 2b) and 0.963 (95% CI, 0.949–0.975) (Fig. 2c) for SYSU2 and SZPH, respectively, showing consistent performances in dealing with data from different medical centres. For the public available TCGA subset, the micro-average AUC was 0.978 (95% CI, 0.971–0.983) (Fig. 2d), which surpassed those obtained from both internal and external cohorts. In terms of precision, recall, and F1-score (Table 4), the model performed best with SZPH dataset, followed by SYSU2, and NL was the most accurately distinguished tissue type with macro-average F1-scores of 0.94 across the four cohorts, followed by LUAD with a macro-average F1-score of 0.86. The inherent nature of TCGA and SZPH had limited corresponding experiments to partial categories of lung lesions in this study; meanwhile, the results demonstrated our method’s robustness and insensitivity to the influence of class imbalance. Overall, the histopathological six-type classifier delivered consistent answers to multi-cohort testing, and its flexibility of data bias and applicability of a wider scale bridged the distance between artificial intelligence and actual clinical use. It was reasonable to believe that the model held promise to relieve workloads of pathologists and cover more extensive clinical scenarios.

Comparison between EfficientNet-B5 and ResNet-50

Table 5 illustrates that ResNet-50 performed comparably with EfficientNet-B5 on the SYSU1 cohort, slightly less accurate on SYSU2. However, EfficientNet-B5 exerted obvious advantages on SZPH and TCGA cohorts. ResNet-50 was competent in common tasks, but inferior in generalization as shallower networks are naturally weaker in learning abstract features which may be crucial for distinguishing slides of multiple sources. Hence, EfficientNet-B5 outperformed ResNet-50 in multi-cohort testing and was selected as the backbone model.
Table 5
EfficientNet-B5 outperformed ResNet-50 across four testing cohorts
Cohort
Model
Micro-AUC
Macro-AUC
Accuracy
Weighted-F1-score
SYSU1
ResNet-50
0.966
0.985
0.860
0.860
EfficientNet-B5
0.970
0.988
0.860
0.860
SYSU2
ResNet-50
0.887
0.953
0.780
0.770
EfficientNet-B5
0.918
0.968
0.870
0.870
SZPH
ResNet-50
0.713
0.733
0.540
0.520
EfficientNet-B5
0.963
0.971
0.890
0.900
TCGA
ResNet-50
0.967
0.973
0.690
0.680
EfficientNet-B5
0.978
0.962
0.800
0.810

Visualizing predictions with heatmaps

To see the landscape of whole slide level predictions, heatmaps were plotted as overlays on the tiles with various colours standing for the predicted tissue types. One representative of each tissue type was randomly selected and is visualized in Fig. 3. The first row displayed the WSIs with ROI annotations, and the second row illustrated the resulting probability heatmaps paired with the first row. From left to right were the sample heatmaps for LUAD, LUSC, SCLC, PTB, OP, and NL, respectively. In Fig. 3, the predictions of tiles and subregions were clearly observed and mapped to the in situ tissues. The whole slide landscapes of predictions were generally a mix of tissue components, among which the predominant component of the same priority contributed more to the final diagnostic conclusion. Figure 3 also illustrates that the suggested regions by our six-type classifier were highly consistent with the ROIs annotated by the pathologists. For example, the highlighted regions of SCLC, PTB, and OP heatmaps were perfectly matched to their corresponding ROI annotations in the upper row, and the predicted region of LUAD coincided with the main ROI though missed about 30% of the actual lesions. Notably, cancerous components merely appeared in noncancerous slides, and the prominent components tended to present like a gobbet. In addition, the margins of noncancerous slides seemed to be predicted as OP. We also generated the heatmaps (Additional file 1: Figure S4) to present the false-positive prediction cases. In these false-positive cases, cancer cases were predicted as other types of cancer, and NL cases were predicted as PTB or OP cases. In brief, the heatmaps allow to overview predictions of the whole slides intuitively, discover the underlying histopathological patterns, and simplify the result interpretations.

Contesting with pathologists

To compare our model with pathologists for the diagnosis of lung lesions, four pathologists from three different training levels (senior attending, junior attending, and junior) were invited to independently and blindly review all the H&E-stained slides from four testing cohorts by manual inspection alone. True-positive rate (TPR) and false-positive rate (FPR) were calculated for each pathologist and attached to the ROCs as different coloured five-pointed stars (Fig. 2). We can see the NL curves (cyan) over some stars, and LUAD curves (hot pink) under or close to the stars in most cases. Pathologist3 reached the top rank in SYSU1, SYSU2, and TCGA, albeit at junior attending status. Disparities between attending pathologists existed but not made much difference. Roughly, our model accomplished comparable performance with pathologists, and even better in some cases.
Aiming at quantifying the performance consistency, ICCs under 95% CIs among the pathologists and our model were calculated. As listed in Table 6, our method achieved the highest ICC of 0.946 (95% CI, 0.715, 0.992) with the ground truth in TCGA, 0.945 (95% CI, 0.709, 0.992) with Pathologist3 in SYSU1, 0.960 (95% CI, 0.783, 0.994) with Pathologist2 in SYSU2, and 0.928 (95% CI, 0.460, 0.995) with Pathologist3 in SZPH, respectively. All the ICCs were above 0.75 (P < 0.05), and the model behaved closest to Pathologist3 overall, who was the best performing pathologist in point of blind inspection on the four cohorts.
Table 6
High ICCs between the model and pathologists across four independent testing cohorts indicate high consistency and comparable performance
Raters
Six-type classification model (ICCa with 95% CIb)
SYSU1
SYSU2
SZPH
TCGA
Ground truth
0.941(0.691, 0.991)
0.959 (0.776, 0.994)
0.927 (0.453, 0.995)
0.946 (0.715, 0.992)
Pathologist1+++c
0.938 (0.677, 0.991)
0.957 (0.767, 0.994)
0.878 (0.215, 0.991)
0.918 (0.592, 0.988)
Pathologist2++c
0.873 (0.422, 0.981)
0.960 (0.783, 0.994)
0.909 (0.356, 0.994)
0.928 (0.633, 0.989)
Pathologist3++c
0.945 (0.709, 0.992)
0.945 (0.709, 0.992)
0.928 (0.460, 0.995)
0.922 (0.608, 0.988)
Pathologist4+c
0.944 (0.707, 0.992)
0.800 (0.200, 0.969)
0.905 (0.538, 0.986)
0.754 (0.086, 0.961)
P valued
< 0.05
< 0.05
< 0.05
< 0.05
aICCs were computed with the ‘irr’ package for R v3.6.1 using the ‘oneway’ model to measure the reliability and consistency of diagnoses among raters
bCIs were given by bootstrapping the samples 10,000 times
c‘+’ symbols indicate the levels of pathologists, + means junior, ++ means junior attending, and +++ means senior attending
dICC ranges from 0 to 1, and a high ICC suggests a good consistency. Conventionally, when ICC > 0.75 and P < 0.05, high reliability, repeatability, and consistency were indicated
For further insight into the relationships of the resulting predictions, Sankey diagram (Fig. 4) was built to illustrate the difference among the ground truth, the most experienced pathologist (Pathologist1 in Table 6), and our six-type classifier. Taking the ground truth (the middle column) as the benchmark, the spanning curves on the left and right indicate misjudgements of Pathologist1 and our classifier, respectively. The model’s overall performance was comparable with the pathologist and highly consistent with the ground truths. Comparatively, our model made relatively fewer mistakes for LUSC, though more mistakes for LUAD and SCLC. Further, the model was so tumour-sensitive that it tended to gain false positives by predicting NL as suspicious lesions, whereas expert pathologist had much more confident to confirm a disease-free tissue. Table 7 summarizes the cases that were misjudged by at least one pathologist, and over half of the mistakes were corrected by the model. Therefore, our model achieved excellent performance comparable to those of experienced pathologists.
Table 7
Misjudges from pathologists were corrected by the six-type classifier
Cohorts
SYSU1
SYSU2
SZPH
TCGA
Error(s)a
31
84
21
120
Correction(s)b
22
59
18
90
aErrors denote the number of slides misjudged by at least one of the pathologists
bCorrections denote the number of those misjudged slides corrected by our six-type classifier
Obviously, manual inspection is labour insensitive. For example, the TCGA cohort cost a pathologist 6 to 10 h to complete a full inspection, while the entire analysis can be done within approximately an hour by the model. Additionally, inter-rater and intra-rater variabilities of manual inspection influenced the final consensus. Compared with pathologist’s manual inspection, our six-type classifier approach is a more stable and cost-effective choice.

Discussion

Histopathological evaluation has until now been the cornerstone of final cancer diagnosis, directing further examination, treatment, and prognosis. The transition from glass slides under an optical microscope to virtual slides viewed by computers enabled the automatization of inspection and quantitative assessment. Medical AI is demonstrated favourable for improving healthcare qualities and lessening the inequality between urban and rural health services [45]. Lung cancer is threatening millions of lives every year. Though important discoveries have been made during recent years, accurate histopathological classification remains challenging in clinical practice. Certainly, distinguishing LUAD from LUSC is necessary; however, SCLC deserves more attention for its high malignancy and poor survival rate [46]. In addition, tumour generally appears as a mixture of neoplastic and inflammatory lesions, and extensive inflammatory lesions may shield local tumour changes thus contributing to false-negative diagnosis. On the contrary, mistaking nonneoplastic tissues as neoplastic tissues gave rise to the risk of overdiagnosis and overtreatment. Therefore, in order to tackle real clinical problems, we designed the six-type classifier for wider coverage of lung diseases, including lung cancers as well as inflammatory lung diseases.
The histological assessment of lesions involves different staining techniques to make a final diagnosis. In all histological diagnoses, H&E staining must be first and indispensable. In the routine diagnostic procedure for clinicopathological work, pathologists firstly evaluate lesion with benign or malignant using H&E-stained sections. If the lesion is suspected of malignancy, subsequent typing will be conducted using complex immunohistochemistry or molecular detection. If it is benign, especially suspected with inflammatory lesions, such as evaluation of infection with mycobacteria, Ziehl-Neelsen (ZN) staining is needed to confirm the diagnosis. It is true that H&E staining cannot directly identify the pathogens such as mycobacterium; however, lung tissue infected with mycobacterium results in characteristic histological changes, including the granulomas formation which consists of epithelioid macrophages and multinucleated giant cells, often with caseous necrosis centrally. Therefore, we believe that morphology is the first step to recognize disease microscopically. Based on morphological characteristics, our model performed the task of the six-type classification for diagnostic purpose using the H&E-stained tissue.
Our six-type classifier was compatible to other relevant state-of-the-art tools (Table 1) and gained advantages in dealing with complex application scenarios. For example, DeepPath [16] accomplished micro- and macro-average AUCs of 0.970 (95% CI, 0.950–0.986) and 0.976 (95% CI, 0.949–0.993) respectively for the classification of NSCLC, which were not significantly different to ours. Notably, our model performed better in distinguishing NL (0.999 versus 0.984) and LUSC (0.974 versus 0.966), and comparable in LUAD (0.965 versus 0.969 for LUAD). Yu et al. [17] also implemented multiple network architectures to subclassify NSCLC using the TCGA data and achieved an AUC of 0.864, which was 0.114 lower than our TCGA result. Moreover, Kriegsmann et al. [20] adopted Inception-V3 to classify LUAD, LUSC, SCLC, and NL, accomplishing an AUC of 1.000; however, that was achieved after strict quality controls in their data pre-processing phase. Furthermore, Wang et al. [21] conducted a similar classification task without PTB and OP using different feature aggregation methods and compared their efficiencies. Their CNN-AvgFea-Norm3-based RF method achieved an AUC of 0.856 and an accuracy of 0.820 on the TCGA dataset, which was 0.122 loss in AUC and 0.020 gain in accuracy compared with our classifier. Notably, the input dataset in Wang’s study was manually picked up from TCGA and only composed of LUAD (n = 250 slides) and LUSC (n = 250 slides). These suggested that our classifier could adapt to more complicated situations in real clinical scenarios.
Moreover, we overcame some challenges in data pre- and post-processing. The first challenge was to reduce the class-imbalance of the initial dataset, which needed proper separation at slide- and tile-level. The ROIs varied in size and a slide can have different numbers of ROIs. Hence, we divided the slides into training, validation, and testing sets according to the ROI areas per slide per class, roughly following a ratio of 4:1:1. Nevertheless, some tiles were filtered for low tissue coverage before model training. We examined the distribution of ROI areas by counting the number of tiles per slide (Additional file 1: Figure S5). The general trend in the distribution was that the slide number declined with the tile number increased in both training and validation sets. A majority of the slides got ROIs within 2000 tiles, and the largest tile number was no more than 4000, which suggested cautious annotation strategies and a low chance of excessive presentation of a certain slide, thereby avoiding overfitting in the model training phase to some extent.
Then came the challenge of the aggregation from tile-level prediction to slide-level inference. Note that multiple tissue components usually coexist in a slide. Therefore, the slide-level label should not be determined only based on the tissue type with the most supporting tiles, and tumorous class should be reported first even with fewer cancerous tiles. Most recently, scientists experimented to append heuristic algorithms (e.g., logistic regression, random forest, and support vector machine) which input features based on the tile probability scores generated by CNNs [47, 48]. Campanella et al. applied a random forest algorithm for selecting top suspicious tiles and then trained an RNN model to draw slide-level predictions [49]. However, the feature engineering and extra optimization procedures complicated the classification work and introduced uncertainty to some degrees. In this study, we preferred to testify if a more convenient AI solution could accommodate to clinical use. Accordingly, a set of thresholds advised by expert pathologists conforming to clinical experience was defined and integrated with the majority voting method for the slide-level label inference. Validated on both the inner and inter testing datasets, the thresholds were proved feasible and effective.
Ultimately, we tried to interpret the differences in prediction effectiveness observed in the multi-centre validation experiments. First, we checked and compared the distributions of ROIs across testing cohorts (Additional file 1: Figure S5). Although a similar pattern of tile agglomeration in the testing slides, quite a few slides fell into the interval of 0–500, especially in the SYSU2 and SYSU1 cohorts. The tile distribution of misjudged slides was plotted as a Cleveland graph grouped by cohort (Additional file 1: Figure S6). Not surprisingly, errors occurred intensively in the slides with a tile number less than 500. This happened because small slides were most susceptible to individual tile errors. A closer inspection of the testing set of SYSU1 showed approximate 24.3% of the slides were small specimens, which may partially explain the relatively lower AUCs in SYSU1. SYSU2 cohort was collected due to the substantial number of small sample slides and the imbalance of SYSU1 testing cohort. As a result, the model obtained an improved performance on SYSU2. SZPH cohort was best predicted, which may lie on a relatively even distribution of tiles. When reviewing the TCGA slides, we found some obvious artefacts such as pen marks, margin overlap, and defocus. In addition, staining differences were observed between TCGA and the other three cohorts, which also contributed to the performance diversities.

Conclusions

The efforts presented in our work highlighted the possibility of predicting a wider spectrum of confusing lung diseases from WSIs using a deep learning model coupled with threshold-based tumour-first aggregation method. With the broad coverage of lung diseases, the rigorous validations on multi-centre cohorts, the improved interpretability of the model, and the comparable consistency with experienced pathologists, our classifier exhibited excellent accuracy, robustness, efficiency, and practicability as a promising assistant protocol, which was close to the complex clinical pathology scenarios.

Acknowledgements

We acknowledge the pathologists and assistants participating in this study for specimen collection, preparation, and quality control at the Sun Yat-sen University.

Declarations

This study was approved by the Ethics Committee of First Affiliated Hospital of Sun Yat-sen University, approval number [2013] C-084.
Not applicable.

Competing interests

To maximize the impact of this study, Sun Yat-sen University submitted a patent application to the State Intellectual Property Office of Chia (SIPO).
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​. The Creative Commons Public Domain Dedication waiver (http://​creativecommons.​org/​publicdomain/​zero/​1.​0/​) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Supplementary Information

Literatur
1.
2.
Zurück zum Zitat Stang A, Pohlabeln H, Müller KM, Jahn I, Giersiepen K, Jöckel KH. Diagnostic agreement in the histopathological evaluation of lung cancer tissue in a population-based case-control study. Lung Cancer. 2006;52:29–36.PubMedCrossRef Stang A, Pohlabeln H, Müller KM, Jahn I, Giersiepen K, Jöckel KH. Diagnostic agreement in the histopathological evaluation of lung cancer tissue in a population-based case-control study. Lung Cancer. 2006;52:29–36.PubMedCrossRef
3.
Zurück zum Zitat Grilley-Olson JE, Hayes DN, Moore DT, Leslie KO, Wilkerson MD, Qaqish BF, et al. Validation of interobserver agreement in lung cancer assessment: hematoxylin-eosin diagnostic reproducibility for non-small cell lung cancer: the 2004 World Health Organization classification and therapeutically relevant subsets. Arch Pathol Lab Med. 2013;137:32–40.PubMedCrossRef Grilley-Olson JE, Hayes DN, Moore DT, Leslie KO, Wilkerson MD, Qaqish BF, et al. Validation of interobserver agreement in lung cancer assessment: hematoxylin-eosin diagnostic reproducibility for non-small cell lung cancer: the 2004 World Health Organization classification and therapeutically relevant subsets. Arch Pathol Lab Med. 2013;137:32–40.PubMedCrossRef
4.
Zurück zum Zitat Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2021;67:101813.PubMedCrossRef Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2021;67:101813.PubMedCrossRef
5.
Zurück zum Zitat Chen H, Qi X, Yu L, Dou Q, Qin J, Heng PA. DCAN: deep contour-aware networks for object instance segmentation from histology images. Med Image Anal. 2017;36:135–46.PubMedCrossRef Chen H, Qi X, Yu L, Dou Q, Qin J, Heng PA. DCAN: deep contour-aware networks for object instance segmentation from histology images. Med Image Anal. 2017;36:135–46.PubMedCrossRef
6.
Zurück zum Zitat Pham HHN, Futakuchi M, Bychkov A, Furukawa T, Kuroda K, Fukuoka J. Detection of lung cancer lymph node metastases from whole-slide histopathologic images using a two-step deep learning approach. Am J Pathol. 2019;189:2428–39.PubMedCrossRef Pham HHN, Futakuchi M, Bychkov A, Furukawa T, Kuroda K, Fukuoka J. Detection of lung cancer lymph node metastases from whole-slide histopathologic images using a two-step deep learning approach. Am J Pathol. 2019;189:2428–39.PubMedCrossRef
7.
Zurück zum Zitat Li X, Tang Q, Yu J, Wang Y, Shi Z. Microvascularity detection and quantification in glioma: a novel deep-learning-based framework. Lab Investig. 2019;99:1515–26.PubMedCrossRef Li X, Tang Q, Yu J, Wang Y, Shi Z. Microvascularity detection and quantification in glioma: a novel deep-learning-based framework. Lab Investig. 2019;99:1515–26.PubMedCrossRef
8.
Zurück zum Zitat Ortega S, Halicek M, Fabelo H, Camacho R, Plaza ML, Godtliebsen F, et al. Hyperspectral imaging for the detection of glioblastoma tumor cells in H&E slides using convolutional neural networks. Sensors (Basel). 2020;20:1911.CrossRef Ortega S, Halicek M, Fabelo H, Camacho R, Plaza ML, Godtliebsen F, et al. Hyperspectral imaging for the detection of glioblastoma tumor cells in H&E slides using convolutional neural networks. Sensors (Basel). 2020;20:1911.CrossRef
9.
Zurück zum Zitat Jansen I, Lucas M, Bosschieter J, de Boer OJ, Meijer SL, van Leeuwen TG, et al. Automated detection and grading of non-muscle-invasive urothelial cell carcinoma of the bladder. Am J Pathol. 2020;190:1483–90.PubMedCrossRef Jansen I, Lucas M, Bosschieter J, de Boer OJ, Meijer SL, van Leeuwen TG, et al. Automated detection and grading of non-muscle-invasive urothelial cell carcinoma of the bladder. Am J Pathol. 2020;190:1483–90.PubMedCrossRef
10.
Zurück zum Zitat Hekler A, Utikal JS, Enk AH, Berking C, Klode J, Schadendorf D, et al. Pathologist-level classification of histopathological melanoma images with deep neural networks. Eur J Cancer. 2019;115:79–83.PubMedCrossRef Hekler A, Utikal JS, Enk AH, Berking C, Klode J, Schadendorf D, et al. Pathologist-level classification of histopathological melanoma images with deep neural networks. Eur J Cancer. 2019;115:79–83.PubMedCrossRef
11.
Zurück zum Zitat Ambrosini P, Hollemans E, Kweldam CF, Leenders GJLHV, Stallinga S, Vos F. Automated detection of cribriform growth patterns in prostate histology images. Sci Rep. 2020;10:14904.PubMedPubMedCentralCrossRef Ambrosini P, Hollemans E, Kweldam CF, Leenders GJLHV, Stallinga S, Vos F. Automated detection of cribriform growth patterns in prostate histology images. Sci Rep. 2020;10:14904.PubMedPubMedCentralCrossRef
12.
Zurück zum Zitat Yao J, Zhu X, Jonnagaddala J, Hawkins N, Huang J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med Image Anal. 2020;65:101789.PubMedCrossRef Yao J, Zhu X, Jonnagaddala J, Hawkins N, Huang J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med Image Anal. 2020;65:101789.PubMedCrossRef
13.
Zurück zum Zitat Echle A, Grabsch HI, Quirke P, van den Brandt PA, West NP, Hutchins GGA, et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Gastroenterology. 2020;159:1406–16.e11.PubMedCrossRef Echle A, Grabsch HI, Quirke P, van den Brandt PA, West NP, Hutchins GGA, et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Gastroenterology. 2020;159:1406–16.e11.PubMedCrossRef
14.
Zurück zum Zitat Sha L, Osinski BL, Ho IY, Tan TL, Willis C, Weiss H, et al. Multi-field-of-view deep learning model predicts nonsmall cell lung cancer programmed death-ligand 1 status from whole-slide hematoxylin and eosin images. J Pathol Inform. 2019;10:24.PubMedPubMedCentralCrossRef Sha L, Osinski BL, Ho IY, Tan TL, Willis C, Weiss H, et al. Multi-field-of-view deep learning model predicts nonsmall cell lung cancer programmed death-ligand 1 status from whole-slide hematoxylin and eosin images. J Pathol Inform. 2019;10:24.PubMedPubMedCentralCrossRef
15.
Zurück zum Zitat Wang S, Rong R, Yang DM, Fujimoto J, Yan S, Cai L, et al. Computational staining of pathology images to study the tumor microenvironment in lung cancer. Cancer Res. 2020;80:2056–66.PubMedPubMedCentralCrossRef Wang S, Rong R, Yang DM, Fujimoto J, Yan S, Cai L, et al. Computational staining of pathology images to study the tumor microenvironment in lung cancer. Cancer Res. 2020;80:2056–66.PubMedPubMedCentralCrossRef
16.
Zurück zum Zitat Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24:1559–67.PubMedCrossRef Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24:1559–67.PubMedCrossRef
17.
Zurück zum Zitat Yu KH, Wang F, Berry GJ, Ré C, Altman RB, Snyder M, et al. Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J Am Med Inform Assoc. 2020;27:757–69.PubMedCrossRefPubMedCentral Yu KH, Wang F, Berry GJ, Ré C, Altman RB, Snyder M, et al. Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J Am Med Inform Assoc. 2020;27:757–69.PubMedCrossRefPubMedCentral
18.
Zurück zum Zitat Gertych A, Swiderska-Chadaj Z, Ma Z, Ing N, Markiewicz T, Cierniak S, et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci Rep. 2019;9:1483.PubMedPubMedCentralCrossRef Gertych A, Swiderska-Chadaj Z, Ma Z, Ing N, Markiewicz T, Cierniak S, et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci Rep. 2019;9:1483.PubMedPubMedCentralCrossRef
19.
Zurück zum Zitat Wei JW, Tafe LJ, Linnik YA, Vaickus LJ, Tomita N, Hassanpour S. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci Rep. 2019;9:3358.PubMedPubMedCentralCrossRef Wei JW, Tafe LJ, Linnik YA, Vaickus LJ, Tomita N, Hassanpour S. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci Rep. 2019;9:3358.PubMedPubMedCentralCrossRef
20.
Zurück zum Zitat Kriegsmann M, Haag C, Weis CA, Steinbuss G, Warth A, Zgorzelski C, et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel). 2020;12:1604.CrossRef Kriegsmann M, Haag C, Weis CA, Steinbuss G, Warth A, Zgorzelski C, et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel). 2020;12:1604.CrossRef
21.
Zurück zum Zitat Wang X, Chen H, Gan C, Lin H, Dou Q, Tsougenis E, et al. Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE T Cybern. 2020;50:3950–62.CrossRef Wang X, Chen H, Gan C, Lin H, Dou Q, Tsougenis E, et al. Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE T Cybern. 2020;50:3950–62.CrossRef
22.
Zurück zum Zitat Bankhead P, Loughrey MB, Fernández JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: open source software for digital pathology image analysis. Sci Rep. 2017;7:16878.PubMedPubMedCentralCrossRef Bankhead P, Loughrey MB, Fernández JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: open source software for digital pathology image analysis. Sci Rep. 2017;7:16878.PubMedPubMedCentralCrossRef
23.
Zurück zum Zitat Senaras C, Niazi MKK, Lozanski G, Gurcan MN. DeepFocus: detection of out-of-focus regions in whole slide digital images using deep learning. PLoS One. 2018;13:e205387.CrossRef Senaras C, Niazi MKK, Lozanski G, Gurcan MN. DeepFocus: detection of out-of-focus regions in whole slide digital images using deep learning. PLoS One. 2018;13:e205387.CrossRef
24.
Zurück zum Zitat Wang S, Wang T, Yang L, Yang DM, Fujimoto J, Yi F, et al. ConvPath: a software tool for lung adenocarcinoma digital pathological image analysis aided by a convolutional neural network. Ebiomedicine. 2019;50:103–10.PubMedPubMedCentralCrossRef Wang S, Wang T, Yang L, Yang DM, Fujimoto J, Yi F, et al. ConvPath: a software tool for lung adenocarcinoma digital pathological image analysis aided by a convolutional neural network. Ebiomedicine. 2019;50:103–10.PubMedPubMedCentralCrossRef
25.
Zurück zum Zitat Janowczyk A, Zuo R, Gilmore H, Feldman M, Madabhushi A. Histoqc: an open-source quality control tool for digital pathology slides. JCO Clin Cancer Inform. 2019;3:1–7.PubMedCrossRef Janowczyk A, Zuo R, Gilmore H, Feldman M, Madabhushi A. Histoqc: an open-source quality control tool for digital pathology slides. JCO Clin Cancer Inform. 2019;3:1–7.PubMedCrossRef
26.
Zurück zum Zitat Zheng Y, Jiang Z, Zhang H, Xie F, Shi J, Xue C. Adaptive color deconvolution for histological WSI normalization. Comput Methods Prog Biomed. 2019;170:107–20.CrossRef Zheng Y, Jiang Z, Zhang H, Xie F, Shi J, Xue C. Adaptive color deconvolution for histological WSI normalization. Comput Methods Prog Biomed. 2019;170:107–20.CrossRef
27.
Zurück zum Zitat Rolston KVI, Rodriguez S, Dholakia N, Whimbey E, Raad I. Pulmonary infections mimicking cancer: a retrospective, three-year review. Support Care Cancer. 1997;5:90–3.PubMedCrossRef Rolston KVI, Rodriguez S, Dholakia N, Whimbey E, Raad I. Pulmonary infections mimicking cancer: a retrospective, three-year review. Support Care Cancer. 1997;5:90–3.PubMedCrossRef
28.
Zurück zum Zitat Kohno N, Ikezoe J, Johkoh T, Takeuchi N, Tomiyama N, Kido S, et al. Focal organizing pneumonia: CT appearance. Radiology. 1993;189:119–23.PubMedCrossRef Kohno N, Ikezoe J, Johkoh T, Takeuchi N, Tomiyama N, Kido S, et al. Focal organizing pneumonia: CT appearance. Radiology. 1993;189:119–23.PubMedCrossRef
29.
Zurück zum Zitat Chen SW, Price J. Focal organizing pneumonia mimicking small peripheral lung adenocarcinoma on CT scans. Australas Radiol. 1998;42:360–3.PubMedCrossRef Chen SW, Price J. Focal organizing pneumonia mimicking small peripheral lung adenocarcinoma on CT scans. Australas Radiol. 1998;42:360–3.PubMedCrossRef
32.
Zurück zum Zitat Liu D, Yu J. Otsu method and k-means. In: 2009 ninth international conference on hybrid intelligent systems. Shenyang: Conference; 2009. p. 344–9.CrossRef Liu D, Yu J. Otsu method and k-means. In: 2009 ninth international conference on hybrid intelligent systems. Shenyang: Conference; 2009. p. 344–9.CrossRef
33.
Zurück zum Zitat Deng J, Dong W, Socher R, Li L, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Miami: Conference; 2009. p. 248–55.CrossRef Deng J, Dong W, Socher R, Li L, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Miami: Conference; 2009. p. 248–55.CrossRef
35.
Zurück zum Zitat Arellano AM, Dai W, Wang S, Jiang X, Ohno-Machado L. Privacy policy and technology in biomedical data science. Annu Rev Biomed Data Sci. 2018;1:115–29.PubMedPubMedCentralCrossRef Arellano AM, Dai W, Wang S, Jiang X, Ohno-Machado L. Privacy policy and technology in biomedical data science. Annu Rev Biomed Data Sci. 2018;1:115–29.PubMedPubMedCentralCrossRef
37.
Zurück zum Zitat Swami A, Jain R. Scikit-learn: machine learning in Python. J Mach Learn Res. 2013;12:2825–30. Swami A, Jain R. Scikit-learn: machine learning in Python. J Mach Learn Res. 2013;12:2825–30.
38.
Zurück zum Zitat Efron B. Bootstrap methods: another look at the jackknife. Ann Stats. 1979;7:1–26.CrossRef Efron B. Bootstrap methods: another look at the jackknife. Ann Stats. 1979;7:1–26.CrossRef
39.
Zurück zum Zitat Eliasziw M, Young SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther. 1994;74:777–88.PubMedCrossRef Eliasziw M, Young SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther. 1994;74:777–88.PubMedCrossRef
40.
Zurück zum Zitat Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–8.PubMedCrossRef Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–8.PubMedCrossRef
41.
Zurück zum Zitat Goode A, Gilbert B, Harkes J, Jukic D, Satyanarayanan M. OpenSlide: a vendor-neutral software foundation for digital pathology. J Pathol Inform. 2013;4:27.PubMedPubMedCentralCrossRef Goode A, Gilbert B, Harkes J, Jukic D, Satyanarayanan M. OpenSlide: a vendor-neutral software foundation for digital pathology. J Pathol Inform. 2013;4:27.PubMedPubMedCentralCrossRef
42.
Zurück zum Zitat Culjak I, Abram D, Pribanic T, Dzapo H, Cifrek M. A brief introduction to OpenCV. In: 2012 Proceedings of the 35th International Convention MIPRO. Opatija: Conference; 2012. p. 1725–30. Culjak I, Abram D, Pribanic T, Dzapo H, Cifrek M. A brief introduction to OpenCV. In: 2012 Proceedings of the 35th International Convention MIPRO. Opatija: Conference; 2012. p. 1725–30.
43.
Zurück zum Zitat Ketkar N. Introduction to PyTorch. In: Ketkar N, editor. Deep learning with Python: a hands-on introduction. Berkeley: Apress; 2017. p. 195–208.CrossRef Ketkar N. Introduction to PyTorch. In: Ketkar N, editor. Deep learning with Python: a hands-on introduction. Berkeley: Apress; 2017. p. 195–208.CrossRef
44.
Zurück zum Zitat Suwabe K, Suzuki G, Takahashi H, Katsuhiro S, Makoto E, Kentaro Y, et al. Separated transcriptomes of male gametophyte and tapetum in rice: validity of a laser microdissection (LM) microarray. Plant Cell Physiol. 2008;49:1407–16.PubMedPubMedCentralCrossRef Suwabe K, Suzuki G, Takahashi H, Katsuhiro S, Makoto E, Kentaro Y, et al. Separated transcriptomes of male gametophyte and tapetum in rice: validity of a laser microdissection (LM) microarray. Plant Cell Physiol. 2008;49:1407–16.PubMedPubMedCentralCrossRef
45.
Zurück zum Zitat Guo J, Li B. The application of medical artificial intelligence technology in rural areas of developing countries. Health Equity. 2018;2:174–81.PubMedPubMedCentralCrossRef Guo J, Li B. The application of medical artificial intelligence technology in rural areas of developing countries. Health Equity. 2018;2:174–81.PubMedPubMedCentralCrossRef
46.
Zurück zum Zitat Govindan R, Page N, Morgensztern D, Read W, Tierney R, Vlahiotis A, et al. Changing epidemiology of small-cell lung cancer in the United States over the last 30 years: analysis of the surveillance, epidemiologic, and end results database. J Clin Oncol. 2006;24:4539–44.PubMedCrossRef Govindan R, Page N, Morgensztern D, Read W, Tierney R, Vlahiotis A, et al. Changing epidemiology of small-cell lung cancer in the United States over the last 30 years: analysis of the surveillance, epidemiologic, and end results database. J Clin Oncol. 2006;24:4539–44.PubMedCrossRef
47.
Zurück zum Zitat Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH. Patch-based convolutional neural network for whole slide tissue image classification. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: Conference; 2016. p. 2424–33. Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH. Patch-based convolutional neural network for whole slide tissue image classification. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: Conference; 2016. p. 2424–33.
48.
Zurück zum Zitat Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep. 2018;8:3395.PubMedPubMedCentralCrossRef Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep. 2018;8:3395.PubMedPubMedCentralCrossRef
49.
Zurück zum Zitat Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med. 2019;25:1301–9.PubMedPubMedCentralCrossRef Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med. 2019;25:1301–9.PubMedPubMedCentralCrossRef
Metadaten
Titel
Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study
verfasst von
Huan Yang
Lili Chen
Zhiqiang Cheng
Minglei Yang
Jianbo Wang
Chenghao Lin
Yuefeng Wang
Leilei Huang
Yangshan Chen
Sui Peng
Zunfu Ke
Weizhong Li
Publikationsdatum
01.12.2021
Verlag
BioMed Central
Erschienen in
BMC Medicine / Ausgabe 1/2021
Elektronische ISSN: 1741-7015
DOI
https://doi.org/10.1186/s12916-021-01953-2

Weitere Artikel der Ausgabe 1/2021

BMC Medicine 1/2021 Zur Ausgabe

Leitlinien kompakt für die Allgemeinmedizin

Mit medbee Pocketcards sicher entscheiden.

Seit 2022 gehört die medbee GmbH zum Springer Medizin Verlag

Facharzt-Training Allgemeinmedizin

Die ideale Vorbereitung zur anstehenden Prüfung mit den ersten 24 von 100 klinischen Fallbeispielen verschiedener Themenfelder

Mehr erfahren

Update Allgemeinmedizin

Bestellen Sie unseren Fach-Newsletter und bleiben Sie gut informiert.