Background
Immunohistochemistry (IHC) or in situ hybridization (ISH) assays are well-established methods used to distinguish subtypes in breast cancer (BC) based on hormone receptor statuses [
1]. Increasingly, RNA-based signature assays such as MammaPrint, Oncotype DX, and Predictor Analysis of Microarray 50 (PAM50) [
2] are being employed as supplementary prognostic indicators due to studies demonstrating more significant differential survival between identified subtypes when compared to standard clinicopathological factors [
3‐
5]. In particular, PAM50 intrinsic molecular subtyping, as part of the NanoString Prosigna [
6] and the Agendia BluePrint panel [
7], is becoming more widely used in early-stage breast cancers to determine the likelihood of responding to chemotherapy. The PAM50-defined intrinsic molecular subtype (IMS) classifications include Luminal A (LumA), Luminal B (LumB), HER2-enriched (HER2), Basal-like (Basal), and Normal-like (Normal); while there is some correlation between receptor status and IMS, the latter is determined by consideration of gene expression beyond receptors. The molecular signature-based tests are not, however, as ubiquitously employed as IHC-based receptor subtyping in part due to their high cost, extended processing times, and requirement for appropriate tissue samples. Thus, a method for tumor classification beyond receptor subtyping that approximates PAM50 subtyping that is practical, cost-effective, and utilizes readily available samples could be of great utility.
Unlike the samples needed for molecular signature assays, hematoxylin and eosin (H&E)-stained biopsy slides are routinely collected during pathological examination, and are often digitally recorded as whole-slide images (WSIs) [
8].
Machine learning approaches can extract knowledge from WSIs beyond that of which a human is capable, as evidenced by the many computer-assisted diagnosis (CAD) software solutions created to augment pathological inspection workflows [
8]. It has been demonstrated previously that even genetic subtyping can be approximated using WSIs as input to relatively simple machine learning algorithms [
9].
Deep learning methods are an emerging set of influential machine learning technologies well suited to these image-based classification tasks [
10]. Recent advances in both computational power and convolutional network architectures have greatly increased the applicability of these techniques for several new domains in biology including omics analysis, biomedical signal processing, and biomedical imaging [
11]. Specifically, deep learning has been applied to greatly improving detection of regions of interest in BC WSIs [
12] and impressive progress has been made in application of deep learning to BC diagnosis from images [
13‐
15].
Of particular interest in WSI analysis is the use of multiscale patch representations that allow concurrent use of high-zoom patches that capture cellular level information with lower-zoom patches that capture global interdependence of tissue structures [
16‐
18]. Bejnordi et al. used multiscale patch representation of WSIs to build highly accurate context-aware stacked convolutional neural networks (CNN) for distinguishing between invasive ductal carcinomas (IDC) and benign ductal carcinoma in situ (DCIS) [
19]. Similarly, Liu et al. used this same approach to accurately detect whether biopsy samples from nearby lymph node tissue were positive for metastases [
20].
While use of multiscale patch representations may increase performance in WSI-based classification tasks, the computational complexity of training on all possible multiscale patches from gigapixel WSIs is substantial. As such, previous studies have employed strategies that limit the analyzed patches to a subset of the total image. For example, in a study of subtypes in BC, Verma et al. used a minimum filter on the blue–yellow channel at × 20 magnification to select patches rich in epithelial cells [
21]. Similarly, in a study of non-small cell lung cancer WSIs, Yu et al. successfully used only the top ten cell-dense 1000 × 1000 pixel (250 × 250 μm) patches at × 40 magnification. However, both of these strategies leveraged tissue-specific knowledge of cell morphology in their respective indications [
22]. Generalizable methods for focusing on information-rich image patches are an area of ongoing research.
Here, we present a method for minimizing manual work required to identify cancer-rich patches among all multiscale patches in H&E-stained WSIs that can be generalized to any indication. A minimal number of such cancer-rich WSI patches were then used to classify tumors into IMS, i.e., PAM50 WSI-based subtypes.
Similarly to the method presented here, Couture et al. [
23] recently applied deep learning to image analysis to predict BC grade, ER status, and both histologic and intrinsic subtype when modeled as binary classifiers (i.e., Basal-like vs. non-Basal-like) and achieved > 75% accuracy, supporting development of such classifiers. They used 1 mm cores from pathologist-marked areas (1–4 per WSI) for tissue microarray (TMA) construction, and the authors noted that cores taken from a single slide often classify as different intrinsic subtypes, which may be evidence of heterogeneity. However, characterizing the extent of intrinsic subtype heterogeneity from TMAs would be extremely difficult even with multiple small cores from a single WSI.
One distinct advantage of the patch-based WSI-based IMS classifier described here is retention of the ability to observe intratumoral heterogeneity directly without resorting to numerical deconvolution methods. We leveraged this patch-based system to identify tumors presenting at least two molecular subtypes within the same tissue section, and support these cases as mixed populations using independent data including overall survival. Others have previously used image-based measures of heterogeneity as prognostic biomarkers [
24], but to our knowledge, this is the first study of prognostic intrinsic subtype heterogeneity identified in diagnostic WSIs.
Methods
Constructing multiscale patch representations
All diagnostic WSIs of H&E-stained sections from formalin-fixed paraffin-embedded (FFPE) blocks collected from 1097 patients with invasive BC were obtained from The Cancer Genome Atlas (TCGA) data sources [
25], resulting in a collection of 1142 diagnostic WSIs. WSIs were tiled into 1600 × 1600 pixel (800 × 800 μm) patches at the × 20 zoom level. All 1600 × 1600 pixel patches were filtered for a minimum color variance to eliminate empty (background) patches from further processing. Each 1600 × 1600 pixel 20× patch was converted into 400 × 400 pixel patches at × 5, × 10, and × 20 magnification scales centered on the same point by down-sampling and cropping to the center 400 × 400 pixels. Next, a deep CNN was used to transform 2D color patches into classifiable 1D descriptive vectors as follows: 2D patches were input into a version of the
Inception v3 network (Google) [
26] pre-trained on the ImageNet database of images to classify a wide variety of objects. The representations at the final layer of the network (the logits layer) were then extracted. This process maps each 400 × 400 pixel color patch into highly descriptive vectors with 2048 dimensions at each zoom level. Principal component analysis (PCA) was used to reduce dimensions while retaining > 96% variance. Finally, vectors for all three zoom levels were concatenated into one multiscale patch representation.
Enriching for cancer patches
For training, 238,728 multiscale patch representations were randomly selected. These representations were grouped using k-means clustering; the number of clusters was determined empirically. Clusters with sufficient cellularity were investigated further. A pathologist evaluated 336 representations for tumor content. The clusters were assessed for cancer enrichment by observing the percentage of patches within said clusters that were also positive for tumor content. For each WSI, up to 80 patches that fell within the cancer-rich clusters were used for further analysis. If a WSI contained more than 80 cancer-rich patches, only 80 were selected at random.
PAM50 classification
Both PAM50 expression-based molecular subtyping and survival data were available for 789 out of 1097 BC patients used for our WSI-based IMS classifier development; the subtypes comprised 50.4% LumA, 21.7% LumB, 16.9% Basal, 8.1% HER2, and 2.9% Normal.
Because a large number of example patches are generated from each patient, ensuring that the majority of patches from each training patient are of one subtype is important. In order to remove likely heterogeneous patients from the training pool to allow training utilizing only the most strictly defined subtype, patients were assessed for how closely their gene expression as determined by RNAseq associated with other patients from their assigned subtype. Gene expression values (as RSEM transcripts per million values) for the 50 PAM50 genes were obtained from TCGA sources (
https://gdac.broadinstitute.org). These expression profiles were used to cluster all 789 patients in the PCA-space, an unsupervised analytical method for gene expression data that provides a picture of the overall distribution of the analyzed dataset [
27]. Patients were deemed low-confidence if the Euclidean distance to their assigned subtype centroid was > 33% larger than the distance to the nearest subtype centroid (Additional file
1: Figure S1). A total of 104 patients were assigned the low-confidence (i.e., likely heterogenous) label; elimination of such cases for training is a method used by others [
23].
The normal-like subtype (tumor tissue with gene expression similar to normal breast tissue) was deemed insufficiently represented for multiclass classification (n = 23) and dropped from training, resulting in a 4-way classification task. The 766 non-normal-like patients were split into training (n = 443; 58%) and validation (n = 323; 42%) datasets. All 101 non-normal-like low-confidence patients were assigned to the validation set. Patients in the training dataset were further split into 5 pairs of training and testing datasets (i.e., fivefold cross-validation). Within each fold, 60 multiscale cancer-enriched patches were selected per training WSI and used to train a multiclass one-vs-rest support vector machine (SVM) with radial basis function (RBF) kernel. Trained models were used to classify 80 multiscale cancer-enriched patches from each testing WSI, then aggregated via majority voting to classify at the tumor level. In the few cases where a patient had multiple diagnostic slides, a voting mechanism was used to assign the patient’s overall IMS label. A final multiclass one-vs-rest SVM with RBF kernel classifier was trained on all 443 training WSIs and analyzed for subtyping accuracy in the validation set of unseen patients.
Detecting subtype heterogeneity
Analysis of heterogeneity focused on detecting the two subtypes with most dissimilar survival characteristics (LumA and Basal). To label patients as LumA by image (LumA
IMG), a threshold for the minimum percentage of patches classified as LumA was determined using Youden’s analysis [
28] in the training set as follows: patient tissue was assigned “LumA” or “not LumA” labels according to PAM50 molecular subtyping, then all percentages of LumA patches were considered and the threshold that maximizes the true positive rate (TPR) to false positive rate (FPR) ratio was selected. A threshold for calling WSI Basal by image (Basal
IMG) was found similarly. Patients were categorized as heterogeneous (HET), LumA
IMG, and Basal
IMG using these pre-trained thresholds. Heterogeneity was supported by analysis of Mann-Whitney
U tests of HR expression and Kaplan-Meier survival curves with Cox proportional hazard analysis.
Binary classification: Basal vs. non-Basal
Based on the findings from the above initial establishment of the classifier, we sought to increase its prognostic utility by re-defining subtyping as Basal or non-Basal (HER2, LumA, LumB, and Normal). To do this, we used the top 60 multiscale patches from 582 WSIs (92 Basal and 490 non-Basal) to train a binary linear SVM classifier with C = 1.0. The training data had 34,745 multiscale patches (15.84% Basal and 84.16% non-Basal) and resulted in patch-level train accuracy of 90.58% (with sensitivity = 64.72% and specificity = 95.44%).
In addition, to improve the sensitivity of the Basal vs. non-Basal classifier, we employed a class balance technique of training patches to train another “balanced” linear SVM (C = 1.0) classifier. That is, a set of 5.5K multiscale patches were randomly selected from the Basal and non-Basal classes.
Discussion
Presented here is a system for breast cancer molecular subtype classification using deep learning patch representations of H&E-stained WSIs. Conventionally, such classification is only accomplished using gene expression signatures such as those generated by PAM50; however, the proposed WSI-based IMS classifier methodology presented herein shows promising performance with overall concordance with molecular-based classification of 65.92% and the ability to detect the most aggressive subtype, Basal, with 87% accuracy. Absence of concordance does not necessarily mean the tumor is inaccurately subtyped, as our results on key receptor expression suggest. Rather, application of our methodology may prompt further investigation of subtype accuracy as based on PAM50.
To increase veracity of training examples, we eliminated low-confidence samples; however, even with purified training labels, many of the expression-based characteristics defined by PAM50 genes may not result in visually (image) discernable features [
29], leading in part to the observed reduction in concordance between our image-based classifier and expression-based PAM50 subtyping. Specifically, our analysis shows that this WSI-based IMS classifier is highly sensitive to expression of key cell-surface receptors ERα/
ESR1 and PR/
PGR (see Fig.
3c, Fig.
3d, and Additional file
1: Figure S4). Despite some disparity in subtyping results between expression-based PAM50 subtyping and this image-based analysis that utilizes morphological characteristics, the WSI-based IMS classifier is not inferior to PAM50 in prognostic capability: in fact, in this cohort, the image-based classifier is more prognostic for differential survival between LumA and Basal patients than molecular PAM50 subtyping.
Intratumoral heterogeneity, common in breast tumors—especially in triple-negative breast cancer [
30]—may play a role in reducing concordance between our WSI-based IMS classifier and expression-based subtyping. The methodology presented here summarizes patches into a patient-level classification by majority area, whereas expression profiles are summaries based on total transcript counts. As such, concordance of the deep learning classifier with expression-based subtyping may be improved in the future by increasing weight given to cell-dense or transcriptionally overactive patches.
Many tumor heterogeneity models exist, such as cancer stem cells (CSCs) and the clonal evolution model; recently developed lineage-tracing and cell-ablation methods have furthered understanding of the role of the former in cancer [
31]. Figure
4a shows that tumor heterogeneity can occur on a small (~ 100 μm) or large (~ 10 mm) scale; the solid tumor heterogeneity model must take this spatial information into account.
Because of its sensitivity to subclonal diversity, our WSI-based IMS classifier may have novel application as a method for detecting intratumoral heterogeneity. Inspection of tumor biopsy tissues that were misclassified revealed patterns of discordant subtypes at the patch level. Further evidence that these tumors are in fact heterogeneous populations was found in hormone-receptor expression levels and survival characteristics. Specifically, patients with tumors that were classified as LumA subtype but had Basal subclones have poorer survival compared to those with homogeneous LumA tumors. The specific regions identified by this classifier could be further confirmed as molecularly distinct by laser microdissection followed with separate molecular characterization of subclones.
While survival differences between HET and LumA or Basal were not significant, the trend of the HET group having intermediate survival is complementary to image- and expression-based evidence for heterogeneity. One limitation of the TCGA BC cohort is the higher proportion of prospective samples resulting in relatively short follow-up times, which reduces the number of events available to power the Kaplan-Meier analysis. Nonetheless, the intermediate survival of the HET group supports the merit of further studies on the effects of tumor heterogeneity as revealed by the WSI-based method here on survival.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.