Background
Breast cancer is a disease that has been thoroughly profiled on various levels revealing heterogeneity that is manifest at the clinical, histopathological and molecular level. At each level, separation of breast tumors into different groups has been used to identify subgroups of the disease, which assists patient management. The two major groups of breast cancer at the histopathological level are the estrogen receptor (ER)-positive and ER-negative tumors, encompassing all other molecular subgroups. Further, at the gene expression level, five main subgroups have been identified [
1,
2], and combining gene expression with copy number data further refined breast cancer into 10 integrated subgroups with different genomic and transcriptomic profiles and prognosis [
3].
Recently, mutation data coupled to these 10 subgroups showed how functional mutations in the
PIK3CA gene were associated with different survival times in ER-positive breast cancer when stratifying by these integrated subgroups [
4]. Integrating classifications extracted from four different levels (mRNA, microRNA (miRNA) expression, DNA copy number and methylation) revealed new insights into the biology and immune profile of pre-invasive and invasive breast cancers [
5], while metabolic analyses have revealed three naturally occurring clusters with distinct metabolic profiles [
6]. Exploring the causes and consequences of breast cancer at a higher level may lead to refined therapeutic strategies.
Tumor development and progression is a dynamic evolutionary process involving genomic and epigenetic aberrations, cellular context, influence from the surrounding environment and patient-specific characteristics. Furthermore, cancer is increasingly being understood as a disease with alterations at the network level where multiple different changes can engender a similar cancer phenotype or outcome [
7]. Integration of molecular data is needed to uncover these alterations in single tumors and further link them across patients to understand the effects on network levels. Also, integrative analyses may generate explanatory power that one data type alone cannot provide [
8]. The long-term goal of this approach is further stratification of patients into subgroups for improved tailored therapy. The information content in integrated analyses is higher than in any of the separate molecular-level studies; however, the availability of all these layers of data from the same patients is often limited.
Using data from five molecular platforms (mRNA expression, protein expression, miRNA expression, DNA copy number and methylation), The Cancer Genome Atlas (TCGA) performed a multiplatform integrative analysis on 348 breast tumors [
9]. The subtypes (clusters) defined from each of the molecular levels were subjected to unsupervised consensus clustering revealing four major patient groups. These “higher-order” subtypes corresponded well with the mRNA expression-defined PAM50 subtypes and as such did not identify new subgroups within the subtypes. The same cluster-of-clusters approach was also applied to the corresponding molecular data from 12 different cancer types [
10], revealing 11 major subtypes. Interestingly, although most of the multiplatform subtypes correlated with tissue of origin, some of the tumor types coalesced into one subtype, while, for example, breast cancer was split into two subtypes and bladder cancer into three different subtypes [
10].
The Oslo2 study is a multicenter study initiated in 2006, in which patients with breast cancer were enrolled from Oslo University Hospital. So far, 2000 patients have been enrolled into the study, and here we present an analysis of the first 355 patients in addition to 70 patients from a similar study performed at Akershus University Hospital. In this study, we integrated seven different classifications extracted from five molecular levels; DNA copy number, mRNA, miRNA and protein expression and tumor metabolic profiles. The aim was to identify higher-order integrated subtypes. Whenever possible, we used existing clustering schemes, as developed and tested in the literature, including the cluster-of-clusters analysis (COCA) methods previously described [
9,
10]. In this way we provided comparable evidence on how the population-based Oslo2 cohort represents the previously described subtypes and on how these compare to newly identified subtypes.
Methods
The Oslo2 clinical cohort
Oslo2 is a multicenter study in which patients with primary operable breast cancer (cancer tumor stage (cT)1–cT2) were consecutively enrolled at Oslo University Hospital, Norway (including the Radium Hospital and Ullevål Hospital, Vestre Viken and Østfold Hospital in southeast Norway). Patients were included from 2006, at the time of primary surgery after giving written informed consent. Here we present an analysis of the first subset of 355 patients. The Regional Committee for Medical and Health Research Ethics for southeast Norway has approved the study (approval number 1.2006.1607 and 1.2007.1125, 2009/615, 2009/4935).
Experienced breast pathologists macroscopically evaluated the surgical specimens before parts of the tumor were fresh frozen (-80 °C). Patients were followed according to national guidelines for follow up after breast cancer treatment. A fresh-frozen biopsy sample from the primary tumor and peripheral blood samples were collected at the time of surgery. In addition, bone marrow and lymph nodes were collected. Tumor tissue and blood specimens were collected from patients who experienced relapse.
Samples from the Oslo2 study were coupled with tumor tissue collected from 70 patients from a similar study conducted at the Akershus University Hospital, Norway (approval number 429-04148) from 2003 to 2010. In total, tumor tissue from 425 patients was used in the current analyses and the patient cohort is collectively described as the Oslo2 cohort.
Clinical data
Clinical parameters were collected from patient records and from pathology reports. Hormone receptor status for estrogen receptor (ER) and progesterone receptor (PR) were obtained by standard immunohistochemical assessment (IHC). Amplification of the human epidermal growth factor receptor 2 (HER2) gene was assessed by a combination of IHC and chromogenic in situ hybridization (CISH) following standard guidelines. Experienced pathologists assessed tumor size, morphology, histological grade and axillary lymph node involvement as part of standard diagnostic routine. Age, mode of detection, surgical procedure and presence of metastatic disease at the time of diagnosis were collected from hospital records. Following national guidelines for breast cancer, screening for metastases is not standard procedure in asymptomatic patients at diagnosis. Thus, most patients did not undergo magnetic resonance imaging (MRI) or computed tomography (CT) to detect metastases.
Tumor preparation
Biopsies from Oslo University Hospital were taken at the time of surgery and fresh-frozen. The tumor was cut into three pieces. Frozen sections were taken from the flanking pieces facing the middle piece, stained with hematoxylin and eosin and evaluated by a pathologist for the presence of tumor cell percentage. The average tumor cell percentage was 53% (range 0–90%). A tumor tissue sample from the middle piece was taken for high-resolution magic-angle spinning magnetic resonance spectroscopy (HR MAS MRS). Following this, the three tumor pieces were merged, cut into smaller pieces by scalpel, mixed and split into dedicated vials for DNA, RNA and protein isolation. Biopsies from Akershus University Hospital were first put on RNAlater (Thermo Fisher scientific, Waltham, MA, USA) overnight before being frozen at -80 °C. The preparation was performed as described previously, but without dedicated vials for HR MAS MRS and protein isolation.
DNA and RNA isolation
DNA isolation was performed using the Maxwell® 16 instrument (Promega, Fitchburg, WI, USA) and the Maxwell® 16 tissue DNA Purification Kit (Promega). DNA was isolated according to the manufacturer’s protocol. In brief, tumor tissue was transferred into the Maxwell cartridge cassettes predispensed with magnetic beads, lysis buffer, and wash buffers of isopropanol and ethanol. The isolation procedure is automated, starting with sample lysis and tissue homogenization, following bead isolation of DNA, and finally washing steps. The DNA was eluted in 200–600 ul TE-buffer (pH 8.5). DNA was stored at -20 °C.
Total RNA was isolated by phenol-chloroform extraction using the TRIzol reagent (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s instructions and has been described previously [
11]. The NanoDrop spectrophotometer (Thermo Fisher scientific) was used to assess the concentration and RNA purity by measuring absorbance at different wavelengths. The quality and integrity of the RNA was assessed by chip‐based capillary electrophoresis using a 2100 Bioanalyzer instrument (Agilent Technologies, Santa Clara, CA, USA). The resulting average RNA integrity number (RIN) of all samples was 5.6, range 1.0–9.7.
TP53 sequencing
TP53 mutation analysis of exon 2-11 was performed by Sanger sequencing using the 3730 DNA Analyzer (Applied Biosystems, Life Technologies Corporation, Carlsbad, CA, USA). PCR amplification with the BigDye Direct Cycle Sequencing Kit (Applied Biosystems) was performed using 5 ng tumor DNA, followed by BigDye XTerminator Purification Kit (Applied Biosystems). The sequences were read in SeqScape v.2.7 (Applied Biosystems) by two independent investigators.
PIK3CA mutation detection
Mutations in the
PIK3CA gene were detected using a mass spectroscopy-based approach in addition to Sanger sequencing. In total, 314 tumors were evaluated for ten known
PIK3CA mutations using the Sequenom MassArray MALDI-TOF MassArray system (Sequenom, San Diego, CA, USA) as previously described [
12]:
PIK3CA_C420R_T1258C,
PIK3CA_E110K_G328A,
PIK3CA_E542KQ_G1624AC
PIK3CA_E545KQ_G1633AC,
PIK3CA_G1049R_G3145C,
PIK3CA_H1047RL_A3140GT
PIK3CA_K111N_G333C,
PIK3CA_N345K_T1035A,
PIK3CA_P539R_C1616G and
PIK3CA_Q546LPR_A1637TCG.
A subset of the tumor samples (n = 275) were sequenced for detection of mutations in PIK3CA exon 9 and 20. PCR touchdown reaction with HotStarTaq DNA polymerase (Qiagen, Hilden, Germany) was performed using 10 ng of DNA. The PCR products were visualized on a 1.5% agarose gel, and the products were cleaned with EpMotion 5075 (Eppendorf AG, Hamburg, Germany). For the sequencing reactions, 3 ul of the purified PCR product and BigDye Terminator v1.1 reaction mix was used. Sequencing reactions were performed on MJ Research Tetrad DNA Engine (MJ Research, Bio-Rad Laboratories Inc., Hercules, CA, USA), and cleaned on Sephadex mini-columns (GE Healthcare Life Sciences, Little Chalfont, UK). Sequencing was performed using a 3730 DNA Analyzer (Applied Biosystems). Mutation scoring was performed in SeqScape v.2.7 (Applied Biosystems) by two independent investigators. For tumor samples evaluated by both approaches, the results were combined by identifying a tumor sample as PIK3CA-mutated if at least one of the methods detected a mutation.
Copy number aberration analysis, segmentation and complex arm aberration index
Tumor DNA was hybridized to Affymetrix SNP 6.0 arrays (Affymetrix, Santa Clara, CA, USA) at Aros Applied Biotechnology (Aarhus, Denmark) following the manufacturer’s recommendations. Tumor samples collected at the Akershus University Hospital were stored on RNAlater. DNA extracted from a majority of these samples did not pass the quality control assessment for single nucleotide polymorphism (SNP) arrays. Affymetrix CEL-files were processed using the PennCNV-Affy library [
13], with the HapMap samples as the reference set [
14]. Correction for GC-related binding bias was performed [
15]. The resulting GC-adjusted LogRs were segmented into regions of constant copy number using the R package “copynumber” [
16] in the Comprehensive R Archive Network (CRAN) [
17]. The complex arm aberration index (CAAI) was computed as described previously [
18]. Note that in the original paper CAAI-values were thresholded only at the value 0.5 to produce a dichotomous variable, but here we also distinguished between the number of arms with a CAAI event; zero arms, one arm or at least two arms.
Gene expression and PAM50 subtypes
mRNA expression was measured using SurePrint G3 Human GE 8x60K one-color microarrays from Agilent (Agilent Technologies) according to the manufacturer’s protocol and using 100 ng of RNA as input for amplification. The array includes 42,405 unique 60-mer probes, targeting 27,958 Entrez genes and 7419 long intergenic non-coding RNAs (lincRNAs). Scanning was performed with Agilent Scanner G2565A, and signals were extracted using Feature Extraction v.10.7.3.1 (Agilent Technologies). Non-uniform spots were excluded and missing data were imputed using local least squares imputation (LLSimpute from the R package “pcaMethods” [
19]). Arrays were log
2-transformed, quantile-normalized and hospital-adjusted by subtracting from each probe value the mean probe value of the samples from that same hospital. To have a single expression value per gene per sample, the values corresponding to probes with identical Entrez ID were averaged. A cutoff was applied on the RIN value to exclude samples with an RIN value below 2.5. mRNA expression data have been submitted to the Gene Expression Omnibus (GEO) database [GEO:GSE80999].
The PAM50 subtype algorithm [
20] was used to assign a gene expression subtype label to each sample. For each sample, a 50-dimensional vector was found by extracting the gene expression values for the 50 genes in the PAM50 gene list. A 50-dimensional centroid vector was then calculated by averaging the gene expression vectors for all the ER-positive samples, and likewise a 50-dimensional centroid vector was calculated by averaging the gene expression vectors for all the ER-negative samples. A combined centroid was then defined as a weighted average of the ER-negative and the ER-positive centroids, the weights being c and 1-c, where c is the proportion of ER-negative samples in the original dataset (the training data set) used in Parker et al. [
20] to define the PAM50 centroids. The samples to be subtyped were then centered by aligning the combined centroid with the centroid of the training dataset. This was achieved by subtracting from the expression vector of each sample the combined centroid and then adding the centroid of the training dataset. Finally, one of the subtype labels luminal A, luminal B, basal, normal-like or HER2 was assigned to each sample by calculating the Spearman correlation between the centered expression vector of the sample and each of the five PAM50 centroids, and selecting the one with the strongest correlation.
Protein expression and subtypes
Protein levels were determined using reverse phase protein array (RPPA), a platform whereby single protein levels can be measured across a series of samples simultaneously [
21]. Altogether 148 primary antibodies were used to detect cancer-related proteins. Frozen tumor samples from patients with sufficient material (from Oslo University Hospital) were lysed by homogenization in lysis buffer containing proteinase inhibitors and phosphatase inhibitors. The tumor lysates were diluted to 1.33 mg/ml concentration as assessed by bicinchonic acid assay and boiled in 1% SDS and 2-mercaptoethanol.
Supernatants were manually diluted in five serial twofold dilutions with lysis buffer. The samples were spotted onto and immobilized on nitrocellulose-coated FAST slides. The slides were probed with 105 primary highly validated antibodies in the appropriate dilution. The signal intensity was captured by a biotin-conjugated secondary antibody and was amplified using the DakoCytomation-catalyzed system (Dako, Glostrup, Denmark). Slides were scanned, analyzed and quantitated using MicroVigene software (VigeneTech Inc., Carlise, MA, USA) to generate spot signal intensities. These were then processed by the R package “SuperCurve” (version 1.01), available at
http://bioinformatics.mdanderson.org/OOMPA [
22]. The protein concentrations were derived from the supercurve for each sample by curve fitting, log
2-transformed, and the relative concentrations were normalized by median centering of the samples for each of the antibodies.
RPPA subtypes were obtained using non-negative matrix factorization as done in [
9]. Consensus clustering of the samples was performed with an option for four or five groups using Pearson correlation coefficient-based distance and Ward’s minimum variance-based agglomeration method. The best fit on consensus clustering identified five groups: luminal, HER2, basal and reactive I and reactive II, as defined in the TCGA dataset [
9]. The RPPA data can be found in Additional file
1a (parts of the data (i.e. total protein antibodies) have been published previously [
11]).
miRNA expression and clusters
miRNA expression was measured using the one-color microarray Human miRNA Microarray Kit (V2) design ID 029297 (Agilent Technologies) according to the protocol supplied by the manufacturer (miRNA Microarray System v2.3). This array contains 887 human miRNAs and is based on miRBase release 14.0. Each array is spotted by 14,907 features (60-mers) including 715 control probes; hence each miRNA is on average replicated approximately 16 times. For labeling and hybridization to the array, 100 ng total RNA was used as input. Scanning was performed on the Agilent Scanner G2565A. Samples were processed using Feature Extraction version 10.7.3.1 (Agilent Technologies). All except two tumors that did not pass array quality control were included in downstream analysis. The data were log2-transformed and centered on the 90th percentile using GeneSpring GX v.11.0 (Agilent Technologies). In total, 421 miRNAs were considered to be expressed in the Oslo2 cohort, after filtering out miRNAs detected in fewer than 10% of samples. The miRNA expression data have been submitted to the GEO database [GEO:GSE81000].
In order to identify patient clusters based on miRNA expression, the partitioning algorithm using recursive thresholding (PART) method available in the R package “clusterGenomics” [
23] was used. The PART method determines the number of clusters by recursive partitioning of the samples into subgroups. This means that it first attempts to split the data into an optimal number of subgroups by a flat cut of the dendrogram. It then applies the same procedure to each of the clusters identified, to see if any of these can be further split into subgroups. The benefit of this method is that it allows the dendrogram to be split into clusters occurring at different heights in the dendrogram, thus circumventing the limitation of using only flat cuts of the dendrogram to define clusters. The parameter Kmax is a technical parameter defining the maximum number of clusters to be identified at each stage of this procedure. PART was applied with the Pearson correlation coefficient based-distance and complete linkage and parameters Kmax = 4, minSize = 41 and B = 1000.
Tumor samples from the Oslo University Hospital of sufficient size to obtain a biopsy sample for high-resolution magic-angle spinning magnetic resonance spectroscopy (HR MAS MRS) were cut to fit into 30-μl inserts containing 3.0 μl of 24.29 mM sodium formate (VWR BDH Prolabo, France) in D2O (Armar Chemicals, Switzerland). Each insert was set tightly into a 4 mm optical density (o.d.) MAS zirconium rotor. Samples weighed 7.26 mg on average (2.10–15.60 mg). HR MAS MRS spectra were acquired on a BrukerAvance DRX600 spectrometer equipped with a 1H/13C MAS probe (Bruker, BioSpin GmbH, Germany). Samples were spun at 5 kHz while kept at a temperature of 5 °C to minimize degradation. A spin-echo one-dimensional experiment with presaturation (cpmgpr1d, Bruker, BioSpin GmbH, Germany) was performed on all samples.
The spectral region between 1.40 and 4.70 parts per million (ppm) containing the major information from low molecular metabolites, excluding lipid-containing regions at 4.36-4.27, 2.88–2.70, 2.30–2.20, 2.09–1.93 and 1.67–1.50, was mean normalized and used for unsupervised hierarchical cluster analysis using the Euclidean distance and Ward’s minimum variance-based agglomeration method (Statistical toolbox, Matlab R2013b, The Mathworks, Inc., USA). The dendrogram was cut to give three metabolic clusters (1–3). Relative intensities from integration of spectral regions were used to measure metabolite levels [
6]. The clusters were tested for differences in expression using the Kruskal-Wallis test and corrected for multiple testing with the Benjamini-Hochberg false discovery rate (FDR) [
24]. The metabolic data and cluster assignments have been published previously [
25].
Classification of the Oslo2 samples in the ten integrative clusters
Samples in the Oslo2 cohort were assigned to the ten integrative clusters (IntClust) identified in the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort [
3] using integrative clustering (iClustering) [
26]. With this aim, the same pipeline as used by Curtis et al. [
3] was employed: the pipeline assigns the METABRIC samples in the validation set to the ten clusters obtained from the discovery set. The METABRIC cohort includes 1980 patients (997 in the discovery set, and 983 in the validation set). Integrative clustering is used on the discovery set in order to estimate cluster centroids and the most relevant features (genes) for clustering. METABRIC finally selects ten clusters based on 754 features (39 are segmented copy number aberrations (CNAs) and 715 gene expressions). For assigning the Oslo2 data to these ten clusters, we used the Oslo2 samples for which both CNAs and gene expressions were available (
n = 291). We also used the same 754 features as in the METABRIC case. As the platforms were different, in order to make the gene expressions and CNAs of the Oslo2 cohort comparable to the METABRIC ones, the Oslo2 data were normalized to have the same mean and standard deviation as the METABRIC data.
In the analysis of Curtis et al. [
3] the assignment of the samples in the validation set to the clusters obtained from the discovery set was performed using nearest shrunken centroid (NSC; see [
27]), a supervised classification method where cluster centroids are also shrunken. NSC has two phases: in the training phase, ten new shrunken centroids were estimated starting from the original centroids (given in [
3]), using the within-cluster standard deviation of each of the 754 features. In the testing phase, each selected Oslo2 sample was assigned to one of the ten shrunken centroids, thus identifying cluster membership.
Pathway recognition algorithm using data integration on genomic models
The pathway recognition algorithm using data integration on genomic models (PARADIGM) infers distinct biological pathway activity from multiple genomic data (here, mRNA expression and copy number) in a patient sample [
5,
28]. The pathway concepts (genes, complexes and abstract processes) are derived from the Pathway Interaction Database [
29], BioCarta (
http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways), and Reactome databases [
30]. Copy number is estimated from AROMA CRMA v2 followed by circular binary segmentation (non-paired CBS) to gene level measurement [
31,
32]. The normalized mRNA expression and the 25% most variable copy number data were used for calculating inferred pathway levels (IPL) at five3genomics.com. Clustering was performed by HOPACH 2.10 [
33] in R version 3.0.0 [
17] and visualized using cluster 3.0 [
34] and Java TreeView ver.1.1.6r3 [
35]. For the HOPACH hierarchical clustering algorithm, we used correlation distance “cor” and clustering criteria function “med” (median split silhouette) in the mss parameter (mss determines the number of children at each node, to decide what collapsing should be performed at each level, and to determine the main clusters).
Consensus clustering across the multiple classifications
Consensus Clustering [
36] is a method to represent the consensus across multiple classifications. We used the R package “ConsensusClusterPlus” [
37] to estimate a final clustering of the Oslo2 cohort starting from multiple classifications. Only the samples that were clustered with at least two methods were included in the consensus clustering analysis. Consensus clustering was run using hierarchical clustering with normalized Manhattan distance and Ward linkage, and the final choice resulted in six clusters. The number of COCA clusters was selected by inspecting the average silhouette value [
38] associated to the final grouping, which showed a global maximum when selecting six consensus clusters. The clusters selection criterion suggested by the authors proposing the COCA method [
36] was also examined to support this choice.
Association of COCA clusters to clinical parameters and correlation to molecular subtype levels
To associate the identified COCA clusters to clinical parameters, a Chi-squared association test was used. To assess the correlation between COCA clusters and molecular subtype levels the Pearson correlation coefficient was calculated. This was done by first coding each sample 0/1 if it belonged (or not) to each molecular subtype and each COCA cluster and then correlating a given 0/1 COCA cluster vector with a given 0/1 molecular subtype vector.
Additional file
2a contains further methodological description. All computational analyses were performed in R [
17] unless otherwise specified.
Discussion
The input to the COCA analysis was seven different classifications of breast tumors; the PAM50 subtype, RPPA subtype, metabolic cluster, miRNA cluster, CAAI, PARADIGM and IntClust. The five former were single-molecular-level classifications, while the IntClust and PARADIGM classifications were based on the combined analysis of copy number and expression data in two different ways; iClustering assigned each tumor to one of ten IntClusts derived from the METABRIC cohort, while PARADIGM identified patient clusters based on inferred pathway activity levels. The distance in the consensus clustering method was normalized so that the different layers would be comparable in terms of number of missing values associated with each layer. Furthermore, a strength of the current work is the processing of the tumors where cutting and blending the tissue before dividing it into DNA, RNA and protein isolation ensured representative and comparable molecular data.
We identified six COCA clusters in our analysis. Considering the ranking of the molecular levels based on correlation with the COCA clusters (Table
1), PAM50 subtypes and miRNA clusters were the most strongly correlated; all of the five subtypes and four miRNA clusters were present among the strongest correlations. The PAM50 subtypes have been recognized as a robust classifier [
9]. miRNA expression has previously been associated with both gene expression-based subtypes and with clinical parameters [
45,
46], but subtypes have not yet been “formally” established based on miRNA expression. Interestingly, on ranking all the molecular levels, COCA clusters 1 and 5 were most strongly correlated with miRNA clusters 2 and 4, respectively, suggesting an important role for miRNAs in the separation of breast tumors.
In the TCGA breast study [
9], seven miRNA expression-defined subtypes were identified by consensus non-negative matrix factorization clustering. Except for two of the clusters, each of the clusters was a mixture of the PAM50-defined subtypes. As the four identified consensus clusters mainly recapitulated the PAM50 subtypes, the miRNA clusters were not given a dominant role in the TCGA study. Importantly, this particular study contained very few normal-like samples (1%), and thus the tumor distribution was different from the Oslo2 cohort consisting of 11% normal-like tumors. Similarly as in the TCGA study, the basal-like COCA cluster 3 had the most distinct signature with the strongest associations with several levels; IntClust 10, PAM50 basal subtype, PARADIGM cluster 2, RPPA basal subtype, and miRNA cluster 3 were all strongly correlated with this cluster.
There was also correlation between the COCA clusters and some of the RPPA subtypes, PARADIGM clusters and IntClusts; however, neither the metabolic clusters nor the CAAI subtypes were strongly correlated with any of the COCA clusters, suggesting that grouping based on metabolic clusters and complexity of DNA rearrangements are less strongly associated with the molecular subtypes driven by the other platforms.
Luminal A tumors represent the most frequent breast cancer subtype (approximately 40% of all cases). Although considered to have the best prognosis, the luminal A subtype is also characterized as the most heterogeneous group, both clinically and molecularly [
9,
47]. Some patients with this disease subtype suffer from relapse and may benefit from adjuvant treatment, while others risk unnecessary over-treatment with adverse side effects. Furthermore, survival curves for patients with luminal A tumors suggest that the risk of delayed local relapse and/or distant metastasis persists over long time periods compared to other subtypes [
48].
Heterogeneity at the molecular level was found for luminal A tumors in terms of mRNA expression, mutation spectrum and copy number changes in the TCGA breast cancer study [
9]. In the METABRIC study, which identified ten integrative clusters across breast cancers, luminal A tumors were separated mainly into three distinct subgroups which were found to be driven by specific genomic aberrations [
3]. Ciriello et al. [
47] analyzed copy number and mutation profiles in luminal A tumors and identified four major subtypes with distinct alterations and clinical outcomes.
Being able to distinguish subgroups of luminal A tumors is an important task and may potentially improve the choice of therapeutic approaches and prediction of clinical outcomes. In this respect, the split of the Oslo2 luminal A samples into COCA clusters 1 and 4 may suggest a novel refinement of this group. It was interesting to see that the two luminal A clusters were associated with different miRNA clusters and that overexpression of 13 of the 71 differentially expressed miRNAs in the luminal cell line MCF-7 directly showed functional effects that are important for cancer cell survival. The putative tumor-suppressor miRNA miR-1226*, which was more highly expressed in COCA cluster 1 and for which overexpression resulted in both decreased proliferation, cell viability, ER and p-AKT levels, and increased apoptosis, has previously been found to target and downregulate expression of the MUC1 oncoprotein and induce cell death [
49].
Although long-term follow up of the Oslo2 patients is not yet available, it was intriguing to see that in patients from four other cohorts, luminal A tumors formed two separate clusters when clustered on the same miRNAs differentially expressed in Oslo2. Furthermore, there was a prognostic difference between the patient clusters in the TCGA and DBCG cohorts. From the other molecular differences identified between those clusters, it may seem that the tumors in COCA cluster 4 are more “core” luminal, as they were more frequently assigned to the luminal protein-based subtype and with higher protein expression of the luminal marker GATA3 [
50].
The majority of the luminal A tumors belonging to COCA cluster 1 were of the RPPA-defined reactive I and II subgroups, which were characterized as being highly differentiated tumors with high expression of stromal proteins due to high numbers of stromal cells, lower levels of GATA3 protein compared to other tumors classified as luminal A/B from gene expression and with a favorable clinical outcome [
39]. This difference in association between the RPPA subtype and the luminal A clusters separated by miRNA expression was also seen in the TCGA cohort in which RPPA subtypes were available. Coupling this to outcome data, it seems that the cluster with more tumors classified as the reactive subtype is associated with a better prognosis. The 71 miRNA signatures would need further development to serve as a diagnostic test for patients with luminal A tumors. miRNA-based tests may be beneficial as miRNA molecules are short and relatively stable [
51] and can be successfully applied on, for example, formalin-fixed paraffin-embedded tissue. Further studies for better understanding of the underlying biology and the possible role of miRNAs as markers to separate luminal tumors with different clinical outcome or response to therapy is needed and will be exciting to follow up.
Acknowledgements
We would like to acknowledge Inger R Bergheim, Phuong Vu, Dagim Shiferaw, Veronica Skarpeteig, Tone Olsen, Anja Valen, Anita Halvei, Yiling Lu and Jovana Klajic for assisting in the performance of mutation and array analyses and in RNA extraction. We would like to thank Daniel Nebdal for excellent technical assistance in producing the figures and Thomas Fleischer for technical assistance in processing the TCGA follow-up data. We would also like to thank Rami Mäkelä and Merja Perälä for their contributions to the analysis of the cell line functional data.
Oslo Breast Cancer Research Consortium (OSBREAC) members (additional members not listed in the main author list): Elin Borgen, Department of Pathology, Division of Diagnostics and Intervention, Oslo University Hospital, Oslo, Norway; Olav Engebråten, Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway; Department of Oncology, Division of Surgery and Cancer and Transplantation Medicine, Oslo University Hospital, Oslo, Norway; Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway; Øystein Fodstad, Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway; Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway; Britt Fritzman, Østfold Hospital, Østfold, Norway; Øystein Garred, Department of Pathology, Oslo University Hospital, Oslo, Norway; Gry A Geitvik, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; Solveig Hofvind, Cancer Registry of Norway, Oslo, Norway; Oslo and Akershus University College of Applied Sciences, Faculty of Health Science, Oslo, Norway; Anita Langerød, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; Hege G Russnes, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway; Department of Pathology, Oslo University Hospital, Oslo, Norway; Helle Kristine Skjerven, Department of Breast and Endocrine Surgery, Vestre Viken Hospital, Drammen, Norway; Therese Sørlie, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Norway.
Authors’ contributions
Conception and design: MRA, VV, SJ, EUD, OSBREAC, BN, ES, TS, RK, GMM, AF, VNK, ALBD and KKS. Development of methodology: MRA, VV, SJ, SK, MK, EUD, THH, HKMV, TL, CJV, TFB, GBM, OCL and AF. Acquisition of data: MRA, SJ, MK, EUD, THH, HKMV, SKL, TL, TFB, CC, TT, JA, JO, JG, IRKB, OSBREAC, BN, ES, TS, RK, GMM, VNK, ALBD and KKS. Analysis and interpretation of data: MRA, VV, SJ, SK, MK, THH, SKL, HKMV, ER, WZ, EKM, SN, GFG, JO, GBM, OCL, AF, VNK, ALBD and KKS. Writing, review and/or revision of the manuscript: MRA, VV, SJ, SK, MK, EUD, THH, HKMV, OCL, AF, VNK, ALBD and KKS. Administrative, technical, or material support: EUD, TL, CJV, BN, ES, TS, RK, GMM, VNK, ALBD and KKS. All authors read and approved the final manuscript.