Background
Breast cancer is a heterogeneous disease. Patients at the same stage or in the same molecular subtype can exhibit different clinical prognosis or different benefit from systemic therapy. Personalized medicine is urgently needed for best breast cancer care. For this reason multi-gene signatures have been extensively studied to provide prognostic and predictive information for breast cancer treatment. Today more than 30 different signatures have been reported [
1‐
42]. Several signatures have become commercially available including the 70-gene signature (MammaPrint) [
1], the 21-gene signature (OncotypeDx) [
4], the 97-gene genomic grade index (GGI) [
15], the EndoPredict assay [
33], the breast cancer index [
21], and the PAM50 assay [
3]. The 70-gene signature [
1] and the 21-gene signature [
4] can distinguish patients with different risk for relapse and patients with high risk benefit more from adjuvant chemotherapy (CT) than patients with low risk. The 97-gene genomic grade index [
15] divides classic histologic grade into low and high risk patients. The breast cancer index [
21] divides patients into groups with different risk of recurrences, and low-risk patients have high responsiveness to adjuvant tamoxifen therapy. The EndoPredict [
33] predicts the high-risk or low-risk groups of relapse, indicating CT/no CT. The PAM50 assay [
3] is a classifier for subtyping breast cancer into five subtypes: luminal A (LumA), luminal B (LumB), HER2-enriched (Her2), basal-like (Basal) and normal-like (Normal). PAM50 assay [
3] also assesses a patient’s risk of distant recurrence of disease and likelihood of efficacy from neoadjuvant CT. These commercialized signatures have proved to work well in hormone receptor (HR)-positive breast cancers. Several signatures have also been reported to define patients with a good prognosis within the ER-negative tumour cohorts [
18,
26]. A number of signatures were derived to predict clinical outcome for triple-negative breast cancers (TNBC) or basal breast cancers [
31,
35,
36,
38,
41]. Signatures derived from HER2-positive cohorts are used usually for predicting trastuzumab response [
28,
39]. Several signatures have been developed for predicting docetaxel response [
2,
6,
16].
However, signatures derived for similar tumour cohorts for similar purposes share little overlapping genes [
20,
40]. For example, the 70-gene signature shares only three common genes with the 64-gene signature [
20] while both signatures were mainly derived from ER-positive patients. A previous study using functional enrichment analysis of a limited six gene signatures showed that there was little overlap of functional categories among these six signatures [
24]. However, another study showed a prognostic concordance among several gene expression signatures, suggesting potential equivalence between the signatures [
43]. We hypothesized that although these signatures do not have many overlapping genes, they could share common functions or pathways.
The genes of previously reported signatures could be valuable resources for new approaches because these have been tested to be more or less associated with clinical outcomes in the original studies. Some studies have pooled previously reported datasets to develop a new signature [
24,
44] and one study reported that combining several signature gene sets improved survival prediction from breast cancer [
37]. In addition, different breast cancer datasets can be considered as resamplings from the underlying breast cancer population and the genes most frequently identified (common genes) in the separate resamplings were put forward as a ‘gold standard’ [
24]. Therefore, these common genes identified from these signature genes could be valuable resources for new signature development.
Recently, we have developed a 16 Yin Yang gene mean expression ratio (YMR-16) signature for ER-positive/node-negative breast cancer based on the hypothesis that two opposing effects (Yin and Yang) could determine cancer initiation and progression [
42]. These 16 genes were identified among all human genes on the Illumina gene expression microarray platform. In this study, we attempted to understand why there are so few overlapping genes among the previously reported multi-gene signatures, as well as to address our hypothesis that signatures share common functions or pathways [
42,
45,
46]. We also evaluated the cohort of genes found to be common from multiple signatures in our Yin Yang gene Mean expression Ratio (YMR) model [
42] and as a subtyping classifier.
Discussion
The finding that there is little overlap of constituent genes amongst molecular signatures generated by different researchers using different patient cohort data has long been a source of concern and the reasons for it are still the focus of ongoing debate [
49]. Several explanations have been proposed [
24]. One is that researchers have used different platform technologies and supervised protocols for signatures derivation. Another focuses on the heterogeneity of samples included in the different datasets. Small sample size is also another possible contributor to the differences seen [
24]. However, it is possible that a group of distinct genes actually support the same function, and there are limited studies that focus on such an analysis of the genes within different signatures [
24].
Breast cancer is well known to be heterogeneous. The differences in clinical composition of the existing different datasets may partly explain the little overlap already found amongst different signatures. However even the number of overlapping genes within the known ER-positive/negative subgroups is still small
21 and the one-size-fits-all signature will not be possible. In our study, signatures for the currently defined same type of breast cancers (ER-positive/negative), tended to have more common genes with each other than signatures for different breast cancer subtypes (Fig.
3a). However, there are still known and unknown forms of heterogeneity that exist among the same ER status subtypes that could partly contribute to the limited gene overlap problem.
Many gene signatures were developed by selecting genes whose expression levels correlate to the clinical outcomes without any focus on gene functions. We hypothesize that the genes in these reported signatures would functionally associate to the diseases, either directly or indirectly. Thirty-three signatures were included in this study. Two hundred thirty-eight of the total 2239 genes were shared by at least two signatures. However, 429 of the total 1979 function terms derived from the signatures were common in at least two signatures. These function terms were cell cycle, cell death, response to wounding, response to organic substance and intracellular signaling cascade. The fact that signatures shared more function terms than genes supports our hypothesis that the signature genes represent similar functions or pathways despite actual different individual genes.
In this study we found that most of the signatures from ER-positive breast cancers had common function terms focused on cell proliferation although they did not share common genes. Functions impacting cell proliferation, such as cell cycle process, mitotic cell cycle, DNA replication, nuclear division were shared by signatures mostly used for ER-positive, such as Mamma, RS, Endo, and GGI97. Most of the signatures from ER-negative tumours shared common function terms focused on immune response though they did not share common genes. Interestingly, we found the signatures derived from ER-negative tumours, such as Novel1, Novel2, MBC, shared common functions associated with immune response including lymphocyte activation, leukocyte activation and T cell activation. It has been reported that activation of complement and immune response pathways are associated with good prognosis in a subclass of basal tumors [
18].
One point worth noting is that there were also common function terms between the signatures from ER-positive and ER-negative breast cancers including cell death, regulation of cell proliferation, response to organic substance, intracellular signaling cascade, response to hormone stimulus, response to oxygen levels, bone development, DNA packaging, response to hypoxia, ossification and skeletal system development. It indicates that the genes of these function terms are generally associated with prognosis and treatment response prediction in different breast cancer types. This implies that in tumour progression, different subtypes undergo similar biological processes. The same ER+ prognosis using the same platform (qRT-PCR), signature Endo shares the top pathway regulation of cell cycle process (GO: 0010564) with signature MS14 while this pathway shares zero genes between these two signatures, again consistent with the concept that the pathway is more important than the actual genes. Our study strongly implies that the prognosis of different subtypes may be determined by the similar biological processes or pathways. However, different subtypes may have specific pathways that can be added to or impact on the common pathways, i.e. ER or HER2. For example, signature OncotypeDx was developed for ER+, signature MBC is for TNBC, the pathway response to wounding (GO: 0009611) is common, but pathway leukocyte activation (GO: 0045321) is unique for signature MBC. It was previously reported that breast cancer patients whose tumors expressed wound-response genes had significantly poorer outcomes in both overall survival and distant metastasis-free survival than tumors that did not express wound-response genes [
5].
In addition to proliferation pathways, we also find cytoskeleton organization pathways are common among ER+ signatures (MammaPrint, 97-gene GGI, BCI, MS14) as well as organelle fission pathways are common to 97-gene GGI, BCI, MS14. Common functions were found for signatures for different subtypes. MammaPrint was used for ER+ and shared no common genes with several ER-negative breast cancer derived signatures (HDPP, Tcell, and GCN) but shared pathway intracellular signalling cascade (GO: 0007242), pathways in cancer (hsa05200), response to oxygen levels (GO:0070482) with HDPP, shared pathway positive regulation of signal transduction (GO:0009967) with Tcell. We also found functions, such as regulation of protein modification, regulation of cell death, are enriched in signatures used for different subtypes (ER+, TNBC, HER2+).
Another argument is that a statistical association between multi-gene signatures and clinical outcomes does not necessarily imply biological significance [
50]. For example, Miller and colleagues [
7] developed a 32-gene expression signature which indicated p53 status (mutant and wild-type). However, none of the 32 genes were known transcriptional targets of p53 or known to be involved in the p53 pathway. A potential explanation could be that most of these signatures were identified using Cox regression, which simply selected the top-ranked genes using a Cox score [
51]. In this study, we hypothesized that two opposing effects called Yin and Yang determine the fate of tumour cells. We used Yin to represent the effects leading to cancer progression and Yang as the effects to maintain the normal healthy status. In this context, we tried to develop signatures that could indicate the biological mechanisms in breast cancer progression.
Interestingly, many signature genes do not show a difference in expression level between tumour and normal breast tissue samples, at least at the RNA level. We selected 71 Yin genes and 20 Yang genes from the signature genes. Functional annotation showed that most of the Yin genes functioned in the cell cycle, while the enriched function terms of the 20 Yang genes were more diverse such as secreted extracellular region, signal peptide, disulfide bond, regulation of apoptosis and blood circulation. In this study we used GO biological process terms and KEGG pathways which may be different from those that were used in other studies focused on function terms and pathways.
Urgent work is needed for personalized care for TNBC because TNBC is more heterogeneous and more aggressive than ER-positive breast cancers. Current signatures developed for TNBC have not been used clinically mostly due to the lack of further validation and/or poor reproducibility. We found that five signatures were derived from TNBC (Multigene, Bcell, Novel1, Novel2 and MAGEA), one from basal tumours (MBC), two from ER-negative (IR7 and Tcell), and three from invasive/metastatic/tumorigenic breast cancers (SDPP, LM and IGS). Interestingly, these signatures shared more common genes with each other than they shared with others derived from ER+ and/or Her2 enriched subtypes. We selected 64 genes common to at least two of these 11 signatures to classify 127 TNBC patients from METABRIC dataset (Additional file
14: Figure S6) and 107 TNBC patients from GSE58812 dataset (Additional file
15: Figure S7). Two clusters were identified in both the two datasets. Most of these common genes had a high expression in one cluster while a low expression in another cluster. The cluster with higher expression had a better overall survival rate however with a modest significance (
p = 0.14 for METABRIC,
p = 0.13 for GSE58812). This inferred that the expression level of these genes could be associated with the progression of the aggressive breast cancers. The top pathway enriched in the 64 genes from these 11 signatures was “regulation of immune response” (
p = 1.8E-4, FDR = 8.5E-2).
One of the limitations of this study is we used the same significant p-value as the cutoff to evaluate the functional groups of signatures. This informs what functional group a signature may be involve in, but does not tell the significance. However, it is challenging to compare the significance of signatures with different gene list size. The second limitation is we lacked the optimization and large dataset validation using common genes for YMR signature model development, though this is not the focus of this study.