Background
More than 50% of transcripts have no protein coding potential through the analysis of mammalian transcriptomes, a subset of these noncoding transcripts are termed long non-coding RNAs (lncRNAs) that range from 200 nucleotides to multiple kilobases in length [
1]. These long, polyadenylated RNAs do not code for proteins, but function directly as RNAs. Many lncRNAs have already been associated with various disease processes, and cancer features prominently among these. In addition to the classic protein coding mRNAs, recent studies have revealed the contribution of lncRNAs as protooncogenes, tumor suppressor genes, and drivers of metastatic transformation at the transcriptional, post-transcriptional, and epigenetic levels [
2]-[
5]. Accumulating evidence indicates that lncRNAs are linked to a diverse range of functions in cellular development and their misregulation has also been implicated in various types of cancers [
6]-[
8]. In most cases, these transcripts are aberrantly expressed in cancers, which may indicate their potential as possible biomarkers and can be predictive of clinical outcome.
Breast cancer is a heterogeneous disease composed of multiple molecular alterations. Molecular differences between histologically similar tumors make clinical outcomes difficult to predict and treatment imperfectly adapted [
9]. Breast cancers of varying histological subtypes and risk stratification are traditionally diagnosed based on their histopathological features, including tumor size, grade and lymph node status. Over the past decade, the “intrinsic” molecular subtypes of breast cancer: luminal A and B, basal, ERBB2 and normal-like, exhibit different histo-clinical features and treatment sensitivity [
10],[
11]. Given the heterogeneity of breast cancer and the multitude of variables influencing clinical evolution, the multi-gene signatures provide further prognostic and predictive information. One of the examples is a 21-gene classifer (Oncotype DX) ,which classifies breast tumors into low-,intermediate- and high-risk groups as to the advisability of adjuvant chemotherapy for patients in high-risk group [
12],[
13]. The utility of such gene signature might have clinical potential to predict patient outcome and aid in treatment choice [
14].
In breast cancer, several lncRNA transcripts were involved in the biology of tumorigenesis. Furthermore, certain lncRNAs exhibit distinct expression patterns between primary tumors and metastases. A 2.2 kb lncRNA, HOTAIR has been shown to be an independent predictor of breast cancer survival. Elevated HOTAIR expression levels correlate with breast cancer, and are linked to poor prognosis and metastasis [
3]. This lncRNA may induce metastases by remodeling the epigenetic machinery to repress metastasis suppressor genes (e.g., HOXD10 ). Another lncRNA, MALAT-1 (metastasis associated lung adenocarcinoma transcript 1) is overexpressed in many different cancer types, including lung, breast, colon, prostate, pancreatic, and hepatocellular carcinomas [
15]-[
17]. This highly conserved 8kb lncRNA is upregulated in invasive breast carcinomas and correlates with tumor grade [
18]. GAS5 (growth arrest-specific 5) was found to be downregulated in breast cancer tissues, and overexpression of this lncRNA in the MCF-7 breast cancer cell line furthered growth arrest and apoptosis [
19]. LSINCT5, the stress-regulated lncRNA, is overexpressed in breast and ovarian cancer cell lines and tumor tissues. In addition, LSINCT5 has been proved to play a role in cellular proliferation and also in the development of breast and ovarian cancers [
20]. Transcriptional profiling has revealed highly aberrant lncRNA expression in human cancers [
21]. However, the prognostic significance of lncRNAs in breast cancer has not been investigated.
Recently, the methodology of repurposing microarray data for probing lncRNA expression is well-established [
22]-[
24]. For instance, Du et al. used a large dataset of microarrays to build a resource of clinically relevant lncRNAs for the development of lncRNA biomarkers and the identification of lncRNA therapeutic targets [
25]. Zhang et al. correlates lncRNA expression profiles with malignancy grade and histological differentiation in human gliomas by re-annotating Affymetrix HG-U133 Plus 2.0 array [
26],[
27]. Furthermore, several studies do have discovered new biomarkers to predict survival by re-annotation of previous microarray data. A six-lncRNA signature has been identified to predict survival of patients with glioblastoma multiforme [
27], while a three-lncRNA signature has been shown to be associated with the prognosis of patients with oesophageal squamous cell carcinoma [
28].
In this study, we aimed at profiling the lncRNA expression signatures by analyzing a cohort of previously published breast cancer gene expression profiles from the Gene Expression Omnibus (GEO), as well as another three independent data sets as testing sets. We identified a four-lncRNA signature associated with survival, and then established a risk score formula using the expressions of these four lncRNAs. The prognostic value of the signature was further confirmed in the testing cohorts. Our findings suggest that lncRNA signatures can be predictive of clinical outcome and they may be useful as biomarkers.
Discussion
The discovery of multiple functional regulatory lncRNAs has lead to genome-wide searches in multiple species as well as for transcripts that are aberrantly expressed in various types of cancers. Similar to protein-coding genes and miRNAs, lncRNAs play key roles in tumorigenesis. They are involved in a number of fundamental processes associated with cancer including cell cycle regulation, apoptosis, the DNA damage response, and metastasis [
3],[
46]. The expression of highly conserved lncRNAs is also altered in breast cancers [
17]. Our recent study achieved the lncRNA profiling by mining the existing microarray gene expression data as reported [
47],[
48]. Except for several recent researches on the roles of lncRNAs in breast cancer, the prognostic value of lncRNA signatures have not been investigated. To our knowledge, this is the first report of a lncRNA expression signature predicting breast cancer patient survival.
In this study, we have identified a four-lncRNA expression signature that is associated with survival of breast cancer patients. We further revealed that the four-lncRNA signature is an independent predictor of breast cancer patient survival.
As for the characteristics of the four genes, the overexpression of U79277 was found to be correlated with shorter survival while three of these four lncRNAs identified (AK024118, BC040204, AK000974) were downregulated in the high-risk group compared to low-risk group. The functional study in cancer of these genes has not been reported so far. Nevertheless, our present study demonstrated the associations between the expressions of these genes and survival time. Interestingly, the locations of those putative lncRNAs overlap with many transcripts including some well-known oncogene and tumor suppress genes. AK024118 is located within the intron of BCL2 which is a known driver of lymphoma. U79277 is transcribed from the minus strand on human chromosome 8 and overlaps with YWHAZ 3’UTR. AK000974 which overlaps with many transcripts including CCNJ mRNA. We found that it is very common for ncRNAs. The lncRNAs were categorized as intergenic or genic. The genic lncRNAs were further classified as being exonic, intronic or overlapping and sense or antisense according to their relation with neighboring protein coding genes [
49],[
50]. Although some of lncRNAs may overlap with neighboring protein coding genes, most of them have their own function. Some lncRNAs regulate the transcription of nearby genes in cis, while others act in trans [
28]. A concrete example is HOTAIR, a well-studied lncRNA, within the HOXC cluster was shown to help silence HOXD cluster genes in trans [
51]. It may be worthwhile to further investigate these lncRNAs for the purpose of better understanding of their roles in determining breast cancer prognosis.
The median risk score was used as a cutoff point for two reasons. First, a previous lncRNA risk score formula used the median as a cutoff point for classifying patients into two groups [
27]. Second, the most common approach for dichotomizing continuous variables was to take the sample median due to the absence of a prior cutpoint [
52].
By performing multivariable Cox regression analysis that included age and subtype(when available) as covariables, we analyzed whether the prognostic value of the four-lncRNA signature was independent of age and subtype. The age at diagnosis exercises a complex influence on breast cancer prognosis. Young age at diagnosis influences negatively the prognosis [
53]-[
55], whereas breast cancer in elderly women is associated with an inferior prognosis when compared to that of middle-aged women [
54]. Observational data in breast cancer patients is suggestive of an increased risk of disease specific mortality with increasing age [
56],[
57]. Several observations suggest that the percentage of deaths attributed to breast cancer decreased with age [
58],[
59]. These inconsistency in findings could explain the results that age was not significant prognostic factor when assessed in the univariable Cox regression analysis in our study. Nonetheless, we could conclude that the risk score obtained by the four-lncRNA signature was independent of age in the present study.
Breast cancer is clinically heteregeneous due to molecular differences between histologically similar tumors. Luminal, Her2 enriched, basal-like (Triple-negative) subgroups were identified and were shown to have different long-term survivals [
10],[
11],[
60],[
61]. There were few reports about the correlation between lncRNAs and molecular subtype of breast cancer. A newly identified lncRNA, LOC554202, has been found to express abundantly in the non-invasive breast cancer cell lines like luminal subtype, but the expression is lost in more aggressive triple-negative breast cancer cell lines of basal subtype [
62]. It was therefore of interest to determine if our four-lncRNA signature was associated with this strong prognostic factor. As the data on molecular subtype was only available in GSE21653, we performed the analysis of multivariable Cox regression including risk score and subtype in this testing group. Because of the small sample size in some subgroups, we did not observe significant difference in either univariable or multivariable Cox regression analysis.
Further ROC analysis demonstrated that four-lncRNA gene signature was comparable with Oncotype DX (p = 0.0837). Although Oncotype DX is the most accepted in clinical practice for decision making as to the advisability of adjuvant chemotherapy for breast cancer patients [
12], the test is not financially feasible for every patient in developing countries. As shown in this study, a small number of genes (4 genes) could be sufficient to predict the prognostic, using simply reverse transcription polymerase chain reaction (RT-PCR). Clinically, risk score may provide clues on biological behaviors as well as prognostic characteristics of tumors. Patients belonging to high-risk group may need more effective adjuvant therapy in addition to the standard treatment protocol. In addition to the current prognostic model, the four-lncRNA signature may develop easy-to-use prognostic model in order to facilitate further stratification of patients.
Moreover, Gene set enrichment analysis (GSEA) was performed aiming at analyzing coordinate expression changes at a pathway level. The associated molecular pathways, namely, epithelial mesenchymal transition(EMT) [
44], cell cycle signaling and DNA replication revealed the four-lncRNA signature might be involved with cancer metastasis. Hence, these findings are likely to be implicated in the development of new targeted anti-cancer therapies. In breast cancer, it has been shown that knock-down of lncRNA HOTAIR with specific siRNAs may limit the metastatic potential of breast cancer cells [
3]. The therapeutic potential of targeting regulatory lncRNAs in order to increase the expression of specific genes has also recently emerged [
1],[
63]. The four prognostic lncRNAs may have therapeutic potential as novel molecular targets.
Several limitations to this study need to be acknowledged. First, in our study, only a fraction of human lncRNA (5635 out of 15000+) were included in the analysis. So, the prognostic lncRNA genes identified here may not represent all the lncRNA candidates that are potentially correlated with breast cancer overall survival. Secondly, we lack information on the mechanisms behind the prognostic values of these four lncRNAs in breast cancer, and experimental studies on these lncRNAs might provide important information to further our understanding of their functional roles. Finally, although we recapitulated our findings in three published datasets to the extent possible based on data availability, the signature has not yet been tested prospectively in a clinical trial. Despite these drawbacks, however, the significant and consistent correlation of our four-lncRNA signature with overall survival in several independent data sets indicates that it is a potentially powerful prognostic marker for breast cancer.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JM, PL and QZ identified all of the public datasets, JM, PL,QZ and ZY carried out all of the biostatistical and informatics analyses, helped develop the method, JM formulated the study conclusions and draft the manuscript; SF initiated and coordinated the project, guided the study design, supervised all data curation and analysis, finalized all study conclusions and manuscript writing. All coauthors reviewed and approved the final manuscript.