Background
Although great advances have been made in the management of breast cancer over the past decade, patient outcomes still merit consideration due to the high rate of tumour-specific death [
1,
2]. The molecular mechanisms of tumourigenesis are still unclear in breast cancer. Thus, identifying new genes related to tumourigenesis and patient prognosis, as well as elucidating the molecular mechanisms underlying these oncogenic processes, are urgently required.
It is certified that less than 2% of the genome encodes proteins, but at least 75% are transcribed into noncoding RNAs [
3]. Generally, long noncoding RNAs (lncRNAs) are defined as
transcripts that are longer than 200
nucleotides and have no protein-coding capacity [
4]. There are more than 120,000 transcripts annotated as lncRNAs in the human genome as released by the Encyclopedia of DNA Elements (ENCODE) Project Consortium (release 28) [
5]. LncRNAs share common features with mRNAs: many of them are transcribed by RNA polymerase II and undergo 5′-capping, polyadenylation and splicing [
3]. In addition, the histone profiles of lncRNAs are similar to protein-coding genes for the active histone markers H3K4me2, H3K4me3, H3K9ac and H3K27ac [
6]. On the other hand, lncRNAs have several distinct features that distinguish them from protein-coding mRNAs. Generally, lncRNAs lack coding potential due to fewer exons [
7]. However, there are some exceptions with the development of lncRNA investigations. LINC00961 regulates mTORC1 and muscle regeneration by encoding SPAR polypeptide [
8]. Functional polypeptides of myoregulin encoded by LINC00948 and Dworf encoded by NONMMUG026737 have been reported to modulate SERCA pump activity [
9,
10]. LncRNA CRNDE encodes a nuclear peptide, CRNDEP, which is overexpressed in highly proliferating tissues and is involved in cell turnover [
11]. Moreover, many investigations have shown that lncRNAs are of relatively lower expression levels than protein-coding genes; however, they exhibit a more cell type-specific pattern [
12‐
14]. Unlike mRNAs, most lncRNAs are localized to the nucleus, while mRNA is in the cytoplasm [
6]. Finally, lncRNAs belong to evolutionary conserved families evolving faster than mRNAs, where sequence similarity is likely to be preserved mainly in regions with secondary structure formation [
6,
12].
LncRNAs play an important role in regulating gene expression at various levels, including alternative splicing, regulation of protein activity, and alteration of protein localization, as well as chromatin modification, transcription, and posttranscriptional processing [
3,
15‐
21]. The mechanisms by which lncRNAs contribute to the regulatory networks that underpin cancer development are diverse [
22,
23]. Unlike microRNA or piwi-interacting RNA (piRNA), lncRNAs drive many important cancer phenotypes through their interactions with other cellular macromolecules such as DNA, protein, and RNA [
22‐
26]. Regarding their role in malignancy, lncRNAs are associated with immortal-associated characteristics, including cell cycle regulation, survival, immune response or pluripotency in cancer cells [
27‐
32]. Several common lncRNAs have been investigated in cancers. For example, HOTAIR promotes cancer metastasis by reprogramming the chromatin state in a manner dependent on PRC2 [
33]. SChLAP1 contributes to the development of lethal cancer by antagonizing the tumour-suppressive functions of the SWI/SNF complex [
34]. Another lncRNA, SAMMSON, increases the pro-oncogenic function via interacting with p32 in melanoma [
35]. Despite growing knowledge regarding the molecular mechanisms of lncRNA functions in malignancy, the modes of action of most lncRNAs in breast cancer remain unclear. Aberrant expression of oncogenic lncRNAs may confer capacities for tumour initiation, growth, and metastasis in breast cancer cells, thus leading to a worse prognosis for patients. Similarly, applying antisense oligonucleotides can inhibit the expression of the candidate therapeutic target Malat1 successfully in an MMTV-PyMT mouse model with breast cancer [
36]. Thus, unravelling the landscape of these oncogenic lncRNAs in breast cancer is essential and urgent.
The aim of this study was to identify the predictive capability of oncogenic lncRNAs for the tumourigenesis and prognosis of breast cancer. In this study, 1088 breast cancer patients from GEO data were selected and analysed to investigate the lncRNAs associated with tumourigenesis and outcomes. The aberrant long noncoding RNAs were then validated using TCGA data and high-throughput sequencing in our cohort. Genome-wide in silico analysis and in vitro assays revealed the potential biological functions of these lncRNAs, including their mechanisms of regulation and roles in tumourigenesis. Moreover, identification and validation of breast cancer subtype-specific lncRNAs and the determination of their association with clinical outcomes were also investigated.
Methods
Public data access and analysis
GEO data (GSE21653, GSE31448, GSE10810, GSE29431, GSE23177, GSE42568, and GSE48391) were downloaded and processed (
http://www.ncbi.nlm.nih.gov/geo/). The genome-wide lncRNA expression profiles for breast cancer, renal cancer, and lung cancer were downloaded from TCGA (
https://tcga-data.nci.nih.gov/). For the microarray analysis, we adjusted the signal values for low-abundance genes. A signal value lower than log3 was set to log3. Moreover, the invariant genes (i.e., same expression value across all samples) and low-variation genes were filtered. Genes that were detected in less than 50% of the profiled samples were also filtered. The SAM method was applied, and we implemented a series of steps to estimate the significance of difference and false discovery rate for each filtered gene as previously described [
37]. A method to estimate the significance of difference and false discovery rate for each filtered gene was implemented as follows:
(1)
Calculate the exchange factor s0. First, we calculated the standard deviation of geneexpression for all genes s
i
, denoted s
α
, as the α percentile for s
i
. The relative difference in gene expression (d Score) at the α percentile is calculated as \( {d}_i^{\alpha }={r}_i/\left({s}_i+{\mathrm{s}}^{\alpha}\right) \), where r
i
is the fold change of gene expression for gene i between two conditions. Next, each interval of the percentile value q1 < q2 < ··· < q100 of the s
i
and the mean absolute deviation of \( {d}_i^{\alpha }{\mathrm{v}}_j= mad\left\{{d}_i^{\alpha }={r}_i/\left({s}_i+{\mathrm{s}}^{\alpha}\right)|{s}_i\in \Big[{q}_j,{q}_{j+1}\Big)\right\} \) were calculated. Finally, we selected the α (denote as \( \widehat{\alpha} \)) to make the CV (coefficient of variation) of the v
j
achieve a minimum, and set the exchange factor \( {\mathrm{s}}_0={\mathrm{s}}^{\widehat{\alpha}} \).
(2)
Calculate the statistical value (d Score) for every gene i:
d
i
=
r
i
/(
s
i
+ s
0),
(3)
Calculate the order statistic according to d
i
:
d
(1) ≤ d
(2) ≤ ··· ≤ d
(i) ≤ ··· ≤ d
(p),
(4)
Perform 1000 permutations to estimate the expected distribution of the d Score. We denote the estimated statistical values:
\( {d}_{(1)}^{\ast}\le {d}_{(2)}^{\ast}\le \dots \le {d}_{(i)}^{\ast}\le \dots \le {d}_{(p)}^{\ast } \),
(5)
Obtain the order statistic value under the permutation:
\( {\overset{-}{d}}_{(i)}=\frac{\sum_{i=1}^{1000}{d}_{(i)}^{\ast }}{1000} \),
(6)
By calculating the maximum distance between the order statistic d
(i) and the expected order statistic \( {\overset{-}{d}}_{(i)} \), we would construct a series of reject regions for the q-value.
(7)
For a fixed delta value, we computed the difference \( {\Delta}_{(i)}={d}_{(i)}-{\overset{-}{d}}_{(i)} \) and found the nearest Δ(i) for gene i. The cut-up was marked as min{Δ(i) ≥ delta} for the positive gene and the cut-down as max{Δ(i) ≤ − delta} as for negative gene. Next, we called these genes significantly positive genes, whose difference was larger than the cut-up value, and significantly negative genes, whose difference was smaller than the cut-down value.
(8)
Estimate the false discovery rate (FDR):
\( FDR=\frac{V_{(p)}}{R_{(p)}} \), where V
(p) is the number of positive genes called in the 1000 permutations, and R
(p) is the median of the number of false-positive genes for the above permutations.
(9)
Obtain the q-value for gene i by selecting the minimum of the FDR for the 50 delta values determined in step (7).
Welch’s t-test (unequal variances) and analysis of variance were also applied for two-group and multiple-group analyses, respectively. For multiple-comparison analysis, the q-value was used to control the false discovery rate. Clustering heatmap analysis was performed according to previous study [
38].
Guilt-by-association analysis
To identify a list of lncRNAs positively and negatively correlated with the target genes, data from the TCGA were evaluated to compute a pairwise Pearson correlation between the expression of the target lncRNA and all the genes. Only associated genes with an absolute
r ≥ 0.4 and a significant correlation (
P < 0.05) were retained. Gene ontology term enrichment (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were analysed using DAVID as previously described [
39,
40].
Patients and clinical samples
Thirty invasive breast cancer and adjacent non-cancerous tissues were obtained from patients who had not received either chemotherapy or radiotherapy and who were treated at the Department of Breast Surgery at Harbin Medical University Cancer Hospital in 2014. This study was approved by the research ethics committee of the Harbin Medical University Cancer Hospital. Written informed consent was obtained from all the patients who participated in the study. The clinicopathological characteristics of the patients are presented in Additional file
1: Table S1.
Library preparation for lncRNA sequencing
A total of 3 μg of RNA per sample was used for downstream RNA sample preparations. Ribosomal RNA was removed using the Ribo-Zero™ Gold kit (Epicentre, Wisconsin, USA). Subsequently, sequencing libraries were generated according to the manufacturer’s recommendations. The libraries were sequenced on an Illumina HiSeq 2500 platform, and 100-bp paired-end reads were generated. Raw sequencing and processed RNA-Seq data of this study have been deposited to the NCBI Gene Expression Omnibus database under accession number GSE71651 (
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? token = obcxosaur xoppwx & acc = GSE71651).
Cell culture experiments
MDA-MB-453 and MCF-7 cells were cultured in DMEM (Invitrogen, Carlsbad, CA) containing 10% foetal bovine serum and 100 units/ml penicillin/streptomycin at 37 °C in an atmosphere containing 5% CO2. UACC812 and T47D cells were cultured in 1640 medium (Invitrogen, Carlsbad, CA) containing 10% foetal bovine serum and 100 units/ml penicillin/streptomycin at 37 °C in an atmosphere containing 5% CO2. All the cell lines were obtained from the Chinese Type Culture Collection, Chinese Academy of Sciences. Cells were used during their logarithmic growth phase.
Small interfering RNA (siRNA) and qRT-PCR
Breast cells were grown in complete medium before transfection with siRNAs using Lipofectamine 2000 (Life Technologies, 11,668–019) according to the manufacturer’s protocol. Two siRNAs each were designed to target TINCR (siR_1:5′- GCAUGAAGUAGCAGGUAUUUU-3′ and siR_2: 5′-GAUCCCGAGUGAGUCAGAA UU-3′), HOTAIR (siR_1: 5′-CCACAUGAACGCCCAGAGAUU-3′ and siR_2: 5′-GAACGGGAGUACAGAGAGAUU-3′) and DSCAM-AS1 (siR_1: 5′- ACUCAUCCAUGUACCCAUUUCUUAA-3′ and siR_2: 5′-CCUCCUCCAACUGCCAUU UAUUUAU-3′). qRT-PCR analysis was performed using the SYBR-Green method, and the specific sequences of the primers used were as follows: 5′-TGTGGCCCAAACTCAGGGATACAT-3′ (forward) and 5′-AGATGACAGTGGCTGGAGTTGTCA-3′(reverse) for TINCR, 5′- GTCCCTAATATCCCGGAGGT-3′ (forward) and 5′-GCAGGCTTCTAAATCCGTTC-3′ (reverse) for HOTAIR; 5′-GATCCTTGTTTGGTCTCACTCC-3′(forward) and 5′-ATGCCTATGTGGGTGATTGG-3′(reverse) for DSCAM-AS1; and 5′-TTTGATGGTGACCTGGGAAT-3′ (forward) and 5′-GAACATCTGGCTGGTTCACA-3′ (reverse) for ERBB2; 5′-ACCACAGTCCATGCCATCAC-3′ (forward) and 5′- TCCACCCTGTTGCTGTA-3′ (reverse) for GAPDH. Primers for MiR-125b and U6 were obtained as previously described [
41]. Quantitative normalization of target cDNA was performed for each sample using GAPDH/U6 expression as an internal control. The relative levels of TINCR, HOTAIR, DSCAM-AS1, ERBB2, MiR-125b vs. GAPDH/U6 were determined by the comparative CT (2
−ΔΔCT) method.
Cell proliferation assays
Cell proliferation assays were performed using the Cell Counting Kit-8 according to the manufacturer’s instructions (Beyotime, Shanghai, China). Briefly, 2 × 103 cells were seeded in a 96-well plate. Cell proliferation was assessed at 24, 48, and 72 h. After the addition of 20 μl of WST-1 reagents per well, cultures were incubated for 2 h, and the absorbance was measured at 450 nm using a microplate reader (BioTek, VT, United States).
Flow cytometry
An Annexin-PE Apoptosis detection kit (BD Biosciences, San Jose, CA) was used to examine cell apoptosis according to the manufacturer’s instructions. Briefly, cells were washed twice in cold PBS and then harvested and resuspended in 1× binding buffer. Next, 100 μl of the cell solution (1 × 10
5 cells) was transferred to a 5-ml culture tube, and 5 μl of annexin V-PE and 5 μl of 7-AAD were added into culture tube. The cells were gently vortexed and incubated for 15 min at RT (25 °C) in the dark. Next, 400 μl of 1× binding buffer was added to each tube, and apoptosis analysis was performed using a FACScan instrument (Becton Dickinson, Mountain View, CA, USA). For cell cycle analysis, the CycleTEST™ PLUS DNA Reagent Kit (BD. Cat No.340242) was used, and the experiment was performed as previously described [
42].
TUNEL assays
Apoptosis-induced DNA fragmentation was performed using the transfers-mediated deoxyuridine triphosphate (dUTP)-digoxigenin nick end-labelling (TUNEL) assay. UACC812 and MDA-MB-453 cells were plated in 24-well flat-bottomed plates at a density of 1 × 105 cells per well, and cells were fixed in 4% (w/v) paraformaldehyde at 4 °C for 25 min. TUNEL staining was examined using the in situ cell death detection kit (Roche), and the nuclei were stained with DAPI for 10 min according to the manufacturer’s instructions. The numbers of TUNEL-positive cells were captured with a fluorescence microscope (Olympus), and the ratio of apoptosis cells was determined with ImagePro Plus software.
Statistical analyses
The times of OS and RFS were calculated as the time from surgery until the occurrence of death or relapse, respectively. The expression of lncRNA was dichotomized using a study-specific median expression as the cut-off to define “high value” at or above the median versus “low value” below the median. The differences between the groups in our in vitro experiments were analysed using Student’s t-test. Spearman correlation coefficients were calculated for correlation analysis. All the experiments were performed in triplicate, and SPSS 16.0 software (SPSS, Chicago, IL) was used for statistical analysis. All statistical tests were two-sided, and P < 0.05 was considered to be statistically significant.
Discussion
Oncogenic lncRNAs in breast cancer were identified and validated comprehensively via genome-wide in silico analysis and biological experiments in this study. To our knowledge, only a few lncRNAs that regulate the process of tumourigenesis and that are associated with the prognosis in breast cancer have been identified. In this study, the identified lncRNAs were found to be associated with essential biological functions in cancer, including the regulation of immune system activation, cell adhesion, angiogenesis, ABC transporter activity, and TGF-beta and Jak-STAT signalling. Moreover, TINCR, LINC00511, and PPP1R26-AS1 were identified as HER-2, triple-negative and luminal B subtype-specific lncRNAs, respectively. In addition, BCPALs, including HOTAIR, LINC00115, MCM3AP-AS1, TINCR, PPP1R26-AS1, and DSCAM-AS1, were identified and confirmed via log-rank analysis. Next, gene amplification in the genome appeared to be the main underlying mechanisms for the upregulation of these oncogenic lncRNAs. Finally, the oncogenic roles of TINCR, HOTAIR and DSCAM-AS1, which showed the most significant differential expression in cohort I and cohort II, were selected and performed in vitro. Knockdown of each of the above lncRNA inhibited proliferation in breast cancer cell lines, increased apoptosis and inhibited cell cycle progression. Together, these results indicated that the lncRNAs identified and validated in this study play oncogenic roles in breast cancer.
Among the 30 aberrantly upregulated lncRNAs identified in this study, only 7 have previously been reported to be associated with malignancy, and the potential functions of the other 23 lncRNAs remain a mystery. Increased SNHG3 expression is associated with malignant status and worse overall survival, recurrence-free survival and disease-free survival in hepatocellular carcinoma patients [
50]. PCAT6 may play an oncogenic role in lung cancer progression, and it negatively correlates with the overall survival of lung cancer patients [
51]. MIAT may promote the development of lung adenocarcinoma via the MIAT-miR-106-MAPK signalling pathway loop [
52]. Oestrogen increases HOTAIR levels via GPER-mediated miR-148a inhibition and is an independent prognostic marker of metastasis in breast cancer [
53,
54]. Moreover, increased expression of DSCAM-AS1 was associated with a worse overall survival in breast cancer patients according to our in silico analysis, and knockdown of DSCAM-AS1 inhibited breast cancer cell proliferation, increased apoptosis and inhibited cell cycle progression in this study. We obtained similar results in a previous report [
55]. Antisense lncRNAs can regulate their corresponding sense mRNAs at different levels, including transcriptional interference, imprinting, alternative splicing, translation or RNA editing [
56]. However, the sense mRNA DSCAM was not regulated by DSCAM-AS1 [
55]. Thus, trans-regulation may be associated with the role of DSCAM-AS1 in breast cancer. Regarding DLEU2 and TINCR, it appears to be controversial that DLEU2 and TINCR would act differently in various malignancies. DLEU2 inhibits cell proliferation and tumour progression through the regulation of miR-15a/miR-16-1 in chronic lymphocytic leukaemia [
57,
58]. However, our in silico analysis suggest that, in breast cancer, it may be oncogenic, in contrast to its reported role as a tumour suppressor in chronic lymphocytic leukaemia. For TINCR, this lncRNA is induced by SP1 and promotes cell proliferation in gastric cancer and oesophageal squamous cell carcinoma, a finding that is in accordance with our results in breast cancer [
59,
60]. However, in colorectal cancer, the loss of TINCR expression promotes proliferation and metastasis by activating EpCAM cleavage [
61]. The functional diversity of DLEU2 and TINCR in various malignancies may explain this discrepancy, and subsequent research is necessary to reveal the potential roles of these two lncRNAs.
The identification of these oncogenic lncRNAs in this study may add vital significance for clinical practice. The study addresses three issues that are not present in other studies. First, a panel of oncogenic lncRNAs was identified via genome-wide in silico analysis. It provides a more efficient step in research methodology and expands the scope of studied lncRNAs that are associated with patient outcome. Hence, the drawbacks of a single lncRNA study model were avoided. Second, the breast cancer prognosis-associated lncRNAs identified via data integration and analysis are definite in this study, and they can be applied for the evaluation of cancer characteristics and patient survival potential. In this way, clinical examination of HOTAIR, LINC00115, MCM3AP-AS1, TINCR, PPP1R26-AS1, and DSCAM-AS1 may be beneficial for comprehensive management. Third, breast cancer subtype-specific lncRNAs were also obtained via high-throughput sequencing. Such oncogenic lncRNAs should be not underestimated for their clinical significance. From a clinical viewpoint, owing to the varying objectivity of pathologists and different qualities of antibodies to ER, PR, HER-2 and Ki-67 used in immunohistochemical methods, determining the molecular subtype for an individual’s breast cancer may be influenced by subjective and objective factors. Many obstacles, including economic and technical factors, must be overcome to achieve the standardization of immunohistochemical methods among laboratories and greater accessibility of quality assurance programmes. From our sequencing results and TCGA data, TINCR, LINC00511, and PPP1R26-AS1 were identified and were found to represent HER-2, triple-negative and luminal B subtypes of breast cancer, respectively. Thus, examination of these BCSPLs may complement the use of the aforementioned parameters, and this approach may decrease the misdiagnosis rate and optimize costs in clinical practice.
We acknowledge several limitations of our study. First, not all the datasets that include lncRNA expression and follow-up information in the GEO could be accessed. Thus, the oncogenic lncRNA landscape of breast cancer indicated in this study would not be sufficiently comprehensive. Second, the biological functions and detailed mechanisms of these oncogenic lncRNAs were not elucidated. Additionally, further in-depth efforts will be needed to investigate other oncogenic lncRNAs besides TINCR, HOTAIR and DSCAM-AS1 with large sample sizes and a focus on molecular mechanisms.