Background
Lung cancer is the leading cause of cancer-related mortality worldwide, with a particularly low 5-year survival rate for patients suffering from this disease at its advanced stages. In the US, lung cancer is estimated to account for approximately one quarter (26%) of all cancer-related deaths in the year 2017 [
1]. In China, which currently hosts the largest population in the world, 730,000 new cases of lung cancer were estimated for the year 2015, along with more than 610,000 deaths [
2]. Across the globe, as incidence and mortality generally continue with rise, lung cancer has become a major public health problem, and is therefore under intensive biomedical and clinical research.
Breakthroughs in ‘omics’ technologies, such as genomics, transcriptomics, and proteomics, have opened avenues for a systematic approach for understanding and treating cancer [
3,
4]. In particular, a flurry of recent cancer profiling studies have focused on RNA sequencing (RNA-Seq), a rapidly maturing development of the next-generation sequencing technology. Compared with microarray analysis, RNA-Seq profiling allows for larger dynamic range, and higher sensitivity and throughput [
5]. As a result, RNA-Seq profiling has been used in several recent studies of lung cancer molecular pathogenesis, including discovery of novel mutations in key oncogenes and genomic rearrangements in squamous cell lung cancer [
6] and adenocarcinoma [
7], identification of potential biomarkers in non-small cell lung cancer (NSCLC) [
8], and quantification of expression of marker genes [
9].
One revelation largely enabled by high-throughput sequencing analysis was that non-coding RNAs make up the majority (approx. 85%) of transcriptome. Based on transcript length, non-coding RNAs can be divided into short non-coding RNAs (sncRNAs, < 200 nucleotide) and long non-coding RNAs (lncRNAs, > 200 nucleotide) [
10]. Deregulation of lncRNAs has been well recognized in cancer, and has been suggested to modulate tumor development at chromosomal, transcriptional, and post-transcriptional levels [
10,
11]. In lung cancer, the list of implicated lncRNAs is expanding rapidly [
11]. However, much still remains unknown about the mechanics and significance of lncRNAs in many aspects of this disease, such as carcinogenesis, development, metastasis, response to anti-cancer treatment, and prognosis.
In this study, we took advantage of large-scale expression profiles and a systems biology strategy to identify lncRNAs that were significantly regulated in lung cancer specimens, and were strongly co-expressed with a large pool of protein-coding genes (PCGs). In order to detect co-expression pattern among the lncRNAs and PCGs in our TCGA datasets, weighted gene co-expression network analysis (WGCNA) was applied. WGCNA has been established as an effective data mining method for finding clusters or modules of highly correlated biomolecules and identifying intramodular “hubs”, including genes [
12], miRNAs [
13], and metabolites [
14]. Consequently, WGCNA has been successfully applied in several lung cancer profiling investigations, such as identification of differential mRNA expression [
12] and lncRNAs expression profile signature [
15] in lung squamous cell carcinoma.
In the present study, we used RNA-Seq datasets from The Cancer Genome Atlas (TCGA) database to identify novel lncRNAs associated with lung cancer. LncRNA profiling and protein-coding transcript profiles of lung cancer were extracted from TCGA. Afterwards, these datasets were subjected to a battery of analyses, including differential expression analysis, co-expression network and cluster analyses, KEGG pathway enrichment, and survival analysis. After several rounds of screening, two largely uncharacterized lncRNAs, CTD-2510F5.4 and CTB-193M12.5, were identified. Both lncRNAs were significantly upregulated in cancerous specimens and co-expressed with 304 protein-coding genes, suggesting a wide spectrum of target PCGs under the modulation of these two lncRNAs. More importantly, expression levels of CTB-193M12.5 also showed significant negative correlation with the prognosis of the patients from whom the RNA-seq datasets were derived. Together, our results provide a promising lncRNA candidate for further validation and characterization by “wet bench” and clinical research.
Discussion
Recent investigations have provided good evidence that opens avenues to the largely unknown roles of lncRNAs, which are estimated to make up for approximately 85% of the genome. More than 3000 lncRNAs have been identified so far; however, functions and biological roles for only 1% of them have been proposed, much fewer characterized [
10]. Insights into the function of the few characterized lncRNAs suggest a surprising diverse variety of cellular processes, from chromatin modification, transcription, splicing, and translation to cellular differentiation, cell cycle regulation, and stem cells reprogramming [
10].
Recent emergence and maturation of the RNA sequencing technology has greatly facilitated identifying lncRNAs associated with various diseases. Traditional hybridization-based approaches such as DNA microarray suffer from several limitations, including reliance on sequenced genomes, high background levels, and a relatively narrow dynamic range. More importantly, comparison of expression profiles across different experiments is often difficult and requires complex data processing. In contrast, RNA-Seq enjoys a number of advantages, including very low background signal and large dynamic range of detection. Furthermore, RNA-seq enables high-throughput sequencing of transcriptomes at single-base resolution, whose quantification across experiments can also be performed with simple normalization algorithms. Together, these factors have made RNA-seq an ideal choice for screening for lncRNAs with clinical significance.
Consequently, databases of publicly available RNA-seq profiles have been constructed and showing continuous growth, although many of the datasets remain to be mined with comprehensive bioinformatics tools in order to reveal identifies of potential key master regulators that could provide hints for validation and clinical application. In this study, we used transcriptome datasets collected with RNA-seq to screen for potential lncRNAs markers associated with lung cancer. The expression profiles were analyzed with a series of analytical tools. As a first step, lncRNAs and protein-coding genes that showed significant up- or down-regulation were identified (Fig.
1). From 592 specimens (59 normal and 533 cancerous specimens), 679 lncRNAs and 12,040 PCGs were selected for differential expression analysis, and 119 lncRNAs and 1934 PCGs were found to be differentially expressed. The large number of differentially expressed lncRNAs is consistent with the versatile roles and regulatory mechanisms of lncRNAs unveiled thus far, and suggests a vast unchartered territory of the roles of these biomolecules in lung cancer biology [
10,
11,
30].
The next step was to detect similar patterns of expression among these differentially expressed lncRNAs and PCGs. There were two purposes to this analysis, namely to identify lncRNAs and PCGs that may function in pathways in the same cellular processes, and to identify lncRNAs (hubs) that potentially play central roles in modulating the expression of targets within the co-expression module [
17,
31].
Unlike sncRNAs, lncRNAs are poorly conservative and highly versatile in modulating biomolecules. A plethora of mechanisms by which lncRNAs regulate gene expression have been reported [
10]. Due to their large size and therefore the ability to adopt complex conformations, lncRNAs can bind to DNAs, RNAs, and proteins. These interactions, in turn, enable lncRNAs to act as activators, blockers, and scaffolds of their interacting partners, including DNA, mRNAs, miRNAs, transcription factors, and chromatin regulators [
11]. At the transcriptional level, transcription of lncRNA upstream of a target can facilitate or impede that of the latter through modulating DNA conformation, RNA Pol III activity, or the association of transcription factors and promoters. In addition, lncRNAs also regulate alternative splicing, or serve as mRNA stabilizers and a sncRNA repertoire. Furthermore, lncRNAs can modulate genome activity through affecting histone modification, DNA methylation, and chromatin structure [
10,
11,
32].
Of the 119 lncRNAs and 1934 PCGs that showed differential expression between normal and cancerous specimens, six co-expression modules were detected with weighted co-expression network analysis. Among these modules, the Blue module showed the strongest positive correlation with lung cancer (Fig.
4d). The five lncRNAs in this module, despite brief mentioning as part of significantly regulated genes in a handful of previous reports [
33‐
38], remain almost entirely uncharacterized. Interestingly, all five lncRNAs showed upregulation in lung cancer specimens, suggesting potential tumor-promoting roles.
Similar to protein-coding genes, lncRNAs can be classified into two major groups, tumor suppressor lncRNAs and onco-lncRNAs [
39]. Several lncRNAs have been proposed as oncogenic in lung cancer, including MALAT1 (a diagnostic and prognostic biomarker in NSCLC) [
40], AK126698 (mediates cisplatin resistance in NSCLC) [
41], and lncRNA-DQ786227 (implicated in chemical carcinogenesis) [
42]. All three onco-lncRNAs showed upregulation in lung cancer, similar to the five lncRNAs in the Blue module. Conceivably, these lncRNAs may be novel onco-lncRNAs of clinical relevance to lung cancer, although further research is warranted for validation.
As for the 304 PCGs, KEGG pathway analysis showed that they were enriched in processes closely related to lung cancer biology, such as p53 signaling, cellular senescence, DNA replication, and metabolism [
24,
25]. These enriched pathways may be used as a basis for gaining deeper insights into the five lncRNAs.
Following detection of co-expression networks, we chose the Blue module due to its strong correlation with lung cancer, and determined the hub genes in this module. To suppress background noise, 178 PCGs with strong over correlation with the five lncRNAs (PCC > 0.6) were selected and subjected to regulatory network analysis. Two lncRNAs, namely CTD-2510F5.4 and CTB-193M12.5, were identified as hubs of the resulting network. In addition, both lncRNAs also showed the strongest overall correlation with all 304 PCGs in the Blue module, further supporting their centrality in this co-expression module. Moreover, survival analysis showed significant correlation between expression of either lncRNA and poor prognostic overall survival, suggesting CTD-2510F5.4 and CTB-193M12.5 as potential prognostic indicators.
Currently very little is known about either lncRNA. As a result, neither has an official Human Genome Nomenclature Committee symbol. CTD-2510F5.4 (GenBank accession AC099850.7) is the transcript of the gene ENSG00000265415, which is located to chromosome 17 (chromosome 17: 59,065,973–59,264,225). CTD-2510F5.4 has been reported to show consistent increased expression in relation to p53 mutations in lung adenocarcinomas [
33]. Moreover, CTD-2510F5.4 was also found to be differentially expressed in another study that used RNA-seq data from TCGA and two independent experiments of more than 60 lung adenocarcinoma specimens, which supports the validity of our results.
Functions of CTD-2510F5.4 remain to be characterized. Proline rich 11, a gene neighboring ENSG00000265415, was recently suggested as a weak prognostic factor in non-mucinous invasive lung adenocarcinoma [
43], suggesting a possible mechanism by which elevated CTD-2510F5.4 expression contributes to poor prognosis. As suggested by KEGG pathway analysis, CTD-2510F5.4 may also be implicated in key lung cancer-related cellular processes such as senescence. Upon induction of cellular senescence with overexpression of oncogene B-RAF, CTD-2510F5.4 was shown to be downregulated as compared with control cells [
34]. Since oncogene-induced senescence (OIS) is an important defense mechanism against lung cancer initiation [
44], a hypothesis could be proposed, in which aberrant overexpression of CTD-2510F5.4 contributes to survival of cells overexpressing the tumor-promoting B-RAF despite OIS, and thereby exert oncogenic functions. More research is, obviously, needed for validation of this hypothetical mechanism.
The other hub lncRNA, CTB-193M12.5 (GenBank accession AC026401.7), is the product of the gene ENSG00000280206, which is located to chromosome 16 (chromosome 16: 15,570,622–15,708,653). CTB-193M12.5 was found to be upregulated in lung squamous cell carcinomas in a recent report analyzing RNA-seq profiles [
37], which is consistent with our finding of the overexpression of this lncRNA in lung cancer specimens. In addition, expression of this lncRNA was reported to be dramatically increased in gastric cancer tissues [
37] and in triple negative breast cancer cell lines and primary tumors (Cancer RNA-seq Nexus database, analysis title GSE58135) [
45]. We also tried to gain insights into the potential functions of CTB-193M12.5 by predicting its target PCGs and enriched pathways. The most significantly enriched term suggests the roles of CTB-193M12.5 in DNA and/or glycoprotein metabolism, both are known to be crucial in cancer progression [
29].
In summary, starting from TCGA gene transcript profiles collected from 592 lung cancer specimens, through integrated bioinformatics analyses, we identified two largely unknown lncRNAs CTD-2510F5.4 and CTB-193M12.5. Expression levels of both lncRNAs were significantly increased in lung cancer specimens, and showed strong correlation with those of more than 300 differentially expressed protein-coding genes. Moreover, further analysis placed these lncRNAs in the center of the regulatory network consisting of the lncRNAs and PCGs in a co-expression module that showed the strongest positive correlation with lung cancer. Most importantly, high expression of CTD-2510F5.4 and CTB-193M12.5 significantly correlated to poor overall prognostic patient survival, and the prognostic value of the latter was further supported by an independent validation.
Altogether, these results provide evidence that, for the first time, correlate CTB-193M12.5 with prognosis of lung cancer patients, and thereby can be used as the basis for further investigation towards elucidating its biological significance and clinical applications.