Background
Genetic structural variation in the human genome can be present in many forms, ranging from single nucleotide polymorphisms (SNPs) to large chromosome aberrance [
1]. In the past, SNPs are regarded as the predominant form of structural variation and account for much phenotypic variation [
2,
3]. However, recent studies show the widespread existence of copy number variation (CNV) in individuals, and since that these observations have been extremely appreciated and expanded [
4‐
6]. In general, CNV is defined as an amplifying or decreasing number of DNA segments that is 1 kb or larger in the human genome [
1,
4,
5], which accounts for an important part of genetic structural variation. Currently great efforts in science community have been directed to catalog and characterize somatic CNV in a comprehensive manner [
7,
8], which provides key knowledge on how they impact biological function, evolution and human diseases on genomic level.
It is generally accepted that somatic CNV is highly associated with the development and progression of numerous cancers by impacting gene expression level [
9‐
19]. Samulin Erdem et al. [
13] found that Neurofascin (NFASC) gene is significantly amplified and overexpressed in non-small cell lung cancer (NSCLC) patients and the novel role of NFASC is identified in the regulation of cell motility and NSCLC migration. Dong et al. [
16] analyzed the copy number alterations and differentially transcribed genes in esophageal cancer and observed a noteworthy association between CNV and differential gene expression for FAM60A, TFDP1, CDC25B and MCM2. Subsequently, FAM60A was identified as a potential prognostic factor with a striking correlation to overall survival and clinical-pathological parameters. Lines of evidence support differential gene expression might be a vital intermediate mechanism for CNV to exert effect on the downstream phenotype.
Despite a number of studies have explored CNV and differential gene expression of several classical oncogenes or tumor suppressor genes in different cancers [
12,
14,
17,
18,
20], there has been no systematic study about the relationship between CNV and differential gene expression across a broader spectrum of cancer types and cell lines. It is unclear to what extent the expression level is affected by CNV for the whole genomics. Previous observations from single gene or single cancer type may not be representative for other genes or other types of cancer. Here we aimed to systematically investigate the specific relationship between somatic CNV and differential gene expression across cell lines and different cancer types for known genes. This study may help us better understand the correlation between CNV and differential gene expression and provide new insights into the mechanism of development and progression of cancer.
Discussion
In this study, we provided strong evidences to support the high correlation between CNV and differential gene expression. This finding reveals the qualitative relationship between genetic variation and its downstream effect, especially for oncogenes and tumor suppressor genes, which is of a critical importance for prevention, diagnosis and treatment of cancer. First, by integrated analysis of CNV and differential gene expression of CCLE, NCI-60 and TCGA, it revealed a positive association between copy number and expression level with high Pearson’s r of fitting, positive ρ and significant
p values. Besides, not only in cell lines but in patients copy number amplification strikingly harbored a higher expressed level compared to copy number deletion. Secondly, we investigated every gene over the relationship of CNV and differential gene expression across 9139 tumor samples and 1025 cell lines respectively. Our results showed the majority of genes the copy number displayed a positive linear influence on gene expression, indicating that genetic variation generated a direct effect on gene transcriptional level. In addition, we validated 10 genes with a significant correlation between CNV and differential gene expression through literature (Table
1). A strong correlation was confirmed combining ρ or Pearson’s r of fitting for 9 genes except the weak evidence for NFASC, possibly due to the difference of analytical method.
A recent study by GTEx consortium associated genetic variants with gene expression levels across 44 human healthy tissues and gene expression levels are found to be affected by local genetic variation for most genes based on eQTL analysis [
24]. Meanwhile, it was reported that copy number and expression levels had a strong positive correlation for 99% abundantly expressed human genes by integrating predicted copy number and corrected expression level from 77,840 expression profiles [
35]. Moreover, it has been widely reported that copy number is remarkably correlated with expression of protein in literature such as FGFR1 [
36], HER2 [
37,
38], MET [
39], FADD [
40], EGFR [
37]. Message RNA, as intermediates between genes and functional proteins, plays a vital role in proteins production. Thus, we speculated gene expression might be correlated with protein expression as well for the high correlation and concordance between CNV and differential gene expression across cell lines and TCGA datasets and between CNV and differential protein expression of these five genes in literature (Additional file
2: Table S9, Table S10). Notably, FGFR1, known as fibroblast growth factor receptor 1, has been discovered that its copy number amplification is strikingly correlated with FGFR1 gene upregulation and FGFR1 protein upregulation in tumor samples [
18,
36]. It indicated that the dysregulation of protein might attribute to original copy number aberrance through the concordant differential gene expression.
However, a fraction of genes’ expression level did have nothing to do with copy number keeping in a stable expression level over various copy number. We think these genes might be involved in the maintenance of the basal cellular function such as metabolism and signal transduction by the results of significant KEGG pathway enrichment including retinol metabolism, olfactory transduction, calcium signaling pathway, neuroactive ligand-receptor interaction, etc. (Additional file
2: Table S2 and Table S3). Otherwise, it has been well documented that 24% of the 575 housekeeping (HK) genes accounted for the metabolic proteins and 19% for RNA-interacting proteins [
41]. Thus, we focus on the whole small nucleolar RNAs and found most genes were indeed expressed very stably versus CNV (Additional file
2: Table S11 and S12). Third, our results revealed the little existence of highly inconsonant genes of copy number amplification and expression level downregulation, copy number deletion and expression level upregulation (Additional file
1: Figure S4), which indicated that the copy number amplification barely causes gene expression downregulation and the copy number deletion hardly promotes gene expression upregulation. Otherwise, among the highly concordant genes with copy number amplification and expression level upregulation, copy number deletion and expression level downregulation, the frequency of copy number amplification and expression level upregulation evidently exceeded copy number deletion and expression level downregulation (Fig.
3b) possibly as a result of selection on deletions for it is unknown of the selective pressures on amplification [
1]. We attempt to revalidate the ten highly concordant genes in literature (9 AUGs, 1 DDG; Table
2), whose results was highly consistent with the variation trend in literature.
Note that although the sample sizes of CCLE, NCI-60 and 31 cancers in TCGA were discrepant (Additional file
2: Table S1), they still showed a similar tendency of the association between CNV and differential gene expression (Fig.
1a; Additional file
1: Figure S1 and Figure. S2). Moreover, we observed a high level of agreement between cell lines and TCGA datasets which showed a consistent distribution of genes in Fig.
2a and Additional file
1: Fig.
3a including the ρ for 16,639 shared genes from (Fig.
2c) and a comparable Pearson’s r of fitting (Fig.
2b; Additional file
1: Figure S3C and Figure S3D). Our results suggested that this phenomenon was well conserved within cell lines and tissues.
In total, we identified 925 highly concordant genes including 560 AUGs and 365 DDGs. For examples, numerous studies reported that DERL1 overexpression was significantly related to cancer cell proliferation [
26], invasion [
42,
43] and poor prognosis [
27], which might be driven by copy number amplification for DERL1 obtained the majority of copy number amplification and expression level upregulation in many cancers (Fig.
4a). Obviously, CNV-driven differentially expressed genes (DEGs) might broaden our insights into the mechanism of tumorigenesis, migration, resistance, poor prognosis, etc. for the increasing studies on CNV-driven DEGs [
16,
44,
45]. In our study, a large proportion of AUGs were affiliated with metabolic pathways especially in terms of Oxidative phosphorylation and GPI-anchor biosynthesis (Fig.
5a), which suggested the gained function of metabolism-related proteins in tumors to provide more energy for cancer cells [
46‐
48]. In contrast, DDGs were significantly related with ubiquitin mediated proteolysis and wnt signaling pathway (Fig.
5b), whose dysfunction tend to lead to tumorigenesis [
29], metastasis [
49], resistance [
31], etc. With respect to wnt signaling, lost function of DDGs such as inhibitory SMAD4 and APC would definitely enhance the function of wnt signaling leading to tumorigenesis [
50‐
57], while attenuated function of ubiquitin mediated proteolysis facilitate proliferation [
58]. Wherein, we found 10 highly concordant genes with a strong relation to patient overall survival including 5 AUGs and 1 DDG (Table
3), while FYTTD1 has been hardly reported to be associated with cancer. By further integrated analysis of CNV and differential gene expression of FYTTD1 in ESCA patients, we observed that 24.73% patients showed a high level of copy number amplification with a median Z score of 4.15 which means FYTTD1 was strikingly overexpressed (Additional file
1: Figure S6). Wherein, AU patients occupied an overwhelming part among these 48 copy number amplified patients (84.44%) indicating global effect of CNV on FYTTD1 gene expression, which may be a potential driver gene or prognostic marker in ESCA. Therefore, highly concordant genes of AUGs and DDGs may provide new insights into the development and progression of cancer.
Additionally, we utilized another independent dataset (CCLP) to revalidate the relationship between CNV and differential gene expression. Although CCLP applied a different algorithm to calculate copy number variation, it also showed a positive correlation between copy number and expression level (Fig.
6a). Our results demonstrated that gene expression levels of copy number amplification substantially surpassed gene expression levels of copy number deletion (Fig.
6b). Besides, copy number amplification and expression level downregulation, copy number deletion and expression level upregulation versus copy number aberrant counts took up the smallest part (1%). Concordantly, most genes showed an overwhelming level of either copy number amplification and expression level upregulation or copy number deletion and expression level downregulation (93%, ratio > 0.9), and it was hardly existed of genes with both high level of copy number amplification and expression level upregulation, copy number deletion and expression level downregulation (Fig.
6c).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.