Background
Renal cell carcinoma (RCC) ranks among the top ten cancer diagnoses worldwide, which account for 5% and 3% of all new cancer cases in males and females, respectively [
1]. According to the latest data from the World Health Organization, there are more than 140,000 RCC–related deaths per year [
2]. Among the RCC subtypes, clear-cell renal cell carcinoma (ccRCC) is the most common one and comprises the majority of kidney cancer deaths [
3]. Therefore, identifying reliable prognostic tools for predicting the clinical outcomes and helping make decisions regarding observation, surgery, drug therapy and conservative options is obviously crucial for now.
Biomarkers used to predict overall survival (OS) can range from clinical parameters, endogenous substances and pathohistological characteristics of tumor to specific mutated gene. For example, the tumor node metastasis (TNM) classification system is most widely used to estimate prognosis and guide treatment in patients with cancer [
4]. Besides, more and more single signature have been explored to predict the OS of ccRCC patients, such as CX3CR1 [
5], miR-497 [
6] and LncRNA CADM1-AS1 [
7]. However, it is a challenge to predict survival of patients with ccRCC using single parameter by reason of the impact of wide variability of outcomes and genetic heterogeneity [
8]. Thus, it is the best way to develop a comprehensive prognostic evaluation system including multiple biomarkers which can improve the predictive accuracy.
Nowadays, gene-based prognostic models containing other clinical parameters in predicting OS of cancer patients including ccRCC have been investigated numerously but they have not been widely accepted and exerted on the clinical practice [
9‐
11]. Therefore, more novel prognosis-related genes could be uncovered by different bioinformatics analysis process and used to establish a more accurate prognostic models than conventional clinical parameters.
In this study, we constructed a model based on multiple prognostic-related genes and clinical parameters to predict OS of ccRCC patients. We screened the high-throughput sequence data from The Cancer Genome Atlas (TCGA) to explore differentially expressed genes and used the univariate Cox proportional hazards regression analysis, Least Absolute Shrinkage and Selection Operator method (LASSO) as well as best subset regression (BSR) to identify a five-gene group which got the lowest AIC value. The risk score was calculated through the multivariate cox coefficient multiplied by the expression of the gene. External validation was performed to verify the risk score model. Then the risk score and clinical parameters were combined together to construct a nomogram which was assessed by the calibration plot and time-dependent receiver operating characteristic curve (tROC) analysis. Furthermore, we did an internal validation to verify the model. Finally, functional enrichment analysis was performed to identify the potential biological pathways of the DEGs and five novel genes.
Materials and methods
Datasets sources and processing
Raw counts of RNA-sequencing data (level 3) and corresponding clinical information (Additional file
1: Table S1) from 533 KIRC and 78 paracancerous samples were obtained from The Cancer Genome Atlas (TCGA) dataset (
https://portal.gdc.cancer.gov/) in April 2018, in which the method of acquisition and application complied with the guidelines and policies. Based on the requirement to the data integrality, patients that met the following criteria were excluded from subsequent analysis: (1) patients with survival time less than 30 days, (2) insufficient information of TNM, stage, grade, recurrence, age and gender. Finally, 504 tumor samples which were from different individuals and 71 paracancerous samples were selected from the dataset in this study. The patients (n = 504) were further randomly assigned to a training set and a testing set by a ratio of 7 to 3. Entrez IDs from gene expression data were converted to gene IDs by using a GTF file, which was downloaded from GENCODE (
https://www.gencodegenes.org/). According to the selection criteria that gene was excluded if the sum of its expression level for each sample is less than 1, 19,651 protein-coding genes annotated by gene IDs above and were selected for further analysis.
Meanwhile, one microarray dataset GSE29609 which includes 39 KIRC patients with corresponding clinical information (Additional file
1: Table S1) was downloaded from GEO (
http://www.ncbi.nlm.nih.gov/geo/) for external validation. It was performed on Agilent-012391 Whole Human Genome Oligo Microarray G4112A platform. The normalized expression matrix of microarray data could be directly download from the dataset. The probes were annotated by using the corresponding annotation files from the dataset as well. Then a principal component analysis (PCA) was used to detect whether the microarray dataset had the batch effect. The “sva” R package was used to eliminate the batch effect [
12].
Differential genes expression analysis of ccRCC
The raw count data of mRNA profile in ccRCC from TCGA dataset including tumor and paracancerous groups were normalized and quantile filtered by “voom” transformation and the differentially expressed genes (DEGs) were analyzed using the “limma” package of R software [
13]. DEGs including significantly upregulated and downregulated genes were screened to subsequent analysis with an adjusted
p value < 0.05 and absolute log2 fold change (FC) > 4.
Selection and verification of prognosis-related genes
The raw counts of RNA-sequencing data were normalized with transcripts per million (TPM) method and using a log2-based transformation (log2TPM) for subsequent survival analysis.
Then this normalized expression data from the training set (n = 353) were used to build a panel of multi-gene signature to predict prognosis in ccRCC. Firstly, the expression data transformed by log2 TPM and the corresponding clinical information were used to screen out the prognosis-related genes using univariate Cox proportional hazards regression analysis (Hazard Ratio (HR) ≠ 1, p < 0.05). Then the prognosis-related genes (HR > 1, higher expression of genes indicate poor prognosis of patients) were intersected with the upregulated DEGs to obtain one set of candidate genes. The prognosis-related genes (HR < 1, lower expression of genes indicate poor prognosis of patients) were intersected with the downregulated DEGs to obtain another set of candidate genes. Finally, these two set of genes called overlapping candidate genes (OCGs) were used for subsequent analysis.
LASSO (Least Absolute Shrinkage and Selection Operator) regression was applied to construct a multi-gene signature with the OCGs for predicting prognosis in ccRCC using “glmnet” package of R software [
14]. To improve the reliability and objectivity of analysis result, tenfold cross-validation was performed to identify the optimal lambda value that came from the minimum partial likelihood deviance.
Then the prognosis-related genes screened from LASSO algorithm with tenfold cross-validation was further analyzed in BSR, which is an exploratory model building regression analysis and can compare all possible created models based upon an identified set of genes. Supposed there were A prognosis-related genes (A = number) screened from LASSO algorithm. More detailed algorithm is summarized as follows:
2.
Chose k genes from A genes to construct models C (A, k), whose akaike information criterion (AIC) was calculated by means of “glmulti” package of R software [
15].
3.
According to the smallest AIC (sAIC) calculated above, CsAIC (A, k) would be selected as the best optimal model consists of k genes.
However, taking into account of the feasibility of clinical work where the lesser number of the biomarkers in the model, the more advantage it gets in the clinic, the maximum value of k range was set to five [
9,
16,
17]. Then patients from training set were divided into two groups according to the expression of every gene from C
sAIC (A, k) screened through BSR: high expression (log
2TPM higher than the cutpoint, which determined by “survminer” package of R software [
18]), and low expression (log
2TPM lower than the cutpoint). Then KM curves as well as a log-rank test were implemented using R package “survival” [
19] to show the relationship between expression of candidate genes and OS in ccRCC patients.
Establishment and estimation of mulit-gene prognostic signature
The regression coefficients of 5 optimal prognostic genes were derived from the multivariate Cox proportional hazards regression model. Subsequently, a linear combination method was adopted to assemble expression level and coefficient of each gene to get a risk score formula, which is as follows:
$${\text{Risk score}} = \mathop \sum \limits_{i = 1}^{5} \beta_{i} *Exp_{i}$$
where Exp is the expression level of each prognostic gene, and β is the regression coefficient of it.
The patients in the training set were stratified into high-risk and low-risk groups based on the median risk score as the cutoff. The KM survival analysis with log-rank test were also used to compare the survival difference between above two groups. Univariate Cox proportional hazards regression analysis was performed to compare the prognostic power of the risk score and some clinical parameters including, T-stage, N-stage, M-stage, AJCC-stage, grade, gender, age, laterality and recurrence. Furthermore, we used multivariate Cox proportional hazards regression analysis to determine whether the risk score could be an independent prognostic factor in ccRCC patients based on risk levels. Other clinical parameters with statistically significant difference (p < 0.05) in univariate Cox proportional hazards regression were also incorporated in the analysis.
In order to explore the diagnostic capability of multi-gene prognostic signature in different levels of other clinical prognostic parameters, the KM curves were used to compare the difference of subgroups of AJCC-stage, grade, age, gender, laterality and recurrence, which were grouped by risk level for each sample in training set. Besides, tROC analysis was performed to compare the predictive accuracy of each gene and risk score.
Validation of multi-gene prognostic signature
For internal and external validation, the testing set (n = 151), whole set (n = 504) and external validation set (n = 39) were used to validate the predictive capability and applicability of the multi-gene prognostic signature in ccRCC. In validation set, the risk score of each patient was calculated using the coefficients of 5 genes above. Then the patients were stratified into high-risk and low-risk groups by the median risk score from the training set. The KM survival analysis with log-rank test and tROC analysis were used to validate the multi-gene prognostic signature.
The image of immunohistochemistry (IHC) staining of the selected prognosis-related genes in normal tissue and ccRCC tissue were retrived from Human Protein Atlas online database (
http://www.proteinatlas.org). Moreover, the mutation type of the finally selected prognosis-related genes was explored in cBioPortal (
http://cbioportal.org).
Construction and validation of gene prognostic nomogram
A composite nomogram was constructed based on all independent prognostic parameters screened by univariate and multivariate Cox proportional hazards regression analysis above to predict the probability of 1-year, 3-year and 5-year OS using “rms” package of R software [
20].
The tROC curves were plotted to assess the predictive accuracy of independent prognostic parameters including AJCC-stage, risk level and gene prognostic nomogram using the R package “survivalROC” [
21]. The area under the ROC curve (AUC) was calculated to make a comparison for discriminatory ability of above prognostic parameters. Then we used calibration curve to visualize the performance of the nomogram with the observed rates of training set at corresponding time points by a bootstrap method with 1000 resamples. The predicted and observed outcomes of the nomogram could be compared in the calibration curve while the 45° line represents the best prediction. The same methods were used in the testing set and the whole set to validate the results.
Functional enrichment analysis of DEGs and prognosis-related genes
With the screened DEGs, gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis were performed on the online tool-Metascape [
22] (
http://metascape.org/gp/index.html#/main/step1). A p-value of < 0.05 was considered as statistically significant.
As for the ultimate prognosis-related genes used for nomogram construction, Gene Set Enrichment Analysis (GSEA) was performed to identify the potential biological pathways. The whole set of 504 ccRCC samples were divided into two groups based on the median expression of each prognosis-related gene discussed above. Then GSEA software (v3.0,
http://software.broadinstitute.org/gsea/) was conducted on JAVA 8.0 platform. The annotated gene set c2.cp.kegg.v6.2.symbols.gmt obtained from the Molecular Signatures Database (MSigDB) was chosen as the reference set to calculate Enrichment Score (ES) which estimated whether genes from prior defined gene set are enriched in high/low expression group of each prognosis-related gene or distributed randomly. The number of permutations was set to 1000. Gene size smaller than 15 or larger than 500 was excluded. A gene set was considered as a enriched group when the normalized p value < 0.05 and FDR score < 0.05 [
23].
Statistical analysis
The samples of tumor tissues were randomly divided into two groups using “sample” function of R software. Heatmap of DEGs was plotted using “pheatmap” R package [
24] with zero-mean normalization. PCA was used to estimate batch effect and clustering result using “ggfortify” R package [
25]. Two groups of boxplot were analyzed using Wilcoxon-test. For Kaplan–Meier curves, p-values and hazard ratio (HR) with 95% confidence interval (CI) were generated by log-rank tests and univariate Cox proportional hazards regression. All analytical methods above and R packages were performed using R software version 3.6.1 (The R Foundation for Statistical Computing, 2019). All statistical tests were two-sided. p < 0.05 was considered as statistically significant.
Discussion
During the last two decades, the incidence of renal cell carcinoma significantly increased and the mortality was not promising [
2,
27]. Identifying effective prognostic biomarkers to construct good prognostic tools to predict the survival of ccRCC patients is the advisable choice applied in the clinical practice. At present, the TMN staging system is commonly used to predict the prognosis of ccRCC patients [
28]. But as discussion above, single clinical parameter has poor power of prognosis prediction. Thus, combining other prognostic parameters would be the best way to improve the accuracy of prediction.
In our current study, the DEGs were identified firstly from ccRCC and normal tissue and were found to be principally enriched in basolateral plasma membrane, anchored component of membrane, PPAR signaling pathway and cell adhesion molecules (CAMs). Then the intersected genes between DEGs and prognosis-related genes sifted out from univariate Cox regression methods in the training set were used in LASSO regression with tenfold cross-validation and BSR to screened out five novel DEGs (PADI1, ATP6V0D2, DPP6, C9orf135, PLG), where the order as well as the content of the screening methods were not all the same as the most research.
To the best of our knowledge, there has not been any study using the screening methods like ours to identify the upregulated DEGs with HR > 1 and downregulated DEGs with HR < 1. The method can exclude some situations such as upregulated DEGs with HR < 1 and downregulated DEGs with HR > 1, which are not practical in clinical practice. The five novel genes are significantly related to the OS of ccRCC patients. While PLG, DPP6, ATP6V0D2, C9orf135 are negative prognostic genes, PADI1 is a positive prognostic gene. PLG plays an important role in tissue remodeling during development, physical injury, inflammation and carcinogenesis. It can help degrade the extracellular matrix with other matrix metalloproteases, such as collagenases, gelatinases and stromelysins, which all serve a vital character in cancer invasion, especially in lung and breast cancer [
29,
30]. However, PLG is not only a pro-tumorigenic factor but also an anti-tumorigenic factor due to the fact that proteolysis of PLG can release angiotensin, which will function against cancer progression [
31]. This may explain the result that the expression level of PLG in ccRCC samples was lower than that in adjacent normal tissue in our study, which meant the low expression of PLG was important for ccRCC progression. In addition to our results, downregulation of PLG in ccRCC was confirmed by Schrödter et al. who screened the DEGs using a microarray and qPCR [
32]. PLG was also screened as a hub gene in some research, which suggested it might play a major role in ccRCC [
33,
34]. Worse OS of ccRCC patients associated with low expression of PLG was verified by Wang et al. using UALCAN [
34]. Our GSEA analysis showed that low expression of PLG also probably negatively mediates p53 signaling pathway to promote ccRCC progression. DPP6 is known as a protein participating in modulating A-type potassium channels in somatodendritic compartments of neurons, which plays a role in synaptic plasticity [
35]. Nevertheless, recent research has found that DPP6 could regulate various biological functions, maintain cell-specific phenotype and dysregulated expression of DPP6 would result in carcinogenesis [
36,
37]. It was reported that DPP6 was down-regulated in acute myeloid leukemia and melanoma but up-regulated in colon cancer, which was probably caused by hyper- and hypomethylation, respectively [
38‐
40]. In ccRCC, Song et al. also figured out that DPP6 was a downregulated gene in ccRCC samples compared with normal tissue by analyzing GEO and TCGA databases [
41]. However, there are few studies regarding the role of DPP6 in ccRCC at present. PADI1 belongs to the peptidyl arginine deiminases family consisting of five family members (PADI1-4 and PADI6) in human. They catalyze the process of citrullination modification of proteins [
42]. When the process is upregulated, it would disturb the stability of proteins and caused DNA damages, which is associated with carcinogenesis involved in the stomach, the liver, the large intestine, oral squamous cell carcinoma and so on [
42‐
44]. Interestingly, overexpression of PADI driven by MZF1 and Sp1/Sp3 binding to the promoter region can citrullinate PKM2 and stimulate glycolysis in cancer cells [
45,
46]. However, to the best of our knowledge, the specific correlation between PADI1 and ccRCC remains ill-defined. ATP6V0D2 is a gene encoding H
+ transporting protein in the plasm membrane of cells, especially osteoclasts [
47]. When ATP6V0D2 is downregulated, it will dysregulate the intracellular and extracellular acidic environment. Some research suggests that a high intracellular pH and a low extracellular pH will give cancer cells a competitive advantage over normal cells for growth [
48]. But the specific correlation of ATP6V0D2 dysregulation and tumor acidity remains uncertain. Downregulated ATP6V0D2 probably functions through increasing HIF-2α expression produced by macrophage to enhance tumor vascularization and growth [
49]. Previous studies showed that an elevated expression of ATP6V0D2 was found in stomach cancer specimens, whereas the expression was reduced in the colorectal and renal cancer specimens, which confirmed our findings [
50,
51]. But so far, as for the specific mechanism between ATP6V0D2 and ccRCC, there has been no research reported yet. C9orf135, chromosome 9 open reading frame 135, encodes a membrane-associated protein whose expression is related to pluripotency in human embryonic stem cells (hESC). The expression of C9orf135 is regulated by OCT4 and SOX2 and decreases during hESC differentiation [
52]. However, the role of C9orf135 has not been widely characterized in cancer. Ye et al. reported that its expression was downregulated in nasopharyngeal carcinoma [
53]. Our GSEA suggests that low expression of C9orf135 probably promote ccRCC formation through affecting PPAR signaling pathway. Taken together, we revealed that the correlation between the expression level of the novel five genes and the OS of ccRCC patients; meanwhile, GSEA was also performed to identify the potential biological pathways of the novel five genes in ccRCC formation and progression. Due to the activity of five genes on carcinogenesis and the significant relevance to the prognosis of ccRCC patients, probably they can function as novel cancer biomarkers if the more details of their specific roles playing in ccRCC are explored widely and deeply.
After identifying the five prognostic genes, five-gene prognostic signature was developed and investigated for its prognostic value in ccRCC patients. The patients in high-risk groups showed significantly poorer prognosis than the patients in low-risk group. Moreover, the prediction of 5-gene prognostic signature could be used in different subgroups such as stage I/II, stage III/IV, grade 1/2, grade 3/4, male, female, younger (≤ 65 years old), older (≥ 65 years old), left and right site and recurrence group. There was significantly different prognosis between high-risk and low-risk level in these subgroups and all high-risk groups had worse OS than that of low-risk groups, which meant that the novel gene model could be used to stratify ccRCC patients into high-risk and low-risk group in these subgroups and help clinician choose wiser clinical decisions.
Then the univariate and multivariate Cox regression analysis showed that the five-gene prognosis signature could be an independent factor to evaluate the prognosis. Internal and external validation were also conducted to confirm its predictive value. Further, the time-dependent ROC analysis of each gene was performed and the results showed that the sensitivity and specificity of single parameter was poorer than that of five-gene prognostic signature, which suggested that the predictive power of multi variables would perform much better. However, the AUC of five-gene prognostic signature for 1-year, 3-year and 5-year OS showed a little bit smaller than that of AJCC-stage in three set (Fig.
9). In order to improve the ability to prognosis prediction of five-gene prognostic signature, a highly accurate predictive nomogram was constructed integrating the risk score and conventional clinical prognostic parameters including AJCC-stage, age and recurrence, all of which were verified as an independent prognostic factor using univariate and multivariate Cox proportional hazards regression analysis for the OS of ccRCC patients. It could be used to predict the individual 1-, 3- and 5-year OS probability specifically according to the risk score and other conventional clinical prognostic parameters. Then its time-dependent ROC survival analysis in the three sets revealed that it presented the best power of 1-, 3- and 5-year OS prediction compared with that of risk score system and AJCC-stage (Fig.
9). Very perfect agreement was observed in the calibration plot of our nomogram in the training set between the predicted and observed outcomes. Satisfied agreement was also seen in the internal validation and the whole set. Therefore, our five-gene based prognostic nomogram may aid clinician in predicting the survival outcome of ccRCC patients and provide the reference for therapy guidance than single conventional clinical parameter. Besides, to some extent, based on the hints about the drastically clinical significance of these five prognostic genes from our study, we think we provide the necessity of following functional experiment exploration.
However, several limitations in our study should be acknowledged. Firstly, our study only focused on the large-scale mRNA sequencing data from TCGA platform. Other types of data like single nucleotide polymorphisms (SNP), copy number variation (CNV) and DNA methylation are provided by the public dataset. If possible, five novel biomarkers could be analyzed further to see whether their expression level is related to mutation types above. Secondly, the significantly difference of protein expression level of the five genes between tumor and normal tissues could be detected in TCGA database, where patients are mainly Asian and White. More public database or experiment needs to be explored whether their expression level is geographically different. Thirdly, our study provides the evidence that five novel genes are significantly related to the survival of ccRCC patients and possibly become therapeutic targets for precision medicine in the future, which was analyzed through data mining merely. Functional experiment for revealing their roles in cancers is valuable and crucial.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.