Background
Non-small cell lung carcinoma (NSCLC) accounts for more than 80% of lung cancer, which is the second most common cancer and the most common cause of cancer-associated deaths worldwide [
1,
2]. With the development of diagnostic techniques, more NSCLC patients will be diagnosed at earlier stage [
3,
4]. These patients can achieve a relatively superior prognosis, but some patients still develop recurrent cancer and about 40% of them will die of cancer within 5 years [
5,
6]. NSCLC is also a kind of cancer with high heterogeneity, of which 45% were lung squamous cell carcinoma (LUSC) and 30% were lung adenocarcinoma (LUAD) [
7]. Histological and genetic diversity can account for some of the individual variation in NSCLC survival. Therefore, identification of molecular subtype for early-stage NSCLC patients associated with survival will benefit early treatment and patient prognosis.
Molecular subtype has been used in the exploration of NSCLC heterogeneity. Gene expression subtypes of LUSC and LUAD have been proposed by The Cancer Genome Atlas (TCGA) research network, respectively [
8,
9]. A multiplatform-based NSCLC molecular subtype including 9 subtypes for 1023 NSCLC patients has also been identified in a recent study [
10]. There are some other kinds of lung cancer molecular subtypes according to different gene sets [
11,
12]. However, there were some special molecular characteristics for early-stage NSCLC. Patient prognostic information has also not well utilized in these subtypes and gene sets, leading to weak predictive ability for patient prognosis.
In this study, we analyzed gene expression and DNA methylation data for early-stage NSCLC, and proposed a prognosis-related molecular subtypes for LUSC and LUAD. Then, we explored the function of differentially expressed genes and differentially methylated genes. We also selected biomarkers and built prediction model for each subtype in training dataset, and validated the models in test dataset. The prediction model was evaluated by sensitivity (SE), specificity (SP) and area under the ROC curve (AUC). Furthermore, we analyzed the molecular functions of these biomarkers in cancer development.
Methods
Datasets and preprocessing
RNA-Seq data, DNA methylation data and clinical information of NSCLC patients were downloaded from the UCSC Xena website (
http://xena.ucsc.edu/). The RNA-seq data were log
2 transformed RSEM normalized counts and mapped to HUGO gene symbols. The DNA methylation levels were represented by β-values (from 0 to 1). Methylation sites were filtered by the following criteria: 1) probes located in the X or Y chromosome; 2) SNP present within the assay of probe; 3) probes did not annotate with any reference genes; 4) probes located in the shelves and oversea regions of CpG island. Genes and methylation sites with missing value in more than 20% of patients were excluded, and patients without mRNA data or methylation data were also removed from further analysis. Data were centralized and standardized before analysis.
For each gene and methylation site in the entire data set, we built a univariate Cox proportional hazard (Cox-PH) model and selected variables with P values less than 0.001. We than used these genes and methylation sites to cluster the patients using Partitioning Around Medoid (PAM) clustering algorithm. The cluster number K of PAM clustering algorithm was set to 2–5. The optimal number of NSCLC clusters was determined by maximizing the difference of overall survival among different subtypes. The Database for Annotation, Visualization and Integrated Discovery (version 6.8, DAVID) tool was used for the functional annotation for Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.
Prediction model for molecular subtypes
We randomly divided the data set into training set and test set, in which training set contained 60% patients (Table
1). We selected biomarkers and built prediction models for molecular subtypes in training set, and validated the models in test set. In the biomarker selection phase, univariate Wilcoxon test was firstly used to selected differentially expressed genes and methylated sites (
P < 0.001) compared with other subtypes in the training dataset. Then, a multivariate partial least square (PLS) model was established, and 10 variables with largest variable important projection (VIP) values were selected as biomarkers for each subtype. Random forest (RF) and support vector machine (SVM) models were constructed with 10 selected biomarkers in training dataset. The model prediction ability was evaluated in training and test datasets, respectively.
Table 1
Clinical characteristics of early-stage NSCLC patients in training and test sets
N | | 181 | 122 | 210 | 141 |
Age | | 68.41 ± 8.41 | 66.93 ± 9.48 | 65.43 ± 9.66 | 65.46 ± 9.98 |
Sex | Female | 48 (26.52) | 32 (26.23) | 117 (55.71) | 73 (51.77) |
Male | 133 (73.48) | 90 (73.77) | 93 (44.29) | 68 (48.23) |
Pathologic stage | I | 101 (55.80) | 69 (56.56) | 140 (66.67) | 99 (70.21) |
II | 80 (44.20) | 53 (43.44) | 70 (33.33) | 42 (29.79) |
Therapy outcomea | CR | 128 (88.28) | 89 (85.58) | 138 (77.97) | 99 (81.15) |
PR | 1 (0.69) | 3 (2.88) | 1 (0.56) | 2 (1.64) |
SD | 5 (3.45) | 7 (6.73) | 17 (9.60) | 4 (3.28) |
PD | 11 (7.59) | 5 (4.81) | 21 (11.86) | 17 (13.93) |
Smoking statusb | 1 | 5 (2.84) | 5 (4.20) | 33 (16.26) | 17 (12.32) |
2 | 62 (35.23) | 35 (29.41) | 48 (23.65) | 37 (26.81) |
3 | 30 (17.05) | 18 (15.13) | 54 (26.60) | 37 (26.81) |
4 | 77 (43.75) | 60 (50.42) | 66 (32.51) | 45 (32.61) |
5 | 2 (1.14) | 1 (0.84) | 2 (0.99) | 2 (1.45) |
Pack year | | 52.20 ± 27.93 | 52.65 ± 28.92 | 39.18 ± 24.70 | 43.81 ± 28.38 |
Evaluation criteria
The prediction model performance was evaluated by sensitivity (SE), specificity (SP) and area under the ROC curve (AUC). SE and SP were defined by:
$$ SE=\frac{TP}{TP+ FN} $$
$$ SP=\frac{TN}{TN+ FP} $$
where.
True Positive (TP): the patient belongs to a subtype, and the prediction model predicts the patient as this subtype;
False Positive (FP): the patient does not belong to a subtype, but the prediction model predicts the patient as this subtype;
True Negative (TN): the patient does not belong to a subtype, and the prediction model does not predict the patient as this subtype;
False Negative (FN): the patient belongs to a subtype, but the prediction model does not predict the patient as this subtype.
AUC were defined by:
$$ AUC=\frac{\sum {r}_i-{n}_0\left({n}_0+1\right)/2}{n_0{n}_1} $$
where
n0 and
n1 are the number of patients who belong to and not belong to a subtype respectively, and
ri is the rank of
ith patient of a subtype in the ranked list.
Discussion
In this study, we proposed a prognosis-related molecular subtype for early-stage NSCLC, including 4 subtypes for LUSC and 2 subtypes for LUAD. These subtypes showed different trend in overall survival, gene expression pattern, and DNA methylation level. Most subtypes showed highly expressed and hypermethylated gene regions, which facilitated the biomarker selection for subtypes. We also selected biomarkers and built prediction models with good performance, which can help the grouping of new patients and therapy strategy selection.
LUSC patients were divided into 4 clusters by 14 mRNAs and 362 methylation sites related with overall survival. These subtypes were mainly determined by DNA methylation information, and all the selected biomarkers were also methylation sites. Five methylation sites (cg00894870, cg03041700, cg08356572, cg11416447 and cg22627950) were selected as biomarkers for both LUSC-C1 and LUSC-C3, in which the function of 4 methylated genes were associated with cancer [
13‐
18]. The function of 267 genes were mainly associated with regulation of cell cycle and gene transcription.
In LUSC-C1, 3 hyper-methylated sites were located in transcriptional start site (TSS) 200 regions of GHSR and weakly negatively related with GHSR (
Supplementary Table 3), which can encode growth hormone secretagogue receptor (GHS-R) and related with energy metabolism. KIAA0090, ATAD3B, TRIM27 and DMTF1, regulated by hypo-methylated sites, were also associated with cancer. KIAA0090, which was positively related with hypo-methylated cg00894870, was associated with cancer metastasis and prognosis [
16]. ATAD3B was expressed in cancer cell, and may related with tumorigenesis, proliferation and chemoresistance [
14,
15]. TRIM27 was an oncogene [
18] and DMTF1 can regulated ARF-p53 pathway [
13,
17].
In LUSC-C3, 8 hyper-methylated sites were located in 10 genes. In addition to 4 same genes (KIAA0090, ATAD3B, TRIM27 and DMTF1) with LUSC-C1, ACP1 and SH3YL1 also played important roles in cancer. ACP1 can encode a tyrosine phosphatase, which was an anti-tumorigenic factor interacted with PDGF-R and FAK [
19]. SH3YL1 can regulate migration of cancer cell [
20]. Two hypo-methylated sites were located in gene body of PCDH gene family (PCDHA, PCDHB and PCDHG). The aberrant methylations of these genes were also found in breast cancer [
21].
Unlike LUSC, LUAD patients were divided into 2 clusters by 143 mRNAs and 458 methylation sites, which indicated that these subtypes were determined by both mRNA and DNA methylation. These differentially expressed genes were mainly associated with cell cycle regulation. Whereas the differentially methylated genes were involved in a variety of GO terms and KEGG pathways, such as signal transduction, cell division and apoptosis.
In LUAD-C1, 10 selected biomarkers were all down-regulated in LUAD-C1. ANLN, CCNA2, CDCA5, DLGAP5, TPX2 and KIF4A were involved in the regulation of cell cycle (
Supplementary Table 4). CKAP2L and SHCBP1 were associated with spindle formation, which was also involved in cell cycle. In previous study, over expression of 9 selected gene biomarkers (ANLN, CCNA2, CDCA5, CKAP2L, DLGAP5, KIF4A, KPNA2, SHCBP1 and TPX2) can indicate poor prognosis in different cancer types, including lung cancer, colon cancer, breast cancer and bladder cancer [
22‐
31].
We built 2 prediction models for subtype prediction based on RF and SVM algorithms. The SE, SP and AUC for subtype prediction in training dataset were 1 by RF model, larger than the values calculated by SVM model. However, these values were smaller than those calculated by SVM model in test dataset. This phenomenon indicated that the model built by RF was over-fitting, and the prediction ability was worse for new data than SVM model.
Conclusions
In conclusion, we identified 6 subtypes for early-stage NSCLC, including 4 subtypes for LUSC and 2 subtypes for LUAD, by gene expression and DNA methylation data integration analysis. Furthermore, we also selected biomarkers and built prediction model to distinguish these subtypes, and most of these biomarkers were involved in tumor related function.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.