Background
Lung adenocarcinoma (LUAD) is the most common type of non-small cell lung cancer (NSCLC), accounting for approximately 80–90% cases of lung cancer [
1]. Currently, approximately 35–75% of LUAD patients relapse or die within 5 years of receiving conventional treatments based on the National Comprehensive Cancer Network Clinical Practice Guidelines in Oncology [
2]. Recently, immunotherapies, which eliminate tumours by activating the immune system [
3], have shown great promise for NSCLC [
4,
5]. For example, an immune checkpoint inhibitor, nivolumab, which targets programmed cell death protein-1 (PD-1), can significantly increase survival in advanced-stage NSCLC by blocking the interaction between PD-1 and its ligand programmed death-ligand 1 (PD-L1) and allowing cytotoxic T lymphocytes to act on tumour cells [
6]. Furthermore, the inhibitor (ipilimumab) for cytotoxic T lymphocyte-associated antigen 4 (CTLA-4), which suppresses immune responses, has been approved for treating NSCLC [
5] and some other cancers [
7]. However, the heterogeneity of the response to immune checkpoint inhibitors significantly confounds the treatment of NSCLC [
3]. Therefore, it is important to identify patients potentially benefiting from these immune checkpoint inhibitors.
Previously, PD-L1 protein expression in NSCLC patients has been approved as an auxiliary predictive marker for certain PD-1/PD-L1 inhibitors including pembrolizumab [
8]. However, PD-L1 protein expression alone cannot completely account for the survival benefit to patients treated with immune checkpoint inhibitors [
8‐
11]. Moreover, analysis of PD-L1 protein expression via immunohistochemistry (IHC) is challenging because of subjective diagnoses of immunostaining results by pathologists using different criteria or interpretations [
12].
Several previous studies have reported a high tumour mutation burden (TMB), determined through whole-exome sequencing (WES), indicating that patients are more likely to harbour neoantigens, can predict the sensitivity to immunotherapies [
13,
14]. For example, high-TMB patients are associated with enhanced responses to nivolumab (PD-1 inhibitor) plus ipilimumab (CTLA-4 inhibitor) immunotherapy [
15]. Moreover, a high TMB is more significantly associated with the response to PD-1/PD-L1 inhibitors than with PD-L1 protein expression detected via IHC [
16]. However, WES, necessary to determine the TMB, is not routinely performed in clinical practice because it is costly, time-consuming and labour intensive, and needs a large number of sequences [
3,
17,
18]. Previous studies have reported that the TMB can be accurately estimated using smaller gene panels encompassing several hundred genes, such as the 324-gene mutation panel (FoundationOne CDxTM assay) [
6,
19‐
21] and the 341-gene mutation panel (MSK-IMPACT) [
22,
23], which have been clinically used. The cost-effectiveness of these mutation panels facilitates a greater sequencing depth than that of WES and consequently a higher ability to detect mutations, even for genes mutated in some tumour cells [
24]. However, these commercial mutation panels were selected from cancer-related genes regardless of the cancer type, rather than being developed via a feature selection method; thus, mutation panels can still be improved. In particular, it is necessary to develop a cancer-type-specific mutation panel to estimate the TMB of LUAD samples, since different cancer types have different mutation landscapes [
25]. Recently, Lyu et al. [
3] constructed a LUAD-specific 24-gene model for predicting the TMB of LUAD samples. However, this panel was also based on complete exons of the panel genes, comprising thousands of exons in the panel genes, most of which being unmutated, solely increasing the unnecessary cost and time for sequencing.
In this study, based on the coding sequences (CDSs) with a high frequency of mutation in LUAD, we developed a CDS mutation panel to estimate the TMB of LUAD samples. Thereafter, we determined the correlation of CDSs in the mutation panel with the TMB in two independent datasets. From two datasets (Matthew and Rizvi), we used data on progression-free survival (PFS) of advanced LUAD patients treated with immune checkpoint inhibitors to estimate the performance of the CDS mutation panel for predicting the efficacy of immunotherapy. Furthermore, the CDS mutation panel was compared with two commercial mutation panels (324-gene and 341-gene panels) and a LUAD-specific mutation panel (24-gene panel).
Methods
Data sources and pre-processing
The three LUAD somatic mutation datasets (Table
1) were used to construct and validate the mutation panel for estimating the TMB. The training mutation data were downloaded from The Cancer Genome Atlas (TCGA) database (
https://portal.gdc.cancer.gov/), comprising 486 LUAD samples with paired mRNA expression data. For validation, we obtained two independent somatic mutation datasets with PFS data for patients treated with immune checkpoint inhibitors, including 59 LUAD samples reported by Matthew et al. [
5] and 29 LUAD samples reported by Rizvi et al. [
4]. The patients included in the Matthew dataset were treated with nivolumab (PD-1 inhibitor) plus ipilimumab (CTLA-4 inhibitor) and those in the Rizvi dataset were treated with pembrolizumab (PD-1 inhibitor) immunotherapy.
Table 1Whole-exome sequencing mutation data analyzed in this study
Histology |
Adenocarcinoma | 486 | 59 | 29 |
Age (years) |
No less than 65 | 223 (46) | 29 (50) | 10 (34) |
Less than 65 | 263 (54) | 30 (50) | 19 (66) |
Sex |
Male | 222 (46) | 22 (37) | 13 (45) |
Female | 264 (54) | 37 (63) | 16 (55) |
Smoking status |
Never | – | 13 (22) | 5 (17) |
Former/light | – | 38 (64) | 18 (62) |
Current/heavy | – | 8 (14) | 6 (21) |
Stage |
I | 263 (54) | – | – |
II | 117 (24) | – | – |
IIIA | 70 (14) | – | – |
IIIB–IV | 36 (7) | 59 (100) | 29 (100) |
PFS-status |
Progression | – | 40 (68) | 20 (69) |
Progression-free | – | 19 (32) | 9 (31) |
Percentage of tumour cells |
Known | 433 (89) | – | – |
Unknown | 53 (11) | – | – |
Average percentage of tumour cells | 78.76 | – | – |
Whole-exome sequencing was previously performed for these TCGA data with tumour tissues and matched normal tissue or blood, which were used to filter out germline mutations and screen somatic mutations [
26]; the detailed protocol is described in the original literature [
27]. Briefly, 0.5–3 µg of DNA from each sample was used for library preparation and sequenced using the Illumina HiSeq platform. The mean coverage across targeted bases on tumour and germline DNA were 97.63 and 95.83, respectively. Mutations with a variant allelic fraction of < 0.05 in tumour tissue were excluded. Only the non-synonymous mutations, including missense mutation, nonsense mutation, nonstop mutation, frame-shift mutation and in-frame mutation, were included, and a discrete mutation profile including 82,574 CDSs of 16,961 genes was generated. For the two test mutation data, whole-exome sequencing was performed for tumour tissues and matched normal tissues or blood. The detailed protocol is further described in the original literatures [
5,
28]. Finally, discrete mutation profiles including 18,793 CDSs of 9400 genes and 8711 CDSs of 5504 genes were generated, wherein the CDSs mutation matrix was constructed using matched human reference genome annotated files derived from GENCODE (
https://www.gencodegenes.org/human/releases.html).
Development of the CDS mutation panel for estimating TMB
First, from TCGA LUAD somatic mutation data, we extracted mutations in the CDSs using the human reference genome annotated file (GRCh38), and selected non-synonymous mutations to construct an m*n CDSs mutation matrix, where m represents the number of CDSs in genes and n represents the number of samples. TMB was estimated as (total mutations in CDSs/total bases of CDSs) * 106.
Thereafter, Spearman’s rank correlation analysis was performed to estimate the correlation of the CDSs mutation state with the TMB. Herein, we restricted the analysis to the CDSs mutated in more than 5% cancer samples [
29,
30] to filter out ‘passenger’ genes with low-frequency mutations, as it may be subjected to random mutations rather than having a tumorigenic advantage. p-values were adjusted using the Benjamini–Hochberg procedure [
31] for multiple testing to control the false discovery rate (FDR). CDSs significantly correlated with the TMB were selected as candidates.
Finally, the genetic algorithm (GA package) was used to generate a final CDS panel from among candidate CDSs, whose panel-score was most correlated with TMB. The genetic algorithm was implemented with a population size of 5000 and a crossover fraction of 0.9; it was terminated if the optimization objective of the best subset was not improved in 100 generations. Details regarding the genetic algorithm are shown in Additional file
1. The correlation (R
2) was estimated via linear regression analysis [
32]. Here, the panel-score was calculated as following (Formula
1):
$${\text{Panel-score}} = \beta \frac{{\sum\nolimits_{i = 1}^{n} {k_{i} } }}{{l*10^{{{ - }6}} }} + C$$
(1)
where
n is the number of CDSs in the panel,
l is the length of the panel, and
\(k_{i}\) is the number of mutations in
i-th CDS;
\(\beta\) and
\(C\) was obtained through linear regression analysis,
\(\beta\) is a coefficient to balance the panel-score and TMB,
\(C\) is a constant.
As no clinical data regarding immunotherapy were available for patients in TCGA, we could not determine the optimal cut-point for our CDS panel for predicting the efficacy of immunotherapy. Therefore, we set the cut-point of our CDS panel at a median panel score in TCGA.
Survival analysis
PFS was defined as the period during and after the treatment of a disease, wherein a patient lives with the disease but it is not exacerbated. The survival curve was estimated using the Kaplan–Meier method and compared using the log-rank test (survival package: ‘survdiff’) [
33]. The univariate Cox proportional hazards regression model (survival package: ‘coxph’) was used to evaluate the predictive performances of the mutation panels. Furthermore, the multivariate Cox model (survival package: ‘coxph’) was used to evaluate the independent prognostic value of our CDS mutation panel after adjusting for clinical factors including age, sex, and smoking. Hazard ratios (HRs) and 95% confidence intervals (CIs) were generated using the Cox proportional hazards model (survival package: ‘coxph’).
Functional enrichment analysis
Functional pathways for enrichment analysis were downloaded from Gene Ontology (GO) in November 2018. First, we performed Student’s t-test with a 5% FDR control to select differentially expressed genes (DE genes) between the high-TMB and low-TMB groups classified by the CDS panel. Here, 17,680 genes were used for differential expression analysis. Thereafter, the hypergeometric distribution model was used to determine whether the number of DE genes observed in a functional term was significantly greater than that expected through random chance.
All statistical analyses were performed by using R software packages version 3.4.2 (
http://www.r-project.org/). Significance was defined as
p < 0.05 or FDR < 0.05 for multiple testing.
Discussion
This study describes the generation of a mutation panel comprising 106 CDSs of 100 genes spanning 0.34 Mb. Previous studies have reported that the sequencing panel, comprising more than 300 cancer-related genes, can help predict the TMB; however, its performance is apparently low when the number of genes in the panel is lesser than 150 [
37]. However, these commercial mutation panels (such as 324-gene and 341-gene panels) were not selected through any feature selection method; thus, their high correlations with the TMB primarily resulted from the large number of genes included in the panels. In contrast, our 106-CDS mutation panel developed using a genetic algorithm and containing more major variates associated with the TMB is expected to be reliable in estimating the TMB, and its performance was validated in the two independent test datasets. Certain differences in the correlations of our 106-CDS panel and the TMB were observed in the two test datasets, thus potentially accounting for their different sample sizes or sample collections; these correlations require further validation in a large-scale clinical trial.
The present results show that the 106-CDS panel with a cut-point of 6.20 mut/Mb preferably predicted the efficacy of immunotherapy among advanced-stage LUAD patients. For high-TMB patients predicted via the 106-CDS panel with a cut-point of 6.20 mut/Mb, immunotherapy with nivolumab plus ipilimumab improved the 1-year PFS rate to 0.67, which was markedly higher than the 1-year PFS rate (0.25) of the predicted low-TMB patients. Similarly, the 1-year PFS rate of the predicted high-TMB patients was 0.61, being markedly higher than the 1-year PFS rate (0.13) of the predicted low-TMB patients after pembrolizumab treatment. However, we considered that the cut-point of the 106-CDS panel, which was set at a median panel score in training dataset, may not be the optimal threshold for predicting the efficacy of various immunotherapy drugs. In order to assess the effect of specific cut-points for predicting the efficacy of immunotherapy, we additionally set the cut-points of our CDS panel at upper tertiles (9.17 mut/Mb) and quartiles (12.13 mut/Mb) of panel scores in training dataset, respectively, and estimated in the two test datasets. The univariate survival analyses revealed that the 106-CDS panel with the cut-point of the upper quartiles (12.13 mut/Mb) had the optimal predictive performance (log-rank
p = 0.0079, HR = 3.81, 95% CI 1.33–10.93, Additional file
7: Figure S1A) than the median (log-rank
p = 0.0018, HR = 3.35, 95% CI 1.51–7.42, Fig.
3a) and upper tertiles (log-rank
p = 0.0298, HR = 2.59, 95% CI 1.07–6.27, Additional file
7: Figure S1B) as cut-pionts for the patients treated with nivolumab plus ipilimumab in the Matthew dataset. While, it had the weakest performance (log-rank
p = 0.1258, HR = 2.58, 95% CI 0.72–9.21, Additional file
7: Figure S1C) than the median (log-rank
p = 0.0020, HR = 5.06, 95% CI 1.63–15.69, Fig.
3c) and upper tertiles (log-rank
p = 0.0081, HR = 5.82, 95% CI 1.33–25.51, Additional file
7: Figure S1D) for the patients treated with pembrolizumab in the Rizvi dataset. These results suggest that the 106-CDS panel with a cut-point of 6.20 mut/Mb can effectively predict patients potentially benefiting from immunotherapies, but the optimal cut-point for a specific immunotherapy drug needs further exploration in a large-scale clinical trial.
The larger the number of genes included in the mutation panel, the higher the expected correlation with the TMB. Our results show that although the number of genes in the 106-CDS panel is twofold less than that of the 324-gene [
19] and 341-gene [
22] panels, our 106-CDS panel displayed better performance in predicting the efficacy of immunotherapy. Although the length of the 106-CDS panel (0.34 Mb) was longer than the 24-gene panel (0.18 Mb), its performance was markedly better in predicting the efficacy of immunotherapy. These results indicate that the 106-CDS panel of mutations may have higher antigenicity, which needs further confirmation.
Functional annotation revealed that several genes including TP53 [
38], AMER1 [
39], and TEX15 [
40] in the 106-CDS panel are involved in DNA repair and cell cycle arrest, playing a key role in genomic instability. DE genes between the two groups classified using the 106-CDS panel with a cut-point of 6.20 mut/Mb were significantly enriched in several pathways associated with genomic instability, such as DNA repair [
34], DNA replication [
35], and chromosome segregation [
36]. These functional analyses indicate that compared with low-TMB patients predicted using the 106-CDS panel, the high-TMB patients potentially have higher genomic instability and are more likely to harbour neoantigens.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.