Background
Lung cancer (LC) is the most common cause of cancer-related mortality worldwide, responsible for more than 1.4 million deaths per year [
1]. The two major subtypes of LC, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LSCC), are classified as non-small cell lung cancers (NSCLC) [
2]. Despite the common classification, these NSCLCs are likely to have drastically different clinical outcomes; LSCC and LUAD patients have an overall survival rate of 18% and 65%, respectively, when treated with tailored therapy [
3,
4]. However, patients can receive tailored therapy only after typification, i.e., identifying what kind of LC they have. Despite advances in genomic characterization, associating genomic information (as, e.g., mutational profiles) with the clinical outcomes in NSCLCs remains an open challenge, given its complexity and heterogeneity [
5].
Together with the mutational signatures characteristic of each LC subtype, certain clinical variables can help the typification of a tumour. For example, smoking has been recognised as the leading risk factor for LC, especially for the LUAD subtype [
6,
7]. Specific genes are affected in these patients depending on whether they are smokers or not. For example, non-smoker LUAD patients typically present driver mutations EGFR, KRAS, TP53, and fusions in ROS, EML4-ALK, and RET genes [
8]. On the other hand, smoker LUAD patients commonly have KRAS mutations [
9]. The tumour mutational burden (TMB), defined as the total number of somatic mutations per coding area of a tumour genome, encodes some of the information above. As tumours with high TMB are likely to express more neoantigens that may sensitise them to immunotherapy [
10,
11], TMB has been used as a predictor of immunotherapy response and effectiveness across various tumour types [
12,
13]. Therefore, including the number of mutations could further characterise tumour progression in NSCLCs.
Recently, large-scale sequencing techniques have led to the accumulation of genomic information in cancer research. This comprehensive mapping of the mutational signatures of tumours has allowed researchers to use machine learning models to solve classification problems or predict relevant clinical outcomes. One of these clinical outcomes is whether a patient will develop metastasis, which is the leading cause of death in cancer patients [
14]. Therefore, finding which factors (among clinical and genomic) are most informative in these models—and thus are better predictors for metastasis development—is crucial for identifying risks to develop metastasis in the early stages of cancer. Finally, this information would support medical practitioners adapt their therapeutic strategies when treating LC patients.
In this study, we put forward a new variable to quantify the accumulation of missense mutations in the whole exome: the Total Mutational Load (TML). Through the TML, we account for potential effects of the accumulation of missense mutations in metastasis development, as these may impair tumour-suppressing proteins or promote the development of proto-oncogenes, thus favouring cancer cells proliferation [
15]. First, we studied the distribution of the TML and clinical variables across patients with different LCs and clinical categories. Then, using Random Forest (RF) machine learning models, we evaluated how informative the TML and other clinical variables (e.g., age, tumour stage, and smoking status) were to classify metastasis development in 1144 Pan-Lung Cancer samples. Finally, we compared different data preprocessing and processing alternatives to identify the one that produces the best-performing models. Altogether, we provide new insights on the factors that could allow an early identification of patients at risk of developing metastasis, and improve understanding of the relationship between genomics and clinical variables in NSCLC patients.
Discussion
In this study, we first analysed the association between the number of missense mutations, i.e., the Total Mutational Load (TML), and clinical variables in a cohort of 1144 LUAD and LSCC patients. Then, we used these results to understand the metastasis status classification models using a benchmarking strategy based on Random Forest (RF) models.
Regarding clinical parameters and TML, we found that age, smoking history, and cancer type are significantly associated with the TML. In other words, patients belonging to different categories in these clinical variables had significantly different mean TMLs. Younger patients (≤ 60 years old) showed a higher TML than older patients (> 60 years old), while lifelong non-smokers had lower TMLs than current smokers. When having a closer look, we find among LUAD patients that young current smokers have higher TMLs than old current smokers (cf. Figure
2C, D). Thus, we analysed the dataset to assess whether younger patients in our cohort were heavier smokers than older ones. We found no significant difference between the number of packs of cigarettes smoked per year between age groups in LUAD patients. However, for LSCC patients, older patients have a higher consumption of cigarettes (p = 0.0296) (Additional file
1: Fig. S1). The above suggests that there might be other variables than solely smoking habits and the number of cigarettes necessary to explain this difference.
Regarding smoking status, reformed smokers have higher TML than lifelong non-smokers, suggesting that cigarette consumption has long-term effects on missense mutations. Considering cancer type, LUAD patients have a higher TML than LSCC patients. We found that smoking is strongly associated with the TML in LUAD, consistent with the literature [
10]. LUAD never-smoker patients have a lower TML than those who have actively smoked during their lifetime. Therefore, smoking seems to be a relevant factor in explaining the increase in mutations in patients with this disease. On the contrary, LSCC patients that have never smoked have a larger number of missense mutations than never-smoker LUAD patients, indicating that other factors contribute most to the development of this pathology (cf. Figure
2C, D). Consistently, previous studies showed that LSCC patients accumulate numerous passenger mutations and suggested that LSCC is no longer a smoker’s-only disease since 14.7% (95% CI, 12.1% –17.4%) of their patients were never-smokers [
22].
Interestingly, the association between TML and smoking status which we found finds support along the lines of recent experimental findings linking smoking with an increased risk of LC and a higher frequency of somatic mutations [
23]. This association was also suggested by previous preliminar studies [
24]. However, given the observational nature of our study, our results could also be explained by the following confounders. First, we found that the TML was higher in younger than older patients, which could account for different health-seeking behaviours between age categories. For example, as young individuals do not perceive themselves at risk of developing cancer, they do not go through screening until presenting symptoms—which typically appear in advanced stages of cancer. As the TML accounts for how much an individual has been exposed to mutagenetic factors (own and external), it is reasonable to expect specific relation between tumours and accumulation of mutations (deletion of tumour suppressing genes or incorporation of proto-onco genes). However, when combined with the explicit label given by clinical categorization, we found the TML to add information that was not contained in such, confirming the non-redundancy of the variable we put forward. Factors as socioeconomic status, access to health, and typical comorbidities should also be homogeneous across the cohort and could explain potential differences with other studies. Given the high costs related to tumour typification, economic status and whether the governments mediate access to health are determinants for correct classification within the cohort and can further increase class imbalance in cohorts from countries with lower average incomes.
Understandably, there are several ways to quantify the effects of mutational processes. In our case, we counted all missense mutations in the tumour exome to define the TML. Although this modelling choice can be considered somewhat arbitrary, we do not aim to provide a general tool for characterizing mutagenesis nor compare alternatives for that, but rather to generate a method that answers our research questions. In that way, our study opens new research directions to assess whether other calculation methodologies for the TML could further improve the classification performance of RF models trained with it.
Although we have a large cohort of patients, the main difficulty in this work (and related works) is the imbalance of classes in the metastatic status. Notwithstanding the technical challenges behind dealing with class imbalance, we found that models trained on datasets without PCA and no data resampling achieved the best classification performance. Regarding the PCA, obtaining models with low performance might be due to information losses induced by the reduction in dimensionality. Regarding resampling, not performing any led to higher performances, but might also lead to overfitting, especially when using low values of k for the k cross-validation. Finally, we verified that models trained using clinical variables and TML obtained the best performance metrics. The tumour stage, redefined including also early stages as described in Methods, is a very relevant variable in the classification model. These results indicate that tumour stage II and III samples could be reclassified as metastatic samples being able to help the pathologist to classify samples considering this information. On the other hand, TML is also ranked as a highly-informative variable, suggesting that the information contained on it is not redundant with clinical variables.
Altogether, the findings in this work may contribute to the development of diagnostic tools able to classify metastasis status at an early stage using clinical information, such as the cancer type, the smoking history, and the age. For example, we knew that smoking has a critical relationship with the generation of LC. However, according to our results and when combining both variables, TML has a more important contribution to predicting metastasis in patients with this disease than cigarette smoking. Therefore, we remark on the benefits of including it as a predictive feature in classification models driven by machine learning.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.