Background
Colorectal cancer (CRC) is a common and lethal disease. Although its mortality has been declining since 1990, its mortality rate is currently approximately 1.7–1.9% [
1]. CRC remains the third most common cancer according to the World Health Organization. This disease can be characterized based on the embryological origin [
2]. Right-sided colon cancer (RCC) originates from the midgut, including the cecum, ascending colon, and hepatic flexure. In contrast, left-sided colon cancer (LCC) originates from the hindgut, including the splenic flexure, descending colon, and sigmoid colon [
3]. Over the past few years, the differences between LCC and RCC have received increasing attention due to their different prognoses, outcomes, and clinical responses to chemotherapy. In many publications, it has been reported that there are significant differences regarding the mutations, epidemiology, survival, pathology, and clinical presentation between RCC and LCC [
4,
5]. Compared with LCC, RCC was reported to occur more in older patients and females, having a poorer prognosis [
6]. RCC tumors were also reported to be poorly differentiated, and to be larger and have more advanced stages [
4,
7,
8]. However, some conflicting results concerning the differences between RCC and LCC were reported, and it remains a topic of considerable debate whether tumor location itself has a significant impact on prognosis [
7]. Furthermore, the differences in molecular features between LCC and RCC remain unclear [
9]. Studies have found that BRAF was preferentially mutated in RCC, while epidermal growth factor receptor (EGFR) was generally amplified in LCC [
10]. Several studies have also reported that mutations and protein expression of p53 differed significantly between LCC and RCC [
11‐
13]. However, another study showed that p53 protein expression had no significant difference between LCC and RCC [
14].
In light of this background, there is a need to comprehensively survey the differences of gene mutations and expression levels between LCC and RCC. Knowledge of the differences at the molecular level would help us to obtain an in-depth understanding of LCC and RCC and further improve their diagnostic and treatment strategies in clinical practice. The rapid development of high-throughput sequencing technologies has provided us with opportunities to characterize the diverse array of genomic changes found within each cancer type. Projects like The Cancer Genome Atlas (TCGA) have compiled mutation, gene expression, methylation, and copy number data across cancer types [
15].
As demonstrated by many researchers, machine learning (ML) is becoming increasingly important in cancer prognosis and prediction. In this context, we here established a study to use ML methods to explore gene mutation and expression data from TCGA to infer the molecular differences between LCC and RCC.
Discussion
It has been hypothesized that there are significant differences between RCC and LCC in terms of the molecular features, which might be the cause of clinicopathological differences [
24]. However, the differences of molecular features between RCC and LCC patients have remained unclear. Using ~ 300 LCC and RCC samples from TCGA, we attempted to unearth more valuable information on the differences between LCC and RCC by applying ML methods. It has been reported that RCC has a higher incidence of KRAS mutation than LCC (57.3% vs. 40.4%;
P-value < 0.0001) [
25], and a higher frequency of BRAF mutation (18.4–22.4% vs. 1.3–7.8%) [
26]. However, other studies found no significant differences in BRAF and KRAS mutation rates [
27]. In our study, RCC was also found to have higher incidences of KRAS mutation (49.7% vs. 33.0%,
P-value = 0.007) and BRAF mutation (23.4% vs. 3.6%,
P-value = 2.8e-6) than LCC. However, no significant difference was found in the expression of BRAF (FDR = 1, log
2FC = 0.1) and KRAS (FDR = 0.92, log
2FC = − 0.04) in our study, which implies that the mutations may have no impact on the transcription of KRAS and BRAF. The top four genes with the highest mutation rates in LCC were APC (84.8%), TP53 (68.8%), TTN (54.5%), and KRAS (33%). In RCC, the top four again included APC as the most common (63.2%), followed by TTN (63.2%), KRAS (49.7%), and TP53 (49.7%).
Using ML methods, we selected 30 mutations to build an XGBoost classifier with AUC of 0.80 in the test dataset. The feature with the highest score in the model was rs113488022 in BRAF. The top seven mutations scored by XGBoost were in the BRAF, KRAS, APC, and TP53 genes. APC and TP53 are tumor suppressor genes, while KRAS and BRAF are oncogenes. The differences in the frequencies of these mutations may be the reason for the clinicopathological differences between the two types of colon cancer.
The genes that were upregulated in RCC compared with the levels in LCC were particularly associated with some immunity-related processes. Using DEGs, we constructed a model with AUC of 0.96 in the test set using only 17 features, implying large differences between LCC and RCC at the level of gene expression. Among these features, small nuclear protein PRAC1 (FDR < 0.001, log
2FC = − 4.1) was the most important, which was highly expressed in LCC. The higher expression of PRAC1 in LCC than RCC was also identified in other studies [
5]. However, the function of PRAC1 in colon cancers remains elusive. Mutations in this gene have been found to be associated with a predisposition to prostate cancer and it is a candidate for the hereditary prostate cancer 1 (HPC1) allele. The second most important feature in the model, HOXC6 (FDR < 0.001, log
2FC = 1.03), was highly expressed in RCC; it belongs to the homeoprotein family of transcription factors, members of which play important roles in morphogenesis and cellular differentiation during embryonic development [
28]. The higher expression of HOXC6 in RCC than LCC was also described in another study [
5]. Furthermore, the overexpression of HOXC6 has been detected in several human carcinomas, including breast, gastrointestinal, and lung cancers, as well as leukemia [
29]. High expression levels of HOXC6 have also been found to be associated with lymph node metastasis [
30]. The differential expression of PRAC1 and HOXC6 and other genes may be the reason for the different characteristics between LCC and RCC, which warrants more attention in further study.
In the correlation network, it was shown that some of the mutant genes, such as TP53 and KRAS, were the hub nodes. The LR analysis also showed a close relationship between BRAF V600E mutation and the expression of four genes such as HOXC6 and CA8. These findings suggest that the differences of gene mutations and expression, and the associations between them may be the key reasons for the differences in clinical features between LCC and RCC.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.