Background
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide [
1]. Approximately 1.8 million new CRC cases and > 860,000 CRC-related deaths occurred globally in 2018, making CRC the third most frequent cancer worldwide [
1,
2]. CRC develops through a multistep process characterized by accumulated genetic and epigenetic abnormalities that cause genomic instability and mutations in tumor-suppressor and oncogenic genes [
3]. Most CRC lesions show little sensitivity to immune-checkpoint inhibitor-based therapies, although immunologic parameters may have prognostic value [
4]. Therefore, further research on the tumor immunity of CRC will provide a theoretical basis for developing CRC immunotherapeutic.
Interferons were first discovered as antiviral proteins, and subsequently, interferon regulatory factors (IRFs) were discovered and studied intensively. IRFs are transcription factors participating in interferon gene regulation [
5]. The amino termini of IRFs contain a DNA-binding domain (DBD) composed of 115 amino acids (like DBD of Myb) and can bind promoter regions in DNA. The carboxyl termini of IRFs have a variable region that serves various biological functions [
6]. Ten IRFs (IRF1 to IRF9 and virus IRF) have been discovered. IRFs are found in various tissues and play important roles in cell-cycle regulation, cell differentiation, apoptosis, and tumor immune regulation [
6]. Future studies on IRFs will provide a theoretical basis for their mechanistic roles in tumor development and tumor immunity and for choosing drug therapies.
The roles of IRFs in CRC have not been investigated using bioinformatics analysis. Here, we used public databases to analyze IRF expression levels and mutations in CRC patients to determine distinct prognostic values, study tumor immunity regulation, and identify potential functions of IRFs in CRC. We verified these results via immunohistochemistry (IHC) analysis with our own cohort of CRC patients.
Methods
Data acquisition
Data regarding fragments per kilobase million (FPKM) values and microRNA (miRNA)-expression levels of patients with CRC were downloaded from the COAD/READ datasets of The Cancer Genome Atlas (TCGA) Genomic Data Commons website (
https://portal.gdc.cancer.gov/) and used as the training dataset. FPKM values were converted to transcripts per million values and divided into mRNA- and long noncoding RNA (lncRNA)-expression groups. “Masked Somatic Mutation” data of patients with CRC were downloaded, pre-processed using VarScan software, and visualized using the R software package, maftools [
7]. The clinicopathological features and prognoses of patients with CRC, such as gender, age, and stage, were downloaded from the UCSC Xena website (
http://xena.ucsc.edu/). After removing samples with missing clinical information, 597 tumor samples and 51 normal tissue samples were obtained. Table
1 and Additional file
5: Table S6 shows the baseline clinical information of patients with CRC from TCGA-COAD/READ datasets. The likelihood of each response to immunotherapy was predicted using the Tracking of Indels by DEcomposition (TIDE) algorithm (
http://tide.dfci.harvard.edu) [
8]. Gene expression data from different organizations and in different cell lines were downloaded from TCGA and the Cell Line Cancer Encyclopedia (CCLE) databases (
https://portals.broadinstitute.org/ccle/about) to compare IRF expression levels between tumor and normal tissues.
Table 1
The baseline information of patients with colorectal cancer (CRC) and scoring interferon regulatory factor (IRF) family by random forest algorithm from The Cancer Genome Atlas (TCGA) database of COAD/READ datasets
Gender | | | | 0.904 |
Female | 277 (46.4%) | 139 (46.6%) | 138 (46.2%) | |
Male | 320 (53.6%) | 159 (53.4%) | 161 (53.8%) | |
Age | | | | 0.378 |
< 60 | 170 (28.5%) | 80 (26.8%) | 90 (30.1%) | |
≥ 60 | 427 (71.5%) | 218 (73.2%) | 209 (69.9%) | |
Pathologic stage | | | | < 0.001 |
I | 108 (18.1%) | 69 (23.1%) | 39 (13.1%) | |
II | 225 (37.7%) | 120 (40.3%) | 105 (35.1%) | |
III | 177 (29.6%) | 86 (28.9%) | 91 (30.4%) | |
IV | 87 (14.6%) | 23 (7.7%) | 64 (21.4%) | |
T | | | | 0.002 |
T1 | 19 (3.2%) | 11 (3.7%) | 8 (2.7%) | |
T2 | 105 (17.6%) | 65 (21.8%) | 40 (13.4%) | |
T3 | 408 (68.3%) | 201 (67.5%) | 207 (69.2%) | |
T4 | 65 (10.9%) | 21 (7.0%) | 44 (14.7%) | |
N | | | | < 0.001 |
N0 | 342 (57.3%) | 191 (64.1%) | 151 (50.6%) | |
N1 | 145 (24.3%) | 71 (23.8%) | 74 (24.7%) | |
N2 | 110 (18.4%) | 36 (12.1%) | 74 (24.7%) | |
M | | | | < 0.001 |
M0 | 453 (75.9%) | 249 (83.6%) | 204 (68.2%) | |
M1 | 85 (14.2%) | 22 (7.4%) | 63 (21.1%) | |
MX | 59 (9.9%) | 27 (9.0%) | 32 (10.7%) | |
Gene expression data in GSE17536 [
9] and GSE39582 [
10] and clinicopathological patient characteristics were downloaded as validation datasets from the Gene Expression Omnibus (GEO) database. The data were downloaded from Homo Sapiens; this platform is based on the GPL570 [HG-U133_PLus_2] Affymetrix Human Genome U133 Plus 2.0 Array. GSE17536 included 177 colon cancer tissue samples, and GSE39582 included 566 colon cancer tissue samples and 19 colon non-tumor tissue samples.
Genetic characteristics of the IRF family and validation by constructing clinical prediction models
We incorporated the expression levels of IRF family genes into a random forest model. The random forest package of R [
11] was used to develop an IRF-based risk-assessment model for patients with CRC. Patients were divided into high- and low-IRF risk groups, based on the median value.
To assess patient prognosis by combining the IRF risk score with clinicopathological features, univariate and multivariate Cox proportional-hazards analyses were used to analyze the independent predictive power of risk scores for overall survival (OS) and disease-free survival (DFS). Subsequently, a survival-prediction nomogram was constructed for patients in TCGA dataset and was validated for patients in the GEO dataset. To quantify differentiation performance, Harrell’s consistency index (C-index) was measured. A calibration curve was generated to evaluate the performance of the line map by comparing the predicted value of the line map with the observed OS rate. In the calibration curve, the abscissa shows the survival rate predicted by the model, and the ordinate shows the survival rate observed. Theoretically, the prediction should be consistent with the observation, which is the diagonal line. However, there is still a gap between the actual process and the theory. The closer the line and the dashed line between the points, the better the consistency of the model. We used the above methods to evaluate the quality of the model.
Differentially expressed genes (DEGs) and clinical correlation analysis
Data of patients with CRC were downloaded from TCGA and the GEO databases, and the patients were divided into high- and low expression groups, according to the IRF score. The DESeq2 package of R [
12] was used to analyze DEGs in both groups, where a log fold-change (logFC) more than 1.0 and
P value less than 0.05 was considered a threshold value with statistical difference for DEGs.
We compared the expression levels of IRF family genes at different TNM stages. The Human Protein Atlas (HPA,
https://www.proteinatlas.org) provides immunohistochemical expression data for nearly 20 different cancers [
13] and enables the identification of tumor type-specific differential protein expression patterns, where protein expression levels of all IRF family genes were compared between normal and CRC tissues.
Functional enrichment analysis and gene set enrichment analysis (GSEA)
Gene Ontology (GO) analysis is commonly used for large-scale functional enrichment research of biological processes (BPs), molecular functions, and cellular components. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a widely used database containing information regarding genomes, biological pathways, diseases, and drugs. GO and KEGG pathway-enrichment analyses were performed with signature genes using the clusterProfiler R package [
14]. A false-discovery rate of < 0.05 was considered statistically significant.
To investigate differences in BPs among different subgroups, GSEA was performed using the gene expression profiles of patients with CRC. GSEA can be used to identify the statistical differences between two groups in a gene set and estimate changes in pathways and BP activities [
15]. The gene set “C2.CP.kegg. V6.2.-symbols” [
15] was downloaded from the Molecular Signatures Database for GSEA. An adjusted
P value of < 0.05 was considered statistically significant.
Constructing a protein–protein interaction (PPI) network and screening hub genes
We used the Search Tool for Retrieving Interacting Genes (STRING) database [
16], which predicts PPIs, to construct PPI networks for the selected genes. Genes with scores > 0.4 were selected to construct a network model, which was visualized with Cytoscape V3.7.2 [
17]. In the co-expression network, the maximum clique centrality (MCC) algorithm most effectively located the node in a set. The MCC of each node was calculated using CytoHubba plugins [
18] in Cytoscape, and genes with the highest eight MCC values were selected as hub genes.
Constructing a competing endogenous RNA (ceRNA) network based on miRNA-mRNA and miRNA-lncRNA interactions
LncRNA-miRNA interaction data were downloaded from the miRcode database and miRNA-mRNA interaction data were downloaded from the miRTarBase, miRDB, and TargetScan databases. The DESeq2 package of R [
12] was used to analyze miRNA and lncRNA expression differences between the high-IRF and low-IRF risk groups. LogFC > 1.0 and
P < 0.05 were set as criteria for a statistically significant difference. Cytoscape (V3.7.2) was used to construct a ceRNA network by analyzing the correlations between lncRNA- and mRNA-regulated miRNAs simultaneously.
Tumor immune estimation resource (TIMER) database analysis and comparison of immune-correlation scores between both groups
The TIMER database (
https://cistrome.shinyapps.io/timer/) enables users to estimate B-cell, CD4
+ T cell, CD8
+ T cell, macrophage, neutrophil, and dendritic-cell infiltration into different tumor types [
19]. We used the TIMER database to analyze correlations between the expression levels of different IRF genes and immune cell infiltration in COAD/READ samples.
The R estimate package [
20] quantifies immune cell infiltration levels in tumor samples, based on gene expression profiles, and was used to assess the immune activity and stromal score of each tumor sample. Immune cell infiltration levels between both groups were compared using the Mann–Whitney
U test.
Analysis of anticancer therapy sensitivity
The Genomics of Drug Sensitivity in Cancer (GDSC) database (
https://www.cancerrxgene.org/) enables exploration of gene mutations and targeted cancer therapies. We downloaded gene expression data from cell lines and IC
50 values to analyze correlations between differentially expressed IRF genes and anticancer drug sensitivities.
Calculating tumor-mutation load fractions and analyzing genetic variations of IRF family members in CRC
The tumor mutational burden (TMB) of each tumor sample was defined as the number of somatic cell mutations identified, excluding silent mutations. Patients with CRC were divided into high-TMB and low-TMB groups according to the median TMB value. The Wilcoxon rank-sum test was used to compare the risk scores of IRF family genes between both groups.
Patients and specimens in the validation cohort
Tumor specimens were obtained from 114 CRC patients who underwent treatment at Zhongshan Hospital (Fudan University) between 2008 and 2016. The inclusion criteria were as follows: (a) a clear pathological diagnosis of CRC, (b) complete follow-up data until December 2019, (c) suitable formalin-fixed and paraffin-embedded tissues, and (d) agreeing to participate in the study and provide signed informed consent. CRC diagnosis was based on the World Health Organization criteria, and tumor stages were classified according to the 7th edition of TNM classification of International Union Contra Cancrum. Ethical approval was obtained from the Research Ethics Committee of Zhongshan Hospital. The clinical characteristics of the 102 patients with follow-up data are presented in Additional file
5: Table S1.
IHC staining evaluation
Cancer and adjacent normal tissues were formalin-fixed, paraffin-embedded, and prepared as tissue microarrays (TMAs) after hematoxylin and eosin staining and histopathology-guided location. Five-micron-thick TMA sections were deparaffinized and rehydrated in 0.1 M citrate buffer (pH 6.0), followed by high-temperature antigen retrieval in a microwave for 15 min. The sections were incubated overnight at 4 °C with primary antibodies against IRF3 and IRF7 (Abcam, Cambridge, UK), CD4 (Servicebio Technology, Wuhan, China), CD8 (Servicebio Technology), CD19 (Servicebio Technology), CD68 (Servicebio Technology), MPO (Servicebio Technology) and CD21 (Servicebio Technology). The sections were incubated for 30 min with a secondary antibody at room temperature and immunostained based on avidin biotin complex formation, using 3,3′-diaminobenzidine. Hematoxylin was used as a counterstain.
Antigen–antibody complexes in whole samples were detected using a panoramic slice scanner (3DHISTECH, Hungary) and viewed with CaseViewer 2.2 (3DHISTECH). H-scores were calculated to evaluate gene expression levels using Quant Center 2.1 (3DHISTECH): H-score = Σ (PI × I) = (% of weakly stained cells × 1) + (% of moderately stained cells × 2) + (% of strongly stained cells × 3), where PI is the proportion of the positive area, and I is the staining intensity.
Statistical analysis
The data were analyzed with R software (V4.0.2). The independent Student t test was used to estimate the statistical significance of normally distributed variables, and the Mann–Whitney
U test was used to analyze differences in non-normally distributed variables between two groups of continuous variables. The chi-squared test or Fisher exact test was used to analyze statistical significance between two groups of categorical variables. Correlation coefficients between different genes were calculated via Pearson correlation analysis. The survival package of R was used for survival analysis, Kaplan–Meier survival curves were used to determine survival differences, and the log-rank test was used to evaluate significant differences in survival times between two groups. Univariate and multivariate Cox analyses were used to determine independent prognostic factors. The pROC package of R [
21] was used to draw receiver operating-characteristic (ROC) curves, and area under the curve (AUC) values were calculated to assess the accuracy of risk scores for prognosis estimations. All statistical
P values were bilateral, and
P < 0.05 was considered statistically significant.
Discussion
Differential expression of IRF genes has been reported in many cancers [
6], and IRFs play important roles in CRC tumorigenesis and prognosis. However, this study is the first to explore IRF expression levels at both the mRNA and protein levels, and to determine the prognostic value, effects on immune cells, and potential molecular pathways of IRFs in CRC.
IRF3 and IRF7 are closely related, and unlike other IRFs, they are considered key for evading innate immune responses to virulence factors [
22]; thus, they may play crucial roles in anticancer immunity. IRF3 plays important roles in DNA damage responses (DDRs) in cancer [
23]. During chemotherapy with DDR agents and immunotherapy involving checkpoint blockade, IRF3 expression is upregulated via cGAS–STING pathway activation [
24,
25]. IRF3 activation in response to DDR promotes its role in upregulating RAE1 [
26], which is the tumor-cell ligand for NKG2D on NK cells. Together, RAE1 and NKG2D stimulate NK cell-effector function. IRF3 overexpression inhibits tumor-cell growth by increasing p53 activity in vitro [
27]. Additionally, IRF3 may be involved in STING activity [
28]. Increased PD-L1 expression following treatment with DDR inhibitors is mainly IRF3-dependent [
25], and tumor-growth inhibition and immune-checkpoint blockade with DDR inhibitors is completely dependent on the cGAS–STING–IRF3 axis. Our current findings further suggest an additional benefit of cGAS-STING-IRF3 axis activation owing to increased expression of the CXCL10 and CCL5 chemokines, leading to T cell tumor infiltration. Previously, we found that IRF3 and IRF7 could mediate the acquisition of new anti-tumor effector functions in macrophages [
29]. In the present study, we observed that high IRF3 and IRF7 expression was related to CD4
+ T cell, CD8
+ T cell, B-cell, and macrophage activation, indicating that IRF3 and IRF7 could promote the anticancer effect of immune cells.
Interestingly, among all IRF factors, the mRNA and protein expression levels of IRF3 and IRF7 were significantly upregulated in tumor tissues and associated with poor OS in CRC patients. As IRFs are transcription factors, they may also influence tumor cell development by regulating the transcription of other oncogenes, although the related mechanisms require further investigation. We further assessed the relationship between IRF risk scores and immune and stromal scores in cancer patients to examine why increased IRF3 and IRF7 expression promotes immune cell recruitment without killing tumors. We found that high IRF family score was associated with high TIDE score and high TMB score. It was believed that dysfunction of T cells with high level of infiltration or distinct exclusion of T cells from infiltrating tumors as two primary mechanisms resulting in tumor immune evasion. TIDE is constructed to quantify this effect. Meanwhile, TMB reflects the amounts of mutant proteins brought from neoplasm as well as immunogenic neoantigen load in microenvironment. Hereby, we speculated that IRF family might involve in an imbalance status or even a disorder of immune microenvironment in CRC, more than just attenuating level of tumor immune infiltrations. Since the immune score is calculated by integrating the expression of different immune genes, IRF family involve in the interferon response which represents one type of immune response. Further experimental work is needed to resolve these contradictory results.
Although, we have previously demonstrated that the translation of the IRF2 protein is repressed by microRNA-18 binding to the 3 ′UTR region of the IRF2 mRNA [
30], in the present study, we found that the IRF family exhibits a high frequency of genetic variations in the COAD cohort. We therefore constructed a competing
ceRNA network containing miRNAs, lncRNAs, and mRNAs expressed at different levels to uncover the underlying regulatory relationships among them. Noncoding RNAs are widely considered to function at every layer of genetic regulation, including duplication, transcription, and translation, especially during cancer development [
31]. Fan performed an integrated investigation, constructing a lncRNA-miRNA-mRNA ceRNA network specific to CRC, and identified components related to the prognosis of CRC patients [
32]. Qi summarized a comprehensive depiction on the ceRNA crosstalk in CRC [
33]. We have also tried to provide novel insights into the connection between coding and noncoding RNAs based on the IRF family, which indicate that HOXC8 and HOXC13, belonging to a highly conserved homeobox family, are regulated by some miRNAs and lncRNAs when mediating the transcription of members of the IRF family. These bioinformatic results point to subsequent experimental validating work.
There are still several limitations in our study. Clinical studies with large sample size are required to verify the predictive value of risk score established by IRF family. The cellular functions and molecular mechanisms of IRF3 and IRF7 in CRC are warrant for conformation with in vitro and in vivo animal experiments. As the biomarker candidates, IRF family should be evaluated in the context of tumor immunotherapy.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.