Background
Beyond clonally-derived tumor cells, abundant and heterogenous cells that harbor these tumor cells constitute the tumor microenvironment (TME) [
1]. The TME plays an essential role in tumor differentiation, growth, and invasion [
2]. The TME comprises a spectrum of cell types responsible for immune and angiogenic responses [
2]. When antitumor immune responses are triggered, inflammatory cells populate the TME, including natural killer (NK) cells, active cytotoxic CD8 T cells, memory CD4 T cells, pro-inflammatory macrophages, and dendritic cells (DC). In contrast, a TME that contributes to functional evasion of tumor immune response includes Foxp3 + regulatory T cells (Tregs), exhausted CD8 T cells, inactive macrophages, and myeloid-derived suppressor cells (MDSCs) [
1]. Non-tumor stromal cells and endothelial cells remodel the angiogenic microenvironment to support tumor growth and invasion [
3]. Also, the plasticity of epithelial cells plays a critical role in tumor progression [
4]. The dynamic interactions between tumor cells and other cells in their microenvironment can promote tumor progression [
3].
Tumor immune subtypes can be identified based on immunological gene expression profiling [
5]. Tumors that are highly characterized by pro-inflammatory cytokines and T cell infiltration, i.e., immunologically hot tumors, have a better response rate to immune checkpoint inhibitors compared to immunologically cold tumors, which have a relatively low level of immune cell infiltration [
6]. However, the binary classification of hot and cold tumors oversimplifies the broader underlying immune landscape in TME. In the angiogenic microenvironment, tumors that are inclined to promote endothelial cell proliferation by producing vascular endothelial growth factor (VEGF) to develop new blood vessels can be targeted by angiogenesis inhibitors [
7], e.g., cancers of the lung, kidney, breast, colon, and rectum [
8]. Thus, understanding the heterogeneity of TME can guide therapy response and prognosis [
1].
Gene expression and DNA methylation have been used to estimate cell composition in complex mixtures and include both reference-based and reference-free methods. CIBERSORT is a prominent reference-based method developed for deconvolving immune cell types using mRNA expression data [
9]. The accuracy of cell composition estimates using gene expression approaches is limited by variability in cell-specific gene expression across cells and the feature-space of gene expression data. DNA methylation is an epigenetic modification associated with gene regulation and is essential to lineage specification in development to establish and preserve cellular identity [
10]. There are three notable advantages to reference-based DNA methylation methods compared with RNA-based approaches in estimating cell composition. First, DNA is more stable than RNA. Second, the covalent addition of a methyl group to a cytosine is binary, tracking with cell count. Third, using standard measurement approaches, the feature space to define reference profiles of cell-specific DNA methylation is at least 40-fold that of the typical gene expression feature space and can be up to 2000-fold higher [
11]. We have established and created extended libraries for reference-based DNA methylation deconvolution that result in improved accuracy and performance for peripheral blood immune cell deconvolution [
12,
13]. Tissue-specific reference-based libraries have also been developed to infer cell-type composition in the brain, breast, and skin [
14,
15].
Initial approaches to deconvolve the TME using DNA methylation have been described.
MethylCIBERSORT and
MethylResolver have succeeded in resolving 10 and 12 cell types, respectively [
16,
17]. However, due to the complexity and heterogeneity of the cell types in the TME, existing methods lack accuracy, specificity, and detailed cell types. Both the
MethylCIBERSORT and
MethylResolver methods used data from cancer cell lines rather than data from primary cancer cells. This is potentially problematic for deconvolution as cancer cell lines harbor additional epigenetic alterations as compared to primary tumors. [
18]. Also, instead of using organ-specific epithelial cell type DNA methylation signatures,
MethylResolver used a universal standard reference for tumor purity estimation in all tumor types.
To address the limitations of existing methods and to enhance the accuracy and utility of TME deconvolution, we developed a novel DNA methylation-based algorithm that employs a tumor-type-specific hierarchical model and broadens the number of immune cell types that are deconvolved. Our method, called Hierarchical Tumor Immune Microenvironment Deconvolution (HiTIMED), uses deconvolution libraries specific to tumor type, identifying the most cell-discriminatory CpG sites for each cell type in each tumor type context, resulting in 12 libraries per tumor type. Our method also organizes deconvolution into the three major tumor microenvironment components (tumor, angiogenic, immune), resulting in the ability to resolve a total of 17 cell types in the TME: tumor, epithelial, endothelial, stromal, basophil, eosinophil, neutrophil, monocyte, dendritic cell (DC), B naïve (Bnv), B memory (Bmem), CD4T naïve (CD4nv), CD4T memory (CD4mem), CD8T naïve (CD8nv), CD8T memory (CD8mem), T regulatory (Treg), and natural killer (NK) cells, in 20 carcinoma types. HiTIMED's ability to resolve tumor cellular composition with high resolution promises a better understanding of cell heterogeneity in the TME and offers new opportunities to study more complex relationships of the TME with etiologic exposures, patient outcomes, and response to treatment.
Discussion
Previous gene expression and DNA methylation-based deconvolution approaches for TME cell composition have had some success for major cell types [
16,
17,
35]. However, due to the across-tumor-type diversity and within-tumor-type heterogeneity of the TME, substantial gaps still exist in tumor type specificity, cell projection accuracy, and cell-type resolution for TME deconvolution. Here, we present
HiTIMED, optimized to more accurately, specifically, and exhaustively deconvolve the TME.
HiTIMED has three major advantages compared to the existing algorithms: high cell-type resolution, tumor-specific libraries, and cell-projection accuracy optimization. Firstly,
HiTIMED provides high-resolution profiling of the cell types in TMEs. Seventeen cell types in total among 3 TME components (tumor, immune, angiogenic) are projected by
HiTIMED. In the immune microenvironment, closely related lymphocyte subtypes, including subtypes of CD4T and CD8T cells, and granulocyte subtypes are captured by
HiTIMED. In the angiogenic/non-immune microenvironment, epithelial, endothelial, and stromal cells are profiled by
HiTIMED separately as their roles in TME could be functionally very different. Furthermore, numerous variables from
HiTIMED predicted cell types offer more opportunities to study the associations between TMEs and clinically relevant outcomes. For instance, studies have demonstrated CD8mem to Treg ratio as an indicator of the immune balance between cytotoxic and regulatory immunity, corresponding to the immunotherapy response [
36‐
38]. Also, DC to NK ratio was studied in a mouse colon cancer model to enhance the antitumor effect as DC plays a crucial role in NK cell activation [
39]. The high resolution of
HiTIMED projection provides novel opportunities to exploit the cellular composition of the TME to discern patient prognosis and response to therapy. Although it can be argued that single-cell RNA sequencing technologies can offer a similar resolution of cell profiling in TME, DNA methylation-based deconvolution is immensely more cost-effective, less laborious, and is amenable to archival biospecimens where cells are no longer intact. Secondly,
HiTIMED uses DNA methylation signatures that are specific to tumor type. Most of the existing methods developed a universal reference library for all types of tumors [
16,
40]. Although, it is possible to estimate tumor purity with a signature that captures generalizable DNA methylation changes across all tumor types. The use of tumor-specific DNA methylation signatures maximizes the power of detecting most differentially methylated CpGs as tumors are genetically and epigenetically very different by tumor type. Although one algorithm has developed multiple libraries based on tumor type, cell lines were used rather than primary tumors [
17]. Studies have shown consistently differential DNA methylation profiles between cancer cell lines and primary tumor samples [
18,
41]. Finally,
HiTIMED optimizes cell projection accuracy by employing a novel hierarchical model for deconvolution. With the high resolution of cell mixture deconvolution, bias can be generated with inevitable noise for cells under similar or the same lineage. The hierarchical model enhances the projection of the primary cell types in the specific lineage niche in a stepwise manner. For example, Library L3A in
HiTIMED was designed to target angiogenic microenvironment deconvolution. As a result, the library collapsed all immune cells into one group but separated epithelial, endothelial, and stromal cells for optimal discernment. Although tumor purity and major immune cells were validated for accuracy in the previously existing methods, unlike
HiTIMED, extensive deconvolution of immune cell types has not been validated in other methods [
16,
17].
Understanding the TME with a standardized and cost-effective approach enables precision medicine. Studies have demonstrated TME's association with chemo- and immunotherapy responses and prognosis [
1,
42,
43]. The balance between cytotoxic and regulatory immunity dictates tumor behavior in the immune microenvironment [
36]. When the balance favors cytotoxic immunity, tumor elimination is promoted. On the contrary, tumor escape is facilitated when the balance tips toward regulatory immunity. CD8T cells are one of the cytotoxic representatives, whereas Tregs are a proxy for regulatory immunity [
36]. Studies have shown the CD8T to Treg ratio as a significant biomarker for chemo- and immunotherapy responses [
36,
38]. Our analyses with
HiTIMED on TCGA showed better 5-year survival rates with higher CD8T memory cell levels in lung adenocarcinoma and better long-term survival in liver hepatocellular carcinoma, head and neck squamous cell carcinoma, and endocervical adenocarcinoma, which are consistent with its cytotoxic role in anti-tumoral activities. In kidney clear cell renal cell carcinoma, a higher level of Treg is associated with a worse survival outcome, indicating its role in immunosuppression [
36]. Interestingly, in endometrial carcinoma, we observed significantly better survival with a higher level of Treg. This finding is consistent with a previous report on Treg being beneficial for survival in endometrial carcinoma [
44]. The impact of Treg in cancer survival varies greatly by tumor site, suggesting differential physiological functions and roles of Tregs in different tumor types [
45]. Based on TME composition, immune hot tumors are defined as tumors with a high level of immune cell infiltration and, thus, more likely to respond to immunotherapy [
6,
42]. In our analysis, the unsupervised dichotomous classification of TCGA tumors by
HiTIMED immune projection demonstrated the potential identification of immune hot and cold tumors. Future supervised training on paired data on immunotherapy response with
HiTIMED immune projection promises a potential on systematically rating a tumor for immunotherapy response rate.
The angiogenic microenvironment supports tumor proliferation and metastasis [
46]. The formation of new blood vessels relies heavily on endothelial and stromal cell proliferation [
7]. In our study, a higher level of endothelial and stromal cells identified by
HiTIMED was associated with worse survival rates in multiple cancers. Interestingly, in kidney clear cell renal cell carcinoma, a higher level of endothelial cells is beneficial for survival. This result is consistent with a single-cell analysis on kidney clear cell carcinoma, showing a better survival outcome in tumors with more endothelium [
47]. A unique role of endothelial cells in prognostication of survival and immunotherapy response in kidney clear cell renal cell carcinoma patients has been hypothesized [
47]. Worse 5-year survival outcomes were observed in multiple cancers for angiogenic hot tumors compared to angiogenic cold tumors in our analyses. Interestingly, immune hot and cold tumors were not significantly associated with 5-year survival after adjusting for age, gender, and tumor stage. Taken together, these data lead us to hypothesize that there is a closer relationship between the angiogenic microenvironment in TME with prognosis.
The cell type heterogeneity in TME complicates epidemiological analyses of TME and clinical outcomes. The association between cell type prevalence in TME and patient survival has previously been studied primarily by counting certain cells in TME using immunohistochemical quantification [
29]. However, the cells in TME are dynamically interactive, making such analysis susceptible to other cell type confounders. The high resolution of
HiTIMED makes it possible to adjust for such cell type confounders. Further, traditional EWAS analyses are susceptible to the cell type heterogeneity confounding. For instance, EWAS can identify valuable epigenetic biomarkers for early cancer detection and prognosis [
52]. However, the sensitivity and precision of identifying such biomarkers are compromised when the tissue cell heterogeneity is ignored [
53].
HiTIMED-projected cell composition in TME provides new opportunities for EWAS studies to unveil cell-type independent epigenetic biomarkers in cancer. Our results clearly show that much of the vast DNA methylation dysregulation previously observed in tumors is attributable to cell heterogeneity. Further application of
HiTIMED cell estimates to models that identify tumor-specific DNA methylation is poised to enable a clearer understanding of early DNA methylation drivers alterations in carcinogenesis and disease progression.
While
HiTIMED points to a valid method for estimating cell proportions in TME and the potential application to cancer research, we recognize some limitations. First,
HiTIMED shows modest over-prediction for CD8T cells and under-prediction for CD4T cells, especially Tregs, in artificial mixtures. The
HiTIMED libraries were developed to optimize the deconvolution in specific tumor microenvironments. We posited that the bias we observed in artificial mixtures would be minimized in specifically targeted tumor types. However, the hypothesis is hard to examine without known cell composition in TME. When collapsing the subtypes of T cells,
HiTIMED is highly accurate, even in artificial mixtures. Also, immune cells are possibly reprogrammed by interaction with the TME. Thus using normal cells as a reference may generate noise. Second, the stomach adenocarcinoma showed least methylation distinction between tumor versus normal tissues compared to other carcinomas. This may attribute to the heterogeneity of stomach tumor cell subtypes. Future work on stomach cancer subtype specific libraries may be necessary for stomach TME deconvolution. Third, macrophages were not included in
HiTIMED. Macrophage is a highly heterogeneous cell type in TME [
48,
49]. Our initial effort on including macrophage generates substantial noise that confounds other mononuclear cells. Future effort on epigenetically defining tumor specific macrophages may help to address the issue. Fourth, only carcinomas are currently included in
HiTIMED, and future work is needed to add other complex tumor types. Fifth, tumor subclones were not captured in all existing deconvolution methods. Future epigenetic data on tumor subclones guarantee a more extensive deconvolution of TME. Finally, the angiogenic/non-immune microenvironment profiled by
HiTIMED cannot distinguish normal and tumor-impacted epithelial, endothelial, and stromal cells. However, our
HiTIMED-profiled angiogenic microenvironment reflects angiogenesis globally, providing relevant information. Differentiating TME affected cells and normal cells may provide additional research avenues beyond the scope of this method.
Methods
Discovery data sets
For the discovery of our tumor TME deconvolution libraries, we used nine publicly available data sets from TCGA, Gene Expression Omnibus (GEO), and ArrayExpress, and two data sets from our laboratories available through GEO (GSE193297, GSE167998) that contain DNA methylation microarray data on 20 types of carcinomas and their matched normal, 12 types of purified immune cell, and three types of angiogenic cell (Additional file
1: Table S1) [
50‐
53]. Purified basophils, eosinophils, neutrophils, monocytes, B naïve cells, B memory cells, CD4 naïve cells, CD4 memory cells, T regulatory cells, CD8 naïve cells, CD8 memory cells were cytometric and magnetic-sorted and flow confirmed. The artificial mixtures were generated from MACS-isolated and FACS-verified cells. The cells were purchased from AllCells® corporation (Alameda, CA, USA), StemExpress (Folsom, CA), and STEMCELL Technologies (Vancouver, BC, Canada). The donors included 41 males and 15 females, with a mean age of 32.2 years (sd = 12.2) and multiple ethnicities including African-Americans, East-Asian, Indo-European, and multiple/admixed. The donors were anonymous and healthy. For more details on sample information and preparation, please refer to our previous publication [
12]. Dendritic cells used in this study were monocyte-derived dendritic cells from healthy human blood donors. Firstly, the PBMCs were isolated from buffy coat cells by Fiscoll density gradient centrifugation. Next, the CD14 cells were purified using immunomagnetic purification. Finally, 5-day incubation with 500 U/ml human granulocyte-macrophage colony-stimulating factor (hGM-CSF) (PeproTech, Rocky Hill, NJ) and 1,000 U/ml human interleukin 4 (hIL-4) (PeproTech, Rocky Hill, NJ) completed the procedure. More details on the protocol and procedure can be found at [
54] and [
55]. Although the discovery data sets contain Illumina HumanMethylation450k or HumanMethylationEPIC array data, to ensure the applicability of the library, we retained CpGs that were common to both platforms. Furthermore, cross-reactive probes, SNP-related probes, sex chromosome probes, and non-CpG probes were masked in the analysis. 384,640 CpGs were retained after this process. The
SeSAMe pipeline from Bioconductor was used to preprocess the data, including data normalization and quality control [
56]. The probes that contained over 20% of low-quality data (pOOBHA > 0.05) across samples per tissue type were removed for quality control.
HiTIMED development
Due to the complexity and cell heterogeneity of TME, we propose a novel, tumor-type-specific hierarchical model to develop libraries with optimized accuracy for cell projection. In each tumor type, six layers of libraries were developed to hierarchically project cell proportions in first, tumor; second, angiogenic; and third, immune microenvironments (Fig.
1). For tumor purity estimation, the
InfiniumPurify pipeline was employed to estimate the tumor purity [
19]. The method identifies the top 1000 informative differentially methylated CpG (iDMC) sites between tumor and normal samples by rank-sum test and require that their variances of beta values are greater than 0.005 in tumor samples. The number 1000 was selected based on the performance of iterations of various number of iDMCs (50, 100, 200,500, 1000, 3000, 5000, 10,000, 15,000, 20,000, 30,000, 40,000). The performance was evaluated by correlating iDMC estimated purity and ABSOLUTE purity [
25], which is somatic copy-number-based tumor purity estimation, in lung adenocarcinoma [
19]. iDMCs were separated into hyper- and hypo-methylated groups based on their mean beta values in tumor and normal samples. The beta values for hypermethylated iDMCs remain unchanged whereas the hypomethylated iDMC beta values were transformed to 1-beta. Density estimation with Gaussian kernel was applied to the transformed iDMC beta values. The estimated purity is the mode of the density function. More details on
InfiniumPurify pipeline can be found at [
19]. In our study, we updated the pipeline by identifying tumor type specific iDMCs. Briefly, instead of using a universal set of iDMCs for estimating tumor purity for all tumor types, for each carcinoma type included in the study, we developed iDMCs specifically for that tumor type for tumor purity estimation. Epithelial, endothelial, stromal, basophil, eosinophil, neutrophil, monocyte, dendritic, B naïve, B memory, CD4 naïve, CD4 memory, T regulatory, CD8 naïve, CD8 memory cell proportions were estimated using the constrained projection/quadratic programming approach developed by Houseman et al. [
21]. Libraries for specific cell types were developed using
limma linear regression with empirical Bayes adjustment statistics in
Meffil [
20] to reduce methylation profiles to top 100 cell-type-specific hyper- and hypo-methylated CpGs. The number 100 was selected based on the performance of iterations of various number of cell type specific CpGs (50, 100, 200, 500, 1000). The performance was evaluated by calculating cell type specific absolute error and overall absolute error in colon adenocarcinoma (Additional file
2: Figure S17). The overall absolute error was minimal when using the 50-CpG library, however it had the worst performance in CD4 memory cell and eosinophils. To balance the performance across all cell types, we decided to use the 100-CpG library. The overall absolute error for 100-CpG library was only 0.2% lower than the 50-CpG library, however unlike the top-50 CpG library, the 100-CpG library did not have the worst performance across any of the cell types. More details on the hierarchical library construction can be found in the Results section and Fig.
1.
Validation of HiTIMED projections
HiTIMED predicted tumor cell proportions were compared to the estimated tumor purity from major existing methods, including methylation-based
InfiniumPurify [
19]
, MethylCIBERSORT [
17]
, MethylResolver [
16]
, LUMP [
23]
, gene expression-based
ESTIMATE [
24]
, somatic copy-number-based
ABSOLUTE [
25]
, image stain-based
IHC [
26]
, and a consensus measurement of purity estimations (CPE) [
26], using TCGA tumor data. One additional data sets of high-grade serous ovarian cancer was added due to the limited ovarian cancer sample size on TCGA (Additional file
1: Table S2) [
57]. Tumor type stratified comparison between
HiTIMED tumor proportion and
InfiniumPurify tumor purity was conducted with Pearson's correlation coefficient, and the p-value was reported. Method paired pan-cancer tumor projection comparison was performed across
HiTIMED, MethylCIBERSORT, MethylResolver, CPE, ESTIMATE, LUMP, IHC, and
ABSOLUTE, with r and p-value reported. We applied
HiTIMED to 12 artificial mixture samples with 12 predefined immune cell proportions (Additional file
1: Table S2). RMSE, R, and p-value were calculated for each of the 12 immune cell types by contrasting the
HiTIMED cell estimates versus each sample's known ground truth proportion. To validate the angiogenic/non-immune microenvironment projection,
HiTIMED was applied to publicly available normal human intestinal epithelium [
27] and human umbilical vein endothelial cells [
28] (Additional file
1: Table S2). Mean and standard deviation of
HiTIMED predicted endothelial proportion and epithelial proportion were reported for normal human intestinal epithelium and human umbilical vein endothelial cells respectively.
HiTIMED deconvolution compared to MethylCIBERSORT and MethylResolver
A Venn diagram was used to compare the cell types in the tumor microenvironment that can be captured by HiTIMED, MethylCIBERSORT and MethylResolver. All three methods were employed on the 12 immune cell artificial mixture samples for performance comparison. For cell types that can be estimated by all three methods, a performance comparison with operated by cell type and with all cells pooled. The error rate was calculated as\(Predicted Proportion\left(\%\right)-True Proportion (\%)\). The absolute error rate was calculated as \(|Predicted Proportion\left(\%\right)-True Proportion \left(\%\right)|\).
Statistical analysis of the variation of TMEs and survival in TCGA samples
In TCGA samples, variances of immune and angiogenic microenvironments were calculated per tumor type. Tumor types were ranked by the variance of the immune microenvironment and angiogenic microenvironment, respectively, to demonstrate the across-tumor-type variation of TMEs. Ovarian cancer was removed from this analysis due to the limited sample size with survival information. Major immune cells (Bmem, CD8mem, DC, Tregs) and angiogenic cells (epithelial, endothelial, stromal) were investigated for 5-year survival outcomes in higher than median value group compared to lower than or equal to median value group across tumors using Cox proportional hazard models with age, gender, tumor proportion, tumor stage, and other cell-type proportions (Treg, Bmem, DC, CD8mem, epithelial, endothelial, stromal) adjusted. Two Cox models, with and without cell-type adjustment, were compared in clear cell renal cell carcinoma as sensitivity analyses. Gender-specific and tumor stage information unavailable cancer types were excluded from the survival analysis. The Schoenfeld residuals were used to test the proportional hazard assumption for Cox models. To ensure that the proportional hazard assumption was not violated in the Cox models, tumor stage was stratified into high stage and low stage in lung adenocarcinoma. Age was stratified into ten groups in the bladder carcinoma data set.
Classification of immune and angiogenic hot/cold tumors and survival in TCGA samples
With the high resolution of
HiTIMED predicted cell types, immune and hot tumors were classified using the consensus PAM clustering method based on
HiTIMED projected granulocyte, mononuclear, T cell, B cell, and NK cell proportions in TCGA samples. Similarly, consensus PAM clustering was used to classify angiogenic hot and cold tumors based on
HiTIMED projected epithelial, endothelial, and stromal cell proportions. Multivariable linear regression adjusting for age, gender, and tumor type, was used to compare
HiTIMED projected cell proportions between immune/angiogenic hot and cold tumors. Cox proportional hazard models with age, gender, and tumor stage-adjusted were applied to investigate the survival outcomes in immune hot vs. cold tumors and angiogenic hot vs. cold tumors. Cancer types gender-specific and with tumor stage information unavailable were excluded from this analysis. The proportional hazards assumption of all models was checked using the Schoenfeld residuals test. Log-rank tests were used to test survival differences in four groups of tumor clusters that were generated by combining the immune and angiogenic hot and cold classification. The Student's t-test was used to compare
HiTIMED immune cells between immune subtyped C2 and C6 tumors [
30].
Models comparing methylation profile between colon adenocarcinoma and adjacent normal samples
Three models were generated to identify DMCs between colon adenocarcinoma and normal adjacent tissues. Model 1 adjusted for age and gender. Model 2 adjusted for age, gender, and HiTIMED-projected tumor purity. Model 3 adjusted for age, gender, HiTIMED-projected tumor purity, DC, CD8mem, Bmem, Treg, epithelial, endothelial, and stromal cell proportions. Delta betas larger than 0.3 and FDR smaller than 0.01 were used as the cut-off for statistically significant DMC identification. Heatmaps with Manhattan distance clustering and colon cancer CIMP subtypes colored were generated per model.
Statistical analysis on TMEs in drug-sensitive and resistant mCRC and recurrent TNBC
TMEs in drug-sensitive and resistant mCRC and recurrent TNBC were deconvolved using HiTIMED. Student's t-tests were used to compare the means of 17 HiTIMED projected cell types between first-line chemotherapy drug-sensitive and resistant mCRC tumors. Similarly, Student's t-tests were also used to compare the means of 17 HiTIMED projected cell types between recurrent and non-recurrent TNBC tumors in the chemotherapy-treated arm and nonchemotherapy-treated arm after locoregional therapy, respectively.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.