Background
Hepatocellular carcinoma (HCC) is the most common primary liver cancer and accounts for 75–85% of cases. HCC was also the sixth most commonly diagnosed cancer and the fourth leading cause of cancer-related deaths globally [
1] in 2018. Only hepatic resection and liver transplantation are considered potentially curative approaches for treating HCC. However, most patients are diagnosed at a late stage, and the treatment rate for early-stage patients is disappointingly low [
2]. It is well known that the tumour immune microenvironment (TIME) plays an essential role in tumorigenesis, tumour development, and treatment outcome [
3‐
5]. In recent years, immunotherapy has emerged as a promising strategy for cancer treatment, while only a few HCC patients showed response to immune treatment. Therefore, systematic analysis of the function of various types of intratumour immune cells might contribute to the development of novel biomarkers for prognosis and therapeutic effectiveness for patients with HCC.
With the rapid development of next-generation sequencing technologies, an increasing number of studies have examined gene expression in HCC based on RNA sequencing (RNA-seq). However, RNA-seq is typically performed in “bulk”, with data representing the average gene expression patterns of a large number of cells [
6]. Notably, single-cell RNA sequencing (scRNA-seq) is a novel sequencing technology that provides relevant information for the characterization of single immune cells or tumour cells [
7]. scRNA-seq highlights intratumour heterogeneity and distinct subpopulations, and it is possible to enumerate and quantify immune infiltration in tumour tissues [
8,
9]. Importantly, the heterogeneous make-up of immune cell infiltrates is a key factor for therapy response and prognosis in HCC and other tumour types [
10‐
14]. Unfortunately, scRNA-seq is relatively expensive, so only a limited number of sample datasets were available. However, the information from scRNA-seq can be very meaningful for exploring the characteristics of each cell subpopulation from bulk samples and the interaction of each cell in the TIME [
15‐
17].
In the present study, distinct cell subpopulations between tumour tissues and normal control tissues were identified from HCC scRNA-seq datasets in the Gene Expression Omnibus (GEO) database. The weighted gene coexpression network analysis (WGCNA) algorithm was used to explore the coexpression network and key modules most closely related to tumours based on the Cancer Genome Atlas (TCGA) expression profile data of tumour samples and normal samples. Based on the integration of scRNA-seq and bulk RNA-seq data, we screened the key genes related to immune cell subsets in HCC. Next, we employed univariate Cox and least absolute shrinkage and selection operator (LASSO) Cox regression to construct a risk model, which was demonstrated to have great potential as a biomarker for prognosis and to have excellent predicted immunotherapeutic efficacy for patients with HCC.
Methods
Data source and preprocessing
The HCC scRNA-seq dataset GSE149614 was downloaded from the GEO database and included 10 primary tumour (PT) patients, 2 portal vein tumour thrombus (PVTT) patients, 1 metastatic lymph node (MLN) patient and 8 normal liver tissue (NLT) patients. The original data contained a total of 25,479 genes and 71,915 cells. The percentage of mitochondria and rRNA was calculated through the PercentageFeatureSet function, and the genes expressed by each cell were greater than 500 and less than 8000, the selection criteria was showed in Additional file
2: Figure S1. The mitochondrial content was less than 30%. In addition, the number of UMIs in each cell was at least 500. After filtration, there were 71,139 cells.
Public clinical data and gene expression information were retrieved from the TCGA database (
https://portal.gdc.cancer.gov/) and International Cancer Genome Consortium (ICGC) database (
https://xena.ucsc.edu/). In total, 366 samples in the TCGA-LIHC cohort and 232 samples in the ICGC-JP cohort were used for further analysis.
scRNA-seq data clustering dimension reduction
First, we normalize the merged data through log-normalization and find the first 2000 highly variable genes through the FindVariableFeatures function (identify variable features based on the variance stabilization transformation (“vst”)). At the same time, all genes were scaled using the ScaleData function, and RunPCA function was used to reduce the dimension of PCA for the first 2000 highly variable genes screened above. We choose dim = 50 and clustered the cells through the “FindNeighbors” and “FindClusters” functions (resolution = 0.1) to find the cell clusters. Next, we selected the top 50 principal components to further reduce dimensionality using the UMAP method. UMAP is a method of data dimensionality reduction, which assumes that the available data samples are uniformly distributed in the topological space (Manifold), and these limited data samples can be approximated (Approximation) and mapped (Projection) to a low-dimensional space. To put it simply, the UMAP algorithm is considered to be a principle similar to t-SNE, which is an algorithm that maps the high-dimensional probability distribution to a low-dimensional space, so as to achieve the effect of dimensionality reduction. Mainly based on the theory of manifold theory and topology algorithm, the dimensionality of high-dimensional data is reduced to form the input features of other classification models. Finally, we used the FindAllMarkers function to screen the marker genes of 25 subgroups with logfc = 0.5 (differential multiples) and Minpct = 0.35 (the expression ratio of the least differential genes). Finally, we used the corrected p < 0.05 to screen the marker gene.
Cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT)
CIBERSORT is a method based on the input matrix of a gene expression file to accurately estimate the relative proportions of various cell subsets in tissues [
18,
19]. Here, we used CIBERSORT analysis to compare differences in various immune cells in distinct groups. Spearman correlation analysis was performed to explore the association between the risk score and infiltrating immune cells. The “ggplot2” package was used to visualize the differences in abundance in immune cells and the results of the correlation analysis.
Least absolute shrinkage and selection operator (LASSO)
To construct the prognostic model, a univariate Cox regression model identified the genes that were significantly correlated with survival outcome. Moreover, LASSO analysis was employed to select reliable predictors [
20]. The risk score of each patient in the TCGA database and ICGC database was assessed using the formula risk score = Σ coefficient
mRNAn * expression level
mRNAn. Then, the correlation between the risk score and prognosis of patients was further analysed.
Weighted gene co-expression network analysis (WGCNA)
WGCNA is an R software package that is used for weighted correlation network analysis, including for module identification, network generation, gene screening, calculation of properties, and data visualization [
21]. Here, we used cibersort's algorithm to evaluate the score of each sample of TCGA bulk RNA-seq with respect to each cell subgroup. Each sample itself is a geometric body of multiple cell types. We performed WGCNA analysis on this similarity score to screen the gene modules with the highest correlation with a certain subgroup. [
22] Highly similar modules were identified by cluster analysis, and the association between each module and intratumour cell subgroup abundance was assessed.
Tumour immune dysfunction and exclusion (TIDE)
To predict the immune checkpoint blockade response, Jiang et al. developed the TIDE method, which was used to simulate the mechanisms of tumour immune evasion (including T cell dysfunction and T cell exclusion) [
23,
24]. The software is freely available online at
http://tide.dfci.harvard.edu. In this study, we employed TIDE to evaluate the response to immunotherapy in patients. A higher TIDE score indicates a higher possibility of immune escape and poor response to immunotherapy for patients with HCC.
Cell culture and quantitative real-time PCR (qRT-PCR)
The HCC cell line SK-Hep-1 and healthy human liver cell line L02 were obtained from the Chinese Academy of Sciences. The cells were maintained in Dulbecco’s modified Eagle’s medium supplemented in 10% foetal bovine serum (Wisent, Ottawa, ON, Canada) and 1% penicillin in humid conditions at 37 °C with a 5% CO2 atmosphere. The RNA of the cell lines L02 and SK-Hep-1 was extracted by using TRIzol reagent (Invitrogen), and the RevertAid First-Strand cDNA Synthesis Kit (Thermo Fisher Scientific, Inc.) was used to synthesize cDNA. qRT-PCR analysis was performed using SYBR Green (Takara). The primer sequences are listed in Additional file
1: Table S1.
Statistical analysis
Prism 7.0 (GraphPad software, CA, USA) and R version 3.5.2 were used for statistical analysis. Kaplan–Meier survival curves were used for survival analysis by the survminer R package version 2.43–3. Student’s t-test was carried out to analyse the significant differences among distinct groups. The glmnet R package was used for LASSO Cox regression analysis. A P-value < 0.05 indicated statistical significance (*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001).
Discussion
scRNA-seq has emerged as a useful tool for transcriptional classification of cell types in various cancers. Here, we performed HCC scRNA-seq data from the GEO database to define the cell subpopulations in tumours, and we found multiple subgroups of 5 cell types, including liver bud hepatic cells, CD4 + cytotoxic T cells, dendritic cells, Kupffer cells, and liver progenitor cells. Specifically expressed gene markers might serve as specific markers to identify cell subgroups in a large set of samples. In addition, we screened the key genes related to immune cell subsets in HCC and constructed a three-gene risk model that had excellent prognostic efficiency and might serve as a biomarker for immunotherapy response. Similarly, Liang et al. [
22] used scRNA-seq to analyse the heterogeneity of tumour immune cells and established a risk model for predicting the prognosis of ovarian cancer patients. Zheng et al. [
17] screened six hub genes related to prognosis from GEO oesophageal squamous cell carcinoma (ESCC) datasets and TCGA ESCC datasets, and the results of scRNA-seq showed that the expression of hub genes was significantly higher in normal tissues and cells. Further Kaplan–Meier survival analysis and immune infiltration analysis indicated that the hub genes were promising biomarkers for ESCC diagnosis and prognosis [
17]. scRNA-seq was also adopted to decipher the cell-to cell interactions inside gliomas, and the identified autocrine ligand-receptor signal pairs were found to significantly affect the prognosis of glioma patients [
25]. Taken together, the findings indicate that scRNA-seq technology could help to effectively dissect the TIME and identify potential prognostic biomarkers.
Here, we performed differential analysis on gene expression data from the TCGA database. Three upregulated DEGs (cystatin B (CSTB)
, transaldolase 1 (TALDO1) and clathrin light chain A (CLTA)
) that belonged to monocyte (C21) marker genes might be applied as potential biomarkers for immunotherapy. CSTB, a member of the cystatin superfamily, is an inhibitor of cysteine proteases. Dysregulated expression of CSTBA has been reported to be involved in various cancers. For example, the expression of CSTBA was increased in serum and might be an early-stage diagnostic biomarker for HCC [
26] and ovarian epithelial tumours [
27]. CSTB has also been reported to serve as a prognostic biomarker for bladder cancer [
28], lung cancer and colorectal cancer [
29,
30]. Wu et al. reported that the expression of TALDO1 was increased in upper tract urothelial carcinoma tissues and that upregulated TALDO1 expression was correlated with large tumour size, advanced stage, and distant metastases [
31]. In addition, genetic polymorphisms in TALDO1 were closely correlated with squamous cell carcinoma of the head and neck [
32]. A better understanding of the molecular mechanisms of the 3-gene model in HCC pathogenesis to validate its clinical applications is needed for the further development of novel diagnostic and prognostic biomarkers.
In this work, we jointly analysed scRNA-seq data and the gene expression profile of bulk RNA-seq data. The results both improve our understanding of the heterogeneity of the TIME at the single-cell level and provide a 3-gene model based on prognosis-related genes. Additionally, the research strategies used in this study might also be suitable for other cancers. However, there were several limitations in this study. First, the size of sample was relatively small. Second, the functional experiments and underling molecular mechanism of the 3 genes are needed. Third, the model was generated with HCC tissues, which cannot diagnose tumour at the early stage. In future studies, we plan to detect the expression of the three genes in circulating immune cells, which might contribute to increasing the early diagnosis rate for HCC.
Conclusion
By integration of bulk RNA-seq and scRNA-seq, we analysed the heterogeneity of the TIME at the single-cell level, and we constructed a 3-gene model that could accurately evaluate the survival outcome and immunotherapy response of patients with HCC.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.