Introduction
In December 2019, the coronavirus disease 2019 (COVID-19) has been out breaking in Wuhan China and rapidly spread throughout the world inducing a worldwide panic [
1]. The novel coronavirus was isolated from human airway epithelial cells and was named severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2), which is highly infectious and induces a high fatality [
2‐
5]. Nowadays, the underlying pathogenic mechanism of SARS-CoV-2 has been generally explored [
6]. Similar to SARS-CoV-1, SARS-CoV-2 uses the receptors of angiotensin converting enzyme II (ACE2) for viral entry process. After receptor binding, the spike (S) protein priming protease, such as cell surface transmembrane serine protease (TMPRSSs) and endosomal cathepsins, works in membrane fusion [
2,
7]. However, these proteases are not prognosis predictors of COVID-19 patients, which may be significantly associated with therapeutic decision-making.
Generally, patients’ characteristics, nutritional status, clinical symptoms, comorbidities, inflammatory biomarkers and chest CT images are different in terms of patient outcome; however, whether these factors can serve as prognosis predictors for COVID-19 pneumonia is not clear. Regarding chest CT images, consolidation, emphysema and residual healthy lung parenchyma are regarded as independent predictors in COVID-19 patients [
8]. Additionally, high-sensitivity C-reactive protein–albumin ratio (HsCAR) and low prognostic nutritional index (PNI) and the ratio of interleukin (IL)-6 to IL-10 were reported to be related to the prognosis of COVID-19 patients [
9,
10]. Moreover, carcinoembryonic antigen (CEA) is a glycoprotein generated in colonic epitheliums in the embryonic period and has been widely used as a biomarker for tumorgenesis and progression. CEA has been also reported to be associated with the prognosis of COVID-19 patients [
11,
12]. However, the potential mechanism of their predicting roles is unknown, neither is other candidate predictors.
The aim of this study is to provide novel predictors and their hypothetical mechanism in the infection and progression of COVID-19. In this study, we systematically collected and analyzed clinical information from hospitalized adult patients with COVID-19 pneumonia including demographics, disease, treatment and outcome information to identify all potential prognosis indicators for COVID-19 pneumonia. Based on the identified predictors, the prognostic nomogram was established to guide clinical decision-making. Furthermore, in order to explore the underlying mechanism of candidate biomarkers, single-cell transcriptomics of bronchoalveolar lavage fluid (BALF) from patients with or without COVID-19 were also analyzed with integrated bioinformatics methods. This study will provide novel predictors and their potential mechanism in the infection and progression of COVID-19.
Materials and methods
Patient selection and data extraction
This study was approved by the Ethics Committee of Jinyintan Hospital (KY-2020-58.01), followed the Standards for Reporting of Diagnostic Accuracy Studies Statement and Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) [
13,
14]. A total of 300 hospitalized adult COVID-19 patients diagnosed by reverse transcription polymerase chain reaction (RT-PCR) in Wuhan Jinyintan Hospital from January 1, 2020, to April 30, 2020, were included in the retrospective study. The exclusion criteria were: (1) patients younger than 18; (2) non-hospitalized patients; (3) patients with follow-up period less than 60 days; (4) patients admitted for another reason than COVID-19-related respiratory failure (as patients with specific malignancies could have increased CEA levels without any correlation with COVID-19, all patients with primary malignancy were excluded from the study); (5) patients whose survival time, endpoint (overall survival), demographic information or treatment data was unknown; and (6) patients whose admission carcinoembryonic antigen (CEA) was unknown.
The clinical data in the study were retrieved from the electronic medical record system of Wuhan Jinyintan Hospital on initial admission, including variables of demographic information (age at diagnosis and gender), symptom (fever, cough, expectoration, shortness of breath and diarrhea), comorbidity (diabetes mellitus, hypertension, cardiovascular disease and cerebral infarction hypertension) and therapeutic information (use of glucocorticoid, imaging score, nasal catheter, high flow oxygen intake, ventilation). Additionally, laboratory indexes were also collected including CEA (ng/ml), albumin (g/l), hemoglobin (g/l), neutrophils (× 109/l), lymphocytes (× 109/l), C-reactive protein (CRP, mg/l), hypokalemia, hypocalcemia, hyponatremia, hyperkalemia and hypernatremia. The cutoff values of laboratory and imaging indexes were determined according to the normal values stipulated by the laboratory and imaging department of Wuhan Jinyintan Hospital.
As the endpoint, the survival time and overall survival status of each patient were retrieved. The endpoint of the present study was the overall death of COVID-19 patient, which presented the outcome and prognosis of patients in this study. Patients who were diagnosed after April 30, 2020, were excluded from the study.
Epidemiological statistical analysis
The retrospective study started with descriptive statistic: Discontinuous variables were presented as percentages while continuous variables in normal distribution were described as mean ± standard deviation (SD) or else reported as median (range). Two statistical methods were applied to explore potential significant predictors. As initial parameter or nonparametric tests, the Chi-square test was used to compare the outcomes between discontinuous variables, and variance homogeneous and normal distributed continuous variables were compared by the Student t-test; otherwise, the Mann–Whitney U-test or Kruskal–Wallis H-test was used. Besides, the Kaplan–Meier survival analysis was used to determine the prognostic value of each variable. Furthermore, predictors with statistical significance in both parameter or nonparametric tests and Kaplan–Meier survival analysis were selected to construct the multivariate Cox proportional hazard model. The nomogram was established based on the multivariate model to predict the prognosis of COVID-19 patients. The significant prognostic factors in multivariate Cox model were marked with asterisks (*) in the nomogram (*: P < 0.05; **: P < 0.01). Receiver operating characteristic (ROC) curve and calibration curve were drawn to evaluate the discrimination and calibration of the nomogram.
Processing of single-cell RNA-seq data
Single-cell RNA-sequence (scRNA-seq) data of COVID-19 patients’ and healthy volunteer’s bronchoalveolar lavage fluid (BALF, accession no. GSE145926) [
15] and peripheral blood mononuclear cells (PBMCs, accession no. GSE150728) [
16] were download from the Gene Expression Omnibus (GEO). All BALF and PBMC samples were taken at the initial admission.
The preliminary data processing of single-cell RNA-seq data started from the Cell Ranger Single Cell Software Suite 3.3.1 (
http://10xgenomics.com/). The pair-ended reads fastq files were trimmed to remove template switch oligo (TSO) sequence and poly-A tail sequence. Then, command of “cellranger count” was used to quantify the clean reads, aligned to the hg38 human genome. The Seurat method was applied to integrated data analysis [
17].
In terms of quality control (QC), genes with average read count greater than one and being expressed in at least three single cells were considered for further analysis. Cells with either fewer than 100,000 transcripts or fewer than 1,500 genes were filtered out.
In data processing, first, variance stabilizing transformation (VST) method was used to identify variable genes. Variable genes were input as initial features for principal component analysis (PCA) [
17]. Then, the principal components (PCs) with P values < 0.05 were filtered by the jackstraw analysis and were incorporated into further UMAP (uniform manifold approximation and projection) and t-distributed stochastic neighbor embedding (t-SNE) to identify cell subclusters (resolution = 0.50) [
18]. Only the genes with |log2 fold change (FC)|> 0.5 and false discovery rate (FDR) value < 0.05 were identified as differentially expressed genes (DEGs) among cell subclusters. Feature plots and violin plots were utilized to illustrate the distribution and expression of DEGs, respectively. Additionally, scMatch [
19], singleR [
20] and CellMarker [
21] were used as references to define each cluster. Cell trajectory and pseudo-time analysis was performed by monocle2 [
22]. Furthermore, 50 hallmark gene sets were retrieved from the Molecular Signatures Database (MSigDB) version 7.1 (
https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) and gene set variation analysis (GSVA) algorithm was performed to absolutely quantify the activity of signaling pathways in each single cell [
23,
24]. Furthermore, the CellphoneDB algorithm was used to identify the cellular communication between pneumonocyte and immune cells [
25].
Identification of the mechanism of abnormal CEA expression in COVID-19 patients
First of all, the distribution and expression of CEA-related genes (CRGs) including CEACAM1, CEACAM3, CEACAM4, CEACAM5, CEACAM6, CEACAM7, CEACAM8, CEACAM16, CEACAM18, CEACAM19, CEACAM20, CEACAM21, CEACAMP1, CEACAMP2, CEACAMP3, CEACAMP4, CEACAMP5, CEACAMP6, CEACAMP7, CEACAMP8, CEACAMP9, CEACAMP10, CEACAMP11 and CEACAM22P were visualized by feature plot and violin plot in BALF and PBMC scRNA-seq data. Then, co-expression (correlation) analysis was performed among CRGs and 50 hallmark of gene sets to identify the potential downstream pathways. The CellphoneDB algorithm was used to illuminate the cellular communication between cells with high CRG expression and other cells. Besides, two data including scRNA-seq data of acute lung injury (ALI) mouse lung (GSE134383) and idiopathic pulmonary fibrosis (IPF) mouse lung (E-HCAD-14) were downloaded to evaluate the distribution and expression of CRGs, key receptor–ligand pair of cellular communication and potential downstream pathways [
26‐
30].
Statistical analysis
Only p value of two-sided statistical testing lower than 0.05 was considered statistically significant. All statistical analysis processes were performed with R version 3.6.1 software (Institute for Statistics and Mathematics, Vienna, Austria;
www.r-project.org).
Discussion
The COVID-19 has induced a worldwide epidemiological event with a high infectivity and mortality [
35]. Identification of predicting biomarkers may assist clinicians in decision-making. However, the candidate predictors of COVID-19 remain unclear. In this study, we identified CEA as a potential biomarker for COVID-19 patients. To further explore the underlying mechanism, we used the single-cell transcriptomics of BALF from patients with or without COVID-19, along with the scRNA-seq data of ALI and IPF mouse lungs. We found that the developing neutrophils/neutrophil progenitors can have the cross talk with type II pneumocyte via CEACAM8-CEACAM6 in COVID-19 but not ALI and IPF.
The predicting biomarkers are important for clinical decision-making; thus, many efforts have been made to identify them in patients with COVID-19 pneumonia. Previously, the inflammatory biomarkers (IL-6, IL-8, IL-10 and ratio of IL-6 to IL-10), patients’ characteristics (age) and chest CT images (consolidation, emphysema and residual healthy lung parenchyma) have been reported to predict the prognosis of COVID-19 patients [
8,
36]. In addition, the innovative method of machine learning is also used to precisely evaluate the COVID-19 pneumonia [
37].
In this study, based on the clinical information of hospitalized adult COVID-19 patients, we identified CEA as a prognostic indicator for COVID-19 patients independently. Additionally, the prognostic nomogram including CEA was also constructed with a good applicability (AUC = 0.776). CEA, initially considered as an oncofetal protein, is an epithelial cell glycoprotein with a molecular mass of 180–200 kDa. At present, CEA is viewed as a normal epithelial molecule and its abnormal expression is generally found in tumors [
38].
In COVID-19, we also found that CEACAM8 is highly expressed in the developing neutrophils/neutrophil progenitors, while CEACAM5 and CEACAM6 are highly expressed in type II pneumocyte. In humans, CEA and CEA subfamily members (CEACAMs) are cell surface heavily glycosylated proteins. In the bacterial or viral infection, CEA and CEACAM1 participate in the adherence of enteric bacteria to the apical membrane of colonic M cells in the human gut mucosa [
39]. Besides, in the human respiratory tract, CEACAM1 and CEACAM5 increase the host susceptibility to bacterial infection upon viral challenge [
40].
COVID-19 can lead to fatal comorbidities, especially acute respiratory distress syndrome (ARDS), which mainly caused by the injury to the alveolar epithelial cells [
41]. And it has been reported that the major risk factors for severe COVID-19 are shared with IPF, namely increasing age, male sex and comorbidities such as hypertension and diabetes [
42]. Due to the close correlation between CEA and ALI/IPF, we initially speculated that the poor prognosis of COVID-19 patients mediated by CEA might be related to ALI and IPF pathophysiologically. Because there are no natural models for IPF and ALI, the use of animal models that reproduce key known features of the disease is warranted. Direct lung infection is the leading cause of ARDS/ALI and can be modeled in mice using live pathogens and sterile models of inflammation while the bleomycin mouse model has identified many of the molecular and cellular mechanisms recognized as being important in pathogenesis of IPF [
43,
44]. Therefore to validate this hypothesis, scRNA-seq data of ALI mouse lungs and IPF mouse lungs were also downloaded to evaluate the distribution and expression of CRGs, key receptor–ligand pair of cellular communication and potential downstream pathways [
26‐
30]. The cross talk via CEACAM8-CEACAM6 was found between developing neutrophils and type II pneumocyte in COVID-19 but not ALI and IPF suggesting that during COVID-19 infection process, the differentiated developing neutrophils might regulate some biological processes of type II pneumocyte. The previous study reported that some CEACAMs were shown to be receptors that facilitate entry of middle east respiratory syndrome coronavirus [
45]. And CEACAM were involved in cell–cell recognition and modulate cellular processes that range from the shaping of tissue architecture and neovascularization to the regulation of insulin homeostasis and T-cell proliferation [
46]. However, the role of CEACAM in COVID-19 remains hypothetical in ARDS pathophysiology.
In COVID-19, the developing neutrophils were found to have cross talk with type II pneumocyte via CEACAM8-CEACAM6. Generally, CEACAM can be engaged in cellular communication which may affect various signal transduction processes related to cell activation, differentiation and apoptosis [
47,
48]. In this process, CEACAM8-CEACAM6 regulation network may promote the differentiation of developing neutrophils, which are the newly annotated cells in patients with ARDS and represent neutrophils at various developmental stages [
16]. The developing neutrophils may further lead to COVID-19 progression and induce the ARDS. Besides, it also regulates the proliferation of type II pneumocyte, which highly expresses ACE2 and serves as the major infected cell type by SARS-CoV-2 [
49].
CEA level has been reported to be correlated with severity of several lung disease [
28,
50‐
52]. The close association between respiratory epithelial damage and the release of CEA in IPF has been validated by a study based on BALF and serum measurement of CEA [
50]. Acute exacerbations of IPF is pathologically manifested as diffuse acute lung injury (DALI) on the basis of pulmonary interstitial fibrosis [
28]. Since COVID-19 pneumonia belongs to interstitial pneumonia and IPF was the result of the final fibrosis of interstitial pneumonia, we initially speculated that the poor prognosis of COVID-19 patients mediated by CEA might be related to ALI and IPF pathophysiologically. However, abnormal expression of CRGs was not found in both scRNA-seq samples of ALI and IPF while no developing neutrophils were annotated. Thus, abnormal expressions of CRGs in COVID-19 patients were COVID-19-specific and not related to CEA involvement in ALI and IPF.
To the best of our knowledge, the present study was the first to systematically evaluate the prognostic roles of CEA in COVID-19 patients and implies the potential mechanism in BALF and PMBC. The results implied the potential for clinical application. However, several limitations were present in this study. First, the retrospective nature of the present study was a limitation compared with a prospective study. Secondly, the generalizability of the nomogram had not been validated externally by the multicenter data, nor had the potential mechanism of CRGs been verified by wet experiments. Third, the case number of scRNA-seq data of BALF from three patients with moderate COVID-19, six patients with severe or critical infection and three healthy controls was limited. Fourth, smokers could have increased CEA levels without any correlation with COVID-19. However, smoking status was unknown for more than 25% of the patients. Any conclusion seems therefore debatable. Due to the limited number of smoking patients in this study (11/300) and the large number of patients with unknown smoking status (79/300), the relationship between smoking status and CEA in COVID-19 patients needs to be further studied. Last but not least, the limitation of sample size may contribute to the major bias of this study. Subsequent studies should focus on clinical studies with larger sample sizes and higher levels of evidence and basic studies exploring the molecular mechanisms of key biomarkers.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.