Background
Lymph node metastasis (LNM) is one of the most important factors affecting the prognosis of breast cancer [
1]. The accurate assessment of lymph node status can predict patients’ outcomes and guide the choice of treatment options [
2]. Axillary lymph node dissection (ALND) is the gold standard for evaluating axillary lymph node (ALN) status, but it would bring great harm to patients. Although milder sentinel lymph node biopsy (SLNB) has become routine surgery, it is still risk surgery, which would increase considerable anesthesia time and expense, and cause multiple complications in 3.5–10.9% of patients [
3,
4]. Therefore, developing a low-cost non-invasive method to evaluate the status of ALN would be of great benefit to breast cancer patients.
Cell-free DNA (cfDNA) has been an essential biomarker in many cancer applications, such as early detection and outcome prediction of cancer [
5]. At present, the most commonly used features are cfDNA level and its sequence information. Previous studies have described the close relationship between abnormal cfDNA levels and ALN metastasis [
6,
7], which indicates that cfDNA may be used to assess ALN status. However, the level of cfDNA is influenced levels are affected by many pathological processes, such as infection and inflammation [
8‐
10]. In addition, some studies wanted to find ALN metastasis-related ctDNA mutations or ctDNA hypermethylation [
1,
11‐
13], however, no relationship was found between them [
2]. Thus, novel disease-specific features of cfDNA with high predictive efficacy are needed to be found for predicting LNM.
Recently, cfDNA coverage on gene promoter has found that it carried gene expression information of its original tissues [
14,
15]. Plasma cfDNA is mainly released by apoptotic cells after enzymatic processing of chromatin [
16]. The DNA bound to the nucleosomes is retained, while the exposed DNA between the nucleosomes is digested. Analysis of cfDNA fragments derived from cancers showed that the promoter regions of active genes exhibited depleted coverage, which meant that nucleosome binding was less in these regions along with increased gene expression [
15]. In cancer patients, cfDNA is mainly derived from tumor and hematopoietic cells [
16]. More importantly, studies on breast cancer have shown that many gene expression signatures could be used to estimate the risk of distant relapses, and some of which have been commercialized, such as PAM50. In addition, the immune cells have been proved to play an important role in tumor metastasis, and the peripheral blood immunome of breast cancer patients is influenced by the existence and stage of cancer [
17,
18]. Therefore, we assume that the cfDNA coverage at the gene promoter has potential to assess the ALN status.
In this study, we first compared the nucleosome footprint around the transcriptional start sites (TSS) of ALN-positive and ALN-negative breast cancer patients to identify genes with differential coverage. In order to further evaluate the potential of promoter profiling for evaluating ALN status, we developed a classifier for distinguishing ALN-positive and ALN-negative patients by using multiple machine learning models. Finally, we incorporated some clinicopathological characteristics in our classifier to test whether its performance would improve.
Methods
Participants and study design
From January 2018 to December 2019, before cancer therapy, plasma samples were prospectively collected from 330 breast cancer patients, including 162 ALN-positive and 168 ALN-negative patients. We excluded patients who: (1) were pregnant or lactating, (2) were metastatic breast cancer or had non-infiltrating tumors histologically, (3) had a hematopoietic system or inflammatory breast diseases, and (4) were ALN-negative patients diagnosed with fine needle aspiration biopsy. We reviewed all tumor specimens histopathologically and staged them according to the seventh edition of the American Joint Committee on Cancer (AJCC) staging system for breast cancer. All plasma samples were obtained under institutional review board of The First People's Hospital of Foshan approved protocols with written informed consent from all participants for research use (ID: L[2021]-7). Table
1 summarizes the characteristics of patients, including age, T stage, estrogen- (ER) and progesterone-receptor (PR) status, expression of human epidermal growth factor receptor 2 (Her2), proliferative fraction (Ki-67 labeling index), and histological grade.
Table 1
Patient characteristics
Age |
Years [range] | 51.6 [31–83] | 49.1 [35–83] | 0.124a | 51.7 [28–88] | 52.9 [26–79] | 0.611a |
T stage |
T1 | 22 (19.5) | 13 (26.5) | 0.355b | 77 (65.3) | 28 (56.0) | 0.288c |
T2 | 69 (61.1) | 24 (49.0) | 39 (33.1) | 22 (44.0) |
T3/T4 | 22 (19.4) | 12 (24.5) | 2 (1.6) | 0 (0) |
ER |
Positive | 90 (79.6) | 43 (87.8) | 0.311b | 99 (83.9) | 40 (80.0) | 0.698b |
Negative | 23 (20.4) | 6 (12.2) | 19 (16.1) | 10 (20.0) |
PR |
Positive | 94 (83.2) | 39 (79.6) | 0.745b | 98 (83.1) | 41 (82.0) | 1b |
Negative | 19 (16.8) | 10 (20.4) | 20 (16.9) | 9 (18.0) |
Her2 |
Positive | 97 (85.8) | 38 (77.6) | 0.284b | 97 (82.2) | 42 (84.0) | 0.953b |
Negative | 16 (14.2) | 11 (22.4) | 21 (17.8) | 8 (16.0) |
Ki67 |
< 20 | 41 (36.3) | 21 (42.9) | 0.539b | 63 (53.4) | 21 (42.0) | 0.238b |
≥ 20 | 72 (63.7) | 28 (57.1) | 55 (46.6) | 29 (58.0) |
Histological grade |
1 | 3 (2.7) | 4 (8.2) | 0.174c | 14 (11.9) | 4 (8.0) | 0.640c |
2 | 82 (72.6) | 37 (75.5) | 80 (67.8) | 38 (76.0) |
3 | 28 (24.7) | 8 (16.3) | 24 (20.3) | 8 (16.0) |
ALN surgery
The ALN status was ascertained clinically by fine needle aspiration biopsy, ALND or SLNB. Because the number of lymph nodes detected by the fine needle aspiration biopsy is limited, some positive lymph nodes may be ignored, which may increase the false positive rate of the evaluation model. Therefore, the patients with ALN-negative detected by fine needle aspiration biopsy were excluded from this study. Indocyanine green with a carbon nanoparticle suspension was used for SLNB and more than three LNs were checked for cancer.
Extracting and sequencing cfDNA
In total, 1 mL peripheral blood was collected using EDTA tubes from each patient and then immediately implemented two-step centrifugation to obtain the plasma. The centrifugation parameters were is 1600g for 10 min, followed by 10 min at 16,000g at 4 °C. Subsequently, the plasma was stored at − 80 °C before use. Each sample yielded at least 1 ng total cfDNA for sequencing. cfDNA was extracted from plasma by QIAamp DNA Blood Mini Kit (Qiagen). A starting amount of approximately 1–5 ng DNA was used for library construction with the Life Sciences Ion Xpress™ Plus Fragment Library Kit. The number of PCR cycles was set to 12. The DNA size distribution of libraries was analyzed on a Bioanalyzer instrument (Agilent Technologies, Singapore). Sequencing was performed with the Ion PI™ Hi-Q™ OT2 200 Kit and the Ion PI™ Hi-Q™ Sequencing 200 Kit on Ion Proton platform (ThermoFisher Scientific, USA) with 520 flow. The mean depth of the sequencing samples was approximately 0.3×.
Sequencing data processing
After sequencing, the raw read was aligned to the human reference genome (hg19) using bwa (ver.0.7.5). Then, SAMtools rmdup function (ver. 0.1.18) was used to remove the polymerase chain reaction duplicates [
19]. The GC-bias correction was implemented using the deeptools (ver.3.5.0) with the default setting. The calculation of tumor fraction and copy number-bias correlation were implemented using ichorCNA algorithm [
20].
The calculation of promoter profiling was similar to that used in our previous study [
15,
21]. In briefly, gene information was downloaded from RefSeq of University of California Santa Cruz [
22]. The region ranging from − 1 KB to + 1 KB around the transcriptional start site of each transcript, was defined as the primary transcription start site (pTSS), was first identified. The read counts for each base at the pTSS were calculated using DANPOS with default setting [
23]. After read alignment, the read coverage at the pTSS was extracted from the aligned BAM files using bedtools (ver. 2.17.0). Then, the read coverage was normalized by the reads per kilobase per million mapped reads (RPKM)-like method. The normalized value of promoter profiling was calculated by the following formula:
$$Normailzed \,Promoter \, profiling=\frac{\mathrm{cfDNA \, coverage \, around \, TSS}\times \mathrm{ 1,000,000}}{\mathrm{Totally \,mapped \,reads }\times \mathrm{ length}},$$
here, the length of each transcript is equal to 2000 because of the pTSS region ranging from − 1 KB to + 1 KB around each transcriptional start site.
Models for evaluating lymph node status
To develop the evaluation classifier, the patients were firstly divided into three cohorts, including discovery, training and validation cohorts. In the discovery cohort, we identified the genes with differential promoter coverage. Then, the plasma samples were then divided into training and validation cohorts in a ratio of 7:3. Based on the training cohort data, we developed classifiers using three models, including support vector machine (SVM), logistic regression (LR), and linear discriminant analysis (LDA) models, to distinguish ALN-positive and ALN-negative tumors. The importance of the features was assessed with the sigFeature package of R. Then we selected top 100 features for further classifier construction. The SVM classifier was constructed with the linear kernel in e1071 package using the default setting. In order to identify the optimal gene combination with the largest area under the curve (AUC), backward method was adopted. To avoid potential bias and over-fitting in the training cohort, the leave-one-out cross validation method was used to evaluate the robustness of these classifiers. Briefly, each subject in the training cohort was withheld in turn, and the rest subjects were submitted to train the model. The trained model was then used to determine the class of the withheld subject. This procedure went on until all subjects in the training cohort were judged. Finally, the efficacy of selected classifiers was evaluated using the validation cohort data.
Statistical analysis
Wilcoxon rank-sum test or Chi square test were used for analyses that compared the two groups. Benjamini–Hochberg method was used to adjust the raw
P-values to the false discovery rate (FDR). Variables with fold change ≥ 1.5 and FDR ≤ 0.05 were considered statistically significant. The genes with differential promoter coverage were used to plot uniform manifold approximation and projection (UMAP) and heat map using uwot package and pheatmap package in R (version 3.0.1), respectively. Receiver operating characteristic (ROC) curves were plotted and differences in the AUC were compared using the pROC package [
24]. GO enrichment analysis was implemented by using Metascape with default settings [
25]. Housekeeping genes and non-constitutive genes were downloaded from the additional material of a previous study [
14].
Discussion
We found that there was a significant difference in promoter profiling between ALN-positive and ALN-negative breast cancer patients (Fig.
3). The classifier PPCNM based on promoter profiling using the SVM model, produced the maximum AUC (0.897 [0.865–0.930]) for distinguishing these two groups of patients, and its performance was significantly better than those of classifiers relied on LR and LDA regression models (Fig.
4c; all
P < 0.05). In addition, the AUC increased slightly with the incorporation of clinical characteristics. These findings indicate that PPCNM may be a promising non-invasive tool for evaluating ALN status.
There are forty-eight genes in the PPCNM (Additional file
1: Table S2). These genes are closely associated with the metastasis of tumor. For instance, a large number of studies have reported the close relationship between NF-κB signaling pathway and tumor metastasis [
28,
29]. NF-κB signaling pathway regulates the expression of its downstream target genes, including MMP9, TNFα, uPA and IL8, thus promoting the invasion and metastasis of breast cancer cells [
29]. Besides, BHLHE40 confers a pro-survival and pro-metastatic phenotype to breast cancer cells by modulating HBEGF secretion [
30]. And BHLHE40 facilitates the invasion of cancer cell by interacting with SP1 [
31]. In addition, USP20 can promote breast cancer metastasis by stabilizing SNAI2 [
32].
ALN status is an essential factor for the prognosis of breast cancer patients and the choice of cancer treatment in breast cancer [
2]. Although milder SLNB has become more pervasive, LN surgery for evaluating ALN status still brings various side effects to patients. Therefore, developing a non-invasive method to predict ALN status may be beneficial to breast cancer patients. At present, some studies show that increased cfDNA levels are related to ALN Metastasis [
6,
7]. But cfDNA levels were affected by various physiological and pathological processes [
8‐
10]. More specific features of cfDNA have to be found for assessing ALN status. Previous studies have reported that cell-free DNA promoter profiling and TF profiling is capable of prediction of tumor subtypes in prostate and detect early-stage colorectal cancer [
14,
33]. Therefore, we assume that promoter profiling could be used to evaluate ALN status. In this study, we found the characteristics of specific promoter profile signatures of cfDNA in ALN-positive and ALN-negative patients (Fig.
3e). The classifiers (PPCNM) based on these differential variables achieved high performance with an AUC of 0.897 [0.865–0.930]. We developed a non-invasive method based on plasma cfDNA to assess ALN status, which could dynamically monitor the status of lymph node. More importantly, our method could avoid the heterogeneity of tumor in tissue detection. Nevertheless, there are some limitations in our research. Although the AUC of our classifier achieved 0.897, and 330 WGS data was used in this study, more prospective samples and samples from other external centers were needed to improve the predictive value of efficacy before clinical application.
In summary, our data suggest that PPCNM is a promising tool based on promoter profiling for evaluating ALN status in breast cancer. PPCNM is a non-invasive technique, which only needs low-coverage DNA sequencing and is not affected by cancer heterogeneity. Therefore, the PPCNM classifier may help patients and clinicians to choose appropriate cancer treatment methods, thus improving the curative effects and the quality of life of cancer.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.