Background
Biliopancreatic cancers including cholangiocarcinoma (CCA), gallbladder cancer (GBC), and pancreatic cancer (PAC) are a group of malignancies characterized by extremely poor prognosis, which mainly attributes to late diagnosis and limited therapeutic intervention [
1,
2]. To better manage these diseases and improve patient survival, accurate detection, especially early detection, is of critical importance. With the rapid development of liquid biopsy, cell-free DNA (cfDNA) has provided a noninvasive approach for the diagnosis of solid malignancies [
3]. Normally, cfDNA originates mostly from hematopoietic cells, while in cancer patients, a fraction of cfDNA molecules could be released from the malignant cells [
4]. Previous studies have demonstrated that plasma cfDNA carried genetic and epigenetic information of their tissue of origin [
5,
6]. However, due to the low concentration of tumor-related cfDNA in plasma, detectable variations are limited among patients, especially at early stages [
7]. To improve the practicality of cfDNA, recent research has focused more on its detailed physical and molecular features, which held great promise for clinical application [
8].
Plasma cfDNA fragmentomics is an emerging field that covers various features like fragment size, end point and nucleosome footprint, where many studies have demonstrated that significant difference could be observed between cancer patients and healthy individuals [
9‐
12], even at early stages [
13‐
16]. Cristiano et al. introduced an approach focusing on the fragmentation size ratio in a multi-cancer cohort, in which the machine learning model had sensitivities of detection ranging from 57% to > 99% among seven cancer types at 98% specificity, with an overall AUC of 0.94 [
9]. The fragment size feature based models were also reported in the early detection of primary liver cancer which showed excellent sensitivities in detecting early-stage cancer (I: 95.9%, II: 97.9%) and small tumor (≤ 3 cm: 98.2%) [
15]. Besides, the end motif feature was widely studied for construction of stacked machine learning models in the early detection of multiple cancers such as colorectal adenocarcinoma and lung adenocarcinoma [
14,
16]. Since the genomic distribution of nucleosomes was considered to be cell-type specific [
17], nucleosome footprint was found to be another important feature that could inform the tissues contributing to cfDNA. A recent study on hepatocellular carcinoma has revealed that nucleosome footprint was the best individual diagnostic feature for differentiating hepatocellular carcinoma from liver cirrhosis in both validation (AUC = 0.971) and test sets (AUC = 0.973) [
13]. However, among biliopancreatic cancers, detection models based on these fragmentomic features still remain less investigated.
In this study, we constructed a fragmentomics-based machine learning model for detecting the closely located biliopancreatic cancers including CCA, GBC, and PAC using low-coverage whole-genome sequencing data. The stacked model by fragment size, end motif, and nucleosome footprint showed excellent performance in detecting biliopancreatic cancers, which exhibited higher sensitivity and specificity when integrated with carbohydrate antigen 19–9 (CA19-9). Moreover, our method demonstrated the potential utility of cfDNA fragmentomics in differentiating biliopancreatic cancer subtypes.
Methods
Patients enrollment and sample collection
Patients with treatment-naïve clinically diagnosed biliopancreatic malignancies were collected from Changhai prospective database (Changhai Hospital, Shanghai, China). Blood samples were collected before surgery or biopsy for cfDNA extraction and CA19-9 examination (Roche Diagnostics GmbH, 11,776,193: 39 IU/mL as cut off value). Exclusion criteria were as follows: (1) Patients without biliopancreatic cancers confirmed by pathological examination. (2) Patients with inconclusive pathology results. (3) Patients simultaneously suffering from other tumors. (4) Blood specimens with hemolysis levels > grade 4. Patients were staged according to the 8th edition of the American Joint Committee of Cancer (AJCC) tumor-node-metastasis (TNM) staging system. All participants provided signed written informed consent to use their blood samples and clinical data, and were divided randomly into a training cohort and a validation cohort. This study was conducted in accordance with the national guideline and approved by the Ethics Committee of Changhai Hospital (CHEC2018-112).
Blood samples were collected in BEAVER Cell Free DNA Tubes (Beaver, 43,803). To harvest plasma, the blood samples were centrifuged at 1600 g for 10 min at 4 °C, after which the hemolysis level was determined and recorded. The samples with hemolysis level ≤ grade 3 were used for further experiments. The collected supernatant was centrifuged at 16,000 g for 15 min at 4 °C to remove the cell debris. Then, the supernatant was stored in 1 mL aliquots at − 80 °C prior to DNA extraction. Plasma samples were thawed in a 37 °C water bath and centrifuged at 1600 g for 10 min at 4 °C. QIAamp Nucleic Acid Kit (Qiagen) was used to isolate plasma cfDNA. Qubit 3.0 fluorometer (Thermo Fisher) and Agilent 2100 bioanalyzer (Agilent Technologies Inc.) were used to detect the cfDNA concentration, purity, and fragment distributions.
Plasma cfDNA library construction and low-coverage WGS
The qualified cfDNA samples were used for cfDNA library construction followed by low-coverage WGS. The library was prepared using KAPA DNA Hyper Prep Kit (KAPA, KK8504) in accordance with the manufacturer’s instructions. The input amount of each cfDNA sample was 10 ng. The base end was replenished, and an “A” tail was added. Then, the joint was connected and purified, and seven circulating enrichment libraries were amplified by PCR. After purification and elution in 25 µL of eluent, the plasma cfDNA library concentration and the fragment distribution were determined by Qubit (Thermo Fisher) and Agilent 2100 bioanalyzer (Agilent Technologies), respectively. NovaSeq 6000 platform (Illumina) was used for WGS with a sequencing strategy of 2 × 150 bp and sequencing volume of ~ 10 G (~ 3 ×).
Sequencing alignment and quality control
Data processing
After removing adapters, the sequencing data in FASTQ format were aligned to the hg19 reference genome. Low-quality and repetitive reads were removed. Only reads meeting the following criteria were kept: (1) Aligned to autosomes; (2) Quality score greater than 20; (3) Insertion size between 150 and 600 bp; (4) Properly paired; (5) Reference region without degenerate bases.
Identifying fragmentomics features
Following the approach called “DNA evaluation of fragments for early interception” (DELFI) method [
9], the genome was divided into 504 5-Mb bins. Coverage of short and long fragments in each bin was calculated, where short fragments were defined as those with a length [130,177] and long fragments were those with a length [177, 237]. The z-score standardized short and total (short and long) fragment coverage in the 504 bins was used as the input for machine learning with 1008 features in total.
For end motif features, we defined the 6th bp on the 5’ end of a cfDNA fragment as the 1st position due to our unique molecular identifier (UMI)-attached sequence data. UMI is commonly introduced into sequencing to increase accuracy, while we found that they affected the frequencies of end motifs. Comparison of samples with and without UMIs showed that the motif frequencies of UMI attached samples could match those of UMI-unattached samples in previous studies when counting from the 6th position rather than the 1st position (Figure
S1). 3-bp motifs from the 1st, 4th, and 7th positions of the 5’ end of the cfDNA fragments were counted for each sample. One 3-bp motif has 64 different combinations. Their frequencies were combined into a 9-bp codon motif with 192 (64 × 3) features.
To compute nucleosome footprint features, we defined the central region of a gene as ± 250 bp around transcription start site (TSS) and the reference region of the gene as the sum of the upstream [− 2000, − 1000] bp region and the downstream [1000, 2000] bp region around its TSS. The nucleosome footprint of a gene was then defined as the mean coverage of the central region divided by the mean coverage of the reference region. A total of 24,639 genes were selected as features based on their nucleosome footprint profiles across samples.
Machine learning
Ten-fold cross-validation on the training cohort was used to select models and optimize criteria. The cutoff was selected as the value that minimized Gini impurity in the training cohort. LinearSVC was selected as the machine learning model for fragment size and nucleosome footprint features. RandomForest was selected as the model for end motif features. A stacked model trained on the three cfDNA features were built using stacking learning with logistic regression as the meta-learner. When training models for a specific cancer type, only non-cancer samples and samples of the specific cancer type were used.
To construct the CF (CA19-9 and fragmentomics) model which is based on CA19-9 and stacked model scores, log2(x + 1) transformation of CA19-9 and ten-fold cross-validated estimates of the stacked model were computed in the training cohort as input scores. LinearSVC was selected as the machine learning classifier for the final model.
Statistics
The
P-values of the risk scores were computed using the Mann–Whitney U (Mann–Whitney U test) function in the Python SciPy package. The heatmaps were constructed using the Python Seaborn package (
https://seaborn.pydata.org/citing.html).
P-values less than 0.05 were considered to be statistically significant.
Ethics statement
The study was conducted in accordance with the Declaration of Helsinki. This study was approved by the Ethics Committee of Changhai Hospital (CHEC2018-112). All participants provided signed written informed consent to use their blood samples and clinical data.
Role of the funding source
The sponsors did not have any role in the study design, data collection, data analysis, interpretation, or writing of the manuscript.
Discussion
Liquid biopsy, particularly cfDNA, has been widely studied to facilitate noninvasive cancer screening for improve patient prognosis [
8]. With recent advancement in sophisticated technologies such as machine learning, cfDNA fragmentomics exhibited great potential for distinguishing cancer-derived cfDNA and determining tissue of origin [
28]. Among the most common malignancies like liver cancer, colorectal adenocarcinoma and lung adenocarcinoma, machine learning models based on cfDNA fragmentomics have been well established to provide accurate assays for cancer detection and early screening [
14‐
16,
29]. However, for the biliopancreatic cancers with very dismal outcomes, the characteristics of cfDNA fragmentomics and its performance on cancer diagnostics remains largely unknown. To the best of our knowledge, this study provided the first systematic characterization of cfDNA fragmentomics in biliopancreatic cancers. We investigated the potential application of cfDNA fragmentomics in detection biliopancreatic cancers, revealing that the ensemble stacked machine learning model of fragment size, end motif and nucleosome footprint showed high prediction accuracy in differentiating biliopancreatic cancer patients from healthy volunteers, which performed better when combined with CA19-9. Besides, our method could also be used to distinguish biliopancreatic cancer subtypes. This study offered valuable support for the clinical application of cfDNA fragmentomics in improving diagnostics of biliopancreatic cancers.
The fact that cfDNA has a very short half-life (~ 2 h) makes it an ideal biomarker, which carries the genetic and epigenetic information of its origin tissues reflecting the current state of a disease [
28]. We previously demonstrated that preoperative detection of main driver mutations in the plasma cfDNA could be used to inform the prognosis of resectable PAC patients and help in optimizing surgical selection, but this requires prior knowledge of critical alterations in tumors [
30]. Our recent research on cfDNA methylation signature found that genome-wide methylation profiles provided potential utility for noninvasive detection of early PAC [
31]. Moreover, methylation-based approaches has the advantages over detection of mutations to detect organ-specific cfDNA fragments by reducing false-positive test results due to clonal haematopoiesis [
32]. Recently, Ben-Ami and colleagues have proved that the combination of 9-loci cfDNA methylation panel, CA19-9 and serum protein marker TIMP1 exhibited greater discrimination of early stage PDAC than CA19-9 alone [
33], and Hartwig et al
. developed a methylation biomarker panel with better greater discrimination power for pancreato-biliary cancers than CA19-9 [
34]. However, high cost and complexity as well as the relatively small number of detectable epigenetic alterations could be the limitations for further clinical application [
31]. To increase the sensitivity of cancer detection with cfDNA, many genome-wide approaches for analysis of cfDNA fragmentation profiles has been developed, which exhibited the potential to identify a large number of tumor-derived changes in the circulation. For PAC, it has been reported that, compared to healthy control, plasma cfDNA had significantly shorter fragment size and higher concentration in patient samples, which were associated with worse outcomes [
35]. Similar results could also be observed in CCA and GBC that the concentration of plasma cfDNA in patient samples were markedly higher than that in non-cancer controls and increased with tumor stage or tumor size [
36,
37]. In this study, we found that fragment size was a powerful biomarker for differentiating biliopancreatic cancers from healthy volunteers. Besides, end motif and nucleosome footprint were identified as another two critical features for biliopancreatic cancers detection. Furthermore, the stacked model based on these fragmentomics features exhibited excellent sensitivity and specificity in biliopancreatic cancers diagnostics, even at extremely low sequencing depth of 0.5 × , providing new insights for revolutionizing blood-based cancer detection.
Since nucleosome occupancy patterns are different among various tissues [
12], it is possible to use cfDNA fragmentation patterns to infer the characteristics of the epigenome for determining its tissue origin. Based on fragment size feature alone, DELFI approach has demonstrated the potential to distinguish between seven different cancers, which had a 61% accuracy (95% CI 53% − 67%) that increased to 75% (95% CI 69% − 81%) when assigning ctDNA to one of two sites of origin [
9]. Later, a multidimensional model developed by Bao et al. based on five distinct fragmentomics features covering cfDNA fragmentation size, motif sequence and copy number variation showed promising results in detecting cancers from distant anatomical locations (97.4%, 94.3% and 85.6% for primary liver cancer, colorectal adenocarcinoma and lung adenocarcinoma, respectively) [
16]. In addition, cfDNA fragmentomics could also be used to classify cancer subtypes. A recent study on cfDNA fragmentomics of primary liver cancer revealed that the fragmentomics-based machine learning model showed potential for distinguishing intrahepatic cholangiocarcinoma from hepatocellular carcinoma (AUC: 0.776) [
15]. In this study, we focused on the three closely located biliopancreatic cancers including CCA, GBC and PAC, which could be more difficult in differentiating due to similar development backgrounds. As expected, differentiation between CCA and GBC was not satisfying by either one fragmentomics feature or stacked model, while it was easier to differentiate PAC from biliary tract cancers (CCA and GBC) (Figure
S4). Particularly, we noticed that the patterns of NF carried more cell-type-specific information compared to fragment size and end motif, which was consistent with other studies [
38]. According to the gene set enrichment analysis of top 500 genes that contribute to predicting PAC against CCA/GBC, olfactory related pathways were found most significantly enriched. It has been reported that in addition to the olfactory system, olfactory receptors were also expressed in other human tissues like pancreas and might play a crucial role in the initiation of different cancers [
39]. In pancreatic cancers, many somatic mutations were found in olfactory receptor genes [
40], and gene expression in the olfactory pathway were mostly significantly affected in pancreatic cancer [
41]. Moreover, the differentially methylated and expressed genes of pancreatic cancer were also found mainly related to olfactory transduction [
42]. With these results, our study provided important evidence to confirm the potential utility of cfDNA fragmentomics in identifying the tissue of origin.
As the most widely used biomarker in biliopancreatic cancers, CA19-9 (also called sialyl Lewis A antigen) plays an indispensable role in cancer diagnosis and prognostic prediction [
43]. However, CA19-9 is not ideal. In addition to false positive results caused by biliary tract obstruction and inflammation, pancreatitis and other digestive cancers, false negative results that Lewis antigen negative individuals, occurring in 5% to 10% of the population, have very low or even absent secretion of CA19-9 could be the major shortfall for its clinical application [
44]. Therefore, the development of new biomarkers to assist CA19–9 in biliopancreatic cancers is highly necessary. Previous studies have proposed that traditional tumor biomarkers carcinoembryonic antigen (CEA) and CA125 had the potential to be applied in Lewis negative patients with pancreatic cancer [
44]. Our recent research found that genome-wide methylation profiles could help accurately identify CA19-9-negative PAC cases [
31]. In this study, we revealed that the fragmentomics-based stacked model achieved high sensitivity (18 out of 19, 94.7%) in differentiating biliopancreatic cancers from healthy volunteers in the CA19-9-negative group. Furthermore, after integrating the stacked model with CA19-9, the final CF model showed great performance in biliopancreatic cancers detection (AUC = 0.995). According to above, our study suggested that cfDNA fragmentomics could be an important complement to CA19-9 in biliopancreatic cancers detection, which paved the way for the development of new diagnostic strategies, especially for patients in the CA19-9-negative group.
Although our preliminary results of using cfDNA fragmentomics to facilitate non-invasive detection of biliopancreatic cancers were encouraging, several limitations should be acknowledged. The relatively small size of the study cohort might impair the performance of our model, which needs to be validated with a larger population. In addition, the sample sizes of several key categories such as stage I of PAC, CCA and GBC were also limited. Therefore, there might be overestimation of model sensitivity, and the model could need some optimization for early detection or screening purposes in biliopancreatic cancers. Finally, our models were constructed based only on cancer and healthy cases. Further inclusion of samples from benign biliopancreatic diseases would help improve the performance of our model in distinguishing biliopancreatic cancers patients and promote its clinical application.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.