Background
In 2019, the projected number of new cases of and deaths due to breast cancer (BrCa) in the United States, are 271,270, and 42,260 respectively [
1]. Worldwide, the corresponding numbers (2018 estimate) are 2,088,849, and 626,679 respectively [
2]. It is the second-leading cause of cancer death in women, one in 8 of whom will acquire the disease in her lifetime. Although genetic predisposition (i.e. BRCA1/2 mutations) is an important contributing factor (5–10%) [
3,
4], most BrCa cases are those without clear genetic link (it may still be due to unknown genetic risk, thus considered familial). While Stage I cases have close to 100% 5-year survival rate, those diagnosed at Stage IV have a 5-year relative survival rate of only 22%, accounting for 6–10% of new BrCa cases and 20–30% all of recurrent disease [
5]. The early detection of BrCa saves lives and reduces the morbidity associated with the aggressive treatments required for treating late-stage cancers. Nevertheless, the primary diagnostic screening method, mammography, has high rates of false positive and false negatives, can result in over-diagnosis,, uses harmful radiation, and is an uncomfortable process for patients [
6,
7]. This necessitates the pursuit of molecular markers more indicative of a tumor’s biological characteristics translatable to a reliable, non-invasive diagnostic assay. Over the years, there have been numerous reports indicating that either blood serum, plasma, or whole blood can harbor molecular biomarkers indicative of a progressing BrCa [
3,
8,
9]. These markers include: secreted proteins (e.g. CA15–3, trefoil factors 1, 2, and 3), auto-antibodies (e.g. antibodies against human endogenous retrovirus-K(HML-2) and heterogeneous nuclear ribonucleoprotein F), lipids (e.g. C16:1, C18:3, C18:2), and microRNAs (e.g. miR-21, miR-221, miR-145). In addition to the blood-based markers mentioned above, there is growing field exploring the use of DNA fragments released by cancer cells (referred to as circulating tumor DNAs or ctDNAs) into the patient’s bloodstream as an indicator of cancer [
10,
11]. Previous studies proved that genomic signatures (e.g. mutation, copy number variation, CpG methylation) found in cancer tissues are largely concordant with those identified in ctDNAs [
12‐
18]. Already marketed are early cancer diagnostic tests based on interrogating site-specific CpG hypermethylation in ctDNAs isolated from patient plasma. These include: a) Epi proColon, ColoVantage, Realtime mS9, all of which detect methylation in the
SEPT9 gene for colon cancer detection [
19]; b) Epi proLung which detects methylation of
SHOX2 for lung cancer detection [
20], and c) Colvera, which detects methylation at
BCAT1 and
IKZF1 for colon cancer recurrence [
21].
There are important considerations in the development of methylation-based early detection assays for BrCa (or any other cancer type). Although the levels of plasma-derived cell free DNA (cfDNA) in serum from cancer patients are indeed abnormally high in early- to late-stage cancers [
22‐
24], only a small percentage are ctDNAs (most cfDNAs are hematological in origin). Another important concern is the selection of appropriate markers. At the very least, the selected CpG sites should be highly methylated in breast primary tumors (PTs) and practically unmethylated in peripheral blood. However, for a marker to be highly specific to BrCa PTs, it needs to have very low levels of methylation in normal breast tissues, and many other tumor types. In this report, we demonstrate a new and more sensitive assay for methylated CpG detection (incorporating various steps including ligase detection reaction), and a comprehensive approach to biomarker discovery using integrated public genomic datasets.
Methods
Public genomic datasets
Analyzed for this study are various publicly available genomic datasets (Additional file
1: Supplement 1) such as those released by the TCGA project (
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) [
25] and those deposited in the Gene Expression Omnibus (
https://www.ncbi.nlm.nih.gov/geo/). The primarily Illumina 450 K methylation array-generated TCGA datasets were previously compiled (and processed) in the UCSC Cancer Genomics website (
https://genome-cancer.ucsc.edu/) [
26,
27]. The TCGA cohorts included in our analyses are: breast invasive carcinoma [BRCA], adrenocortical carcinoma [ACC], bladder urothelial carcinoma [BLCA], cervical squamous cell carcinoma and endocervical adenocarcinoma [CESC], cholangiocarcinoma [CHOL], colon adenocarcinoma [COAD], lymphoid neoplasm diffuse large b-cell lymphoma [DLBC], esophageal carcinoma [ESCA], glioblastoma multiforme [GBM], head and neck squamous cell carcinoma [HNSC], kidney chromophobe carcinoma [KICH], kidney renal clear cell carcinoma [KIRC], kidney renal papillary cell carcinoma [KIRP], brain lower grade glioma [LGG], liver hepatocellular carcinoma [LIHC], lung adenocarcinoma [LUAD], lung squamous cell carcinoma [LUSC], mesothelioma [MESO], pancreatic adenocarcinoma [PAAD], pheochromocytoma and paraganglioma [PCPG], prostate adenocarcinoma [PRAD], rectum adenocarcinoma [READ], sarcoma [SARC], skin cutaneous melanoma [SKCM], stomach adenocarcinoma [STAD], testicular germ cell tumors [TGCT], thymoma [THYM], thyroid carcinoma [THCA], uterine corpus endometrial carcinoma [UCEC], uterine carcinosarcoma [UCS], and uveal melanoma [UVM]. Also crucial to our biomarker identification is the integration of various GEO datasets such as: GSE65820 (ovarian cancer PTs and matching normals) [
28], GSE46306 (normal tissues of the cervix) [
29], GSE99553 (gastric mucosa), GSE74104 (testis) [
30], GSE77871 (adrenal tissues), GSE51954 (dermis and epidermis) [
31], GSE64509 (various brain tissues) [
32], GSE42861 (peripheral blood) [
33], and GSE59250 (various immune cells from healthy individuals) [
34]. The methylation data for BrCa cell lines were extracted from the GEO datasets GSE57342 [
35], GSE68379 [
36], GSE78875 [
37], and GSE94943.
Cell lines and genomic DNAs
The BrCa cell lines SKBr3, MDA-MB-134VI, and MCF7, which would serve as sources of cancer genomic DNAs (gDNAs), were grown according to culture conditions recommended by ATCC (
https://www.atcc.org/). At 80–90% confluence, the cells were washed with Phosphate Buffered Saline (× 3), and collected by centrifugation (500 x g). gDNAs were isolated using the DNeasy Blood & Tissue Kit (Qiagen; Valencia, CA). gDNA (> 50 kb size) isolated from blood (buffy coat) of healthy individuals was purchased from Roche (Indianapolis, IN) (also referred to as “Roche DNA”). Quant-iT Picogreen Assay (Life Technologies/Thermo Fisher; Waltham, MA) was used to determine gDNA concentration. The isolated gDNAs were then fragmented (50 bp to 1 kb size) using an ultra-sonicator from Covaris (Woburn, Massachusetts). The fragmentation size was assessed using the Agilent Bioanalyzer System.
Enrichment of methylated genomic DNA
The gDNA fragments containing CpG methylated fragments were enriched using the EpiMark® Methylated DNA Enrichment Kit (New England BioLabs, Ipswich, MA). This approach uses selective binding of double-stranded methyl-CpG DNA to the methyl-CpG binding domain of human MBD2 protein fused to the Fc tail of human IgG1. The fused IgG1 (MBD2-Fc) antibody is coupled to paramagnetic hydrophilic protein A magnetic beads. The enrichment procedure was carried out according to the manufacturer’s instructions.
Bisulfite conversion of digested genomic DNA
Bisulfite conversion of cytosine bases was accomplished using the EZ DNA Methylation-Lightning kit from Zymo Research Corporation (Irvine, CA). In brief, 130 μl of Lightning Conversion Reagent was added to 20 μl of previously enriched gDNA fragments. Subsequent protocol steps (according to the manufacturer’s instructions) led to elution of bisulfite converted DNA fragments in 10 μl of elution buffer.
PCR-LDR-qPCR
The assay we developed for detection of plasma-based BrCa methylation markers is divided into several steps described in following subsections. All primers (Additional file
1: Supplement 2) were purchased from Integrated DNA Technologies Inc. (Coralville, IA).
Linear amplification
In a 25 μl of reaction volume, the linear amplification step was carried out by mixing: 5.0 μl of corresponding bisulfite converted DNA template (out of 50 μl of eluted DNA after bisulfite conversion), 5 μl of 5x GoTaq Flexi buffer (no Magnesium) (Promega, Madison, WI), 2.5 μl of 25 mM MgCl2 (Promega, Madison, WI), 0.5 μl of 10 mM dNTPs (dATP, dCTP, dGTP and dTTP) (Promega, Madison, WI), 2.5 μl of the reverse primer (or primers in case of multiplex reaction) (1 μM), 0.625 μl of 20 mU/μl RNAseH2 (diluted in RNAseH2 dilution buffer from IDT) (IDT), and 0.55 μl of KlenTaql polymerase (DNA Polymerase Technology, St. Louis, MO) mixed with Platinum Taq Antibody (Invitrogen/Thermo Fisher, Waltham, MA). The reactions were run in a ProFlex PCR system thermocycler (Applied Biosystems/ ThermoFisher, Waltham, MA) with the following program: 2 min at 94 °C, 40 cycles of (20 s at 94 °C, 40 s at 60 °C, and 30 s at 72 °C.), and a final hold at 4 °C. After the reaction, Platinum Taq antibodies were added in the reaction mixture to inhibit the KlenTaq DNA polymerase. The KlenTaql/Platinum Taq Antibody mixture was prepared by adding 0.02 μl of Klentaql polymerase at 50 U/μl to 0.2 μl of Platinum Taq Antibody at 5 U/μl.
PCR
For the PCR reaction, 10 μl of linear amplification product (previous step) was mixed with 2 μl of 5X GoTaq Flexi buffer without Magnesium, 1 μl of 25 mM MgCl2, 0.4 μl of dNTPs (10 mM each of dATP, dCTP, dGTP and dUTP), 2 μl of 0.5 μM forward primer (or primers in case of multiplex reaction), 0.4 μl of Antarctic Thermolabile UDG (1 U/μl) (New England Biolabs, Ipswich, MA), 0.25 μl of 20 mU/μl RNAseH2, 0.44 μl of KlenTaql polymerase mixed with Platinum Taq Antibody (Invitrogen/Thermo Fisher, Waltham, MA). The KlenTaql / Platinum Taq Antibody mixture was prepared by adding 0.02 μl of 50 U/μl Klentaql polymerase to 0.2 μl of 5 U/μl Platinum Taq Antibody. The 20 μl-volume reactions were run in a ProFlex PCR system thermocycler, using the following program: 10 min at 37 °C, 40 cycles of (20 s at 94 °C, 40 s at 60 °C. and 30 s at 72 °C), 10 min at 99.5 °C, and a final hold at 4 °C.
LDR
The LDR step was performed in a 20 μl reaction prepared by combining: 5.82 μl of nuclease-free water (IDT), 2 μl of 10X AK16D ligase reaction buffer 0.5 μl of 40 mM DTT (Sigma-Aldrich, St. Louis, MO), 0.25 μl of 40 mM NAD+ (Sigma-Aldrich, St. Louis, MO), 0.5 μl of 20 mU/μl RNAseH2, 0.4 μl of 500 nM LDR upstream probes, 0.4 μl of 500 nM LDR downstream probes, 0.57 μl of purified AK16D ligase (at 0.88 μM), and 4 μl of PCR reaction products from previous step. The AK16D ligase reaction buffer (at 1X) contains the following: 20 mM Tris-HCI at pH 8.5, 5 mM MgCl2, 50 mM KCl, 10 mM DTT, and 20 μg/ml of BSA (all components purchased from Sigma Aldrich, St. Louis, MO). LDR reactions were run in a ProFlex PCR system thermocycler using the following program: 20 cycles of (10 s at 94 °C, and 4 min at 60 °C) followed by a final hold at 4 °C.
Taqman real-time qPCR
The qPCR reaction was performed in a 10 μl of reaction mixture prepared by mixing: 1.5 μl of nuclease-free water (IDT), 5 μl of 2X TaqMan® Fast Universal PCR Master Mix (Fast AmpliTaq, UDG and dUTP)(Applied Biosystems/ThermoFisher; Waltham, MA), 1 μl 2.5 μM forward primer at, 1 μl of 2.5 μM reverse primer, 0.5 μl of 5 μM probe, and 1 μl of LDR reaction products from the previous step. All qPCR reactions were run in a ViiA7 real-time thermo-cycler from Applied Biosystems (Applied Biosystems/ThermoFisher; Waltham, MA), using MicroAmp® Fast-96-Well Reaction 0.1 ml plates sealed with MicroAmp™ Optical adhesive film (Applied Biosystems/ThermoFisher; Waltham, MA). The run settings were as follows: fast block, Standard curve as experiment type, ROX as passive reference, TAMRA as reporter, and NFQ-MGB as quencher; program at 2 min at 50 °C, and 40 cycles of (1 s at 95 °C, and 20 s at 60 °C).
Taqman digital qPCR
For each digital PCR reaction, a 20 μl mixture was prepared in each of the 96 well digital PCR microplate. The mixture included 2 μl of diluted LDR product (Step 3), 1X Luna Universal ProbeqPCR master mixture, 0.1% tween 20, 0.4 mU RNAseH2, 0.025 U Antarctic Thermolabile UDG, 5 μM each of forward and reverse primers, and Taqman probe. 12 μl of reaction mixture was loaded into the Constellation Digital PCR System (originally Formulatrix, Bedford, MA; currently Qiagen), and run with the following conditions: 37 °C for 10 min, 95 °C for 20 s, and 45 cycles of 5 s (94 °C), and 20 s (60 °C).
Discussion
The limitations of mammography are what drives the persistent efforts towards developing non-invasive screening approaches for early BrCa detection. Falling under the term “liquid biopsy”, many of the methods under investigation are technologies which aim to detect blood-based molecular markers originating from BrCa cells. The molecular markers can include cfDNA fragments, exosome-enclosed or naked RNA molecules, secreted proteins and metabolites [
8,
9].
Of particular interest in the early-cancer detection field are circulating tumor DNAs (ctDNAs), which apoptotic and necrotic cancer cells release into patient plasma [
56]. As expected, ctDNA fragments possess the same molecular signatures (somatic mutations, methylation, copy number variation/aberration, SNPs) present in gDNAs isolated from the tumor tissue samples. Hence, molecular characterization tools normally used to investigate cancer gDNAs (such as exome or genome-wide sequencing, PCR, DNA arrays, methylation arrays) have also been applied in ctDNA analysis [
57]. What makes ctDNA analysis especially challenging is the fact that when isolated from patient plasma, ctDNAs are mixed with an overwhelming amount of DNA fragments that are hematopoietic in origin [
58‐
60]. All of the fragments are collectively referred to as cell-free DNAs (or cfDNAs). According to a recent study, the ctDNAs originating from BrCa cells is just a small fraction of total cfDNAs [
24]. This is based on the observation that the mutant allele fraction (MAF; from sequencing 58 cancer-related genes) of cfDNAs isolated from BrCa patients is less than 1% [
24]. It is imperative that the assay employed to analyze cfDNAs is capable of distinguishing between the positive (several copies of ctDNAs) and mostly negative (from non-cancer cfDNAs) signals. This limitation of ctDNA analyses can be circumvented through the identification of more appropriate molecular biomarkers, along with the modification of assay biochemistry towards higher sensitivity and specificity. Although plasma-based ctDNA markers may include markers for mutations, methylation states, or copy number variations (most reports interrogate methylation and mutation markers), methylation markers have several inherent advantages. First, methylation changes are tissue-specific [
61], thus as markers, would make them highly capable of distinguishing one cancer type from another. Another advantage of CpG methylation over mutation is that oftentimes the methylation changes adjacent CpG sites in promoter regions, are concordant. Methylation-dependent procedures (such as the use of methyl DNA-binding antibodies) would then be more effective in enriching the fragments containing the highly methylated markers.
Identification of appropriate methylation markers (i.e. particular CpG sites) is very crucial. To pinpoint the specific CpG sites that can easily distinguish BrCa tissues from peripheral blood and other types of cancer, we took advantage of the availability of various genome-wide methylation datasets. As previously pointed out, these calculations resulted in identification of 229 potential CpG markers which included CpG sites at the locus of
RASSF1A (
Ras association domain-containing protein 1), which happens to be the most highly reported blood-based methylation markers for breast cancer [
62‐
69]. Additional statistical inspections and assay design considerations would then point to the selection of the 3 CpG markers we focused on for this manuscript. Two of the CpG sites (m_NR5A2 and m_PRKCB) are located in promoter regions of genes, with reported link to breast cancer. NR5A2 (or LRH1) is a zinc finger transcription factor which can regulate CDKN1A expression in BrCa [
70], and has been positively associated with BrCa proliferation [
71], drug resistance [
72], aggressiveness [
73], high grade, and poor outcome [
74]. On the other hand, the role of PRKCB in breast cancer progression is still not clearly defined. While there are reports indicating that PRKCB can promote mammary tumorigenesis [
75], enhance breast cancer cells growth and cyclin D1expression [
76], and has the potential as therapeutic target [
77], there is also a report indicating it may inhibit tumor growth and metastasis [
78]. The third CpG site interrogated by our assay (m_ncr1) is located less than 8000 bp upstream of the exon 1 (according to GENCODE v31 annotation) of the protein coding gene
EFNA3, a member of the ephrin (EPH) family. Whether this particular CpG site influences the expression of EFNA3 protein, or the hypoxia-related EFNA3 lncRNA [
79,
80] is not clear at this point.
Interestingly, the methylation level at m_NR5A2 and m_PRKCB did not correlate with the transcription of the corresponding genes (Additional file
1: Supplement 10). However, it is important to note that CpG methylation (at the promoter region) is not the only factor that influences gene transcription. It is quite possible that histone modification [
81], regulatory miRNA or ncRNAs [
82], as well as transcription factors can supersede CpG methylation in influencing transcription. The competition between mRNA transcription and mRNA degradation is a dynamic process that can determine the transcript level of a gene at any given time [
83]. Regardless of m_NR5A2 and m_PRKCB CpG sites’ influence on their respective transcript levels, their association with BrCa progression is quite clear. This is further demonstrated through comparative genome-wide transcription analyses (which is essentially what GSEA is) of BrCa samples that are highly and lowly methylated at each CpG site. As shown in our analyses, the methylation level at each of the three methylation markers (m_ncr1, m_NR5A2, and m_PRKCB), is positively associated with genes, processes, and pathways indicative of BrCa progression. These include processes (and much of the component genes) associated with the retinoid nuclear receptor, PTEN, p53, p27, RB, and MTOR signaling pathways.
A great majority of reports on the interrogation of CpG methylation in cfDNA for BrCa detection employed the methylation-specific PCR (MSP) approach. Aside from
RASSF1, other genes whose CpG sites were observed to be hypermethylated in BrCa patient-derived cfDNAs (through MSP approach) are:
AKR1B1,
ARHGEF7,
BRCA1,
BRMS1,
COL6A2,
CST6,
CDKN2A,
CCND2,
DKK3,
ESR1,
GATA3,
GPX7,
GSTP1,
HOXD13,
HIST1H3C,
HOXB4,
ITIH5,
KLK10,
MSH2,
MLH1,
NBPF1,
P16,
PCDHGB7,
RARB,
RASGRF2,
SOX17,
SLIT2,
SFN,
SFRP1,
SOX17,
TM6SF1,
TMEFF2,
TRIM9, and
WNT5A [
84] [
62,
64‐
69,
85‐
92]. The aforementioned CpG markers were selected because the genes have known roles in BrCa progression (primarily as tumor suppressors), or were previously identified from the use of earlier, much less dense version of Illumina methylation array (27 K).
Bisulfite conversion is perhaps the most crucial step in MSP. However, bisulfite conversion can cause the degradation of around 84–96% of the input cfDNA, and is thus a significant contributing factor to MSP’s limitations in liquid biopsy [
93]. This is not an issue in analyzing gDNAs extracted from tissues and cell lines, which the MSP assay was originally intended for. In some reports, BrCa patient cfDNAs were analyzed through methylated CpG digestion (e.g. BstUI enzyme), followed by qPCR, with no bisulfite conversion step in the protocols [
94‐
96]. However, results using this approach are not reliable (higher rates of false positives) if there is incomplete digestion of unmethylated CpG sites.
The assay we are proposing incorporated several features which can collectively improve the MSP approach. These include the following: a) selective enrichment of methylated DNA, through the methylated CpG capture by using methyl-DNA binding protein, b) signal amplification of the targeted CpG site through successive steps of bisulfite PCR and LDR, (c) prevention of non-specific primer extension by incorporating RNaseH2-targeted ribose bases at the 3′ end of PCR and LDR primers, d) prevention of carryover-contamination by PCR products originating from previous positive samples, through the use UDG enzyme, e) multiple primer binding regions for orthogonal amplification of a region containing the targeted CpG site, and f) multiplex format of the assays.
Bisulfite sequencing is capable of interrogating more CpG markers compared to site-specific bisulfite conversion assays [
97‐
102]. However, we can only assume that the primary problems in MSP assays (the low abundance of cfDNAs and of target methylated CpG markers) are also encountered in bisulfite sequencing approaches. These factors, along with high cost, limits the recovery of information from bisulfite sequencing of cfDNA fragments [
103].
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.