Background
Breast cancer is the most common cancer and one of the leading causes of cancer death among women. The disease is heterogeneous, both clinically and molecularly. A large number of molecular studies have characterized breast cancer on the basis of data derived from one or two genome-wide measurement platforms, typically using gene expression or DNA copy number platforms [
1,
2]. Arguably the most influential finding to emerge from these studies is the robust identification of five gene expression–based molecular subtypes of breast cancer: two estrogen receptor (ER)-positive subtypes separated mainly by relatively low (luminal A) and high (luminal B) expression of proliferation-related genes, a subtype enriched for
ERBB2-amplified tumors [human epidermal growth factor receptor 2 (HER2)-enriched], a subtype associated with triple-negative [lacking expression of ER, progesterone receptor (PR), and HER2] tumors (basal-like), and a subtype with an expression profile similar to that of normal breast tissue (normal-like) [
3]. Later studies using multiple different platforms, including exome sequencing, DNA copy number arrays, DNA methylation arrays, and gene expression arrays, have highlighted the importance of integrating information across platforms to identify key characteristics of the molecular subtypes of breast cancer [
4].
DNA methylation patterns and chromatin states are epigenetic features often found to be altered in cancer cells [
5]. The breast cancer molecular subtypes have been found to be associated with characteristic DNA methylation patterns on the basis of limited panels of CpG sites [
6‐
8]. Typically, three major DNA methylation subtypes of breast tumors have been identified. One group is characterized by the lowest levels of DNA methylation and is associated with basal-like tumors. A second group is characterized by hypermethylation of promoter CpG sites and is associated with luminal B tumors. A third group is associated with luminal A tumors, whereas the HER2-enriched and normal-like gene expression–based subtypes have been found to have limited association with DNA methylation subtypes. Later, these observations were confirmed using genome-wide sets of CpG sites located primarily in promoter regions [
4] as well as across the entire genome [
9].
There are many links between chromatin states and DNA methylation [
5]. In cancer, widespread correlated changes in DNA methylation patterns and chromatin states have been observed [
10]. As these features collectively are associated with whether genes are transcriptionally active or inactive, they may underlie phenotypic changes observed in cancer cells. Furthermore, recent sequencing efforts have identified mutations of genes leading to altered epigenetic patterns for many tumor types [
10]. In breast cancer specifically, a number of links between DNA methylation and chromatin state have been observed. For example, promoters that are hypermethylated are often in lineage-commitment genes that in embryonic stem cells are in a transcription-ready bivalent chromatin state characterized by both active and repressive marks [
11,
12]. Another example is the observation of extensive chromatin state changes upon loss of DNA methylation in breast cancer coupled with maintaining these hypomethylated regions as transcriptionally silent [
13].
However, less is known about how DNA methylation patterns and epigenetic states on a genome-wide scale are coupled with breast cancer heterogeneity as reflected in the breast cancer subtypes. The development of platforms for genome-wide characterization of cells at many levels, together with large public datasets of normal and malignant breast samples, have provided opportunities to address this question. In the present study, we investigated breast cancer heterogeneity on the basis of genome-wide DNA methylation profiles of human tumors and integrated our findings with various types of molecular data, including chromatin states in both embryonic stem cells and human mammary epithelial cells (HMECs) generated in the ENCODE project [
14]. In a discovery cohort with DNA methylation profiles from 188 samples, we identified seven epitypes of breast cancer that were validated in 669 independent samples from The Cancer Genome Atlas (TCGA) breast cancer project [
4]. By integrating analyses across multiple platforms, we show that the epitypes are associated with specific gene expression subtypes, mutations, and DNA copy number aberrations (CNAs). To characterize epitype-specific hyper- and hypomethylation patterns, we identified sets of CpG sites that display differential methylation status between normal breast tissue and tumors of an epitype. These analyses revealed that DNA hypermethylation in luminal and basal-like tumors occurs in different chromatin contexts with different underlying regulatory potential in stem and mammary epithelial cells. Moreover, hypomethylation in luminal tumors was associated with DNA repeats and subtelomeric regions. Our results highlight links between breast cancer subtypes and the epigenome that could improve understanding of biological mechanisms underlying breast cancer heterogeneity and could eventually contribute to diagnostics and therapeutic interventions.
Methods
Sample material for methylation analysis
Fresh frozen breast tumor tissues (n = 188) obtained from the Southern Sweden Breast Cancer Group tissue bank at the Department of Oncology and Pathology, Skåne University Hospital (Lund, Sweden), and from the Department of Pathology, Landspitali University Hospital (Reykjavik, Iceland), were used as a discovery cohort. The 188 breast tumor tissues were from 181 unique female patients (183 primary tumor samples, 2 metastatic samples, and 3 locoregional recurrences; for 3 patients a primary and a recurrent sample were included, and for 4 patients 2 different primary tumors were included). The study was approved by the regional ethics committee in Lund, which waived the requirement for informed consent for the study (numbers LU240-01 and 2009/658), as well as by the Icelandic Data Protection Committee and the National Bioethics Committee of Iceland. For Icelandic patients, written informed consent was obtained according to Icelandic national guidelines.
Breast invasive carcinomas from the TCGA project with 450K methylation data available (based on TCGA update 27 September 2013) were used as a validation cohort [
4]. Replicated tumors were removed and female patients selected, resulting in a validation cohort consisting of 669 breast carcinomas from 666 unique female patients (665 primary tumor samples and 4 metastatic samples; for 3 patients a primary and a metastatic sample were included). For the normal cohort, 96 normal specimens originating from normal breast tissue from 96 different female patients from the TCGA project were used (90 of these patients also have a tumor sample in the validation cohort).
DNA from human mammary fibroblasts, HMECs, human mammary endothelial cells (ScienCell Research Laboratories, Carlsbad, CA, USA), and peripheral blood leukocytes (Promega, Madison, WI, USA) was used to generate a cohort of normal cell types. DNA methylation data from subpopulations of human blood cells generated by Reinius et al. [
15] were downloaded from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) [
16] accession number [GEO:GSE35069].
DNA methylation analysis
Genome-wide methylation data for the discovery cohort and the cohort of normal cell types were generated at SCIBLU Genomics, Lund University, using the Illumina Infinium HumanMethylation450 BeadChip Array (Illumina, San Diego, CA, USA) according to the manufacturer’s instructions. For the discovery cohort, DNA was extracted as previously described [
6]. DNA was treated with bisulfite using the EZ DNA Methylation Kit (Zymo Research, Irvine, CA, USA) according to the manufacturer’s instructions. DNA methylation data for the discovery cohort and the cohort of normal cell types are available in the NCBI GEO [
16] under accession numbers [GEO:GSE75067] and [GEO:GSE74877], respectively.
The 450K methylation data were processed similarly for all cohorts. Methylated and unmethylated signal intensities were obtained from GenomeStudio (Illumina) for the discovery and normal cell type cohorts, and from TCGA methylation level 2 data for the validation and normal breast tissue cohorts. Signal intensities were converted into β values [β = methylated/(methylated + unmethylated)] representing the methylation levels. CpG sites with detection p values greater than 0.05 or the number of beads for a channel fewer than 3 were considered missing measurements, and β values were set to “NA” (with the exception that the number of beads was not available for the TCGA cohorts). No sample had more than 10,000 missing values (discovery cohort range 835–9438, validation cohort range 214–4746, normal breast tissue cohort range 258–2700, normal cell type cohort range 758–1278). For the blood subpopulation data, β values were obtained as processed in the NCBI GEO.
Adjustment for bias between Infinium I and II assay CpG probes was performed by using a peak normalization algorithm. Briefly, for each sample, we performed a peak-based correction of Illumina I and II chemical assays inspired by Dedeurwaerder et al. [
17] as previously described [
18]. For each chemical assay separately, we smoothed the β values (Epanechnikov smoothing kernel) to estimate unmethylated and methylated peaks. The unmethylated peak was moved to 0 and the methylated peak to 1 using linear scaling, with β values in between stretched accordingly. β values less than 0 were set to 0 and values greater than 1 were set to 1.
A DNA hypermethylation score was calculated as described elsewhere [
19]. The hypermethylation score was calculated for two sets of CpG sites: a global score in which all CpG sites on the platform contributed, and a promoter CpG island score in which CpG sites with both Illumina annotation TSS1500 or TS200 and Illumina CpG island annotation contributed.
Identification and validation of breast cancer epitypes
Unsupervised bootstrap consensus clustering was performed to identify DNA methylation subgroups of tumors using 2000 bootstrap iterations as described elsewhere [
20]. The ward.D agglomerative method with Pearson correlation–based distance in the R package hclust was used for both the inner clustering (based on methylation patterns) and the outer clustering (based on bootstrap coclustering frequencies). DNA methylation centroids for an epitype were constructed by taking the average β value for each CpG site across the tumors in the epitype in the discovery cohort. Pearson correlations between tumors in the validation cohort and the centroids were calculated. Each tumor in the validation cohort was classified into an epitype on the basis of the centroid to which the correlation was largest. Principal component analysis was used to determine that no technical artifacts influenced the methylation data or the epitypes and that the epitypes were associated with the dominant variation in genome-wide methylation data [
21].
Gene expression data analysis
Gene expression data from oligonucleotide microarrays were available for 158 of the tumors in the discovery cohort as part of accession number [GEO:GSE25307], which encompasses 577 breast tumors [
22]. The normalized gene expression values (mean-centered across 577 tumors) in accession number [GEO:GSE25307] were used. Probes were mapped to Entrez Gene IDs, and the probe with the largest variation in expression across the 577 tumors was selected for each gene, resulting in relative gene expression levels for 7499 genes in the discovery cohort. TCGA RNAseq v2 level 3 data were available for a total of 994 tumors and 106 normal breast tissue samples, including 661 of the 669 tumors in the validation cohort. The gene-normalized RSEM count estimates were offset by a pseudocount of 1, log
2-transformed, and mean-centered across the 994 tumor samples to generate relative gene expression levels for 20,531 genes in the validation cohort. For some analyses, we were interested in comparing estimates of the expression levels of different genes and therefore could not use relative expression levels across tumors. In these analyses, we took the effective transcript length into account by using the gene RSEM scaled estimates (tau) from the TCGA data transformed into transcripts per million (TPM), and used log
2(TPM + 1) as a measure of gene expression [
23]. The R package genefu was used to assign expression-based molecular subtype to tumor samples on the basis of PAM50 using relative expression levels in both the discovery and validation cohorts [
24]. Expression data for 35 and 50 of the 50 PAM50 genes were available in the discovery and validation cohorts, respectively. The R package iC10 was used to assign IntClust groups to tumor samples in both the discovery and validation cohorts [
25]. For each cohort, the iC10 package was run with the following settings: expression data only, probe mapping based on gene symbols, and normalizing each probe to a Z-score. Expression data for 346 and 584 of the 612 iC10 genes were available in the discovery and validation cohorts, respectively. The activity of eight gene modules, representing transcriptional programs in breast cancer, was calculated in each tumor in both the discovery and validation cohorts as the average relative expression level of the genes in a module [
26]. Genes in modules were mapped to genes in expression data based on Entrez Gene ID.
Correlation between DNA methylation and gene expression
We calculated correlations between methylation and relative gene expression levels using the validation cohort because the number of genes was limited on the expression platform used in the discovery cohort. Matching on gene symbol, 324,991 CpG sites were associated with a unique gene in the TCGA gene expression data and displayed variation in methylation levels across the validation cohort. Pearson correlations of 0.2 and −0.2 between gene expression and methylation levels were associated with p values much less than 10−6. Hence, correcting for multiple hypothesis testing, less than 1 CpG site having a Pearson correlation greater than 0.2 or less than −0.2 is expected by chance across the 661 tumors.
Functional classification of gene sets
Enrichment of functional classification of genes in identified gene sets was analyzed using the DAVID Functional Annotation Tool [
27] with the default Homo sapiens background and the false discovery rate (FDR) option to correct for multiple hypothesis testing. Gene set enrichment analysis was used to investigate the overlap of genes in identified gene sets with genes in 10,348 gene sets collected in the Molecular Signatures Database (MSigDB) [
28]. In these analyses, CpG annotation data obtained from Illumina were used to map CpG sites to genes, and only CpG sites mapping to a unique gene were included.
Processing of human genome data
Chromatin states in human embryonic stem cells (H1hESCs) and HMECs, as well as peak calls for DNase I hypersensitive sites and EZH2 binding sites in HMECs from the uniform pipeline, all generated by the ENCODE consortium, were obtained using the UCSC Genome Browser [
14,
29]. CpG sites were mapped to chromosome regions with information from ENCODE using the R package GenomicRanges [
30]. CpG sites were mapped to DNA repeat regions using the repeats_rmsk_hg19.txt table in the UCSC Genome Browser.
BRCA1 promoter methylation analysis was performed using the 450K methylation data. To identify informative CpG sites, we screened all 44 CpG sites on the platform located within
BRCA1 transcripts or 1 kb upstream for negative correlation (Pearson correlation less than −0.2) with
BRCA1 gene expression levels using the validation cohort. We identified 21 informative CpG sites. All informative CpG sites were located within 1 kb centered on the
BRCA1 transcription start site. Tumors were classified as
BRCA1 promoter methylated if the average β value for the informative CpG sites was greater than 0.2. The average β value for the informative CpG sites ranged from 0.004 to 0.03 across the 96 normal tissue samples in the normal cohort. To validate BRCA1 promoter status, we used data available for 71 tumors from a previous study in the discovery cohort and obtained with a PSQ HS 96 pyrosequencing system (Biotage, Uppsala, Sweden) as described [
22].
HORMAD1 promoter methylation analysis was performed in the same way as it was for
BRCA1. We identified seven
HORMAD1 informative CpG sites among nine CpG sites located within
HORMAD1 transcripts or 1 kb upstream. All informative CpG sites were located within 1 kb centered on the
HORMAD1 transcription start site. Tumors were classified as
HORMAD1 promoter unmethylated if the average β value for the informative CpG sites was less than 0.8. The average β value for the informative CpG sites ranged from 0.91 to 0.99 across the 96 normal tissue samples in the normal cohort.
Somatic mutation analysis
Somatic mutations from exome sequencing were available from TCGA for 645 of the 669 tumors in the validation cohort [mutation annotation format (MAF) file, curated level 2 data, version 2.1.1.0]. For some tumors, the MAF file contained mutations called from multiple exome sequencing experiments with different reference samples or different tumor aliquots. (For 34 of the tumors mutations were from 2 experiments, and for 1 tumor mutations were from 3 experiments.) We called gene mutations when genes were mutated in at least one experiment for the tumor, and the total number of single-base substitutions for each tumor was calculated as the average for the multiple experiments.
Copy number analysis
Copy number estimates and CNAs obtained from bacterial artificial chromosome (BAC) arrays were available for 180 of 188 tumors in the discovery cohort from previous studies [
22,
31,
32]. Affymetrix Genome-Wide Human SNP Array 6.0 (Affymetrix, Santa Clara, CA, USA) level 3 data were available from TCGA for 660 of 669 tumors in the validation cohort and were used to generate copy number estimates and CNAs as described elsewhere [
33]. The fraction of the genome altered (FGA) by copy number alterations was estimated as the number of probes with copy number gain or loss divided by the total number of probes for the platform. Amplifications were identified using a previously defined set of significant DNA CNAs in breast cancer [
31]. This set was identified using GISTIC [
34]. GISTIC regions with an average copy number estimate of probes in the region greater than 0.8 were called as amplifications in both the discovery and validation cohorts. Complex arm-wise aberration index (CAAI) scores were calculated for each tumor as described by Russnes et al. [
35]. A case was classified as CAAI-positive if one or more chromosome arms were affected by complex alterations with a CAAI score greater than 2 for samples in the discovery cohort or greater than 4 in the TCGA cohort. The reason for the difference in cutoff between the cohorts is due to the different platforms from which the copy number data were generated (Affymetrix Genome-Wide Human SNP Array 6.0 for TCGA, BAC arrays for the discovery cohort). The different platforms have different responses (platform-related characteristics) to copy number change (amplitude), and this amplitude is an important variable in the CAAI calculation.
Statistical analysis
Wilcoxon tests, Kruskal-Wallis tests, χ
2 tests,
t tests, and Fisher’s exact tests were performed in R. Adjustment of
p values for multiple-testing correction of these statistical tests was performed using p.adjust in R with the Benjamini-Hochberg method to control the FDR [
36]. Survival analysis was performed in R using the survival package. Survival functions for patients stratified by epitypes were estimated using the Kaplan-Meier estimator and compared using the log-rank test. In the survival analysis, 169 samples (first primary tumor with available survival data) were included for the discovery cohort and 654 samples (primary tumor with available survival data) were included for the validation cohort.
Discussion
DNA methylation of CpG sites in the genome is a normal developmental process that is of interest in cancer because many sites become aberrantly methylated or demethylated in the disease state. Moreover, it is often claimed that DNA methylation processes are of importance for tumor initiation and progression. We conducted a comprehensive analysis of genome-wide DNA methylation profiles of 188 breast tumor samples. Our overarching goal was to gain insights into how DNA methylation patterns on a genome-wide scale are associated with breast cancer heterogeneity. The findings were extensively validated in an independent cohort from TCGA encompassing 669 breast tumor samples. Previously, TCGA identified five epitypes—essentially corresponding to two luminal A epitypes, two luminal B epitypes, and a basal-like epitype—in an analysis of a large tumor set (
n = 466) restricted to CpG sites in promoters [
4]. The epitypes identified by TCGA provide a direct extension from the three epitypes typically identified in smaller studies [
6‐
9]. In the present study, we identified seven epitypes of breast cancer using unsupervised analysis of genome-wide DNA methylation levels not restricted to CpG sites in promoters. Our epitypes give independent support to the five epitypes identified by TCGA and add a normal like epitype ET1 (normal-like tumors were very few in the original TCGA analysis) and an epitype ET6 enriched for HER2-enriched tumors. We performed an integrative analysis of genomic data at multiple levels to characterize the breast cancer epitypes.
To a large extent, the four luminal epitypes we identified (ET2–ET5) were characterized by a gradual increase of many variables from ET2 to ET5. For example, proliferative rate, fraction of luminal B tumors, promoter CpG island methylation levels, overall mutation rate,
TP53 mutation frequency, number and complexity of CNAs, number of amplifications, and tumor size all increased from ET2 to ET5. These findings are consistent with observations based on gene expression–based analyses suggesting that the separation of tumors into luminal A and luminal B is not well-defined, but rather reflects an arbitrary cutpoint in a continuous distribution of expression levels of proliferation-related genes [
37‐
39]. Importantly, the luminal epitypes also displayed specific epigenetic characteristics in particular for the two more luminal B-like epitypes, ET4 and ET5. ET4 displayed a global hypomethylation phenotype and hypomethylation of subtelomeric regions, and was enriched for tumors with
BRCA2 germline mutations. However, the association between
BRCA2 germline mutations and ET4 remains to be validated in an independent dataset. ET5 displayed a global hypermethylation phenotype and was associated with older patients. The different global methylation patterns of ET4 and ET5 provide an example of the opportunities of going beyond analyses restricted to promoter CpG islands. On the contrary, the more luminal A-like epitypes ET2 and ET3 seemed to reflect more of a continuum, and the separation of these tumors into groups is likely cohort size–dependent. Indeed, in an unsupervised analysis of the large validation cohort (
n = 669), there was support to separate ET3 into two groups (Additional file
2: Figure S2D).
HER2-enriched tumors are typically found to display heterogeneous DNA methylation patterns not associated with a specific epigenetic subtype [
4,
6,
9]. In a previous study based on CpG sites in promoter regions, researchers identified a subtype associated with HER2-enriched tumors with a methylation pattern of infiltrating lymphocytes [
54]. Such a subtype shows similarities to our epitype ET1 that contains relatively many HER2-enriched tumors and is characterized by high expression of immune response genes (Additional file
2: Figure S3). In the present study, we identified, for the first time to our knowledge, a breast cancer epitype associated with HER2-enriched tumors not displaying a methylation pattern similar to normal cells (ET6). ET6 contains only a fraction of the HER2-enriched or
ERBB2-amplified tumors (around 20 %), and it is likely that our use of tumor sets containing many HER2-enriched tumors (Table
1) was essential to identifying this HER2-associated epitype. ET6 tumors were characterized by multiple amplifications beyond HER2 (the epitype with most amplifications per sample), the most complex genomes,
TP53 mutations, and poor overall survival.
We identified only a few associations between somatic mutations and epitypes in a screen taking multiple testing into account. As expected,
PIK3CA and
CDH1 were frequently mutated in the luminal epitypes and
TP53 was frequently mutated in the basal-like and HER2-enriched epitypes. Many genes were mutated in relatively few samples, and it may be worthwhile to investigate whether mutations in sets of functionally related genes underlie specific epitypes.
BRCA1 mutations were significantly associated with the basal-like epitype (ET7). However, we did not identify any methylation differences within the basal-like epitype when stratified according to
BRCA1 status (either germline or somatic), with the exception that
BRCA1 alone displays promoter methylation in a subset of tumors with the basal-like epitype. These observations are consistent with findings reported by Prat et al., who observed very minor molecular differences at multiple levels (gene, protein, miRNA, and DNA methylation) according to
BRCA1 status in basal-like breast cancer [
55].
Analyses of whole tumor tissues have revealed that DNA methylation patterns are heavily influenced by surrounding or infiltrating stromal cells [
18,
54]. We identified an epitype with a methylation pattern similar to that of normal cells (ET1). By collecting tumors with normal-like methylation patterns into a separate epitype, the characteristics of the other epitypes are likely to become clearer. The reproducibility of identified subtypes is often assessed by showing that the proportion of cases assigned to each subtype is similar across different cohorts [
51]. However, it is important to keep in mind that some methods have a bias toward keeping the proportions of subtypes similar [
56]. In the present study, we analyzed retrospective tumor cohorts essentially generated by collecting as many tumors as possible, which may have resulted in cohorts with different characteristics. We found the proportions of samples assigned to the epitypes somewhat different for the discovery and validation cohorts. ET1 (23 % vs. 14 %), ET6 (6 % vs. 4 %), and ET7 (24 % vs. 15 %) contained larger fractions of samples in the discovery cohort, whereas ET3 (30 % vs. 15 %) and ET5 (12 % vs. 4 %) contained larger fractions of samples in the validation cohort. Reassuringly, these differences reflect differences in the composition of the cohorts. On one hand, the discovery cohort is enriched for HER2-enriched tumors [many of which likely are infiltrated by immune cells (Fig.
1a, Additional file
2: Figure S3)] and tumors from patients with
BRCA1 germline mutations. On the other hand, the validation cohort contains a larger fraction of ER-positive luminal tumors and more tumors from older patients (Table
1). These interpretable connections between cohort composition and epitype proportions add support to the reproducibility and generalizability of our epitypes.
Traditionally, epigenetic reprogramming has been thought to contribute to tumor progression by silencing tumor suppressor genes. This model has been challenged by the finding that most cancer-associated methylation occurs in genes that are already repressed in the normal tissue from which the cancer derives [
57,
58]. We identified two different patterns of cancer-associated DNA methylation in breast tumors. One set of CpG sites was methylated in both luminal and basal-like breast tumors and was thereby considered constitutive, whereas a second set was specifically methylated in luminal breast cancer. We found that the set of CpG sites with constitutive methylation matched the paradigm of being repressed in normal breast epithelial cells and displaying no correlation between expression and methylation levels. On the contrary, the set of CpG sites methylated specifically in luminal breast cancer were associated with genes expressed in normal breast epithelial cells and displayed negative correlation between expression and methylation levels. Similar observations have been made in pediatric acute lymphoblastic leukemia for CpG sites with constitutive and subtype-specific methylation patterns, respectively [
59]. Moreover, differentially methylated regions associated with bladder cancer subtypes have been found to separate into patterns with substantial differences with respect to expression–methylation correlations [
48]. As proposed by Sproul et al., the aberrant constitutive methylation in breast tumors may be a marker of their epithelial cell lineage rather than of tumor progression [
60]. However, the CpG sites specifically methylated in luminal breast cancer do influence gene expression levels and may contribute to tumor progression. Methylation of these CpG sites was associated with epitypes enriched for luminal B tumors. This finding is consistent with our previously proposed model in which luminal differentiation is partially blocked by aberrant methylation in luminal B tumors [
6].
Constitutive methylation in breast cancer and methylation specific to luminal cancer occurred in regions in different chromatin contexts in normal mammary epithelial cells. Constitutive methylation occurred primarily in regions in a Polycomb-repressed state, consistent with this methylation not being the original cause of repression of gene expression. Luminal-specific methylation was enriched in regions in active promoter states in normal cells, adding support to the picture in which aberrant methylation contributes to a block to keep some luminal cancers more undifferentiated. Because breast cancer–specific methylation to a large extent is associated with chromatin states and thus with aberrant methylation of very many genes, often already repressed in precancerous tissue, it is not straightforward to identify potential epigenetic driver genes. It may be that epigenetically deregulated driver genes are rare and that most methylation in cancer is a passenger event of general epigenetic deregulation in cancer. Perhaps the methylated genes are prone to methylation merely because they are repressed in a tissue-specific fashion [
58]. Moreover, we observed that genes unmethylated in breast cancer were associated with subtelomeric regions and DNA repeats and showed limited influence on gene expression levels. Hence, identification of candidate tumor suppressor genes or oncogenes based solely on methylation data will likely result in numerous false-positive findings.
We focused our analyses on genome-wide screens for CpG sites that display changes in methylation state between macrodissected tumor tissue and normal breast tissue. There are limitations with use of this approach, although its utility in identifying and characterizing robust epitypes is clear. Directions for future improved characterization of breast cancer epigenetic heterogeneity include using different normal cell subpopulations separately instead of normal breast tissue, and investigating CpG sites that display varying or intermediate methylation in normal cell populations. Another limitation of the present study is that we restricted our analyses to epitype-specific methylation patterns. These analyses revealed very low numbers of CpG sites with specific hyper- and hypomethylation across the basal-like epitype. However, directed analyses showed that BRCA1 and HORMAD1 are clear candidates for driver genes directly regulated by aberrant methylation in some basal-like breast cancers. Taken together, our results suggest that the dominant patterns of breast cancer–specific hyper- and hypomethylation are associated with their genomic contexts, but also that there may be epigenetically deregulated driver genes for subsets of samples.
The gene expression–based molecular subtypes of breast cancer have been included in international guidelines for breast cancer treatment [
61]. The epitypes of breast cancer described in this report reflect, to a large extent, the gene expression–based subtypes, and perhaps may not add independent prognostic value. Nevertheless, it could still be that DNA methylation measurement provides a technically simpler and more robust clinical subtyping tool. Systemic treatment decisions for luminal breast cancer are partly dependent on differences in proliferative rates used to separate these tumors into luminal A and B. Our characterization of luminal epitypes opens up new opportunities to evaluate connections between chemotherapy response and molecular characteristics of luminal tumors. For example, the identification of a luminal group of patients with very few relapses who could be spared chemotherapy may potentially be improved by integrating methylation data with other molecular information. Because aberrant methylation in breast cancer affects large numbers of CpG sites, there are likely very many individual CpG sites that correlate with prognostic information. It has been found that most genes methylated in breast cancer cell lines cannot be derepressed by using the demethylating agent 5-aza-2′-deocycytidine [
60]. However, genes already repressed in normal epithelial cells dominated the evaluated genes. Hence, it may still be worthwhile to evaluate if demethylating agents have an effect on the subset of genes in luminal tumors with expression levels clearly associated with promoter methylation. Potentially, demethylating agents could result in further differentiation of luminal tumors with extensive promoter methylation and could benefit patient outcomes.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MR conceived the study. KH, JS, and MR designed the study. KH, JS, DL, POB, JVC, RBB, ÅB, GJ, and MR contributed to acquisition of data. JS, ML, and MR performed data analysis. KH, JS, MA, DL, MH, GJ, and MR contributed to data interpretation. MR drafted the manuscript with help from all of the other authors. All authors read and approved the final manuscript.