Background
Breast cancer is the most common female cancer and the leading cause of cancer-related mortality in women worldwide [
1,
2]. Due to mammographic screening and advances in chemotherapy, breast cancer mortality rates have decreased in developed countries since 1990 [
3,
4]. Nonetheless, axillary-node negative patients treated by surgery showed a ten-year recurrence rate of approximately 20% [
5]. The five-year survival rate of stage I and II breast cancer patients is reported to be approximately 80% to 88% [
6‐
8]. This means that 10-20% of early-stage breast cancer (EBC) patients have poor clinical outcomes. When considering the large impact that breast cancer has on public health, it is worth investigating genetic mechanisms underlying poor clinical outcomes of some EBCs.
Genomic instability is one of the hallmarks of breast cancer. DNA copy number aberrations, commonly detected phenomena in cancer lesions, are thought to be involved in tumorigenesis and to affect cancer phenotypes [
9]. Different patterns of copy number alterations are associated with distinct gene expression patterns and clinical characteristics of breast cancer [
10]. A number of chromosomal alterations and subsequent expression changes have been investigated to determine their implications in clinical phenotypes or prognosis. These investigations have resulted in the identification of some cancer-related genes in breast cancer [
11]. For example, HER2 amplification/overexpression is known to occur at an early developmental stage of ductal carcinoma in situ (DCIS). Loss of 16q, where potential tumor suppressor genes such as E-cadherin (
CDH1) and
CDH13 are located, is also known to be a major event in low-grade invasive ductal carcinoma [
12,
13]. Especially, recent larger-scale studies have elucidated the molecular complexity of breast cancer and suggested novel genetic subgroups [
14‐
18]. However, since most of them studied Caucasians or Hispanics, the profiles of chromosomal alterations and their biological implications in Asians are relatively less well studied.
In this study, we aimed to describe commonly occurring chromosomal alterations in EBC (stage I and II) and to explore the implications of recurrently altered regions (RAR) on patient prognosis. For this purpose, we analyzed DNA copy number alterations (CNAs) across the whole genome using oligoarray-comparative genomic hybridization (CGH) in a discovery set of EBC patients. RARs in the discovery set that were found to be significantly associated with prognosis were validated in an independent replication set. Our results will contribute to a better understanding of early tumorigenesis in breast cancer and will help to predict the prognosis of EBC patients.
Methods
Patients and tumor specimens
As a discovery set for the whole genome array-CGH analysis, frozen tumor tissues were obtained from 48 EBC patients who underwent surgical resection at Dankook University Hospital in Cheonan, Korea (from 1998 to 2002). As an independent replication set, 97 formalin-fixed, paraffin-embedded (FFPE) EBC tissue samples (from 1996 to 2002) were obtained from Seoul St. Mary’s Hospital, Korea. Patient survival status was obtained in 2010 from the Korean Central Cancer Registry, Ministry of Health and Welfare, Korea. All breast cancers were stage I, IIA, or IIB. This study was performed under approval from the Institutional Review Board of the Catholic University Medical College of Korea (CUMC06U015). Tumor stage was determined according to the standard AJCC guidelines for tumor-node-metastasis classification (sixth edition). Clinicopathologic characteristics of the study subjects are summarized separately for the discovery and replication sets in Table
1. Hormone receptor status for ER, PR and HER2 was obtained through a medical record review and for the cases without the hormone receptor status, immunohistochemical (IHC) staining for ER, PR and HER2 was performed. Based on the IHC measurement, breast cancer cases were categorized into four different molecular subtypes as described elsewhere: luminal type A (ER + and/or PR +, HER2 -: Luminal A), luminal type B (ER + and/or PR +, HER2 +: Luminal B), Her2 overexpressed (ER - and PR -, HER2 +: HER2), and triple negative (ER -/PR -/HER2 -: TNBC) [
19]. For array-CGH analysis, 10-μm-thick frozen sections of tumor cell-rich areas (>60%) were microdissected. Genomic DNA was extracted from these sections using a DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany). For genomic real-time quantitative PCR (qPCR) analysis, 10-μm-thick paraffin sections of tumor cell-rich areas (>60%) in the replication set were microdissected. After paraffin removal, genomic DNA was extracted using a DNeasy Blood & Tissue Kit (Qiagen). Genomic DNA from a healthy female individual was used as the normal reference for all array-CGH experiments. Genomic DNA extracted from the blood of a Korean female individual without breast cancer was used as universal normal reference for all the array-CGH experiments.
Table 1
General characteristics of the study subjects
Age group
|
< 50 years | 26(54.2%) | 22(45.8%) |
≥ 50 years | 48(49.5%) | 49(50.5%) |
Stage
|
Stage I | 11(22.9%) | 25(25.8%) |
Stage II | 37(77.1%) | 72(74.2%) |
Stage IIA | 26 | 50 |
Stage IIB | 11 | 22 |
ER status
|
Positive | 25(52.1%) | 54(55.7%) |
Negative | 23(47.9%) | 43(44.3%) |
PR status
| | |
Positive | 35(72.9%) | 59(60.8%) |
Negative | 13(27.1%) | 38(39.2%) |
HER2 status
|
Positive | 11(22.9%) | 23(23.7%) |
Negative | 37(77.1%) | 74(76.3%) |
Subtype
| | |
Luminal A | 29(60.4%) | 53(54.6%) |
Luminal B | 10(20.8%) | 16(16.5%) |
HER2 | 1(2.1%) | 7(7.2%) |
TNBC | 8(16.7%) | 21(21.6%) |
Array-CGH and data processing
For array-CGH analysis, 30K whole-genome human oligoarrays (Human OneArray
TM; Phalanx Biotech, Palo Alto, CA) were used. Oligoarray-CGH was performed as described elsewhere [
20]. In brief, 1 μg of genomic DNA from tumor tissue was labeled with Cy3-dCTP. The reference DNA was labeled with Cy5-dCTP (GeneChem, Daejon, Korea). Dye-labeled DNA was purified with BioPrime spin columns (Invitrogen, Carlsbad, CA) and precipitated with 100 μg of human Cot-1 DNA (ConnectaGen, Seoul, Korea). The labeled DNA pellet was dissolved in 50 μl of DIG hybridization buffer (Roche, Mannheim, Germany), to which 600 μg of yeast t-RNA (Invitrogen) was added. The labeled DNA solution was applied on the array and incubated for 48 hours at 37°C in a MAUI hybridization machine (BioMicro, Salt Lake City, UT). After washing the slides, arrays were scanned using a GenePix 4000B scanner (Axon Instruments, Sunnyvale, CA) and feature extraction was performed using GenePix Pro 6.0. Normalization and re-alignment of raw array CGH data were performed using the web-based array CGH analysis interface, ArrayCyGHt [
21]. A print-tip Loess normalization method was used and each probe was mapped according to its genomic location in the UCSC genome browser (Human NCBI36/hg18). In total, 24,107 probes were processed out of initial 26,616 probes. Array-CGH data for all 48 cancers are available through GEO (accession no GSE37839).
Detection of recurrent copy number alterations
The rank-segmentation statistical algorithm in NEXUS software v3.1 (BioDiscovery Inc., El Segundo, CA) was used to define CNAs of each sample. To optimize the algorithmic parameters for calling CNAs, 11 independent normal-to-normal hybridizations were performed (10 self-to-self and 1 male-to-female hybridizations). The parameters for defining CNAs were as follows: significance threshold = 5.0E-4; maximum contiguous probe spacing (Kbp) = 1000; minimum number of contiguous probes per CNA segment = 5; threshold of signal intensity ratio >0.2 on log2 scale for gains and < −0.3 on log2 scale for losses. After defining CNAs, RAR was determined to be the chromosomal segment covering overlapping CNAs that appeared in at least 30% of the samples with P < 0.05 in the discovery set (NEXUS software v3.1). High-level amplification (amplification hereafter) was defined as a probe signal intensity ratio of 1.5 or higher on the log2 scale. Likewise, a homozygous deletion (HD) was defined as a ratio of −1.5 or lower on the log2 scale.
Genomic quantitative PCR analysis
qPCR validation of the significant RARs was performed using genomic DNA extracted from the FFPE samples of 97 EBCs. As a diploid internal control, a genomic region on chromosome13 (13q32.1) that showed no genomic alteration in the array-CGH data was used. Details including primer information for targets and the diploid control locus are available in Additional file
1. Genomic qPCR was performed using the Mx3000P qPCR system (Stratagene, La Jolla, CA), as described elsewhere [
22]. In brief, a 10-μl real-time qPCR mixture containing 10 ng of genomic DNA, SYBR Premix Ex Taq II
TM (TaKaRa Bio, Japan), 1× ROX, and 5 pmole of each primer was prepared. Thermal cycling conditions consisted of one cycle of 30 sec at 95°C followed by 45 cycles of 5 sec at 95°C, 10 sec at 55–61°C, and 20 sec at 72°C. All qPCR experiments were repeated three times and relative quantification was performed by the ΔΔCT method. When mean genomic dosage ratios of the region between the target sample and female control DNA (ΔΔCT of target and internal control) were above two, the region was defined as a copy number gain.
Association rule mining
The association rule mining is used for finding interesting relations among variables in a database. In bioinformatics, the information metric was commonly used to assess the degree of “surprise” when a pattern actually occurs [
23]. We used CPAR (Classification based on Predictive Association Rules) [
24] algorithm adopting the information metric which was implemented by the LUCS-KDD research group (
http://www.csc.liv.ac.uk/~frans/KDD/Software). In CPAR,
Laplace accuracy is used to measure the accuracy of rules. Given a rule
r,
Laplace accuracy is defined as follows:
(1)
where m is the number of classes and N
total
is the total number of examples that satisfy the rule’s body, among which N
c
examples belong to the predicted class, C of the rule.
Through the CPAR algorithm, association rules were generated between RAR markers and the survival status in the discovery set. Each RAR marker was coded as 0 or 1 based on the copy number status; 0 indicates no copy number variation and 1 indicates copy number change in the marker region. Likewise, the survival status was coded as 0 (dead) or 1 (alive).
Statistical analysis
To examine the clinicopathologic implications of RARs, five clinical parameters were used as categorical variables: age at diagnosis (<50 vs. ≥50 years), stage, ER status, PR status, and HER2 status. Differential distributions of RARs in each category were tested by a two-sided Fisher’s exact test. The false discovery rate (FDR) was used for multiple comparison correction. In univariate survival analysis, cumulative overall survival was calculated according to the Kaplan-Meier method. Differences in survival curves were assessed with the log-rank test. Cox regression was performed to identify RARs associated with prognosis after adjusting for age, stage, ER, PR, and HER2. SAS version 9.1 (SAS Institute Inc., NC) was used and P-values less than 0.05 were considered significant in all statistical analyses.
Discussion
In this study, we analyzed the genome-wide copy number alteration profiles in 48 EBCs using 30K oligoarray-CGH. We delineated RARs under the assumption that commonly altered chromosomal segments in EBCs may contain driver genes essential for initiation or early progression of breast tumorigenesis. It is also possible that some RARs have prognostic implications in EBC. To explore this possibility, we defined RARs in a discovery set of EBC and examined their associations with prognosis. A total of 23 RARs were defined, and all of them were found to overlap at least one of the recently reported CNAs in breast cancer including EBC, suggesting the reliability of our data [
14‐
17]. The nature of RARs (gain or loss) was also largely consistent with the previous observations. For example, RAR-L3 (8p21.2) and RAR-L5 (17p12), where
PPP2R2A and
MAP2K4 are located, respectively, and RAR-G13 (17q12), where
ERBB2 is located, were consistently detected in a recent large-scale breast cancer genetic subgroup study [
14]. In particular, 21 of the 23 RARs overlap recurrent copy number alterations identified in EBCs (stage I and II) from whites, blacks, and Hispanics by Thompson et al.’s recent study [
15]. However, the recurrent gain on 14q11.2 in Thompson et al.’s report was not detected in our array-CGH analysis. This difference, which requires further investigation, may be due to a Korean EBC-specific feature or to the probe design of the array-platform used in this study. We validated the association of RARs with prognosis in the larger independent replication set of 97 EBCs. In addition to RARs, some entire chromosomal arm changes were also commonly observed (> 30% of the samples) in this study (Additional file
1: Table S7), and are largely consistent with previous observations in breast cancer of diverse ethnic groups [
11,
25].
Of the RARs identified in this study, 15 were commonly detected in both stages I and II, which suggests that these copy number alterations were acquired at an earlier stage of EBC. In particular, 6 of the 15 earlier event RARs, RAR-G2 (1q21.2-q21.3), RAR-G7 (8q24.13), RAR-G8 (8q24.13-21), RAR-G9 (8q24.3), RAR-G10 (8q24.3), and RAR-L1 (8p23.1-p22), appeared in over 50% of cases. Some genes located in these six RARs have been suggested to be involved in early breast tumorigenesis. For instance, the
PTK2 gene located on 8q24.3 (RAR-G9) is a member of the focal adhesion kinase (FAK) subfamily of protein tyrosine kinases. Overexpression of FAK was suggested to be an early event in DCIS tumorigenesis [
26]. Although the protein levels of potential cancer-related genes in these six highly common loci were not examined in this study, our data suggest that the six alterations may be commonly occurring genetic events in the initial stage of breast cancer development. Based on our findings, two RARs on 17q25 can be considered relatively late events in breast tumorigenesis, since the RARs on 17q25 (RAR-G14 and -G15) were scarcely observed in stage I (<10%), but were quite frequent (>45%) in stage II. Interestingly, a copy number gain on 17q25.3 was reported to be one of the recurrence-associated chromosomal alterations in one previous report on Korean women with breast cancer [
27].
When we assessed the prognostic implications of RARs, RAR-G12 (16p11.2) and RAR-G13 (17q12) were significantly associated with poorer prognosis in the discovery set. A number of cancer-related genes are located in these two RARs:
NUPR1, MVP, MAPK3, FUS, and
PYCARD are located in RAR-G12 while
ERBB2,
GRB7, and
PPP1R1B are located in RAR-G13. Among these potential cancer-related genes, Nupr1 is known to interact with various molecules involved in cell cycle regulation, programmed cell death and transcription activity. For these reasons, Nupr1 is a potential molecular target in the development of anticancer drugs [
28]. Although the
NUPR1 gene has been suggested to be responsible for the growth and progression of many cancers including breast cancer [
29,
30], the prognostic implications of the
NUPR1 gene in EBC have not been reported. Amplification and overexpression of the
ERBB2 oncogene in RAR-G13 (17q12) is known to be associated with high recurrence rates and reduced breast cancer survival [
31‐
33]. The frequent copy number gains (38%) and amplification (29%) of
ERBB2 in this study are consistent with previous studies on breast cancer [
11,
34].
In a replication analysis by genomic qPCR, the prognostic implication of ERBB2 gain (RAR-G13) was successfully replicated in the larger replication set, but that of the NUPR1 gain (RAR-G12) was not. We hypothesized that the NUPR1 gain itself might not be an influential alteration, but that EBC prognosis is more strongly affected by the co-occurrence of NUPR1 with a strong driver mutation (ERBB2). Association-rule mining results also supported the predictive power of their co-occurrence for poor prognosis. As expected, when these two RARs were combined and used as an independent factor, the hazard ratio increased in an additive manner. A stronger significance level was also achieved on Cox regression analysis compared with when only ERBB2 was used, which may reflect the multigenic nature of cancer.
In this study, 191 high-level CNAs (158 amplifications and 33 HDs) were detected by array-CGH, and 5 of them were detected in more than 10% of the samples. A substantial number of the high-level CNAs overlap database of genomic variants (DGV,
http://projects.tcag.ca/variation/) entries and the copy number variants (CNVs) identified from Koreans [
35]. Although the limitations of DGV are well known in terms of accuracy and overestimation, we cannot rule out the possibility that some high-level CNAs identified in this study are copy CNVs because we used DNA from a single individual as a universal reference. All five of the common amplifications (observed in >10% of the samples) also overlap the CNV loci in DGV. However, four of them, except for one very small (0.02 Mb) amplification on 16p11.2, were reported to be amplifications or copy number gains in breast cancer by a recent high-resolution array-CGH analysis [
15‐
17,
36], suggesting that these four common amplifications are likely CNAs. The amplification frequency of
ERBB2 in this study was largely similar to the previous studies including Koreans [
37‐
39].
There are several limitations in this study. First, due to the limited sample size of subtypes, we could not see the prognostic implications of the RARs in the four molecular subtypes properly. Second, we did not examine the molecular mechanisms of the synergistic effect of the ERBB2-NUPR1 co-occurrence. Further studies will be required to delineate the roles of NUPR1 gain and the simultaneous ERBB2-NUPR1 gains in early breast tumorigenesis. Third, we used single reference DNA in this study, so it is possible that some of the CNAs identified in this study are CNVs, especially small-sized CNAs overlapping previously reported CNVs.
Competing interests
The authors declare that they have no competing interests.
Authors’ contribution
SHJ executed most experiments and drafted the manuscript. AWL collected the patient specimens and was involved in data analysis. SHY participated in the design of this study, performed statistical analysis and wrote the manuscript. HJH performed an association rule mining and drafted the manuscript. CC was involved in data analysis and preparing the figures. YJC proposed this study, organized the research team, interpreted all the data, and participated in writing the manuscript. All authors read and approved the final manuscript.