Introduction
Breast cancer heterogeneity may obscure etiologic risk factor associations if tumor subtypes are inadequately or incorrectly classified [
1]. Etiologic studies generally group breast cancer into two or more protein-based subtypes using immunohistochemistry expression of estrogen receptor (ER), progesterone receptor (PR), and HER2 [
2]. On the other hand, efforts to classify breast cancer into four genomic-intrinsic subtypes have focused on determining targeted therapies and cancer-specific clinical outcomes [
3]. However, for cancer prevention efforts, optimizing subtype classification for etiologic subtypes is the key for understanding risk factor associations.
There is emerging evidence, based on bimodal age frequency distributions at diagnosis, that breast cancer can be divided into just two etiologically distinct subtypes [
4]. Breast cancer bimodality has been observed across categories of ER status, tumor characteristics and histologic subtypes [
5]. Bimodality has also been observed in different populations, for example, in both black and white breast cancer cases in the US [
6] and South Africa [
7]. However, prior evidence for breast cancer bimodality has been based on national cancer registries, which lack detailed molecular and genomic data. No studies, to our knowledge, have comprehensively explored evidence for bimodal age distribution at diagnosis across quantitative protein-based (i.e., percent ER-positivity) or RNA-based (i.e.,
ESR1 and PAM50) tumor characteristics.
Using data from the Carolina Breast Cancer Study, we visualized age distributions at diagnosis and applied two-component mixture models across categories of breast cancer cases defined by molecular and genomic characteristics. We also sought to identify molecular or genomic features that could separate etiologically distinct breast cancers into single or unimodal age distributions at diagnosis.
Discussion
The identification of at least four distinct intrinsic breast cancer subtypes [
24] has guided the development of targeted therapy and contributed to improved breast cancer survival rates. However, it has been posited that breast cancer is comprised of just two etiologically distinct groups [
4,
25], with ER status currently serving as the most widely used surrogate of these two subtypes [
26]. Though not optimized for this purpose, classifying tumors by ER status has advanced our understanding of breast cancer risk factors. For example, increasing parity is inversely associated with risk of ER-positive breast cancer but positively associated with risk of ER-negative breast cancer, an effect that can be partially offset by breastfeeding [
27,
28]. Under this proposed model, breast cancer is a two-component mixture of ER-positive and ER-negative tumor populations [
4], with differences in quantitative levels of ER expression reflecting enrichment for one or other population. Building on this hypothesis, our work in the Carolina Breast Cancer Study shows that breast cancer bimodality is a robust characteristic observed across molecular and genomic tumor features.
Prior research using publicly available registry data from the US [
5], as well as data from Europe [
29], Africa [
7,
30], and Asia [
31], has established bimodal age at diagnosis as a universal feature of female breast cancer. Breast cancer bimodality has also been observed within categories defined by ER status [
4], high-risk and low-risk tumor characteristics [
32] and histologic subtype [
5]. A notable exception to this bimodal age distribution at diagnosis is medullary carcinoma [
5], a rare early-onset sporadic breast cancer that is linked to ER-negative and triple negative cancers, Basal-like tumor features [
33], and the
BRCA1 mutation [
34]. While developments in molecular and genomic tumor profiling technologies have advanced the field of breast cancer subtyping for prognosis and prediction, national cancer registries are limited to tumor characteristics reported in medical records and therefore lack these data. In the present study, we used quantitative ER expression and RNA data from the Carolina Breast Cancer Study to explore evidence for bimodality within groups defined by molecular and genomic features. We report that although certain molecular and genomic tumor characteristics enriched for either early-onset or late-onset breast cancer, we were unable to separate early-onset from late-onset breast cancer using existing molecular or genomic classifications, or any combinations thereof.
Interpretation of quantitative immunohistochemistry-based ER levels has been subject to some controversy. Replacement of the radio ligand-binding assay with immunohistochemistry for measuring ER status was accompanied by observations that ER expression was bimodally distributed [
35]. Rimm and others have argued that the bimodal distribution of ER expression is an artifact of immunohistochemistry staining methods, which have been optimized to maximize the sensitivity of the assay [
36,
37]. Indeed, we have observed a greater dynamic range of
ESR1 expression, compared to that of immunohistochemistry-based ER expression which becomes saturated at higher levels of
ESR1 [
38]. However, several studies have since shown evidence for bimodal distribution not only of quantitative immunohistochemistry-based ER expression [
39] but also of
ESR1 levels [
39,
40], which are not subject to concerns regarding immunohistochemistry methodology. Herein, we build on evidence for breast cancer bimodality by showing bimodal age-at-incidence across categories of immunohistochemistry-based ER expression,
ESR1 levels, as well as PAM50 intrinsic subtype. As such, this manuscript bolsters evidence for bimodal age distribution at diagnosis as a universal characteristic of female breast cancer.
Breast cancer bimodality is consistent with tumors being derived from two distinct progenitor cell types, basal/myoepithelial versus luminal [
4]. Large-scale genomic analyses have recently challenged the classification of cancers according to their tissue-of-origin. Using multi-platform genomic analyses, Hoadley et al. found that although most tumor types were classified by tissue-of-origin, several distinct cancer types converged into common subtypes regardless of tissue-of-origin, while others diverged into multiple subtypes within the tissue-of-origin classifications [
25]. Breast cancer provided one of the most striking examples of this divergence, with Luminal/HER2-enriched and Basal-like breast cancers forming separate clusters as distinct from each other as from other tissue-of-origin cancer types (e.g., lung). Moreover, this integrated analysis revealed that marked molecular differences were observed between epithelial tumors arising from basal cell versus secretory cell types, suggesting that cell type-of-origin dominates molecular taxonomy of breast and other cancer types [
25]. Shared etiology across cancers with different tissue-of-origin but shared cell type-of-origin (e.g., smoking as a shared risk factor for squamous bladder, head and neck and non-small cell lung cancers) may highlight the importance of classifying breast cancer according to the cell type-of-origin for understanding breast cancer etiology. Future studies should pursue the identification of molecular characteristics that can separate etiologically distinct subtypes of breast cancer.
Our findings should be considered in the context of several limitations. First, though a population-based study, the Carolina Breast Cancer Study oversampled for young and African American women. Our analysis did not account for this sampling schema and thus our population distribution is shifted toward younger ages relative to national distributions of breast cancer incidence. This is highlighted by our finding that modal ages for early and late-onset distributions lie at ages 45 and 65, whereas data from SEER breast cancer cases show that modes are closer to 50 and 70 years of age [
5]. Restricting SEER data to the age range of the Carolina Breast Cancer Study produced similar bimodal age distributions at diagnosis (data not shown), suggesting that the slightly younger modes in the Carolina Breast Cancer Study may be due to the restricted age range at breast cancer diagnosis in the Carolina Breast Cancer Study (20–74) versus SEER (currently 10–117). However, rather than the absolute modal age which depends on the age distribution in the underlying population, the key attribute of these modes is that they are stable across molecular categories. Second, lower numbers of cases particularly in ER-borderline (ER 1–10%) and HER2-enriched categories may have limited our ability to discriminate between single density and two-component mixture models, as evidenced by Δ
AIC values between 4 and 10. However, Δ
AIC values in this range still provide support for a bimodal age distribution at diagnosis for these subgroups, albeit with a slightly lower certainty than when Δ
AIC is greater than 10 [
22]. Third, although we had insufficient sample size to perform race-stratified analysis for each of the molecular subtypes, we were able to perform race-stratified analyses both overall and by ER status. Indeed, our race-stratified results are in agreement with findings from SEER analyses [
5,
6], showing a larger proportion of early-onset cases among black women. Strengths of this study include the large number of African American women, a racial group disproportionately affected by high-risk breast cancer [
12], as well as access to high quality molecular and genomic data for a large number of breast cancer cases.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.