Background
The percentage of mammographic density (PD) that appears white in a mammogram and reflects the relative amount of fibroglandular tissue in the breast is a well-established risk factor for breast cancer [
1]. PD is the most predictive marker of breast cancer for women after familial causes and polygenic markers when adjusted for age and body mass index (BMI) [
2]. For area-based PD, fibroglandular and fatty tissues may be segmented by thresholding, and this is usually achieved by a semi-automatic approach where the threshold is chosen by the investigator using software such as Cumulus [
3]. There has been recent evidence that increasing the conventional brightness threshold might better predict breast cancer risk: this has been demonstrated in Korean women with “for presentation” (processed) full-field digital mammograms [
4,
5], and Australian women with digitised film mammograms [
6].
In addition to subjective visual assessment, another approach for PD estimation using digital mammograms is volumetric density measurement via a fully automated system. Commercial volumetric PD systems including Volpara [
7] and Quantra [
8] have shown good agreement with semi-automated thresholding and an association with risk of breast cancer [
9]. In Volpara, pixel values are calibrated so that the height (amount) of dense tissue at any given point in a mammogram can be estimated, and based on these heights and the estimated breast volume, volumetric density can be determined. By default all dense tissue, regardless of the height at any pixel position, is included to compute the dense volume. However, there appear to be no published studies that have looked at whether applying a threshold to dense tissue heights, effectively excluding some less dense tissue as well as possibly thin sheets or strands of tissue that have similar attenuation coefficients to glandular tissue, could result in better prediction of breast cancer risk.
The aim of this paper is to investigate whether volumetric or area-based PD can be adjusted by varying dense tissue height thresholds so as to better predict breast cancer risk. In previous research [
4‐
6] thresholding was based on pixel brightness from visual assessment, whereas here thresholds on dense tissue heights from volumetric density estimation are used. This allows the calculation of breast density and the application of a chosen threshold to be fully automated (i.e. without manual visual assessment) on digital mammograms. In addition, our thresholding analysis is based on Western women with digital raw mammograms, and to our knowledge this has not been previously examined. An important benefit of using raw images compared to processed images is that it could reduce the discrepancies between different machines due to manufacturers’ proprietary processing algorithms.
Methods
Setting and study design
Two case-control studies were designed as a part of the Predicting Risk Of breast Cancer At Screening (PROCAS) cohort, in Manchester, UK [
10]. The first case-control study had 317 cases and 947 controls while the second had 318 cases and 935 controls. A detailed description of the data in the two studies has been reported previously [
11,
12] (the sample used for analysis differs slightly; see
Appendix). Briefly, in the first case-control study, cases comprised women with cancer detected at first screen on entry into the PROCAS cohort, and we refer to this dataset as study 1. As in our previous study [
11], the craniocaudal (CC) views of the contralateral breast for cases and the left breast for controls were used. In the second case-control study, each woman had a normal screening mammogram (no cancer detected) on entry into the PROCAS cohort, but an interval or screen-detected cancer arose subsequently, and we refer to this dataset as study 2. Similar to our previous study [
11], the CC views of the contralateral breast for cases and the same side for controls were used. The mammograms were obtained on average three years prior to diagnosis of breast cancer and from the same cohort as study 1. In both studies women were matched approximately 3:1 (controls vs cases) by age, BMI, hormone replacement therapy (HRT) use and menopausal status.
Mammograms
All digital raw (“for processing”) mammograms were acquired using a GE Senographe system. Volumetric density, especially the height of dense tissue at each point in the mammogram, was assessed using Volpara 1.5.2 (Volpara Health Technologies, Wellington, New Zealand).
Density measurements
One output from the Volpara software is a “density map” - it contains data on dense tissue height at every point in the mammogram, based on an analysis of pixel values and imaging parameters. Whilst no thresholding is applied in the default output of the software, different threshold values can be tested such that only densities with a height greater than a certain threshold value are included for computing total dense volume. For instance, when a threshold level of 5 mm is used, only those density heights greater than 5 mm are employed to calculate the total dense volume. We refer to this approach to computing PD as volumetric PD (VPD) in this paper, and specifically the default volumetric PD output by Volpara as VPD0 (i.e. the threshold level is 0 mm).
The aforementined approach focuses on percentage of volumetric density as the end point. An alternative approach is to look at the two-dimensional area of dense tissue within the breast: here this is defined as the number of pixels with dense tissue heights greater than a chosen threshold. This is then divided by the total number of pixels in the breast and expressed as a percentage area of dense tissue. As with the volumetric approach, a series of threshold values can be considered. We refer to this as areal PD (APD) in this paper. Note that although APD is an areal measurement, the underlying basis is still volumetric density because dense tissue height (or effectively volume) at each point in the mammogram was used.
Statistical analysis
PDs at various threshold levels, ranging from 0 to 25 mm, were evaluated using conditional logistic regression, based on the pooled data (study 1 and 2 combined) and on study 1 and 2 separately. The Akaike information criterion (AIC) and matched concordance index (mC) [
13] were calculated to measure prediction performance. AIC is a likelihood-based statistic derived from the information theory and is a well-established method for model comparison [
14]. A lower AIC value indicates better model performance. mC is a modification of the concordance index (or area under the receiving operator characteristic curve, AUC) for matched case-control studies, and gives an average concordance index within matched groups. Bootstrap with 10,000 replications was used to assess whether the difference in mC from different models was statistically significant. All
p values are two-sided.
Since biologic phenotypes between screen-detected and interval cancers are different, a further analysis was conducted to test whether there was any significant difference between screen-detected and interval breast cancers. In addition to the fixed threshold level for every woman, sensitivity analysis was conducted by varying the threshold according to a woman’s characteristics based on a linear model, using age, BMI, thickness and total volume of the breast to explore the difference between varying and fixed thresholds.
Discussion
This paper explores the impact of various levels of density thresholding on the performance in prediction of breast cancer using digital mammograms. To achieve this, a range of threshold levels from 0 to 25 mm were tested. For VPD, the threshold was varied so that only dense tissue where heights were greater than a given value were included to calculate the total dense volume of the breast. For APD, we counted the number of dense pixels above the threshold level and compared this with the total number of pixels in the breast to derive the areal PD.
Results from both case-control studies and from the pooled data confirm that a threshold level of 5 mm or 6 mm, either volumetric or areal, improves cancer risk prediction compared to original VPD without thresholding. However, the improvement with VPD at the higher thresholds was relatively small. This is not surprising given the strong correlation between VPD0 and VPD5 (spearman ρ approximately 0.95 in both studies). On the other hand, APD at threshold of 6 mm (APD6) achieved the best results across all models tested, including VPD and APD at various threshold levels, with ΔAIC = 14.52 for the pooled data compared to VPD0. It is worth noting that APD6 was also highly correlated with VPD0 (spearman ρ approximately 0.90 in both studies), which is not surprising given both APD and VPD measure relative dense tissue albeit from a different perspective. In addition to fixed threshold levels, varying threshold levels were also examined with the level of threshold based on a woman’s characteristics such as age, BMI and breast volume; however, the AIC did not improve, so a fixed threshold is preferred.
We also explored the impact of thresholding by visualising mammograms after areas with less dense tissue were excluded. As illustrated in Fig.
4, thresholding at 5 mm filtered out a large portion of lower-density areas, and was roughly comparable to
Altocumulus presented by previous research [
6]. Further thresholding at higher levels at 10 and 15 mm seems to exclude too much information, thus no further improvement in prediction was observed at these levels. It appears that by introducing a suitable threshold level (e.g. 5–6 mm), much of the “noise” presented in the mammograms (including fine structures with low attenuation) is removed and hence results in a more predictive PD estimate.
It is also interesting that whilst APD performed much worse than VPD initially when the level of thresholding was low, APD became better than VPD when a threshold level of 4 mm or above was applied, as shown in Fig.
1. This suggests that VPD is relatively insensitive to the “noise” presented in mammograms compared to APD, since VPD is essentially a weighted sum (i.e. if all dense tissue heights were the same then VPD would be equivalent to APD). However, after exclusion of the noise component, the weights (dense tissue heights) became less relevant, resulting in APD being a better predictor. This is interesting because it suggests that once the density at each point in the mammogram reaches some threshold, the measures are equally informative in terms of cancer risk despite local differences in density.
In terms of the biological plausibility for these findings, the major component of dense breast tissue is stroma [
15], and pathways for breast cancer risk associated with dense tissue are likely to involve the stromal cells, extracellular matrix proteins and the epithelial component. It has also been shown that local density is associated with the location where cancer would develop [
16]. However, the causal route between dense tissue and breast cancer is unknown, and research is ongoing in this important area [
15]. For these reasons we do not speculate further on how this measure of breast density might better capture the biological mechanism for risk due to dense breast tissue. From a measurement accuracy point of view, however, an increased threshold may remove the areas of fat that look slightly grey on the image, which might reduce measurement error. Another possible explanation is that setting an appropriate threshold removes thin sheets or strands of tissue which have similar attenuation coefficients to glandular tissue, and exclusion of this type of tissue might contribute to better density estimation.
Consistent with previous studies [
4‐
6], our results show that once the APD at the optimal threshold level is accounted for, conventional VPD0 no longer adds information - in fact models with multiple PD measurements (M4 and M5) performed worse than the model with only APD6 as a predictor (M3). While the standardised OR and mC, including those based on the original VPD estimated by Volpara (M1), might seem relatively low compared with some previous studies [
6,
9], the results are broadly consistent with a body of previous research [
4,
17,
18]. For example, Brandt et al. [
17] compared VPD with BI-RADS using a large case-control sample (1911 cases and 4170 controls) and identified a similar discriminatory ability for Volpara VPD (AUC = 0.58, 95% CI 0.56–0.59) as in our study. It is also worth noting that the studies that have directly compared VPD by Volpara with established visual-based assessment such as BI-RADS and Cumulus have shown broadly similar ability for risk prediction [
12,
17,
19], and so differences in predictive ability between studies might be due to other characteristics of the data. It is plausible that the predictive ability of a density measure differs across different sub groups of women and types of cancers, such as screen-detected and interval cancers as demonstrated here and by others [
18]. This means the predictive ability likely depends on the composition of the study population, which may explain some of the differences between studies.
Previous studies have demonstrated that breast density adds accuracy to established breast cancer risk models such as the Tyrer-Cuzick and Gail models [
20,
21], including in combination with single-nucleotide polymorphism risk panels [
22]. It is therefore expected that this study will be of clinical importance, as an improved automated density measure is likely to help identify women who require additional screening and to help devise a risk-based screening/prevention strategy.
The strength of our approach, compared to previous studies [
4‐
6], is that the process is fully automated without any human intervention. Also, by using raw (“for processing”) digital mammograms, differences due to manufacturers’ proprietary processing algorithms are reduced. Our approach, however, would benefit from testing in a wider range of settings. For example, the majority of women in our datasets were white and parous, so it would be important to validate our approach amongst other groups of women. Finally, the mammograms employed in our study are generated from a GE system. Nguyen et al. [
5] found that prediction performance may vary considerably between different mammographic machines based on visual assessment. It would be interesting to further explore the impact of thresholding using different systems in which the image properties may differ, and how the method can be calibrated for mammograms from different systems and the resulting discriminatory power in different settings.