We have constructed a decision model of screening mammography using population-based parameters that allows reasonable prediction of baseline primary events and secondary performance measures. We found that the recall percentage is generally stable with age at about 15% (range 7–24%), with women facing an overall recall of about 18% (including short-interval follow-up imaging), with a 14% chance of an immediate false-positive exam. Our decision model predicts how the CDR and PBF generally track the increasing prevalence of breast cancer with age: a 50-year-old woman has 3.4 times the prevalence of a 40 year old, while the predicted CDR and PBF are 3.8 and 3.3 times as high, respectively.
Our model predicts how starting screening at age 45 or 55 would considerably improve overall baseline mammography secondary performance measures simply because breast cancer is relatively more common and easier to detect as women age, while recall rates are stable. The increasing CDR means the absolute benefit from baseline screening increases with age, given the potential life-extending benefit from earlier detection and treatment through delaying or preventing advanced disease. However, the recall rate which is mostly unwanted (collateral) false-positive exams in healthy women (95% at age 50, or 1 – PPVS) is relatively stable with age. So the potential benefit/collateral cost ratio gradually increases with age [
6], as shown by the increasing PPVS, PPVD and PBF values.
In reality, the major screening problem of overdiagnosis, or the detection of nonprogressive or nonlethal pseudodisease, complicates this simplified economic analysis [
21]. Overdiagnosis leads to harmful overtreatment and associated anxiety without any potential benefit to the woman [
45,
46]. The source of the absolute benefit from screening is the earlier detection of and effective intervention for some otherwise lethal tumors. The absolute risk reduction is equal to the relative risk reduction times the absolute death risk. If the relative risk reduction from screening is constant, the increasing age-dependent absolute death risk determines the absolute benefit [
47]. The breast-cancer deaths prevented by screening translated into the age-dependent discounted years of life saved will ultimately reflect the absolute benefit [
48]. Consequently, the CDR as a performance measure directly reflects the development risk for breast cancer, but only indirectly reflects the absolute benefit of screening mammography.
Adjusting the CDR to reflect the proportion of lives saved among the women with mammography-detected cancers may present a more realistic benefit. We have estimated this proportion to be under 5% [
49,
50]. Breast cancer is a heterogeneous disease, with grades of metastatic potential and length of sojourn time [
51]. Screening also has a limited window for possible effectiveness and may work only when the detection occurs before the critical point (development of metastases), and if both events occur during the sojourn time [
21]. Earlier detection will not extend lives if the critical time point occurs after the onset of breast symptoms, or before the breast cancer is detectable by imaging. In less lethal cancer (including pseudodisease), the critical time point may be delayed or never occur. Detecting a cancer early can also cause harm: earlier intervention associated with mammography appears to activate dormant metastases in some cases, especially in younger women [
52].
Clinical implications
Stratifying the recommendations for screening so that only higher-risk women would be encouraged to screen earlier or more often would increase the prevalence and therefore improve performance measures [
53,
54]. Recent research on risk stratification for breast cancer shows this approach may be practical to increase screening efficiency [
55]. Notably, performance measures are worse in women under age 50: 39% of all diagnostic mammograms after screening occur in this group [
40], but less than 25% of invasive cancers do [
53]. A screening "cost-effectiveness" proxy is the negative biopsy fraction (1-PBF), since a negative intervention incurs high financial and emotional costs due to its invasive nature without a corresponding potential benefit of cancer detection. Our model predicts that the negative biopsy fraction is over 90% (range 84–93%) for women under 45 and does not approach performance benchmarks of 60–75% [
27,
40] until after age 55. As shown in Figure
5, even the best United States radiologists will not achieve a level of 75% until women reach age 49.
Consequently, United States organizations [
2] and other countries [
56,
57] have varied opinions regarding the appropriate age to begin and way to practice screening. For instance, the United Kingdom screens women ages 50 to 70 with a screening interval of 3 years, with about half the recall rate but similar cancer detection rates compared with the United States for baseline and subsequent exams [
42,
58]. Therefore, United Kingdom performance measures should be higher. The European desirable recall rate is under 5% for a baseline exam (and under 3% for a subsequent exam), which is also half the 10% United States guideline for all mammograms [
31]. Analysis of BCSC data shows the United States desirable recall rate should be no more than 10% for baseline and 6.7% for subsequent exams [
25]. These recommendations and practices may reflect differences in practice environments and radiologist skill. However, using a broader economic perspective, these targets should reflect an analysis of the opportunity cost of resources devoted to screening and diagnosis versus the absolute net benefit of lives saved minus screening harms. Unfortunately, screening advocates have exaggerated this net benefit through the emphasis on relative rather than absolute risk reduction with screening and the failure to discuss mammography harms in many scientific articles [
11].
The principle of IMDM decentralizes this debate and helps each woman instead of a "policymaker" or potentially biased "expert" decide what starting age is best for her regarding screening [
59]. However, the woman must understand the benefits and harms as well as the true opportunity cost of screening mammography [
5,
60,
61]. Although age is the most important risk factor for breast cancer development, most women are unaware of this fact [
62]. Women tend to identify genetics as paramount, yet only 10% of all breast cancer patients have a family history of first-degree relative (mother, sister, and daughter) with cancer [
53]. To stress the importance of age, perhaps decision aids could present risk as an equivalent risk-adjusted age for breast cancer development [
63]. In this way, a higher risk woman could anticipate mammography performance as equivalent to that for an older woman.
Since the appropriate contents of a screening decision guide are debatable, research is needed on applying these results and making decision guides easier for physicians to present and for women and the public to understand [
53,
64,
65]. For example, a decision guide for younger women utilizes a simplified decision tree of breast-cancer screening [
66,
67]. Including discussion of DCIS overdiagnosis and the accuracy limitations of mammography, which result in false-positive follow-up testing and associated psychological costs, would improve this type of decision guide [
68]. Explaining that "peace of mind" after a negative mammogram [
15] is more a function of low prevalence than the sensitivity of the baseline exam would also promote realistic expectations.
Model applications
Besides being an education tool, we can use our model to predict the effect of new technology on performance measures [
69]. For example, the true value of computer-aided detection (CADe) for mammography depends on the overall marginal effect on both CDR, or (sensitivity times prevalence); and the recall rate = (CDR + ([1-specificity] times [1-prevalence])) [
49,
70]. Our model shows how both the CDR and FP cases/1000 screens influence the other performance measures. For example, the BCSC CADe study showed relative increases in sensitivity (84.0/80.4) and decreases in specificity (87.2/90.2). Using these same relative changes, our model predicts for 50-year-old women an increase of 29% in the recall rate (32% actual), a drop of 19% in the PPVS (22% actual), an increase of 26% in the TIR (20% actual), and a drop of 17% in the PBF [
71]. The secondary performance measures will be stable only if there are equal relative percentage changes in both the CDR and FP cases/1000 screens.
Receiver operating characteristic analysis computes the inverse relationship and inherent trade-off between sensitivity (TPF) and specificity (1-FPF) given the natural overlap between diseased and healthy individuals, independent of prevalence. In theory, when prevalence is incorporated and costs of decision consequences are computed, the optimal threshold or sensitivity/specificity pair on any receiver operating characteristic curve can be calculated [
30]. We used our model to predict how the recommended threshold for screening mammography varies with age or risk. The optimal operating points appear closer to European than United States recommendations, but depend on multiple assumptions including false-positive exam resource costs and the value assigned to a year of life saved [
72].
Limitations
Our analysis is limited to the decision to undergo a baseline exam, the first step in screening mammography. There is an obvious benefit of higher specificity and reduced false-positive recalls in subsequent mammography when prior mammograms are available, although specificity will also drop as sensitivity increases with decreasing frequency of screening [
26]. Complete evaluation of age-related screening performance would require modeling the effects of both initial and repeat screening over multiple periods and frequencies of screening [
73]. However, these models must choose an arbitrary period for repeated screening as well as mortality benefit assumptions [
74].
Furthermore, a comprehensive model may be less useful for education purposes. Our single mammogram model focusing on performance measures helps quantify the primary importance of age. A woman makes the decision to screen independently each time; a woman cannot buy a coupon worth seven screens over the next decade. The fact is that many women undergo sporadic screening and ignore recommendations. For instance, mammography registry data and models show that although 75% of women age 40 to 50 have obtained a baseline mammogram, over 43% have a gap time of over 2.5 years, and only 24% get an annual mammogram [
75,
76].
The cancer detection rates for the baseline exam predicted from our model are higher than the actual BCSC cancer detection rates because the model input prevalence is higher: both sensitivity values use the same BCSC data. However, these BCSC and NBCCEDP first mammograms may not be truly baseline and may underestimate true prevalence as shown by the differences in cancer rates. The fact that the 9–15 months actual BCSC data are close to the SEER incidence rates implies that the BCSC has biased data for first mammography, especially for younger women, or that the sojourn estimates are biased. Our sojourn time estimates could be too long; this could occur with overdiagnosis [
22]. Furthermore, SEER incidence data is an average and based on all cancer diagnoses, and screened and non-screened women with cancer could have different incident rates. The higher predicted CDR from our model is conservative in that it will result in higher predicted PPV performance measures in Figures
4 and
5.
Consequently, we could improve our model with better estimates of risk-adjustable prevalence and sojourn time, as well as baseline total intervention rates. Although our model also uses older accuracy data that does not include recent digital technology, the effect should be minimal. In one large trial, digital mammography had no overall accuracy benefit compared to film, but digital did have increased sensitivity and receiver operating characteristic curve accuracy benefits in the single subset of women under 50 with dense breasts (17% of participants, digital 1.25 year sensitivity 59% versus film 27%) [
77]. But since most digital units have CADe technology [
78], any combined digital/CADe claim for accuracy gain is problematic because CADe causes decreased receiver operating characteristic curve accuracy [
71]. The excess false-positive recalls due to CADe's influence on the radiologist calling a normal mammogram abnormal (computer-aided deception) overwhelms the benefit (extra cancer detection). As our model predicts, false-positive recall exams should increase substantially with combined digital/CADe mammography [
79].
Finally, the CDR is a limited performance measure for radiologists due to the influence of age-related prevalence: as Figure
2 shows, the 90
th percentile (highly skilled) radiologist reading mammograms for women age 50 will find fewer cancers than the 10
th percentile radiologist reading mammograms for women age 55. Emphasizing the CDR also reinforces the naïve but widely held notion that breast cancer is a homogeneous disease [
15], and therefore the belief that earlier detection is always helpful. For instance, in one cross-sectional survey, 94% of women thought that a woman with screen-detected cancer "may have benefited from the mammograms" [
14]. Consequently, to support IMDM and professional honesty [
80,
81], radiologists should continually emphasize to the public that the correct scenario is "earlier cancer detection saves some lives-but screening harms healthy women" or "mammography might extend your life-but often misses cancers."