Original articlesAssessing mammographers' accuracy: A comparison of clinical and test performance
Introduction
Screening mammography is an effective method of detecting early stage breast cancer. However, the diagnostic value of a mammogram depends on both the technical quality of the film and a mammographer's ability to interpret that film. In the last decade mammographic technology has been relatively stable, allowing researchers to focus on the subjective interpretation of mammograms (e.g., 1, 2).
The Mammography Quality Standards Act recognized the effect of mammographers' interpretations on screening assessments and encouraged medical audits of mammographers' clinical assessments. Evaluating mammographers' performance using clinical assessments is intuitively appealing, because this is “real life” performance. For many researchers, the medical audit is the gold standard measure of performance [3]. However, our ability to draw conclusions about the performance of particular mammographers from these clinical assessments is limited because each mammographer reviews a different set of films. The difficulty of films varies with characteristics of the women evaluated (e.g., breast density), characteristics of lesions (e.g., size), and characteristics of technical film quality (e.g., positioning). Variability in film difficulty results in chance differences among mammographers. Systematic differences in the difficulty of films reviewed can also occur, for example, when mammographers tend to send difficult cases to a particular colleague. Differences in the number of films reviewed also affects comparisons between mammographers through the variability of estimated performance. Because performance estimates based on fewer patients tend to be more variable, and therefore more extreme, comparisons that ignore differences in variability can be misleading. Statistical models have a limited ability to adjust for differences in the films read by each mammographer 4, 5.
Estimation of clinical accuracy is further complicated by the influence that clinical assessments have on the probability of detecting breast cancer. When estimating screening accuracy, we focus on the correspondence between a mammographer's clinical interpretation and a woman's true disease state. Because most women only undergo biopsy if a mammographer finds an abnormality, undetected breast cancer cases emerge symptomatically or during a second screening exam. Thus, undetected breast cancer can only be identified when follow-up information exists. A one year follow-up is generally used, with women classified as disease positive at the time of a screening mammogram if breast cancer is diagnosed within one year [3].
Estimation and comparison of clinical screening performance is also hampered by the relatively low incidence of breast cancer. The one year incidence of invasive breast cancer is approximately 3.5 per 1,000 among American women who are over 49 years old [6]. Low incidence rates make it difficult to precisely estimate a mammographer's rate of cancer detection, since most mammographers will evaluate very few cancers in a single year.
Standardized testing of mammographers is an alternative way to estimate their accuracy. Using standardized film sets removes many of the problems with clinical data. Each mammographer views the same films in the same setting and with the same patient information. Test sets exclude films from women without necessary follow-up information, so that true disease state is known with a high degree of certainty. Test sets can also include more films from women with breast cancer than would be seen in clinical practice, allowing more precise estimation of sensitivity. In summary, use of a test film set controls for film difficulty, film quality and the information presented during film evaluation, offering a relatively simple method of estimating mammographers' accuracy under standardized conditions.
Although estimating accuracy from assessments of standardized film sets avoids many of the problems with clinical data, the artificial conditions introduce other problems. Mammographers know that in the test setting their decisions will not affect patient care. The test itself may be burdensome given time constraints. There is also evidence suggesting that the higher prevalence of disease in test film sets introduces bias. Egglin [7] found that radiologists were more likely to interpret arteriograms as positive for pulmonary emboli when viewed in a higher prevalence film set, regardless of true disease state. When this “context bias” exists, sensitivity increases with increasing prevalence while specificity decreases.
Studies describing mammographer variability based on test film sets (e.g., 1, 2) implicitly assume a strong correlation between mammographers' performance estimated from test sets and mammographers' performance in clinical practice. However, this assumption has never been tested. In this article we directly compare mammographers' clinical and test performance.
Section snippets
Data
We analyzed data from 27 mammographers practicing at a large staff model not-for-profit health maintenance organization (HMO). The mammographers included in this study were voluntary participants, though this group essentially included all of the mammographers practicing with the HMO at the time of the study.
Both clinical and test data sets use films from women who remained enrolled in the HMO for at least two years after their index mammogram. Women with breast cancer were identified using the
Methods
We are primarily interested in the degree of correlation between mammographers' accuracy measured in a clinical setting and accuracy measured in a test setting. The accuracy measures we focused on are sensitivity and specificity. Sensitivity is the proportion of women with breast cancer who had a positive mammogram assessment. Specificity is the proportion of women without breast cancer who had a negative mammogram assessment.
Calculation of sensitivity and specificity requires definition of a
Results
There was wide variability in the amount of clinical data available for each mammographer (Table 2). The 27 mammographers clinically evaluated an average of 1890 films during the four-year period (range 232 to 3818), and saw an average of 15 mammograms from women with breast cancer (range 1 to 32). The average clinical prevalence rate across mammographers was 8 cancers per 1,000 mammograms.
Plots of the sensitivity and specificity suggest moderate positive correlation between clinical and test
Discussion
These results represent a comprehensive comparison of mammographers' assessments in test and clinical settings. The clinical data was based on automated collection of mammographers' interpretations and recommendations. The data systems also allowed two-year follow-up of each woman screened. The test data included a relatively large set of 113 mammograms and included 30 cancers. Finally, our statistical model allowed for differences in the number of films each mammographer assessed during
Acknowledgements
This study was supported, by grants CA63731 from the National Cancer Institute and BC962461 from the U.S. Department of Defense.
We wish to acknowledge the careful work of Kari Rosvik and Deb Seger who made this study possible, and the many mommorgraphers who gave their time to this study. We want to especially thank Mary Kelly, MD, and Donna White, MD, who provided valuable leadership.
References (11)
- et al.
Variability in the interpretation of screening mammograms by US radiologists
Archives of Internal Medicine
(1996) - et al.
Variability in radiologists' interpretations of mammograms
New England Journal of Medicine
(1994) - et al.
The mammography auditA primer for the Mammography Quality Standards Act (MQSA)
American Journal of Radiology
(1995) - et al.
Improving the statistical approach to health care provider profiling
Annals of Internal Medicine
(1997) - et al.
Comparing risk-adjustment methods for provider profiling
Statistics in Medicine
(1997)
Cited by (59)
Clinical performance progress of BREAST participants: the impact of test-set participation
2022, Clinical RadiologyThe roles of clinical audit and test sets in promoting the quality of breast screening: a scoping review
2020, Clinical RadiologyCitation Excerpt :For example, the two environments differ in the definition of key parameters such as sensitivity, as explained earlier. Nonetheless, some studies have shown significant correlations between the two performance domains,66,67 whereas other studies found the relationship to be less strong.64,68 For instance, Soh et al. (2013) designed a study to measure the extent of correlation between test set performance and clinical outputs and concluded that the level of agreement is moderate with the relationship between the clinical and test set data being as high as reading the same test set twice.
Machine and Deep Learning
2020, Intelligence-Based Medicine: Artificial Intelligence and Human Cognition in Clinical Medicine and HealthcareUtility of Supplemental Training to Improve Radiologist Performance in Breast Cancer Screening: A Literature Review
2019, Journal of the American College of Radiology