Assessing mammographers' accuracy: A comparison of clinical and test performance

doi:10.1016/S0895-4356(99)00218-8

Journal of Clinical Epidemiology

Volume 53, Issue 5, May 2000, Pages 443-450

https://doi.org/10.1016/S0895-4356(99)00218-8 Get rights and content

Abstract

Direct estimation of mammographers' clinical accuracy requires the ability to capture screening assessments and correctly identify which screened women have breast cancer. This clinical information is often unavailable and when it is available its observational nature can cause analytic problems. Problems with clinical data have led some researchers to evaluate mammographers using a single set of films. Research based on these test film sets implicitly assumes a correspondence between mammographers' accuracy in the test setting and their accuracy in a clinical setting. However, there is no evidence supporting this basic assumption. In this article we use hierarchical models and data from 27 mammographers to directly compare accuracy estimated from clinical practice data to accuracy estimated from a test film set. We found moderate positive correlation [ $ρ ̂ = 0.206$ with 95% credible interval (−0.142–0.488)] between mammographers' overall preponderance to call a mammogram positive. However, we found no evidence of correlation between clinical and test accuracy [ $ρ ̂ = −0.025$ with 95% credible interval (−0.484–0.447)]. This study is limited by the relatively small number of mammographers evaluated, by the somewhat restricted range of observed sensitivities and specificities, and by differences in the types of films evaluated in test and clinical datasets. Nonetheless, these findings raise important questions about how mammographer accuracy should be measured.

Introduction

Screening mammography is an effective method of detecting early stage breast cancer. However, the diagnostic value of a mammogram depends on both the technical quality of the film and a mammographer's ability to interpret that film. In the last decade mammographic technology has been relatively stable, allowing researchers to focus on the subjective interpretation of mammograms (e.g., 1, 2).

The Mammography Quality Standards Act recognized the effect of mammographers' interpretations on screening assessments and encouraged medical audits of mammographers' clinical assessments. Evaluating mammographers' performance using clinical assessments is intuitively appealing, because this is “real life” performance. For many researchers, the medical audit is the gold standard measure of performance [3]. However, our ability to draw conclusions about the performance of particular mammographers from these clinical assessments is limited because each mammographer reviews a different set of films. The difficulty of films varies with characteristics of the women evaluated (e.g., breast density), characteristics of lesions (e.g., size), and characteristics of technical film quality (e.g., positioning). Variability in film difficulty results in chance differences among mammographers. Systematic differences in the difficulty of films reviewed can also occur, for example, when mammographers tend to send difficult cases to a particular colleague. Differences in the number of films reviewed also affects comparisons between mammographers through the variability of estimated performance. Because performance estimates based on fewer patients tend to be more variable, and therefore more extreme, comparisons that ignore differences in variability can be misleading. Statistical models have a limited ability to adjust for differences in the films read by each mammographer 4, 5.

Estimation of clinical accuracy is further complicated by the influence that clinical assessments have on the probability of detecting breast cancer. When estimating screening accuracy, we focus on the correspondence between a mammographer's clinical interpretation and a woman's true disease state. Because most women only undergo biopsy if a mammographer finds an abnormality, undetected breast cancer cases emerge symptomatically or during a second screening exam. Thus, undetected breast cancer can only be identified when follow-up information exists. A one year follow-up is generally used, with women classified as disease positive at the time of a screening mammogram if breast cancer is diagnosed within one year [3].

Estimation and comparison of clinical screening performance is also hampered by the relatively low incidence of breast cancer. The one year incidence of invasive breast cancer is approximately 3.5 per 1,000 among American women who are over 49 years old [6]. Low incidence rates make it difficult to precisely estimate a mammographer's rate of cancer detection, since most mammographers will evaluate very few cancers in a single year.

Standardized testing of mammographers is an alternative way to estimate their accuracy. Using standardized film sets removes many of the problems with clinical data. Each mammographer views the same films in the same setting and with the same patient information. Test sets exclude films from women without necessary follow-up information, so that true disease state is known with a high degree of certainty. Test sets can also include more films from women with breast cancer than would be seen in clinical practice, allowing more precise estimation of sensitivity. In summary, use of a test film set controls for film difficulty, film quality and the information presented during film evaluation, offering a relatively simple method of estimating mammographers' accuracy under standardized conditions.

Although estimating accuracy from assessments of standardized film sets avoids many of the problems with clinical data, the artificial conditions introduce other problems. Mammographers know that in the test setting their decisions will not affect patient care. The test itself may be burdensome given time constraints. There is also evidence suggesting that the higher prevalence of disease in test film sets introduces bias. Egglin [7] found that radiologists were more likely to interpret arteriograms as positive for pulmonary emboli when viewed in a higher prevalence film set, regardless of true disease state. When this “context bias” exists, sensitivity increases with increasing prevalence while specificity decreases.

Studies describing mammographer variability based on test film sets (e.g., 1, 2) implicitly assume a strong correlation between mammographers' performance estimated from test sets and mammographers' performance in clinical practice. However, this assumption has never been tested. In this article we directly compare mammographers' clinical and test performance.

Section snippets

Data

We analyzed data from 27 mammographers practicing at a large staff model not-for-profit health maintenance organization (HMO). The mammographers included in this study were voluntary participants, though this group essentially included all of the mammographers practicing with the HMO at the time of the study.

Both clinical and test data sets use films from women who remained enrolled in the HMO for at least two years after their index mammogram. Women with breast cancer were identified using the

Methods

We are primarily interested in the degree of correlation between mammographers' accuracy measured in a clinical setting and accuracy measured in a test setting. The accuracy measures we focused on are sensitivity and specificity. Sensitivity is the proportion of women with breast cancer who had a positive mammogram assessment. Specificity is the proportion of women without breast cancer who had a negative mammogram assessment.

Calculation of sensitivity and specificity requires definition of a

Results

There was wide variability in the amount of clinical data available for each mammographer (Table 2). The 27 mammographers clinically evaluated an average of 1890 films during the four-year period (range 232 to 3818), and saw an average of 15 mammograms from women with breast cancer (range 1 to 32). The average clinical prevalence rate across mammographers was 8 cancers per 1,000 mammograms.

Plots of the sensitivity and specificity suggest moderate positive correlation between clinical and test

Discussion

These results represent a comprehensive comparison of mammographers' assessments in test and clinical settings. The clinical data was based on automated collection of mammographers' interpretations and recommendations. The data systems also allowed two-year follow-up of each woman screened. The test data included a relatively large set of 113 mammograms and included 30 cancers. Finally, our statistical model allowed for differences in the number of films each mammographer assessed during

Acknowledgements

This study was supported, by grants CA63731 from the National Cancer Institute and BC962461 from the U.S. Department of Defense.

We wish to acknowledge the careful work of Kari Rosvik and Deb Seger who made this study possible, and the many mommorgraphers who gave their time to this study. We want to especially thank Mary Kelly, MD, and Donna White, MD, who provided valuable leadership.

References (11)

C.A. Beam et al.
Variability in the interpretation of screening mammograms by US radiologists
Archives of Internal Medicine
(1996)
J.G. Elmore et al.
Variability in radiologists' interpretations of mammograms
New England Journal of Medicine
(1994)
M.N. Linver et al.
The mammography auditA primer for the Mammography Quality Standards Act (MQSA)
American Journal of Radiology
(1995)
C.L. Christiansen et al.
Improving the statistical approach to health care provider profiling
Annals of Internal Medicine
(1997)
E.R. Delong et al.
Comparing risk-adjustment methods for provider profiling
Statistics in Medicine
(1997)

There are more references available in the full text version of this article.

Cited by (59)

Clinical performance progress of BREAST participants: the impact of test-set participation
2022, Clinical Radiology
To investigate if positive changes in the clinical performance of radiologists are associated with reading mammographic test sets.
This study investigated the clinical audit history for a cohort of 39 participants in the BreastScreen Reader Assessment Strategy who have read for BreastScreen New South Wales in the period between 2010 and 2018, inclusively. Based on the year in which each radiologist completed his or her first test set, data of multiple clinical audit metrics from two calendar years before test-set reading were compared against similar data from three different periods after test-set completion. The same process was repeated after dividing radiologists into two subgroups based on their median screen-reading volume (3,688), to test if experience is a determinant of post-test set performance.
On average, radiologists showed significant improvements (p<0.05) in the recall rate for subsequent screening rounds, in positive predictive value 1 (PPV1), and in specificity. When dividing radiologists based on their average annual reading volume, radiologists with higher reading numbers demonstrated similar significant improvements in the recall rate and in PPV1. In addition, they showed significant improvements in the detection rates of invasive breast cancer and ductal carcinoma in situ (DCIS). In contrast, the radiologists with lower reading volume indicated significant changes in the recall rate and in PPV1, both accruing in one of the three compared periods.
Mammographic test-set participants improve over time in identifying normal breast screens and detecting breast cancer in association with reading higher volumes of breast screening cases.
The relationship between missed breast cancers on mammography in a test-set based assessment scheme and real-life performance in a National Breast Screening Programme
2021, European Journal of Radiology
This retrospective study determined whether a test-set based assessment scheme (PERFORMS) used in a national breast screening programme could be used to predict real-life performance by investigating if the number of cancers missed by mammography readers in real-life related to the number of cancers missed in the PERFORMS test-set and whether real-life reading volumes affected performance.
Data was obtained from consenting readers in the screening programme in England (NHSBSP) where double reading is standard. The rate of cancers missed by individual first readers but correctly identified by second readers was compared with the number of cancers missed in the PERFORMS test-set over a 3-year period. NHSBSP readers are required to interpret at least 1500 cases per year as a first reader, so results were compared between readers who exceeded this target and those that did not. Parametric and non-parametric correlations were calculated.
Amongst the 536 readers, there was a highly significant positive correlation between the real-life and PERFORMS test-set missed cancer metrics (Pearson Correlation = 0.228, n = 536, p < .0001, Spearman's rho = 0.265, n = 536, p < .0001). There was no significant difference in rates of missed cancers between the 452 readers who exceeded the 1500 first read per year target and those who did not (t(94.2) = −1.87, p = .0643, r = 0.19).
The use of a test-set based assessment scheme accurately reflects real-life mammography reading performance, indicating that it can be a useful tool in identifying poor reader performance.
Retrospective radiological review and classification of interval breast cancers within population-based breast screening programmes for the purposes of open disclosure: A systematic review
2021, European Journal of Radiology
Interval breast cancers occur following a negative breast screening mammogram and before the next scheduled appointment within screening programmes. Radiological review classifies them as cancers that develop between screens, cancers with no obvious malignant abnormalities on prior screens or cancers not detected at screening. This study aimed to systematically review published literature on the occurrence of open disclosure following interval cancer radiological reviews by breast screening programmes internationally in a retrospective setting and examine methodologies used for radiological reviews for the purposes of disclosure.
A search for relevant articles published (January 2000 – May 2019) was conducted according to PICO and PRISMA guidelines. The databases Pubmed, Scopus, Google Scholar, Cinahl, Web of Science, Embase, Science Direct and Global Health were searched. Relevant studies were reviewed if they had completed a retrospective review and classification of interval breast cancers.
Of 46 relevant articles included, no study was identified that conducted a retrospective review purposely for open disclosure. Retrospective reviews were conducted for audit/quality assurance, and research including for radiologist education and learning. Variation in methodology was found across review type (non-blinded/semi-informed approach), number of reviewers and classification categories. The proportion of false negative cancers classified among the studies ranged from 4 to 40 %.
Variation among radiological review practices were observed, which likely impacts classification results. To ensure standardised classification of interval breast cancers are employed for the purposes of open disclosure in screening settings, reproducible and consistent methodology is required.
The roles of clinical audit and test sets in promoting the quality of breast screening: a scoping review
2020, Clinical Radiology
Citation Excerpt :
For example, the two environments differ in the definition of key parameters such as sensitivity, as explained earlier. Nonetheless, some studies have shown significant correlations between the two performance domains,66,67 whereas other studies found the relationship to be less strong.64,68 For instance, Soh et al. (2013) designed a study to measure the extent of correlation between test set performance and clinical outputs and concluded that the level of agreement is moderate with the relationship between the clinical and test set data being as high as reading the same test set twice.
Breast screening programmes enhance the probability of early breast cancer detection in many countries worldwide; however, the success of these efforts is highly dependent on the ability of breast screen readers to detect abnormalities in the screened population, which has low prevalence. Therefore, this task can be challenging. Clinical audit is a key quality assurance measure that aims to keep the screen reading performance within acceptable standards. Auditing, nonetheless, is a lengthy process, and its accuracy is dependent on available clinical data, which often can be limited. Mammographic standardised test sets are a different screen reading evaluation approach that provides participants with instant feedback based on a simulated environment. Although a test set provides unique evaluative qualities, its ability to represent clinical performance is debated. This article describes the distinctive roles of clinical audit and test sets in measuring and improving the quality of breast screening and highlights the relationship between test sets and clinical performance.
Machine and Deep Learning
2020, Intelligence-Based Medicine: Artificial Intelligence and Human Cognition in Clinical Medicine and Healthcare
Machine learning along with data mining comprise sub-disciplines under data science. There are several schools of machine learning, including symbolists, connectionists, revolutionaries, Bayesians, and analogizers. Machine learning with its sometimes tedious workflow differs significantly from conventional programming. Classical machine learning consists of supervised (classification and regression) and unsupervised (clustering and generalization) learning but also semi-supervised and ensemble learning. Deep learning consists of a range of methods including convolutional neural networks, recurrent neural networks, generative adversarial networks, and their derivatives. Deep reinforcement learning (such as deep Q network, or DQN) is becoming a valuable deep learning tool in biomedicine. Evaluation of these models includes methods such as receiver operating characteristic, precision-recall curve, and the F-1 measure in the confusion matrix. Finally, issues such as explainability, bias and variance, fitting, curse of dimensionality, and correlation vs causation are discussed.
Utility of Supplemental Training to Improve Radiologist Performance in Breast Cancer Screening: A Literature Review
2019, Journal of the American College of Radiology
The authors evaluate whether supplemental training for radiologists improves their breast screening performance and how this is measured.
A systematic search was conducted in PubMed on August 3, 2017. Articles were included if they described supplemental training for radiologists reading mammograms to improve their breast screening performance and at least one outcome measure was reported. Study quality was assessed using the Medical Education Research Study Quality Instrument.
Of 2,199 identified articles, 18 were included, of which 17 showed improvement on at least one of the outcome measures, for at least one training activity or subgroup. Two measurement approaches were found. For the first approach, measuring performance on test sets, sensitivity, and specificity were the most reported outcomes (8 of 11 studies). Recall rate is the most reported outcome (6 of 7 studies) for the second approach, which measures performance in actual screening practice. The studies were mainly of moderate quality (Medical Education Research Study Quality Instrument score 11.7 ± 1.7), caused by small sample sizes and the lack of a control group.
Supplemental training helps radiologists improve their screening performance, despite the mainly moderate quality of the studies. There is a need for better designed studies. Future studies should focus on performance in actual screening practice and should look for methods to isolate the training effect. If test sets are used, focus should be on knowledge about correlation between performance on test sets and actual screening practice.

View all citing articles on Scopus

View full text

Original articlesAssessing mammographers' accuracy: A comparison of clinical and test performance

Abstract

Introduction

Section snippets

Data

Methods

Results

Discussion

Acknowledgements

Variability in the interpretation of screening mammograms by US radiologists

Archives of Internal Medicine

Variability in radiologists' interpretations of mammograms

New England Journal of Medicine

The mammography auditA primer for the Mammography Quality Standards Act (MQSA)

American Journal of Radiology

Improving the statistical approach to health care provider profiling

Annals of Internal Medicine

Comparing risk-adjustment methods for provider profiling

Statistics in Medicine

Original articles
Assessing mammographers' accuracy: A comparison of clinical and test performance