Original articles
Assessing mammographers' accuracy: A comparison of clinical and test performance

https://doi.org/10.1016/S0895-4356(99)00218-8Get rights and content

Abstract

Direct estimation of mammographers' clinical accuracy requires the ability to capture screening assessments and correctly identify which screened women have breast cancer. This clinical information is often unavailable and when it is available its observational nature can cause analytic problems. Problems with clinical data have led some researchers to evaluate mammographers using a single set of films. Research based on these test film sets implicitly assumes a correspondence between mammographers' accuracy in the test setting and their accuracy in a clinical setting. However, there is no evidence supporting this basic assumption. In this article we use hierarchical models and data from 27 mammographers to directly compare accuracy estimated from clinical practice data to accuracy estimated from a test film set. We found moderate positive correlation [ ρ̂ = 0.206 with 95% credible interval (−0.142–0.488)] between mammographers' overall preponderance to call a mammogram positive. However, we found no evidence of correlation between clinical and test accuracy [ ρ̂ = −0.025 with 95% credible interval (−0.484–0.447)]. This study is limited by the relatively small number of mammographers evaluated, by the somewhat restricted range of observed sensitivities and specificities, and by differences in the types of films evaluated in test and clinical datasets. Nonetheless, these findings raise important questions about how mammographer accuracy should be measured.

Introduction

Screening mammography is an effective method of detecting early stage breast cancer. However, the diagnostic value of a mammogram depends on both the technical quality of the film and a mammographer's ability to interpret that film. In the last decade mammographic technology has been relatively stable, allowing researchers to focus on the subjective interpretation of mammograms (e.g., 1, 2).

The Mammography Quality Standards Act recognized the effect of mammographers' interpretations on screening assessments and encouraged medical audits of mammographers' clinical assessments. Evaluating mammographers' performance using clinical assessments is intuitively appealing, because this is “real life” performance. For many researchers, the medical audit is the gold standard measure of performance [3]. However, our ability to draw conclusions about the performance of particular mammographers from these clinical assessments is limited because each mammographer reviews a different set of films. The difficulty of films varies with characteristics of the women evaluated (e.g., breast density), characteristics of lesions (e.g., size), and characteristics of technical film quality (e.g., positioning). Variability in film difficulty results in chance differences among mammographers. Systematic differences in the difficulty of films reviewed can also occur, for example, when mammographers tend to send difficult cases to a particular colleague. Differences in the number of films reviewed also affects comparisons between mammographers through the variability of estimated performance. Because performance estimates based on fewer patients tend to be more variable, and therefore more extreme, comparisons that ignore differences in variability can be misleading. Statistical models have a limited ability to adjust for differences in the films read by each mammographer 4, 5.

Estimation of clinical accuracy is further complicated by the influence that clinical assessments have on the probability of detecting breast cancer. When estimating screening accuracy, we focus on the correspondence between a mammographer's clinical interpretation and a woman's true disease state. Because most women only undergo biopsy if a mammographer finds an abnormality, undetected breast cancer cases emerge symptomatically or during a second screening exam. Thus, undetected breast cancer can only be identified when follow-up information exists. A one year follow-up is generally used, with women classified as disease positive at the time of a screening mammogram if breast cancer is diagnosed within one year [3].

Estimation and comparison of clinical screening performance is also hampered by the relatively low incidence of breast cancer. The one year incidence of invasive breast cancer is approximately 3.5 per 1,000 among American women who are over 49 years old [6]. Low incidence rates make it difficult to precisely estimate a mammographer's rate of cancer detection, since most mammographers will evaluate very few cancers in a single year.

Standardized testing of mammographers is an alternative way to estimate their accuracy. Using standardized film sets removes many of the problems with clinical data. Each mammographer views the same films in the same setting and with the same patient information. Test sets exclude films from women without necessary follow-up information, so that true disease state is known with a high degree of certainty. Test sets can also include more films from women with breast cancer than would be seen in clinical practice, allowing more precise estimation of sensitivity. In summary, use of a test film set controls for film difficulty, film quality and the information presented during film evaluation, offering a relatively simple method of estimating mammographers' accuracy under standardized conditions.

Although estimating accuracy from assessments of standardized film sets avoids many of the problems with clinical data, the artificial conditions introduce other problems. Mammographers know that in the test setting their decisions will not affect patient care. The test itself may be burdensome given time constraints. There is also evidence suggesting that the higher prevalence of disease in test film sets introduces bias. Egglin [7] found that radiologists were more likely to interpret arteriograms as positive for pulmonary emboli when viewed in a higher prevalence film set, regardless of true disease state. When this “context bias” exists, sensitivity increases with increasing prevalence while specificity decreases.

Studies describing mammographer variability based on test film sets (e.g., 1, 2) implicitly assume a strong correlation between mammographers' performance estimated from test sets and mammographers' performance in clinical practice. However, this assumption has never been tested. In this article we directly compare mammographers' clinical and test performance.

Section snippets

Data

We analyzed data from 27 mammographers practicing at a large staff model not-for-profit health maintenance organization (HMO). The mammographers included in this study were voluntary participants, though this group essentially included all of the mammographers practicing with the HMO at the time of the study.

Both clinical and test data sets use films from women who remained enrolled in the HMO for at least two years after their index mammogram. Women with breast cancer were identified using the

Methods

We are primarily interested in the degree of correlation between mammographers' accuracy measured in a clinical setting and accuracy measured in a test setting. The accuracy measures we focused on are sensitivity and specificity. Sensitivity is the proportion of women with breast cancer who had a positive mammogram assessment. Specificity is the proportion of women without breast cancer who had a negative mammogram assessment.

Calculation of sensitivity and specificity requires definition of a

Results

There was wide variability in the amount of clinical data available for each mammographer (Table 2). The 27 mammographers clinically evaluated an average of 1890 films during the four-year period (range 232 to 3818), and saw an average of 15 mammograms from women with breast cancer (range 1 to 32). The average clinical prevalence rate across mammographers was 8 cancers per 1,000 mammograms.

Plots of the sensitivity and specificity suggest moderate positive correlation between clinical and test

Discussion

These results represent a comprehensive comparison of mammographers' assessments in test and clinical settings. The clinical data was based on automated collection of mammographers' interpretations and recommendations. The data systems also allowed two-year follow-up of each woman screened. The test data included a relatively large set of 113 mammograms and included 30 cancers. Finally, our statistical model allowed for differences in the number of films each mammographer assessed during

Acknowledgements

This study was supported, by grants CA63731 from the National Cancer Institute and BC962461 from the U.S. Department of Defense.

We wish to acknowledge the careful work of Kari Rosvik and Deb Seger who made this study possible, and the many mommorgraphers who gave their time to this study. We want to especially thank Mary Kelly, MD, and Donna White, MD, who provided valuable leadership.

References (11)

  • C.A. Beam et al.

    Variability in the interpretation of screening mammograms by US radiologists

    Archives of Internal Medicine

    (1996)
  • J.G. Elmore et al.

    Variability in radiologists' interpretations of mammograms

    New England Journal of Medicine

    (1994)
  • M.N. Linver et al.

    The mammography auditA primer for the Mammography Quality Standards Act (MQSA)

    American Journal of Radiology

    (1995)
  • C.L. Christiansen et al.

    Improving the statistical approach to health care provider profiling

    Annals of Internal Medicine

    (1997)
  • E.R. Delong et al.

    Comparing risk-adjustment methods for provider profiling

    Statistics in Medicine

    (1997)
There are more references available in the full text version of this article.

Cited by (59)

  • The roles of clinical audit and test sets in promoting the quality of breast screening: a scoping review

    2020, Clinical Radiology
    Citation Excerpt :

    For example, the two environments differ in the definition of key parameters such as sensitivity, as explained earlier. Nonetheless, some studies have shown significant correlations between the two performance domains,66,67 whereas other studies found the relationship to be less strong.64,68 For instance, Soh et al. (2013) designed a study to measure the extent of correlation between test set performance and clinical outputs and concluded that the level of agreement is moderate with the relationship between the clinical and test set data being as high as reading the same test set twice.

  • Machine and Deep Learning

    2020, Intelligence-Based Medicine: Artificial Intelligence and Human Cognition in Clinical Medicine and Healthcare
View all citing articles on Scopus
View full text