Elsevier

Ophthalmology

Volume 125, Issue 8, August 2018, Pages 1264-1272
Ophthalmology

Original article
Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy

https://doi.org/10.1016/j.ophtha.2018.01.034Get rights and content

Purpose

Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading.

Design

Retrospective analysis.

Participants

Retinal fundus images from DR screening programs.

Methods

Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard.

Main Outcome Measures

For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity.

Results

Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR.

Conclusions

Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.

Section snippets

Development Dataset

In this work, we built upon the datasets used by Gulshan et al14 for algorithm development and clinical validation. A summary of the various data sets and grading protocols used for this study is shown in Figure S1 (available at www.aaojournal.org). The development dataset used consists of images obtained from patients who presented for DR screening at EyePACS-affiliated clinics, 3 eye hospitals in India (Aravind Eye Hospital, Sankara Nethralaya, and Narayana Nethralaya), and the publicly

Grading and Adjudication

The baseline characteristics of the development and clinical validation datasets are described in Table 1. The training portion of the development set consisted of more than 1.6 M fundus images from 238 610 unique individuals. The tune portion of the development set consisted of adjudicated consensus grades for 3737 images from 2643 unique individuals. The clinical validation set consisted of 1958 photos from 998 unique individuals. Compared with the clinical validation set, the development set

Discussion

Deep learning has garnered attention recently because of its ability to create highly accurate algorithms from large datasets of labeled data without feature engineering. However, great care must be taken to evaluate these algorithms against as high-quality “ground truth” labels as possible. Previous work14 used majority decision to generate ground truth, without an adjudication process. The present work suggests that an adjudication process yielding a consensus grade from multiple retina

Acknowledgments

From Google Research: Yun Liu, PhD, Derek Wu, BS, Katy Blumer, BS, Philip Nelson, BS.

From EyePACS: Jorge Cuadros, OD, PhD.

References (29)

  • P. Ruamviboonsuk et al.

    Thai Screening for Diabetic Retinopathy Study Group. Interobserver agreement in the interpretation of single-field digital fundus images for diabetic retinopathy screening

    Ophthalmology

    (2006)
  • R. Gargeya et al.

    Automated identification of diabetic retinopathy using deep learning

    Ophthalmology

    (2017)
  • Grading diabetic retinopathy from stereoscopic color fundus photographs–an extension of the modified Airlie House classification. ETDRS report number 10

    Ophthalmology

    (1991)
  • Diabetic Retinopathy Screening Services in Scotland: A Training Handbook – July 2003: page 17

  • International clinical diabetic retinopathy disease severity scale, detailed table

  • I.U. Scott et al.

    Agreement between clinician and reading center gradings of diabetic retinopathy severity level at baseline in a phase 2 study of intravitreal bevacizumab for diabetic macular edema

    Retina

    (2008)
  • H.K. Li et al.

    Digital versus film fundus photography for research grading of diabetic retinopathy severity

    Invest Ophthalmol Vis Sci

    (2010)
  • S. Gangaputra et al.

    Comparison of standardized clinical classification with fundus photograph grading for the assessment of diabetic retinopathy and diabetic macular edema severity

    Retina

    (2013)
  • J.G. Elmore et al.

    Variability in radiologists' interpretations of mammograms

    N Engl J Med

    (1994)
  • J.G. Elmore et al.

    Diagnostic concordance among pathologists interpreting breast biopsy specimens

    JAMA

    (2015)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • A. Esteva et al.

    Dermatologist-level classification of skin cancer with deep neural networks

    Nature

    (2017)
  • Liu Y, Gadepalli K, Norouzi M, et al. Detecting cancer metastases on Gigapixel Pathology Images. arXiv [csCV] 2017....
  • B.E. Bejnordi et al.

    Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

    JAMA

    (2017)
  • Cited by (0)

    Financial Disclosure(s): The author(s) have made the following disclosure(s): J.K.: Employee and stock ownership – Google Inc.

    K.W. and G.S.C.: Employees and stock ownership – Google Inc.

    V.G., L.P., and D.R.W.: Employees, stock ownership, and patent submitted – Google Inc.

    E.R.: Consultant – Google and Allergan.

    P.K.: Consultant – Google.

    HUMAN SUBJECTS: Human subjects were part of this study protocol. All images were deidentified according to HIPAA SafeHarbor before transfer to study investigators. Ethics review and institutional review board exemption were obtained using Quorum Review IRB.

    No animal subjects were used in this study.

    Author Contributions:

    Conception and design: Krause, Gulshan, Rahimy, Karth, Corrado, Peng, Webster

    Data collection: Krause, Gulshan, Rahimy, Karth, Widner, Peng, Webster

    Analysis and interpretation: Krause, Gulshan, Rahimy, Karth, Peng, Webster

    Obtained funding: N/A

    Overall responsibility: Krause, Gulshan, Rahimy, Karth, Widner, Corrado, Peng, Webster

    Supplemental material available at www.aaojournal.org.

    Equal contribution was provided by L.P. and D.R.W.

    View full text