Original articleGrader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy
Section snippets
Development Dataset
In this work, we built upon the datasets used by Gulshan et al14 for algorithm development and clinical validation. A summary of the various data sets and grading protocols used for this study is shown in Figure S1 (available at www.aaojournal.org). The development dataset used consists of images obtained from patients who presented for DR screening at EyePACS-affiliated clinics, 3 eye hospitals in India (Aravind Eye Hospital, Sankara Nethralaya, and Narayana Nethralaya), and the publicly
Grading and Adjudication
The baseline characteristics of the development and clinical validation datasets are described in Table 1. The training portion of the development set consisted of more than 1.6 M fundus images from 238 610 unique individuals. The tune portion of the development set consisted of adjudicated consensus grades for 3737 images from 2643 unique individuals. The clinical validation set consisted of 1958 photos from 998 unique individuals. Compared with the clinical validation set, the development set
Discussion
Deep learning has garnered attention recently because of its ability to create highly accurate algorithms from large datasets of labeled data without feature engineering. However, great care must be taken to evaluate these algorithms against as high-quality “ground truth” labels as possible. Previous work14 used majority decision to generate ground truth, without an adjudication process. The present work suggests that an adjudication process yielding a consensus grade from multiple retina
Acknowledgments
From Google Research: Yun Liu, PhD, Derek Wu, BS, Katy Blumer, BS, Philip Nelson, BS.
From EyePACS: Jorge Cuadros, OD, PhD.
References (29)
- et al.
Thai Screening for Diabetic Retinopathy Study Group. Interobserver agreement in the interpretation of single-field digital fundus images for diabetic retinopathy screening
Ophthalmology
(2006) - et al.
Automated identification of diabetic retinopathy using deep learning
Ophthalmology
(2017) Grading diabetic retinopathy from stereoscopic color fundus photographs–an extension of the modified Airlie House classification. ETDRS report number 10
Ophthalmology
(1991)Diabetic Retinopathy Screening Services in Scotland: A Training Handbook – July 2003: page 17
International clinical diabetic retinopathy disease severity scale, detailed table
- et al.
Agreement between clinician and reading center gradings of diabetic retinopathy severity level at baseline in a phase 2 study of intravitreal bevacizumab for diabetic macular edema
Retina
(2008) - et al.
Digital versus film fundus photography for research grading of diabetic retinopathy severity
Invest Ophthalmol Vis Sci
(2010) - et al.
Comparison of standardized clinical classification with fundus photograph grading for the assessment of diabetic retinopathy and diabetic macular edema severity
Retina
(2013) - et al.
Variability in radiologists' interpretations of mammograms
N Engl J Med
(1994) - et al.
Diagnostic concordance among pathologists interpreting breast biopsy specimens
JAMA
(2015)
Deep learning
Nature
Dermatologist-level classification of skin cancer with deep neural networks
Nature
Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer
JAMA
Cited by (0)
Financial Disclosure(s): The author(s) have made the following disclosure(s): J.K.: Employee and stock ownership – Google Inc.
K.W. and G.S.C.: Employees and stock ownership – Google Inc.
V.G., L.P., and D.R.W.: Employees, stock ownership, and patent submitted – Google Inc.
E.R.: Consultant – Google and Allergan.
P.K.: Consultant – Google.
HUMAN SUBJECTS: Human subjects were part of this study protocol. All images were deidentified according to HIPAA SafeHarbor before transfer to study investigators. Ethics review and institutional review board exemption were obtained using Quorum Review IRB.
No animal subjects were used in this study.
Author Contributions:
Conception and design: Krause, Gulshan, Rahimy, Karth, Corrado, Peng, Webster
Data collection: Krause, Gulshan, Rahimy, Karth, Widner, Peng, Webster
Analysis and interpretation: Krause, Gulshan, Rahimy, Karth, Peng, Webster
Obtained funding: N/A
Overall responsibility: Krause, Gulshan, Rahimy, Karth, Widner, Corrado, Peng, Webster
Supplemental material available at www.aaojournal.org.
- ∗
Equal contribution was provided by L.P. and D.R.W.