Introduction

Today’s clinical trials for cancer therapies regularly depend upon the interpretation of medical imaging to establish efficacy. Of the 20,682 cancer trials registered in clinicaltrial.gov over the last decade,Footnote 1 67% feature endpoints hinging upon medical imaging.Footnote 2 Image readers for cancer trials, typically radiologists or nuclear medicine experts, can be on-site in the clinical setting or comprise teams of selected experts performing an off-site Blinded Independent Central Review (BICR). The BICR serves as the backdrop for this paper.

The purpose of the BICR is to provide an image assessment independent of bias and potential functional unblinding that may exist when patient health information is known. The FDA’s current guidance for industry affirms that the BICR “enhances the credibility” and “better ensures the consistency” of imaging assessment [1]. The BICR is employed for most but not all cancer trials of all phases but deemed specifically important in Phases 2 and 3 and necessary in single-arm or unblinded studies. The most common BICR design is referred to as “2 + 1,” in which two independent readers assess the same images for each study subject, and one adjudicator settles endpoint disparities between the two outcomes. The duplication of two readers with the addition of the adjudicator enhances the reliability of endpoint assessment and makes the measurement of reader performance possible. Though site readers are expected to have the same sources of variability, measurement of that variability is most often not practical if even possible. However, any two readers will inevitably differ in their interpretation to some degree. Research over the last 70 years shows that that disagreement between two radiologists is remarkably consistent throughout the decades and also comparable to the inherent variability between two physicians within any field. Necessarily, the variability that leads to disagreement is present in all radiologists, whether on-site or at a central reading facility. While differences in interpretation do not inevitably nor inherently equate to error, understanding the sources of these differences and associated significance will help guide appropriate action to minimize and control these differences when they can be controlled and understand them when they cannot. The question of whether an independent read using multiple readers provides greater benefit than using individual site investigators is discussed in detail elsewhere [2,3,4,5,6,7]. Methods to measure and monitor for these differences and their relationship with sources of variability are presented in the companion paper to this manuscript [8].

Reader Disagreements are Consistent with Diagnostic Disagreements in Other Areas of Medicine

The breadth of literature supports the broader medical community’s view that disagreement between experts is inevitable and at times necessary in all fields of medicine, including radiology and nuclear medicine. The Society to Improve Diagnosis in Medicine [9]. was established in 2011 to improve misdiagnoses in the clinic, including those made by radiologists. The primary concern over misdiagnosis is the main motivation behind 70 years of exhaustive focus into the causes of perceptual error in imaging [10]. Contemporary medical imaging associations, such as the Medical Image Perception Society (MIPS), have made extensive contributions over many years to the study of medical image interpretation [11]. Since BICR evaluations include a diagnostic component of lesion assessment, as well as evaluating disease status over time, the variability between independent clinical trial readers will also necessarily include some of the challenges seen in the clinic [12,13,14,15,16,17,18,19,20].

Disagreement among physicians is an integral part of medicine. Interestingly, disagreement rates are remarkably consistent across the decades for different types of image evaluations, across medical specialties and different technologies, falling between 20 and 40% [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35] (see Table 1 for a summarized sample of research on evaluator agreement)

Table 1 A Sample of Evaluator Agreements for Different Specialties Since 1947.

Earlier research on radiological discordance focused on disagreements in diagnoses. Other research specifically dedicated to radiological image perception concluded that radiological disagreements are expected for radiologists as they are for any physician [36,37,38,39,40,41,42,43,44]. In a landmark article on radiological error from 1959, Henry [45] stated

Even experienced physicians are found to have a measurable degree of ‘observer error’ due apparently to the so-called human equation… In evaluating pairs of serial roentgenograms, (two physicians are) apt to disagree … in about one-third of the cases and with (themselves) in one-fifth of them.

Remarkably, even when considering technological advancements in scanners, little has changed in the rate of reader discordance over time. Radiological disagreement rates of approximately 30–40% reported by several papers since 1959 are consistent with the rates shown by Ford in 2016 across a variety of oncologic indications [45,46,47] [, holding steady across different response criteria, whether quantitative or subjective in form [26, 35, 45,46,47,48,49]. This consistency may be due to only 5–10% of the information for visual perception coming from the retina while 90–95% comes from different regions of the brain including the cortex and brain stem [50]. Therefore, the majority of the inputs that affect visual perception are resident in the brain at the time the images are evaluated.

Why Expert Readers Disagree

The key to taking appropriate actions to minimize reader disagreement in a clinical trial setting is understanding what sources of variability are involved. Assessing medical images demands cognitive tasks such as reasoning, problem-solving, and visual perception. Within these tasks, clinical trial readers must not only identify and determine the state of the disease but also when the disease changed enough to cross a criteria threshold. The greatest contributing factor of inter-reader variability originates from a radiologist’s own expertise in applying the essentially subjective aspects of the response criteria.

Factors that affect reader performance can be roughly defined as controllable, (e.g. experience/expertise, fatigue, and environment), less-controllable (e.g. daily disposition, stress, and internal biases), and not controllable (e.g. random measurement variability and biological heterogeneity). The less controllable factors comprise 89% of radiological disagreement [31, 51]. Factors such as experience level and reader fatigue can be controlled to some degree and are recognized by regulatory authorities as factors to plan for and monitor and are important enough to be addressed in the FDA Guidance for Industry [1].

Controllable Factors in Reader Performance

Aside from the ambiguity inherent in imaging complex anatomy, sources of reader variability that can be controlled include expertise, training, the reading environment, and setting, including the risk of fatigue [52,53,54]. Radiologists and other readers each have unique levels and types of training and experience that can also contribute to discordance and can be controlled to an extent by the choice of readers for the study. Familiarity with specific disease indications and clinical trial review, the extent of specific reader training and knowledge, and familiarity with the response criteria can also sway interpretation. Moreover, controllable factors unrelated to the individual may also have a considerable impact. These can include the imaging technique’s quality and limitations, the number of response categories, tumor growth rates relative to image sampling rates, and specific tumor feature characteristics.

Here, we provide additional detail to the major discordance categories to explain why they occur to lay the groundwork for reader performance monitoring methods described by Raunig et al. [8] in a companion paper.

Image Interpretation and Experience

Reading medical images requires the detection, interpretation, and appropriate labeling of visible information of interest. Visual information that includes complex shape, texture, and intensity of the entire image is processed by the visual pathways in a manner that is strongly influenced by experience and higher functions including learning and memory [41, 55, 56]. Borradaile et al. authored a review of 40 oncology clinical trials across 12 different indications with 12,299 participants and concluded that differences in expert visual interpretation commonly referred to as “medical image perception,” comprised 77% of the disagreements [57]. Their figure is remarkably consistent with the estimate of 80% independently reported by Kim and Mansfield for radiological diagnostic errors [58].

The importance of clinical trial experience in addition to clinical experience has been noted in the industry as exemplified by the advertisement of the Massachusetts General Hospital of their central reading services:

With over 20 years of clinical trial experience, our radiologists understand the unique needs of CROs [Contract Research Organizations] and pharmaceutical companies and are well equipped to handle even the most challenging of trials [59].

Radiological experience plays a particularly critical role when new findings may represent benign or unrelated conditions mistaken for new metastasis or disease detected in less common manifestations or locations. For example, a pulmonary embolus can appear to be a new lung lesion, impersonating new pulmonary metastasis. Experience and training on the specific implementation of the clinical trial criteria are also critical. The proliferation of response criteria, over 20 in oncology, increases the chance that the readers, site or central, will misinterpret the criteria and, therefore, commit the same procedural error. For example, in a trial involving prostate cancer and the newly released PCWG 3 criteria [60], several readers evaluated according to the older criteria, PCWG2 [61]. The result of the errors was that the scans were re-opened and required re-reading, the readers were required to undergo refresher training, and a plan of corrective actions and preventive actions was created and implemented including a diary entry into the trial master file.

Though there is widespread agreement that more experienced radiologists have better diagnostic sensitivity and specificity than less experienced radiologists, there is no defined threshold for the number of years’ experience needed to successfully read in a clinical trial though. Some research indicates that between 5 and 10 years of experience as a practicing radiologist may be a useful guideline for recruiting candidate readers [62,63,64]. Additionally, Tucker et al. reported that fewer than 80,000 cases read was an apparent threshold for decreased diagnostic accuracy. A search of clinicaltrials.gov for Phase 2 and 3 studies using RECIST for PFS resulted in 3429 studies over the last 10 years for an average of 175 subjects/trial which may be used to approximate the number of clinical trial cases read when the reader curriculum vitae indicates only the number of clinical trials experience and not the total number of cases.Footnote 3 Interestingly, the interaction of reader experience and fatigue having a greater influence on performance for readers with less experience [29, 30, 47]. However, this may not always be true since, at times, information on newer scanning techniques that were previously unavailable may compensate for a reader’s lack of experience [65].

The following are recommended as considerations when choosing either blinded independent readers for central reads or sites and radiology departments when using site readers:

  • Experience

    • Clinical experience in radiological or nuclear medicine evaluations of the specific indication measured in both years as a practicing clinician and number of patients evaluated;

    • Criteria experience or training

    • Experience with reading for a clinical trial, including the phase of the trial which may indicate experience with timelines, criteria, and reader workload.

  • Fatigue

    • The numbers of readers used in a pool to offset the workload on the readers at different parts of the clinical trial (e.g. interim analyses) and at the end of the study.

    • Monitoring or restricting the number of cases read in a single read session or over a longer period (e.g. week or month).

Selecting and Measuring Target Lesions

For most metastatic cancers, response criteria are generally concerned with measuring the change in the patient’s tumor burden. However, measuring all visible lesions in all patients is simply impractical. Therefore, following only a sampled subset of lesions over time is the basis behind most objective response criteria. Radiologists must be able to select ‘relevant” lesions at baseline that they believe represent the disease burden of the patient and that will continue to be accurately measurable throughout treatment. A radiologist’s ability to select suitable target lesions is also dependent on their interpretation of what constitutes a suitable target lesion based on experience. For example, a lesion that meets measurability criteria may also later coalesce on subsequent timepoints, i.e. hilar or mediastinal lymphadenopathy typically leading to changes in measurements inconsistent with the change in the disease state. In these cases, a radiologist who is experienced in reading for a clinical trial may be more likely to choose that lesion as non-target at baseline. Sridhara, et al. point out that a target lesion that cannot be followed by at least one reader can result in missing data and a not evaluable assessment [66]. Examples of this, that members of PINTAD have observed, occurred in several studies when the site chose the target lesions for the central readers.

Complicating matters, the specific target lesions readers select often differ especially in patients with numerous lesions. As the percentage of change will vary and meet certain thresholds at different times depending on the set of lesions selected these differences in selection have been identified as a major reason for reader disagreement [24, 67]. Nevertheless, studies show that allowing readers to independently select target lesions does not affect the overall study result and increases the overall reliability of the result by reducing sampling error of the target lesions that might occur by leaving the target lesion response up to a single reader. [68, 69].

Detecting New Lesions

New lesions that are still small may miss detection or, even if detected, reader comments in many clinical trials indicate that they want to wait to confirm that it is a growing lesion and not a non-malignant finding. Disagreements on whether a “new lesion” existed at baseline for disease-free survival endpoints can lead to the casewise exclusion of that patient for analysis. Interestingly, researchers from MIPS point out that radiologists can miss or misdiagnose lesions even when directed to the location of interest [11].

The detection of new lesions is not only a matter of perception but also of signal versus noise—small lesions in inherently noisy images. To help increase signal, supplemental imaging modalities may also assist the readers when specified or designed to be acquired [70, 71]. Errors by a single reader in new lesion detection account for approximately 10% of all discordances [57].

Recognizing Non-target Lesion Progression

Unequivocal non-target progression should reflect growth in which the “overall tumor burden has increased sufficiently to merit discontinuation of therapy" [72]. Accordingly, disagreements between independent readers regarding non-target lesion progression occur [73,74,75,76,77]. Disagreement on non-target lesion progression comprises about 10% of all disagreements [78] and, while this constitutes a small percentage of all disagreements, perceptual disagreements can be a source of controversy when discussing patient care. Objective evaluation of non-target lesion response to treatment is chiefly dependent on noticeable morphological degrees of growth, reader experience, and the readers’ internal thresholds of when to call unequivocal tumor progression (see also reader bias below). The likelihood and degree of this kind of disagreement can be reduced by ensuring that all readers are jointly trained in the indication, the modality, the criteria, and, most importantly, covering the scenarios that constitute unequivocal progression.

Lesion Measurement

Well-defined, oval, or round lesions with clear lesion-to-background discrimination are easier to measure and result in less variability than complex lesions with diffuse or spiculated borders or those with poor discrimination from the background [79, 80]. Some indications, such as hepatocellular carcinoma or ovarian cancers, are particularly challenging to measure. A lesions’ shape, conspicuity, or diffuse or infiltrating borders, require varying degrees of subjectivity in their measurements [46]. Lesions are also subject to slight movements and deformations due to patient positioning, breathing, swallowing; furthermore, the same lesion can demonstrate changes in its diameter, even upon immediate re-imaging [79].

Also, tumor size is typically measured in a single plane by the longest and/or shortest (i.e. widest) diameter, depending on the lesion type and criteria. RECIST measures solid, non-lymphatic tumors along the longest diameter. Therefore, measurement differences among readers due to differences in the determination of the measured edge or the longest axis can affect the target lesion response. [81] Measurement variability between readers for the same lesion, for most tumor types, can be considered to be only a minor contributor to overall reader variability (intraclass correlation coefficient = 0.991) [82].

Measurement variability can and does lead to different response categories across readers. One example seen in a clinical trial was an assessment of stable disease based on a 19.7% increase in the sum of diameters and the other reader assessing progressive disease based on a 20.1% increase in a single brain metastatic tumor. The adjudicator chose stable disease, which resulted in no progression event at that visit. The actual percentage change was small but the response category difference highlighted disagreements that are inevitable with criteria that rely on thresholds.

Image Quality

Inherent in any reader assessment accuracy is the quality of the image, itself, tempered by one’s ability to perceive the true disease state from poorly acquired images. Alpert and Hillman reported that from 10 to 24% of diagnostic errors in the clinic are associated with low image quality [83]. Certainly, this would also apply to the central review. Low-quality bone scans, computed tomography (CT) scans with motion artifacts, differences in contrast enhancement, inadequate contrast levels, or incomplete data are all image quality issues that make reliable assessments difficult and can increase reader differences. These nuances demand consummate and careful focus. Clear guidance on the expected image acquisitions needs to be provided and the consequences of adjustments due to oversight or for example patient condition understood (e.g. patient cannot tolerate a full dose of intravenous contrast) by trialists and trial sponsors. Pre-study training should prospectively discuss these scenarios, in particular in the context of implications for the assessment criteria (e.g., the timing of contrast in mRECIST for HCC).

Missing Clinical Information

Clinical information may direct the visual search to specific anatomical locations, or clinical information can provide context on a particular finding’s nature. Unlike in clinical practice, where the practice of medicine integrates objective disease progression with patient-related medical care factors, such as toxicity, medications, incidental findings, and overall clinical health, objective assessment of a patient’s response to treatment by imaging is fundamentally based on the reader having no information on the patient that is not pre-determined as part of the reader assessment. Accordingly, patient-reported health status, typically available to the clinical radiologist, is not available to the clinical trial reader to ensure an objective assessment and reduce the possibility of biases to influence the assessment. For example, a reader who knows that the patient’s deteriorating health status may be biased to confirm that knowledge by assessing progressive disease (i.e. confirmation bias). The reverse of confirmation bias, anchoring bias, might occur if the reader is biased by the patient’s health status and then fails to adjust their assessment in light of contradictory radiological information. A complete list of the 10 biases that radiologists are prone to was compiled by Busby et al. [54]. To mitigate the risk of bias and to also include clinical information critical to the assessment (e.g. biopsy), a careful and prospective determination of what clinical data is helpful in the context of the disease and how it is to be integrated with the criteria is strongly recommended and should be included into the image review charter.

Discussion

In 1996, the Clinton-Kessler Oncology Initiative accelerated cancer drug development. With this, centralized imaging rose in importance, becoming a prominent fixture in oncology clinical trials. Accordingly, imaging core labs set out to establish specific read processes for clinical trials. Under the scrutiny of both regulators and sponsors, these processes evolved to establish controls to ensure the reliability of blinded assessments by measuring and monitoring central reader performance, specifically by disagreements between paired readers on teams of two. Though medical imaging is common to most oncology trials, the familiarity of study personnel with what constitutes reasonable disagreement and when or what corrective measures should be taken can vary greatly. We present in this paper a review of the literature as well as the experience of PINTAD members on why readers disagree and what kind of disagreement is to be expected. Fortunately, while errors are bound to occur when multiple readers read a case the errors are almost always limited to one of the readers since the chance that multiple readers committing the same error is small. Therefore, multiple reader paradigms allow for both an overall measure of disagreement and methods to eventually identify which of the readers contributed to that disagreement, provided in great detail in the companion to this paper [8].

A trend seen both by PINTAD members and within the literature on ICL reader disagreements is an increase in the desire to reduce reader disagreement using methods that also reduce the independence of the two readers. One example is to have a single reader choose the target lesions. This particular read design places target lesion response almost entirely on the abilities of the lesion-selection reader and reduces the adjudication of disagreement to a choice of how the readers measure lesions and not on the best set of target lesions. The common use of computer-aided measuring tools would reduce target lesion disagreement to near-zero but at the risk of a less-than-optimal choice of target lesions. Another example of misplaced efforts to reduce reader disagreement allows the readers to discuss the cases before the assessment which constitutes, essentially, a consensus assessment and which is specifically not accepted by the regulatory authorities as a multiple reader assessment. These and other examples demonstrate the risk of methods that, in essence, force the readers to agree without regarding the statistical impact on endpoint accuracy.

Many factors that shape reader variability are common to both on-site and off-site reviews. However, literature comparing imaging core lab and site reads is incomplete. The seminal meta-analysis of 27 studies that concluded equivalence between central and site readers was limited by over half of the studies having sites and core labs communicate with each other (i.e. not independent) or having protocol amendments that required mitigation of site-related bias [84,85,86,87]. Nevertheless, the processes unique to each setting introduce different degrees of control for the reader variabilities. The largest and most impactful differences include (1) the central read’s consistent use of two independent radiologists, and (2) the ability of the central review process to monitor readers for both short-term and long-term trends. In most cases, monitoring site readers for bias, drift, and errors is typically impractical if even possible, and periodic retraining for all site readers may be needed in long follow-up studies to help ensure the reliability of the site-assessed results.

In studies that include the use of both site and central readers in a hybrid model of reader teams, the natural discordance between two radiologists can have an additional impact. For example, when enrollment requires a measurable lesion, which is required by RECIST for an assessment of Partial Response (PR), a disagreement by one reader on the presence of a measurable lesion will preclude that reader from assessing a PR. These incidents have been noted by the authors and other PINTAD members, and the following recommendations are made to mitigate the impact of site versus BICR discordance.

  • In alignment with planned endpoints, consider central confirmation of baseline requirements, specifically the requirement for at least one measurable lesion to ensure that response is possible, and in studies that measure relapse for a disease-free survival (DFS) endpoint, have a central confirmation of the absence of disease at baseline.

  • Require central confirmation of progression or central adjudication of site-central discordance to reduce the possibility of informative censoring.

Independent from the read setting, when the prescribed disagreement rate is significantly above or below the expected rate, an evaluation should be performed to assess whether the observed rate is justified. In the context of well-trained expert readers working under controlled conditions, a higher disagreement rate may reflect the challenges of the interpretation such as mixed responses due to the specific choice of target lesions, or “borderline cases” that hug the thresholds in slow-growing, or visually ill-defined disease, or simply poor image quality. However, if the investigation into unexpected disagreement rates does not suggest the presence of such justifiable differences in interpretation, then inadequate training, fatigue, or other performance-related factors may be the cause. Understanding which disagreements are justified versus which disagreements can and should be limited and managed greatly adds to the effectiveness of any conclusions and possible remedial actions. Most typically, these remedial actions consist of additional reader training, and, depending on the conclusions drawn, can consist of training as simple as a “Read and Acknowledge,” or can be more involved such as a discordant case review remotely or in person. Reviewing discordant cases, though not specifically meant to reduce disagreement rate, does so by resolving the reason for discordance. It may also be helpful to review agreed-upon cases to identify reasons for agreement. In most cases, the review is likely to be most beneficial in identifying any misinterpretations of the criteria which may greatly reduce subsequent disagreements.

In recent years, clinical researchers are looking to artificial intelligence (AI) to support radiologist’s reads and reduce reader variability [69,70,71,72] in addition to potentially replacing the current assessment criteria. A search of the term “AI” in the 2019 RSNA Annual Meeting program resulted in 310 tutorials, classes, talks, or posters. Large international research collectives such as PRIMAGE, with access to adequate big data sets, are pursuing deep machine learning that could facilitate personalized imaging biomarkers [88]. We consider these efforts as highly promising, despite earlier research in computer-aided radiology that suggested computer-aided detection had not substantially improved the incidence of human error [44, 89,90,91,92], and are looking forward to further validation. For the present, at least, imaging interpretation relies on human expertise; and, reader variability remains an unavoidable reality.

A new concern that has arisen recently and sure to become integral in clinical trials for the future is the presence of intercurrent events (ICE) experienced during the COVID-19 pandemic, particularly those ICEs that may make it necessary for oncology patients to be scanned at different imaging centers or even using different modalities [93]. While the guidance recommends consulting with the FDA for the impact of alternative imaging centers on efficacy endpoints and type I and type II error rates, study sponsors may also want to consider the impact of the pandemic on any studies using local evaluations. In the case of site radiologists being unavailable or overworked, studies should have a discussion about incorporating the consistency and availability of using central readers.

Conclusion: The Significance of Reader Variability in Clinical Trials

It is quite clear that two independent experts will always disagree to some extent. Disagreement rates of 25% to 40% on the interpretation of an image are a reasonable benchmark, based on seven decades of consistent findings. Importantly, variability among readers does not necessarily indicate inadequate performance; instead, it often reflects natural and expected differences in all of its forms and may reveal where multiple interpretations are reasonable. In controlled and monitored reader environments, unexpected levels of disagreement should be flags for further investigation and changes in disagreement rates can be reliable indicators of some type of change in performance. Correctly identifying reasons for reader variability may become even more important in the near future as immune-oncology studies become the standard and different sources of image data noise and confounders become more and more important to mitigate.

The procedures and methods presented in the companion to this article recommend ways to monitor and interpret imaging reviewer performance in most clinical trials [8].