Elsevier

European Journal of Cancer

Volume 53, January 2016, Pages 5-15
European Journal of Cancer

Review
Observer variability in RECIST-based tumour burden measurements: a meta-analysis

https://doi.org/10.1016/j.ejca.2015.10.014Get rights and content

Highlights

  • Relative measurement difference may exceed 20% in single lesion between observers.

  • Variability decreased when measured by a single observer or with multiple lesions.

  • Studies on interval change for single observer or with multiple lesions were lacking.

Abstract

Background

Response Evaluation Criteria in Solid Tumours (RECIST)-based tumour burden measurements involve observer variability, the extent of which ought to be determined.

Methods

A literature search identified studies on observer variability during manual measurements of tumour burdens via computed tomography according to the RECIST guideline. The 95% limit of agreement (LOA) values of relative measurement difference (RMD) were pooled using a random-effects model.

Results

Twelve studies were included. Pooled 95% LOAs of RMD in measuring unidimensional longest diameters of single lesions ranged from −22.1% (95% confidence interval [CI], −30.3% to −14.0%) to 25.4% (95% CI, 17.2% to 33.5%) between observers and −17.8% (95% CI, −23.6% to −11.9%) to 16.1% (95% CI, 10.1% to 21.8%) for a single observer. Pooled 95% LOAs of RMD in measuring the sum of multiple lesions ranged from −19.2% (95% CI, −23.7% to −14.9%) to 19.5% (95% CI, 15.2% to 23.9%) between observers, and −9.8% (95% CI, −19.0% to −0.3%) to 13.1% (95% CI, 3.6% to 22.6%) for a single observer. Pooled 95% LOA of RMD in calculating the interval change of tumour burden with a single lesion ranged from −31.3% (95% CI, −46.0% to −16.5%) to 30.3% (95% CI, 15.3% to 44.8%) between observers. Studies on calculating the interval change of tumour burden for a single observer or with multiple lesions were lacking.

Conclusion

Interobserver RMD in measuring single tumour burden and calculating its interval change may exceed the 20% cut-off for progression. Variability decreased when tumour burden was measured by a single observer or assessed by the sum of multiple lesions.

Introduction

Assessment of tumour burden is essential in phase II and III oncology trials in order to evaluate the therapeutic effect of anti-cancer drugs. Since the Response Evaluation Criteria in Solid Tumours (RECIST) guideline was introduced in 2000 [1] and revised in 2009 [2] for the standardised assessment of tumour burden, it has been widely implemented in cancer trials. According to this guideline [2], unidimensional longest diameters of target lesions are measured and summed on cross-sectional imaging modalities, principally computed tomography (CT), to capture tumour burdens. The interval change of the measured tumour burden between pre- and post-treatment with an anti-cancer agent is then calculated to determine tumour response. There are four categories of response: complete remission, partial response (−30% or less), stable disease (−30% to 20%), and progressive disease (20% or more). Of these, precise determination of progressive disease is especially important as ‘progression-free survival’ is increasingly replacing ‘overall survival’ as a primary end-point in current cancer trials [3].

The RECIST guidelines depend on measurements of tumour burden being reproducible [1], [2], and response categorisation has been regarded as relatively consistent [4], [5], [6]. However, as target lesions are selected and unidimensional longest diameters are measured manually, discrepancies within multiple readings or between different individuals can cause inconsistency in response categorisation [7], [8]. The impact of measurement variability in response categorisation might be underappreciated in cases where measured tumour burdens are close to the 20% cut-off value of progressive disease (i.e. 10–30%) and are vulnerable to miscategorization [9]. Observer variability in measurements tends to attenuate estimates of treatment effect and increase type II errors; this reduces statistical power and requires greater sample sizes, impacting the design of clinical trials [10], [11]. Figuring out the extent of observer variability in manual measurements of tumours can be a requisite for defining cut-off values for semi- or fully automated measurements [12] and for the introduction of promising alternative continuous end-points such as time to tumour growth in clinical trials [13].

To assess the reproducibility and the repeatability of the tumour size measurements in relevant studies, the relative measurement difference (RMD) between two measurements is commonly computed as a difference between the two measurements divided either by the first measurement or by an average of the first and the second measurements. Agreement in the two measurements is shown then visually using Bland–Altman plots with 95% limit of agreement (LOA) by plotting the RMDs against its denominator [14], [15], [16]. There have been quite a few studies investigating the observer variability in RECIST-based tumour burden measurement mainly in terms of the RMD and yet there has not been an overall suggestion based on a comprehensive summary of those results so far.

Hence, we conducted a meta-analysis to investigate inter- and intra-observer variability in manual measurements of tumour burdens on CT scanning according to the RECIST guidelines.

Section snippets

Search strategy

Two authors (S.H.Y. and K.W.K.) independently performed OVID/MEDLINE and EMBASE database literature searches to identify relevant publications on observer-related variability of RECIST measurements using CT scans. Search terms included keywords relating to ‘tumor’, ‘measurement’, ‘variability’, and ‘CT’. Searches were limited to English-language publications and human studies (Supplemental Table 1). The search was current as of March 2014 and supplemented by screening the bibliographies of

Literature search

Of 9231 references identified during our initial database search, 11 studies were ultimately included in our analysis [15], [16], [20], [21], [22], [23], [24], [25], [26], [27], [28] (Table 1, Fig. 1 and Supplemental Table 2).

There were nine studies (eight inter-observer and five intra-observer) [15], [16], [20], [21], [22], [23], [24], [25], [26] evaluating observer variability in measuring the unidimensional longest diameter of a single lesion and three studies (three inter-observer and two

Discussion

Our meta-analysis revealed that inter-observer variability occurs in both measuring the tumour burden and calculating the interval change for single lesions, including whether the 20% RECIST cut-off value for progressive disease has been exceeded. Measurement variability decreased when a single observer measured tumour burden and when multiple lesions were measured. The single-observer results are consistent with those of previous studies, while the decrease in variability with multiple lesions

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (no. 2010-0028631).

Conflict of interest statement

Seokyung Hahn had received consulting honorarium from the Roche and research funding from the Novartis not related with this study. Other authors have no conflicts of interest to declare.

References (43)

  • J.S. Gill et al.

    Relation between initial blood-pressure and its fall with treatment

    Lancet

    (1985)
  • P. Therasse et al.

    New guidelines to evaluate the response to treatment in solid Tumors

    J Natl Cancer Inst

    (2000)
  • R.L. Korn et al.

    Overview: progression-free survival as an endpoint in clinical trials with solid tumors

    Clin Cancer Res

    (2013)
  • C. Suzuki et al.

    Interobserver and intraobserver variability in the response evaluation of cancer therapy according to RECIST and WHO-criteria

    Acta Oncol

    (2010)
  • C. Huanyu et al.

    Evaluation of blinded independent central review of tumor progression in oncology clinical trials: a meta-analysis

    Ther Innov Regul Sci

    (2013)
  • P. Thiesse et al.

    Response rate accuracy in oncology trials: reasons for interobserver variability. Groupe Francais d'Immunotherapie of the Federation Nationale des Centres de Lutte Contre le Cancer

    J Clin Oncol

    (1997)
  • J.J. Erasmus et al.

    Interobserver and intraobserver variability in measurement of non-small-cell carcinoma lung lesions: implications for assessment of tumor response

    J Clin Oncol

    (2003)
  • T. Shao et al.

    Use and misuse of waterfall plots

    J Natl Cancer Inst

    (2014)
  • E.L. Korn et al.

    Measurement error in the timing of events: effect on survival analyses in randomized clinical trials

    Clin Trials

    (2010)
  • S. Hong et al.

    Attenuation of treatment effect due to measurement variability in assessment of progression-free survival

    Pharm Stat

    (2012)
  • D.C. Sullivan et al.

    The imaging viewpoint: how imaging affects determination of progression-free survival

    Clin Cancer Res

    (2013)
  • Cited by (65)

    • Primary cutaneous lymphoma: recommendations for clinical trial design and staging update from the ISCL, USCLC, and EORTC

      2022, Blood
      Citation Excerpt :

      If determination of nodal response is by LN size, for nodes ≤1.5 cm in LDi prior to baseline, they must increase to >1.5 cm and increase by ≥50% from PPD nadir to be considered PD unless biopsy proves otherwise. This definition of PD specifically prevents the 10% to 20% standard error in imaging47,48,49 from creating an erroneous response of PD in LNs of 1 to 1.4 cm LDi that show minimal enlargement. If determination of nodal response in PET-CT is by metabolic score, the recommendations of the Lugano classification, or any published update that is accepted by the Tri-Societies, should be used.

    • Reducing number of target lesions for RECIST1.1 to predict survivals in patients with advanced non-small-cell lung cancer undergoing anti-PD1/PD-L1 monotherapy

      2022, Lung Cancer
      Citation Excerpt :

      The number of lesions measured accounts for one of the sources of intra- and inter-observer variability in response evaluation [25]. Assessing only one deposit seemed to provide adequate prediction of OS, as shown in our work and other studies [13,26], yet we found that measuring one target lesion yielded lower intra-individual and inter-method agreement than measuring two or more target lesions, as found in other studies [27,28]. Taken together, our data support the rationale of a two-lesion-based response assessment and the consideration of a revision for RECIST guidelines to improve convenience and feasibility of RECIST.

    • An omic and multidimensional spatial atlas from serial biopsies of an evolving metastatic breast cancer

      2022, Cell Reports Medicine
      Citation Excerpt :

      All lesions meeting these criteria were recorded both on FDG-PET/CT and combined with long axis measurements (e.g., liver, splenic, lung lesions) and long and short-axis measures (lymph nodes) at all time points during the care of the patient. Variability in the measurement of the long axis of each lesion was estimated to be about 20%.85 Change in tumor burden was assessed for each phase of treatment using RECIST 1.1 criteria.8

    View all citing articles on Scopus
    View full text