Introduction
In young patients with diffuse large B cell lymphoma (DLBCL), a large maximum tumor diameter is an indicator of poor prognosis [
1]. Recent progress in lymphoma care has recommended exploration of the prognostic value of volumetric tumor bulk measured on staging 2-deoxy-2-[
18F]fluoro-
d-glucose ([
18F]FDG) PET/CT, with methods combining metabolic activity and volume [
2]. In lung cancer patients, studies have focused on finding the most reliable tumor segmentation method [
3‐
5]. However, compared with lung cancer, lymphoma segmentation is more challenging due to higher number of lesions, multiple anatomical locations, and inter- and intratumoral [
18F]FDG uptake heterogeneity.
Preliminary data suggest that baseline metabolic tumor volume (MTV) has a prognostic value in DLBCL [
6‐
9] and predict outcome better than bulky disease measured by maximum tumor diameter [
7]. Total lesion glycolysis (TLG)—defined as SUVmean in a volume multiplied by the corresponding MTV—seems to perform similarly [
7] or inferiorly [
6,
8] in predicting outcome of DLBCL patients. Various segmentation methods to measure MTV and TLG are being used in clinical lymphoma studies [
10]: most use a fixed SUV threshold (
e.g., SUV≥2.5 [
7,
9] or SUV≥4.0 [
11]) or a percentage of SUVmax (
e.g., 41 % of SUVmax [
6,
8,
12]) to define MTV. An important finding from earlier studies in DLBCL is that optimal cutoff values range widely (220–550 ml), probably because of using different methodologies, small patient cohorts, differences in patient risk factors, and therapies [
13]. Moreover, these data-driven cutoff values should be interpreted with caution, as they depend highly on acquisition and reconstruction protocols.
Segmentation methods in these studies are generally derived from phantom experiments [
4,
12], or correlation with pathological specimens in lung cancer [
4]. Limited data are available about the differences in ease of use in the lymphoma clinical setting and interobserver reliability of these tumor segmentation methods [
10]. Previous studies in DLBCL [
10], T cell [
14], and Hodgkin lymphoma [
15] showed that different segmentation methods, despite having different cutoff values, show comparable accuracy for predicting survival. Therefore, for future use in practice and clinical trials a robust, reliable and easy—
i.e., with least required observer interaction—segmentation workflow is necessary. To the best of our knowledge, this is the first pilot study in DLBCL that compares interobserver reliability and ease of use of three workflows for measuring MTV and TLG, and that assesses the effect of manual modification on interobserver reliability.
Materials and Methods
Study Population
Twelve baseline [
18F]FDG PET/CT scans from newly diagnosed DLBCL patients with wide variation in number and size of involved organs and lymph nodes lesions were selected from the international PETRA database (
http://www.petralymphoma.org). The use of all data within the PETRA imaging database has been approved by the Medical Ethics Review Committee of the VU University Medical Center (JR/20140414) after patients’ consent to participate in the studies included in the database.
Image Analysis Workflows A and B
Two semi-automated workflows (Workflows A and B) were performed in the same week, by three independent observers using the ACCURATE software tool [
16]. Manual modifications of the semi-automatically generated volumes of interest (VOIs) were not allowed initially. The workflow with the best interobserver reliability and ease of use was selected as starting point for manual modification in Workflow C.
Workflow A comprised a user-defined selection of individual lesions. The observers had to select individual lesions (by a single mouse click in the “hottest” part of each lesion), followed by automated segmentation in the tool using four separate frequently published segmentation methods:
2.
A50% of SUVpeak,
i.e., 50 % of SUVpeak with local background correction [
17] (A50%P)
3.
fixed SUV threshold of 2.5 (SUV≥2.5)
4.
fixed SUV threshold of 4.0 (SUV≥4.0).
The four segmentation methods were initiated from one single click by the observer, to avoid introduction of extra variability by repeated clicking. Moreover, the tool first calculated a robust local maximum (using a region growing method applying a 70 % threshold of the point clicked) in order to be less dependent on the exact point clicked by the observer. Generated VOIs were summed for all lesions selected by each observer to calculate MTV and TLG according to each of the four segmentation methods.
To explore the use and performance of consensus methods, two methods were added afterwards, which use the delineations found with the above four standard methods as input for a majority vote (MV) approach [
18]. MV volumes were defined by all voxels included in the MTV or TLG by at least two (MV2) or three (MV3) of the input methods.
Workflow B consisted of a fully automated preselection of [18F]FDG-avid structures defined by an SUV≥4.0 and a volume threshold of ≥3 ml. These preselected regions resulted into an identical starting point for all observers but could include non-tumor regions with normal increased [18F]FDG uptake, such as the brain or bladder. From this starting point, the observers decided on the removal of non-tumor regions by using a clearing option (i.e., single click(s)) or spatial limits to reduce the analyzed field of view (e.g., using a slider option to exclude superior slices including the brain or inferior slices including the bladder); after this, only lymphoma lesions remain. Therefore, a region is defined as any preselected 3D-VOI with uptake above the SUV≥4.0 threshold, whereas a lesion is defined as a 3D-VOI identified by the observer as lymphoma.
To determine ease of use for both workflows, each observer noted the total analysis time per patient (including loading of the scan, performing the analysis, and saving results).
In addition, the success of all semi-automatically generated VOIs was rated by each observer according to the following definitions:
-
Failed: generated VOI is unrealistic or does not contain complete lesion
-
Poor: generated VOI takes into account physiological uptake or contains a lot of background and manual modification is needed
-
Acceptable: only minimal manual modification needed for good VOI
-
Good: generated VOI is comparable to what you consider to be lymphoma
A mean “success rate” (all acceptable and good ratings) was calculated for each method. Finally, observers had to choose one “preferred segmentation” for the generated VOIs. The MV2 and MV3 consensus methods were rated by one experienced observer according to the same success definitions. As these MV methods were assessed afterwards, they could not be chosen as “preferred segmentation.”
Image Analysis Workflow C
The observers used the fully automated method as in Workflow B for the analyses on the same twelve scans (Workflow C1). These analyses were performed 3 months later to minimize recall bias. In addition to the interactive deletion of physiological uptake regions similar to Workflow B, the observers were allowed in Workflow C to manually modify the generated VOIs by adding missed lesions (with the A50%P option or manually) and removing of physiological uptake with an “eraser” tool. The manually modified MTVs and TLGs were checked for correct delineation and identification of tumor sites (and changed if needed) by independent nuclear medicine physicians (NM, one per observer) with more than 10 years of experience with [18F]FDG PET/CT evaluation in lymphoma (OSH, SFB, SM; Workflow C2).
Statistical Analysis
Success rates of generated VOIs were analyzed descriptively. Interobserver reliability was expressed as intraclass correlation coefficients (ICCs) and coefficients of variation (CoVs). ICC estimates and their 95 % confidence intervals (95%CIs) were calculated with a two-way random-effects model for absolute agreement [
19]. The 95%CIs of the ICC values were interpreted as poor (< 0.5), moderate (0.5–0.75), good (0.75–0.9), and excellent (> 0.9) [
20,
21]. CoV was calculated as the ratio of the standard deviation (over three observers) of MTVs or TLGs divided by the mean values per patient. Mean CoVs are presented,
i.e., CoVs averaged over all patients. Bland-Altman plots were drawn to visually assess potential bias of the mean differences between the workflows and to estimate 95 % limits of agreement [
22]. Normality of MTV and TLG differences before and after manual modification was checked with the Shapiro-Wilkinson (SW) test, in which
P < 0.05 was an indication of a non-normal distribution. Statistical analyses were performed using SPSS Statistics (IBM, v.20).
Discussion
We assessed the interobserver reliability and ease of use of three workflows for measuring MTV and TLG in 12 DLBCL patients and found that both improved when using a fully automated preselection approach to measure MTV and TLG (using SUV≥4.0 and volume≥3ml).
Ilyas et al. [
10] compared three MTV segmentation methods (SUV≥2.5, 41%MAX and PERCIST) in patients with DLBCL and concluded that data-driven optimal cutoff values for separation of patients into a good and a poor prognosis group were largely dependent on the method used, but these data-driven cutoff values had comparable prognostic accuracy. In a subset of 50 patients evaluated by two observers, they found that interobserver reliability was excellent (ICC > 0.98). They further reported a mean analysis time ranging between 2.7 and 6.2 min for the 3 methods [
10]. The data-analysis in our study took more time, possibly due to less experience of the observers with the software and the datasheets that had to be completed, which was not included in the time per patient reported in Ilyas study. Yet, also in our study, we found that when total metabolic tumor volume was derived using the preselection and when unwanted normal tissue uptake could be removed and missed lesions could be added by single mouse clicks, the overall processing time was typically less than 5 min. In cases where manual corrections or manual definitions of the VOIs were needed, processing time could well exceed 20 min.
Another important finding in the Ilyas study and our study is that the SUV≥2.5 method showed the highest interobserver reliability. Interestingly though, the observers in our study considered that SUV≥2.5 often overestimated the volume compared with other methods and was almost never chosen as their preferred method on a patient-level.
However, a recent study (partly by the same authors) showed that a slightly higher threshold (SUV≥4.0) outperformed the SUV≥2.5 in terms of success rate [
23].
A recent phantom and patient study in primary mediastinal B cell lymphoma that compared four different MTV methods found that SUV≥2.5 resulted in an overestimation, particularly at high SUV values and 41%MAX underestimated MTV when there were high levels of heterogeneity [
24].
In a publication by Meignan et al. [
12], two observers used two percentage-based methods for MTV assessment in DLBCL (41%MAX and a variable SUVmax threshold that visually resulted in optimal segmentations). They found substantial reliability of 0.99 for the 41%MAX threshold and poor reliability of 0.86 for the variable percentage of SUVmax according to Lin’s concordance correlation coefficient. This study also suggests that reliability decreased with an increasing level of user interaction.
Based on the ratings of individual lesions it could be argued that no single semi-automated segmentation method performed well for every patient and within every lesion of that patient. Lymphoma sites can be difficult to segment because of heterogeneity within and between lesions. Some patients have many lesions, making it almost impossible to delineate each lesion. Besides that, it should be noted that a visual check of the generated segmentation by an experienced nuclear medicine physician or radiologist is necessary if a semi-automated method is applied, as was illustrated by the outliers in this pilot study. For example, patient 10 showed a large difference between the three workflows (Figs.
1 to
3). It appeared that the decision whether the liver was involved or not was the main reason for the large differences in the assessments. Both the observers and the NM physicians did not agree on the question of whether the liver was involved or not. In clinical practice, access to additional clinical information (
e.g., physical examination or lab results) may help to support the decision whether a site is involved or not. This situation illustrates the importance of the development of clear clinical criteria, definitions, and guidelines for lesion selection in PET/CT studies of patients with different lymphoma types [
25].
We also compared the results of the observers (who were clinicians, but not NM physicians) before and after the check of the NM physician. It appeared that only small lesions were added, and in a few patients, physiological uptake was erroneously included in MTV, again supporting the need for checking of results by a NM physician.
This study has strengths and limitations that should be taken into account when interpreting the results. First, we deliberately selected patients with a large variation in number and size of lesions. This might be a strength because it represents examples of different challenges that can occur when analyzing MTV in lymphoma, but this could give a higher prevalence of difficult cases compared with the general DLBCL cohort. However, according to the three experienced nuclear medicine physicians, the dataset was representative of a general DLBCL cohort, even though we selected a relatively small number of patients.
Another strength is the comparison of different workflows for MTV and TLG assessment and their impact on interobserver reliability. Most studies acknowledge the difficulties in the assessment of multiple lymphoma lesions. Some used boxes or VOIs to constrain individual tumors [
6,
8,
12], or limited segmentation to a representative maximum of 5 lymphoma lesions [
26], but none of these studies compared such strategies with another workflow.
A limitation is the dependency of the ICC values on the range of MTV values in the population [
21]. This is present in other MTV studies as well and hampers comparability of ICCs within and between studies. Therefore, we also presented CoVs and Bland-Altman plots which are not dependent on the variability of MTV values among patients.
Finally, a preselection strategy as suggested in this study is not yet widely available in other commercially available (clinical) software tools but could be implemented relatively easily after validation in a larger patient cohort.
Future research should focus on the comparison of a preselection strategy in a larger patient cohort with different segmentation methods, their success rates, and the effect on the prognostic value of MTV and TLG measurements. A possible solution for the problem that none of the methods will be satisfactory in each patient and for each lesion could be the use of a MV approach, which should be investigated further. In addition, the effect of reconstruction settings, different uptake times, and effect of adding small lesions on the accuracy of MTV and TLG measurements should be addressed.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.