In this study, the main objective was to explore the performance dependence of several (semi-)automatic delineation methods [
1‐
3,
14,
17] as function of different image characteristics in the case of [
18F]FDG scans. For all methods substantial variation in bias was observed, but the different methods showed different sensitivities to variations in sphere size, TBR, reconstruction settings, image resolution and noise levels. Secondly, the paper intends to examine the potentially large errors that may occur when using these methods in a non-standardized or non-calibrated method. We also explored VOI
Schaefer without calibration and observed very high bias in measured volume (i.e. >38% bias for a 30-mm diameter sphere in the lung), which was strongly reduced after calibration (<7%). Therefore, in the present paper, in line with the recommendations pointed out in [
1,
18], only VOI
Schaefer with calibration was used. An alternative approach would be to harmonize the image quality (i.e. spatial resolution, TBR and quantitative accuracy) across various sites as attempted by the recently published European Association of Nuclear Medicine (EANM) guidelines [
19]. This approach would only be required when using methods that cannot be calibrated for specific imaging parameters, e.g. threshold-based methods, either with or without background corrections, in order to ensure inter-institute comparability of PET-based tumour volume assessments.
All delineation methods could not define tumour volume accurately for all sphere sizes, i.e. SUV
2.5 showed large bias in estimating tumour volume (i.e. >25% bias for a 30-mm diameter sphere in the lung). As there are no ‘normal’ values of SUV that can be applied to every situation, it has been shown previously [
5] that SUV
2.5 can often fail to produce accurate tumour volumes, e.g. when the physiological background activity lies above the fixed threshold. The remaining methods (Grad
WT, VOI
41, VOI
A41, VOI
RTL and VOI
Schaefer) provided acceptable accuracy, i.e. for spheres >20 mm they showed biases smaller than 18 and 23% for lung and mediastinum, respectively.
Fixed threshold-based methods (i.e. 41–70% of maximum voxel value) strongly depended on the threshold level chosen. Delineated volumes for higher thresholds are obviously smaller, resulting in underestimation of volumes. Advanced adaptive threshold-based methods (e.g. VOISchaefer) do not use a fixed threshold level, but also correct for background activity, and tumour volume or mean tumour intensity. The presented results showed minor dependence on noise, spatial resolution, acquisition parameters and reconstruction settings for VOISchaefer, as was expected when calibrating the method. Overall VOISchaefer seems to perform well over various simulated imaging characteristics.
Simulation studies
Based on the initial results, only five methods (GradWT, VOI41, VOIA41, VOIRTL and VOISchaefer) were evaluated further in relation to various imaging parameters. The accuracy of these methods was affected by tumour size, TBR, image resolution and noise level. By optimizing the imaging parameters the accuracy of the delineated volume estimates increased for all VOI methods investigated.
There was a large difference in accuracy of delineated volume between unsmoothed and smoothed images and/or at various noise levels. All VOI methods tested showed a poor performance for non-smoothed data, which is likely caused by the high noise levels in the computer-generated images. There are several possible causes for the noise dependence of various VOI methods. First of all, methods which use a percentage of maximum uptake to define the final contour are likely to be more sensitive to noise as noise may result in an upward bias of the maximum value. Consequently, the upward bias in the maximum value may result in higher isocontour values and thus in smaller volumes. Secondly, noise will impact the accuracy and precision of any 3-D region growing technique. Therefore, noise will directly impact the granularity of the observed contours and thereby accuracy of observed VOI. When noise levels become too high 3-D region growing algorithms may fail to generate a meaningful VOI. However, the difference in accuracy of delineated volume between smoothed (additional 5 mm FWHM) and more smoothed (additional 7 mm FWHM) was much less (Table
1 and Supplementary Fig.
2). In general, good accuracy (bias <12%) for the delineation methods was found when using 7 mm FWHM smoothed images. However, smoothing with 7 mm FWHM could induce partial volume effects and loss of detail [
14]. The latter effect also explains why most methods have difficulty in providing accurate tumour volumes for small spheres. A lower resolution will also degrade the gradient between tumour and non-tumour tissue and, consequently, it will be more difficult for any VOI method to delineate the tumour boundaries. In the presence of lower gradients small uncertainties in the actual threshold being used by the VOI method for tumour delineation (as is the case for most VOI methods used in this study) could result in larger ‘displacements’ of the generated contour. In the case of gradient-based methods it is obvious that lower gradients will result in less accurate assessments of the position of the steepest gradient and thus in increased uncertainty and reduced accuracy of this method at lower resolutions.
The results obtained by changing noise levels and degree of smoothing indicate that there is a sensitive trade-off between noise and resolution. Ideally, images should have high spatial resolution and very low noise levels. However, in clinical practice some filtering is applied to reduce noise levels. As explained above, elevated noise levels may also hamper (semi-)automated tumour delineation and, especially when expected tumour sizes are large and have high FDG uptake, some filtering may be helpful to generate reliable tumour volume estimates. Yet, filtering degrades image resolution which in turn hampers tumour delineation for smaller tumours (e.g. <15 mm diameter) with lower uptakes (TBR <4). Therefore, in practice the trade-off between noise and resolution should be carefully considered and optimization of imaging parameters in combination with calibrating the VOI method (when possible for the envisioned method) is needed depending on the scanner, tracer, VOI method and tumour type and location.
Effects of an edge-preserving bilateral filter for denoising images were also investigated (Table
1 and Supplementary Fig.
3b). After applying the filter to data sets at two noise levels, the accuracy of all methods, except for Grad
WT, improved. Again this may illustrate the sensitivity of most VOI methods to noise. The lack of improvement of Grad
WT is not fully clear, but a possible explanation for overestimation of tumour volume could be that in our implementation a voxel will be assigned to tumour in case two watersheds are competing for the same voxel, i.e. border voxels are assigned as tumour. Further work is ongoing to enhance the performance of this method, e.g. by allowing for fractional voxels and/or using a higher image matrix size (upsampling). In addition, in this paper we explored the effects of noise reduction using Gaussian and bilateral filtering. It should be noted that both these filters do not take the Poisson nature of noise into account, i.e. the variance is proportional to the underlying signal. Possibly, tumour delineations will benefit from more sophisticated filtering approaches that include an estimate of local variance.
When using an iterative reconstruction algorithm, both quantitative accuracy and noise level depend on number of iterations. A higher number of iterations not only improves convergence and image contrast, but also increases image noise. Only small differences in bias (<3% lower) were observed when varying the number of iterations for each VOI method. This indicates that the chosen reconstruction setting does not show a large effect on accuracy of measured tumour volumes. Similar results were shown in a previous study [
20] that more extensively evaluated the effects of various reconstruction algorithms and settings. It was shown that accuracy of measured volume varies only slightly with image reconstruction algorithm and smaller spheres (i.e. <2 ml) were affected more than larger spheres. The latter was also seen in the present study, i.e. accuracy of tumour volume was better for larger (>30 mm) than for smaller spheres.
Using the cross-shaped pattern to identify an averaged maximum or peak value and its location provided similar results as those based on the maximum (single) voxel value. Accuracy of Grad
WT and VOI
RTL methods was similar for all spheres compared to using a single voxel maximum value. This can easily be understood as both methods do not use the maximum (or peak) voxel value. On the other hand, as can be expected, VOI
41 and VOI
A41 showed a small improvement by 3–6% (Table
1). In addition, the SD of these methods improved slightly when using the cross-shaped pattern, probably because the effects of noisy voxels are reduced by using an average value. Using a cross-shaped pattern did improve performance of percentage threshold-based methods and therefore it is recommended to use this approach for initialization, especially when percentage threshold-based methods are used.
Phantom studies
Similar to what was observed in the simulation studies there was a limitation in defining volumes for the smaller spheres (diameter <15 mm, Fig.
2). Therefore, the smallest sphere gave large biases for all methods (sometimes >70%). For all delineation methods, the best performance was observed for sphere sizes larger than 15 mm diameter. For the HR+ and the GEMINI TF, VOI
Schaefer seemed to be the best method on average.
Moreover, this study showed that effective threshold-based methods that correct for local background activity (i.e. VOI
A41 and VOI
RTL), contrast-oriented (i.e. VOI
Schaefer) as well as gradient-based methods are useful for defining tumour volume. However, optimal percentage threshold level and/or optimal settings strongly depend on imaging parameters. Likewise, VOI
Schaefer needs reassessment of the method’s parameters as function of image characteristics (mainly image resolution). This implies that calibration of VOI methods and/or (in combination with) optimization of PET procedures is required when PET images are used for tumour delineation [
19].
Limitations
Firstly, tumours in both experiments were represented by homogeneous 3-D spheres, thereby excluding effects of tumour shape and heterogeneity. Therefore, even methods that showed good performance in the present paper should be used with care and need to be supervised in the case of (non-spherical) tumours showing heterogeneous tracer uptake. Widely available methods that can accurately deal with variation in imaging characteristics and tracer uptake heterogeneity are needed. In this respect the fuzzy locally adaptive Bayesian method published by Hatt et al. [
21] appears to be very promising. Secondly, in the phantom experiments, but not in the simulations, background activity was uniform around the tumour. This is usually not the case in actual human PET studies and higher local uptake (e.g. due to inflammation) may result in errors when defining tumour contours. On the other hand, for the phantom experiments the wall of the spheres, resulting in a shell of ‘zero’ activity around the spheres, may have affected performance evaluation [
22]. Yet, phantom study results were similar to those seen in simulation results and vice versa. Finally, this study focused on tumours located in the thorax. Therefore, all methods should be evaluated further for other body regions and using clinical data. Even with these ‘simple’ conditions, however, it is clear that differences in image characteristics, caused by differences in reconstruction settings, image filtering and noise levels, can have a pronounced effect on performance of the (semi-)automatic delineation methods investigated, although magnitude and direction of those effects may be different among (semi-)automatic delineation methods.