To develop gold standard atlases for BP contouring, 12 cadavers (age and gender randomized) were used. The cadavers were embalmed according to Thiel because of their optimal image quality and movement capacities [
22,
23]. The latter allowed for the required standardization of the scan position. Magnetic resonance imaging (MRI) of the head-and-neck region was performed to generate high-quality BP delineations that were anatomically validated by dissection. These anatomically validated, MRI-based, BP delineations were then rigidly fused to the corresponding CT to obtain BP gold standard delineations that were applicable to the radiation therapy planning system. A detailed description was provided by Van de Velde et al. [
24]. This study was approved by the ethics committee of University Hospital Ghent (reference number: B67020142069), and was in compliance with the Helsinki Declaration.
For label fusion, 2 different algorithms in ADMIRE® are compared: the STAPLE label fusion [
16] and Patch label fusion [
17]. The STAPLE algorithm works with a statistical framework that simultaneously estimates the underlying ‘truth’ segmentation and the accuracy of each individual atlas [
18]. It ignores the image data and uses only the segmentations when computing the label fusion. In contrast, the Patch algorithm considers the accuracy of the initial image registration by comparing the intensity similarity between the atlas and the patient after being aligned, to get better label fusion results. This process, is called ‘intensity weighting’.
Procedure
The present study aimed to determine the optimal number of atlases and to compare the STAPLE with the Patch label fusion algorithm for multi-atlas-based BP contouring in ADMIRE® software.
For this purpose, a leave-one-out strategy was followed. One of the 12 available cadaver CT-datasets was selected as a patient and the remaining CT-datasets, which contained the anatomically validated BP segmentation, served as atlases. All of the atlases were first registered separately onto the patient using the ‘General’ registration algorithm in ADMIRE®. Next, the label fusion was performed, with both STAPLE and Patch, first using every possible combination of 2 atlases. Subsequently, label fusion was repeated with a gradually increasing number of atlases, until every possible combination of 11 atlases was reached. This process was reiterated for every atlas as a patient. It resulted in 24432 combinations over the different number of atlases. A Power analysis was executed (power π = 80) to calculate the minimum sample size required for a 90 % confidence interval.
Next, for every generated ‘label fused’ autosegmentation, 3 similarity indices with the gold standard contour were calculated to quantify the accuracy (Table
1):
Table 1
Average Dice similarity coefficient, Jaccard index and True positive rate per number of atlases
2 | 660 | 0,247 (0,179) | 0,154 (0,131) | 0,188 (0,158) | 0,400 (0,157) | 0,262 (0,124) | 0,416 (0,178) |
3 | 660 | 0,397 (0,184) | 0,265 (0,151) | 0,373 (0,187) | 0,454 (0,157) | 0,307 (0,136) | 0,439 (0,163) |
4 | 3960 | 0,472 (0,171) | 0,325 (0,147) | 0,473 (0,184) | 0,477 (0,161) | 0,328 (0,141) | 0,445 (0,165) |
5 | 5544 | 0,482 (0,153) | 0,331 (0,132) | 0,534 (0,166) | 0,465 (0,149) | 0,316 (0,128) | 0,435 (0,150) |
6 | 5544 | 0,519 (0,138) | 0,362 (0,128) | 0,616 (0,155) | 0,501 (0,146) | 0,347 (0,133) | 0,465 (0,150) |
7 | 3960 | 0,514 (0,129) | 0,356 (0,117) | 0,658 (0,147) | 0,492 (0,144) | 0,339 (0,131) | 0,446 (0,142) |
8 | 1980 | 0,501 (0,120) | 0,343 (0,106) | 0,686 (0,143) | 0,501 (0,140) | 0,346 (0,127) | 0,466 (0,140) |
9 | 660 | 0,532 (0,102)a
| 0,369 (0,940)a
| 0,726 (0,127) | 0,530 (0,117)a
| 0,370 (0,112)a
| 0,466 (0,125) |
10 | 132 | 0,510 (0,100) | 0,349 (0,900) | 0,742 (0,127) | 0,524 (0,124) | 0,365 (0,116) | 0,468 (0,121) |
11 | 12 | 0,506 (0,940) | 0,344 (0,840) | 0,760 (0,126)a
| 0,530 (0,122) | 0,370 (0,115) | 0,471 (0,115)a
|
First, Dice similarity coefficient (DSC) was calculated between these 2 segmentations. The DSC measures the spatial overlap between the gold standard A and the registered image B, and is defined as DSC(A,B) = 2(A∩B)/(A + B) where ∩ is the intersection volume. The DSC is situated between 0 and 1, with 0 indicating no agreement and 1 indicating perfect agreement.
We also calculated the Jaccard index (JI) as the ratio of the intersection volume and the entire union volume of the delineations: JI(A,B) = (A∩B)/(AUB). The JI is also situated between 0 and 1, with 0 indicating no agreement and 1 indicating perfect agreement.
At last, True positive rate (TPR) was measured between the gold standard BP (A) and the registered BP (B). TPR is the intersection volume of these, divided by the gold standard BP: TPR = (A∩B/A). TPR is situated between 0 and 1 with 0 indicating no inclusion and 1 indicating the total inclusion of A by B.
Finally, for each number of atlases, average DSC, JI and TPR were calculated over the different combinations.
To determine the clinically relevant optimal number of atlases, an equivalence trial was conducted [
25,
26]. An equivalence trial is used to demonstrate similarity between compared groups. It uses a confidence interval in which equivalence is claimed when the confidence interval of the difference in outcome between compared groups is within a predetermined equivalence margin. This equivalence margin represents a clinically acceptable range of differences. For this study, an equivalence margin of 10 % was predetermined.
Only DSC and JI were appropriate as a reference for the equivalence trial, because in those indices, the most accurate segmentation will be associated with the highest index values, since both indices consider a penalty for false positive delineation area. The TPR from its side was not adequate for the equivalence trial because the highest TPR value does not necessary imply the most accurate segmentation [
27], since a false positive delineation area is not penalized in this index.
DSC was chosen for equivalence trial over JI because the DSC has a linear course with an increasing correctly delineated volume and JI has not. Thus, a 10 % (= equivalence margin) increase or decrement of DSC always correlates with the same amount of increase or decrement of the correctly delineated volume [
27]. Using JI conversely, the amount of correctly delineated volume associated with an increase or decrease of 10 % JI value, will vary depending on the starting value of the JI, because this index has a non-linear course. For example, an increase in JI value from 0.8 to 0.9 will result in a larger increase in percentage of correctly delineated volume than an increase from 0.2 to 0.3 [
27].
Starting from the number of atlases with the maximal DSC values (reference group), the number of atlases was first gradually increased by one. If, by increasing the number of atlases each time starting from the reference group, the decrease of DSC (90 % CI) felt within the equivalence margin of 10 %, the groups were considered to be equivalent. This procedure was performed for the two label fusion groups separately [
26]. Only in case of equivalent DSC values combined with significantly higher TPR values, the autosegmentation result was considered to be more accurate, because in this case the equivalence of the DSC values indicates that the increase of the false positive delineation area, which is not penalized by TPR, was kept within bounds.
Next, the number of atlases was gradually decreased by one, starting from the reference group. If, by decreasing the number of atlases each time starting from the reference group, the decrement of the DSC values fell within the equivalence margin, the calculation time could be reduced by using a lower number of atlases without clinically relevant loss in accuracy.
Thereafter, the difference between STAPLE and Patch label fusion was determined using an independent sample t-test. Therefore, in the 2 label fusion groups, the similarity indices for their respective clinically relevant optimal number of atlases were compared.