Introduction
Multiple sclerosis (MS) is a common chronic neuroinflammatory and neurodegenerative disease [
1]. Demyelinating lesions in the brain and spinal cord are the pathological hallmarks of MS, which are detectable
in vivo with magnetic resonance imaging (MRI). MRI has therefore become an essential tool for the diagnosis and monitoring of disease activity in MS [
1,
2]. In MS, the lesion volume reflects the inflammatory burden while atrophy measures quantify neurodegenerative aspects of the disease, which play an important role in all disease stages [
3]. Volumetry is therefore commonly used as a secondary endpoint in clinical trials [
4]. Furthermore, volumetry can be helpful in improving our understanding of the disease since atrophy patterns have been shown to be different in MS compared to other demyelinating disorders [
5].
Obtaining robust imaging biomarkers in MS for assessment of the inflammatory and neurodegenerative burden of disease is, however, challenging [
3]. Brain volumetry is influenced by several subject-related factors such as hydration status, inflammation and clinical therapy [
6]. MS lesions can specifically affect tissue segmentations since white matter (WM) lesions can be misclassified as grey matter (GM) or cerebrospinal fluid (CSF) [
7,
8]. Brain volumetry is also impacted by technical factors such as MRI field strength and scanner model, as well as post-processing related issues [
8‐
10]. Understanding the effect and magnitude of technical factors is important when planning MRI studies [
8].
There are several freely available tools for automated brain volumetry that are commonly applied in MS. Popular choices include FreeSurfer [
11], Structural Image Evaluation with Normalisation of Atrophy Cross-sectional (SIENAX) [
12] and Statistical Parametric Mapping (SPM) [
13]. These software can automatically pre-process and segment T
1-weighted images of the brain. FreeSurfer is computationally demanding and is based on a combined volumetric- and surface-based segmentation aimed to reduce partial volume effects from the convoluted shape of the cortical ribbon [
11]. FreeSurfer uses a template-driven approach to provide a detailed parcellation and segmentation of the cortex and subcortical structures. SIENAX, part of the FMRIB Software Library (FSL), is computationally less demanding but only provides measurements of the gross tissue volumes (WM, GM and CSF) [
12]. FSL-SIENAX relies on registration to the Montreal Neurological Institute 152 template for skull stripping and then performs intensity-based segmentation; the template registration step provides a scaling factor that can be used for normalisation. SPM is based on non-linear registration of the brain to a template and segments brain tissues by assigning tissue probabilities per voxel [
13]. Computational Anatomy Toolbox (SPM-CAT) is an extension for SPM that provides segmentations with a different segmentation approach based on spatial interpolation, denoising, additional affine registration steps, local intensity correction, adaptive segmentation and partial volume segmentation [
14]. Like FSL-SIENAX, the SPM-based methods are less computationally demanding, relative to FreeSurfer, and only provide gross brain tissue volumes.
The primary purpose of this study was to compare the repeatability on the same scanner and the reproducibility on different scanners for brain tissue segmentations in FreeSurfer, FSL-SIENAX, SPM and SPM-CAT. A secondary aim was to study the effect of automated lesion filling to reduce MS lesion-related brain tissue segmentation bias.
Discussion
We present a prospective head-to-head comparison of the robustness of four of the most popular freely available brain segmentation tools in a representative real-life MS cohort scanned twice on three different scanners on the same day. New versions of the tested software have recently been released. An important contribution of the current study is therefore that we provide an up-to-date evaluation of the intra- and inter-scanner variability of brain tissue measurements in MS, facilitating an appropriate choice of software for volumetric studies.
We found that the volumetric output differed between the software, which is expected since they have large technical differences [
11‐
13]. Previous studies of earlier versions of the software have indeed also found significant differences in the output, both numerically and topographically [
24‐
26]. While most previous studies have focused on differences and similarities in the segmentation results [
24‐
26], the current study mainly focused on the robustness of the segmentation tools. Overall, we report that the variability in volumetrics was lower on the same scanner than between scanners, supporting recommendations to follow individuals on the same scanner [
27,
28]. Although brain atrophy rates can be double that of normal aging in untreated MS patients [
29], treated MS patients have atrophy rates around 0.5%/year [
30]. To accurately capture atrophy rates, it is therefore important to have a variability lower than that. Our reported CoVs for intra-scanner (0.17–0.92%) and inter-scanner (0.65–5.0%) variability suggest that measurements are feasible within 1–2 years for the most robust methods on the same scanner. In contrast, several years need to pass to be able to capture atrophy on different scanners, even with normalisation.
SPM-based methods overall had the best repeatability and reproducibility of the four software (except WM segmentations where FreeSurfer was more robust) and are therefore particularly suitable for cross-sectional MS studies. This is in line with a previous international study of two MS patients scanned at multiple sites and a segmentation challenge in persons with diabetes mellitus and cardiovascular risk factors [
31,
32]. We also found that the whole-brain volume was the most robust volumetric, consistent with previous results [
31,
33]. This could be explained by lower variability with a large volume of interest and a larger contrast difference of CSF versus brain parenchyma compared to GM/WM segmentations. In studies with differences in the MRI protocols, it can therefore be recommended to primarily focus on the brain volume. Interestingly, there was no significant difference in the intra-scanner robustness of the software for the brain volume, meaning that all studied software can be favoured for cross-sectional MS studies of the brain volume.
The current study focuses on some of the most commonly used freely available automated segmentation tools for brain volumetrics in MS, but there are several other segmentation tools available, such as AFNI and BrainSuite. While we provide information on the robustness of the studied software, the choice of software must also take other factors into account, such as which types of images are available, user skills and technical requirements [
8]. In this study, we only provided the T
1-weighted images for segmentation, which is the only image contrast that FSL-SIENAX and SPM-CAT are optimised for [
12,
14]. Previous results with segmentation based on multiple contrasts or multi-parametric maps have shown especially good robustness [
32‐
34]. Evaluating such approaches is therefore an interesting avenue for future studies. From a technical standpoint, full functionality of SPM requires a MATLAB license [
13], but a standalone version of SPM or FreeSurfer could be suitable alternatives since FreeSurfer was found to provide more robust normalised measurements between scanners than FSL-SIENAX, consistent with previous results [
35]. While FreeSurfer is computationally more intense than the other software, it also provides more detailed regional morphometry.
Normalisation of the brain volumetrics to the intracranial volume generally improved the comparability of results between scanners, in line with previous recommendations [
8]. This is likely due to a reduction of scaling effects between scanners [
8]. However, using the scaling factor in FSL-SIENAX did not improve the robustness, suggesting that such normalisation may not be sufficient. Overall, there was also a lack of improvement in the repeatability within scanners for all three software with the normalisation. This finding likely reflects that normalisation procedures are less critical if measurements are produced on the same scanner. In clinical practice and longitudinal studies it is, however, important to consider that the variability in measurements are likely to be higher than that presented in this study, where all measurements were performed on the same day [
31].
In terms of the effect of MS lesion filling, we found that lesion filling affected the volumetric results mainly for SPM and SPM-CAT, but also for FSL-SIENAX. These results are consistent with a previous MS study showing increased accuracy of SPM8 segmentations after lesion filling [
36]. Of note, no effect was seen on the FreeSurfer volumes with lesion filling, likely due to the fact that FreeSurfer specifically segments WM T
1-hypointensities and thus take these into account during the WM segmentations [
11].
This study has some limitations. First, the sample size is small, but in total there were 54 measurements since each patient was scanned twice on three scanners and the study showed statistically significant differences in robustness of the software. Second, the MRI scanners were all from the same manufacturer, while higher inter-scanner variability would be expected with multiple vendors [
31]. Third, although the results of the study could change by adjusting acquisition or processing parameters, these results reflect the standard procedures for MRI in MS at Karolinska University Hospital and we used recommended post-processing options [
11,
13,
20]. There was a difference in the resolution between the FLAIR volumes, which could affect the lesion filling but this difference was consistent for the input of all software. Lastly, the current study focused solely on cross-sectional segmentation methods while the robustness of segmentations can be improved by including a priori knowledge of several time-points [
19,
35,
37]. We therefore recommend future studies to also focus on comparing the robustness of longitudinal segmentation methods.
In conclusion, the results highlight the importance of consistently using the same scanner and normalising to the intracranial volume when multiple scanners are used. The output from FreeSurfer, FSL-SIENAX and SPM differ but all three software provide cross-sectional brain volume segmentations with similar intra-scanner robustness. SPM-based methods overall produced the most consistent results, while FreeSurfer had less variability in WM volume segmentations across scanners and was less affected by WM lesions.