Standardizing lesions and atrophy measurement
Volumetric quantification of the changes in lesion load and cerebral atrophy depends crucially on tissue-type segmentation, which is influenced by both acquisition- and disease-related factors. Focusing on the disease-related factors, several recent studies have shown that the extent of WM lesions can influence GM atrophy measurements because WM lesions have MR properties similar to those of GM [
43‐
48]. An interesting approach that has been proposed to counter this problem is lesion inpainting, whereby signal intensities of lesion voxels are substituted with those observed in normal-appearing WM, prior to further analysis [
43,
45,
48]. Although this appears to be a promising approach yielding seemingly improved atrophy measurements [
44,
46], the effect of the correction may change with the lesion load and the specific algorithms used for correction and segmentation. For the FAST segmentation software from FSL [
37], the choice of partial volume modeling algorithm utilized by the segmentation method was shown to exert a clear influence [
43]. An obvious limitation of the lesion inpainting approach is that the lesion voxels still have to be identified and correctly segmented before new intensities can be assigned to them prior to GM segmentation.
Ideally, however, tissue segmentation methods for longitudinal studies of MS should tackle these issues automatically, and we recommend that this should be done by concurrently analyzing all tissue classes. Indeed, an attempt at integrated segmentation including both lesion and atrophy assessments for a single timepoint has already been reported [
47,
49]. The inclusion of all timepoints available for a patient in a single segmentation process is another step that might improve quantification. Such concurrent analysis of multiple timepoints for one patient has been implemented in the CLADA software for longitudinal cortical atrophy measurement [
50] and in the FreeSurfer software package for cortical thickness measurement and deep GM volumetry [
51], while another paper demonstrated how difference images, obtained by subtracting co-registered images from two timepoints, may be used in the automated quantification of lesion volume change [
52].
Development of this type of integrated analysis may take substantial amounts of time, and not all issues may be solvable. It would therefore be prudent to investigate alternative approaches; such approaches could be informed by a detailed analysis of the errors that occur when applying current methods to data already collected in longitudinal studies of MS. While the “holy grail” of a comprehensive segmentation method accessible by all researchers in the field should still be pursued, improvement of existing techniques may be a useful alternative approach.
Most frequent sources of errors
Errors in image analysis in MS studies can be grouped into two main categories: poor registration quality, and poor tissue segmentation. In many analyses, the final tissue segmentation is preceded by an algorithm to (approximately) find the intracranial cavity [
53,
54]; in that case, a third category is the incorrect inclusion of extracranial tissue in the final segmentation. Errors in each of these categories are often the result of one of a few main causes:
-
pathological changes, such as severe atrophy or large WM lesion load;
-
image acquisition-related factors, such as incomplete head coverage, inadequate spatial resolution (leading to substantial partial volume effects), poor tissue contrast, limited SNR, and artifacts;
-
inherent limitations of the algorithm, possibly aggravated by image acquisition-related factors.
Beyond the obvious (partial) solutions of both optimizing the image acquisition for the desired analysis (e.g., using full-head coverage whenever possible), and optimizing the analysis algorithms, there are several additional steps that allow relatively easy correction or prevention of such errors, which give substantial improvements to the quality of the analyses. For group studies, registration errors due to the presence of severe pathology may be limited by using disease group-specific templates rather than standard healthy control templates, together with appropriate regularization of the registration [
55]. However, when there are large pathological changes within a single patient, adequate non-linear matching between timepoints remains challenging. Errors in segmentation may be limited by using information from more than one image type, ideally in an integrated segmentation approach as recommended above. For both these issues, challenges remain, and solving both might be facilitated by the standardized test dataset discussed under recommendation (3).
Progress has recently been made in the initial segmentation of the intracranial cavity, often referred to as “brain extraction”. Brain extraction is often imperfect, leaving tissue around the eyes and optic nerves, or removing part of the brain tissue, thus potentially introducing large errors in atrophy measurements by tools that rely on the brain extraction accuracy. A previous study showed that for 2D images, manual correction of the brain extraction used by SIENA (BET) increases sensitivity to disease effects in MS [
56], but this solution is not feasible for high-resolution 3D images due to the high workload that would be generated. In this case, the brain extraction option settings should be optimized until the best compromise in brain extraction is obtained across all the images to be analyzed. However, a recent paper showed that a single combination of option settings yielded quantitatively very good results across a range of 3D T1-weighted image types in MS patients [
57], obviating the need for further adjustment.
Lesions
Many lesion segmentation algorithms have been proposed, and a useful recent review is given in [
58]. We restrict the scope here to fully automated methods and those that require minimal user intervention. The methods are based on several different principles such as intensity thresholding (e.g., [
59,
60]), intensity gradient features (e.g., [
61]), intensity histogram modeling of the expected tissue classes (e.g., [
49,
62,
63]), identification of nearest neighbors (from a training data set) in a feature space (e.g., [
64‐
66]), or fuzzy connectedness (e.g., [
67,
68]), often using several of these in combination. In some cases spatial (anatomical) information is included in addition to intensities (e.g., [
49,
67,
69]). Algorithmic approaches to segmentation optimization include methods such as Bayesian, expectation maximization, support vector machines (e.g., [
70]), k-nearest neighbor majority voting (e.g., [
64,
65], and artificial neural networks (e.g., [
71].
Although promising results are often reported for images from a single scanner, performance on diverse datasets can be poor due to the different tissue contrasts that may be unknown to the algorithm. This can result in large fractions of both false-positives and -negatives; these misclassifications have proved to be a barrier to widespread adoption, especially in longitudinal studies if image quality varies over time and the level of these misclassifications is inconsistent. Incorporation of “domain knowledge”, i.e., prior knowledge of the distribution of MS lesions in the brain, improves the segmentation of lesions [
67], but, in our experience still does not deliver segmentations that are acceptable to researchers in the field. Because of this unreliability, practical lesion segmentation methods are generally not fully automated, and operator intervention is still needed at the level of individual lesions, usually by some form feature selection based on the local maximum intensity gradient, followed by contour following, e.g., [
72‐
75]. Intra- and inter-observer reproducibilities of contouring are better than for manual outlining [
76,
77], but the method is still labor-intensive. In order to be able to handle the large volumes of imaging data emanating from large therapeutic trials, it would seem appropriate to strive for further, if not complete, automation.
Regarding automated quantification of lesion load change, a recent review by Lladó et al. [
78] highlights the state of the art and remaining challenges for application in a clinical or clinical trial setting. This review includes a table that clearly shows the lack of consistency in quantitative performance metrics used in the literature, clearly illustrating the need for standardized reporting methods. Lladó et al. classify methods for change quantification as intensity-based analysis, temporal analysis, and deformation-based analysis. An intensity-based approach to the detection of change in lesions over time could exploit a combination of registration and subtraction as used by Moraal et al. [
32,
79,
80]. If an expert reviewer is available, the registration–subtraction approach allows easy identification of change, provided that the changes between timepoints due to atrophy are not too large, or a registration method is used that can deal with the resulting brain shape deformations. It was shown for 2D images that the number of changing T2 lesions observed from the beginning to the end of a trial is statistically more powerful than the number of gadolinium-enhancing lesions from monthly scans [
80]. Duan et al. [
52] showed the feasibility of automatically quantifying these changes in lesions from the difference images.
The methods that Lladó et al. refer to as temporal methods typically handle image series with a large number of timepoints, which is an advantage over subtraction image analysis which can only handle two timepoints at once. The method proposed by Ait-Ali et al. [
81] uses expectation maximization to first estimate non-lesion tissues and then adds lesions to the model. Gerig and colleagues [
82] first perform segmentation of GM and WM, and then identify active lesions based on voxel mean and variance over the course of the timepoints. Although the method by Gerig et al. leaves room for improvement, most clearly regarding between-timepoint registration (assumed to be perfect) and the model for temporal signal evolution of MS lesions (assumed to be highly similar between lesions), it does present a feasible approach to the multiple-timepoint analysis of lesions.
Deformation-based methods for lesion change quantification use the local volume change as calculated through deformable registration methods to quantify the lesion volume change. Two viable methods for lesion change quantification have been presented, i.e., that by Rey et al. [
83], which is based on Thirion and Calmon [
84], and that by Pieperhoff et al. [
85], but both require additional modeling or operator intervention to indicate which are the lesion areas whose volume change should be quantified. The lesion segmentation problem therefore still needs to be solved in these approaches.
Three-dimensional imaging with isotropic resolution and multiple image contrasts can be expected to further increase the specificity with which change in lesions can be characterized, both in terms of their spatial location and for distinguishing and interrelating changes in different lesion types. For all these methods, there are several choices to be made on issues such as the type of registration, whether and how to include prior information on expected lesion and atrophy-related change, among others; these choices should be informed in part by comparing results against expert manual analysis.
Atrophy
Just as analysis of MS lesions in longitudinal studies is affected by concomitant atrophy, so too does atrophy quantification deteriorate when there are large changes in the lesion load. For example, large changes in atrophy or in lesion volumes may disrupt the accuracy of registration, which is used by many atrophy measurement methods [
86,
87].
In normal aging and Alzheimer’s disease (AD), Smith et al. compared two whole-brain atrophy measurement techniques, i.e., (brain) boundary shift integral (BSI [
88]) and SIENA directly and showed that the methods gave very comparable results [
89]. Sample size calculations in RRMS showed similar sample sizes were required for BSI and SIENA [
90]. Using images with simulated atrophy in AD, Camara et al. [
91] confirmed the good agreement between BSI and SIENA. More recently, Durand-Dubief et al. [
6] selected seven methods for measuring whole-brain atrophy and assessed their reproducibility across different MRI platforms. This study on nine patients scanned on three occasions over 1 year, each time on two MRI scanners, showed that registration-based methods, i.e., where the registration is performed within-subject between timepoints, particularly an optimized BSI method using k-mean clustering (KNBSI) and Jacobian integration, gave the best agreement of whole-brain atrophy measures between the two different MRI scanners.
Also in MS, but focusing on local change instead, Battaglini et al. [
92] performed a qualitative comparison between two different methods for measuring local changes in atrophy over time. By comparing longitudinal VBM (using FSL) and the voxelwise SIENA-R method directly, in the same longitudinal image set from MS patients who were scanned twice, with a 3-year interval between the two scans, they showed that the cortical regions in which significant atrophy was observed were roughly similar, but the extent was very different. This result was perhaps to be expected based on the different mechanisms of the two methods, with VBM quantifying local GM density and its change over time, while SIENA-R measures displacement of the local brain-non-brain boundary. Nevertheless, this study demonstrates the influence that choice of analysis method has on the results. Both this difference between SIENA-R and longitudinal VBM, and the superiority of (within-subject) registration-based techniques may be explained by the design of the methods: analysis methods that analyze within-subject change over time directly, by concurrently analyzing multiple timepoints, make use of the fact that intra-subject variability is generally smaller than inter-subject variability. These inherently longitudinal methods may therefore be better at quantifying this change than methods that treat each timepoint separately.
As indicated in the section on image acquisition, results are also influenced by the choice of imaging parameters, and so tissue contrast and spatial resolution should be optimized. Nevertheless, the CLADA method proposed by Nakamura et al. [
50] did achieve both accurate measurement of cortical thickness, and reliable measurement of cortical thickness change, in low-resolution 2D images that are (still) typical for clinical trials. Accuracy may also differ between local atrophy measurement techniques, as shown quantitatively by the simulated AD atrophy study by Camara et al. [
91]: deviations from ground truth atrophy differed between two Jacobian integration methods. Moreover, the mean absolute deviation was up to 93 % of the ground truth volume change for hippocampus, indicating the need for further method improvement. Partly simulated image data in which the true change is known, as used in their study, may also facilitate such developments in MS, especially when based on representative images from MS patients and made widely available as recommended below.
In healthy subjects with a mean age of 56.5 years, Takao et al. [
93] investigated the effect of scanner performance on whole-brain and local volume change measurement. They showed that scanner drift and inter-scanner variability can produce large apparent volumetric changes in VBM (using SPM), including both increases and decreases. In contrast, a recent paper on MS demonstrated that, following a standardized imaging protocol and identical longitudinal VBM analysis methods, the differences between centers in the longitudinal VBM changes observed in MS patients are much smaller than the disease-related changes, indicating that pooling of data from different centers may be feasible for longitudinal VBM analysis in MS [
94]. These scanner effects are important issues in most large-scale studies in MS, and this discrepancy merits further investigation.