Introduction
Alzheimer’s disease (AD) is one of the most prevalent health conditions in aging, affecting over 32 million people worldwide [
1,
2]. It is a neurodegenerative disease characterized by an initial asymptomatic stage, which occurs about 20 years before symptom onset, and during which neuronal damage takes place [
3]. The subsequent early symptomatic stage is characterized by a cognitive decline, which is referred to as “mild cognitive impairment due to AD” (herein referred to as MCI) when the clinicians consider the cognitive decline to be the result of the prodromal stage of AD (as opposed to other types of dementia, medication, depression, or other causes) [
4]. As a diagnosis, MCI is useful in predicting future AD, with about 15% of MCI patients converting to AD every year [
5]. Although there is no cure for AD, an early diagnosis of the disease is important to start therapies that can slow down its progression and improve its management [
6]. However, the diagnosis of MCI is difficult, as cognitive decline is present in healthy aging. This calls for the use of MCI as well as AD diagnostic biomarkers. For both diagnoses, one of the most widely used measures to support the clinical decision is medial temporal atrophy (MTA) as detected by visual inspection of a structural magnetic resonance imaging (MRI) scan [
7,
8]. However, this biomarker is not particularly useful for the
early diagnosis of MCI, as MTA does not occur until late into the disease progression [
9]. Other common biomarkers for MCI and AD, such as fluorodeoxyglucose or amyloid positron emission tomography and cerebrospinal fluid measures, are less accessible, more expensive, and more invasive compared to MRI ones [
7], which would make the latter more appealing for a first-line, early diagnosis.
Given the limitations of the current diagnostic biomarkers, it is necessary to develop a diagnostic tool that can detect early manifestations and accurately diagnose both MCI and AD. Recently, there have been efforts to develop such a tool using machine learning (ML) methodologies, with structural MRI, which are high-performing in distinguishing “healthy controls (HC) vs. AD” (80–100% accuracy), “MCI vs. AD” (50–85% accuracy), “HC vs. MCI” (60–90% accuracy), and “HC vs. MCI vs. AD” (59–77% accuracy) [
10,
11]. Importantly, these studies often use the diagnosis at the time of scanning as the “ground-truth” label to train the ML algorithm, limiting the classifier to be only as good as the clinician. However, clinical accuracy is limited at this initial stage, with 20% of MCI patients transitioning to AD and 16% reverting to normal cognition after approximately 1 year [
12]. Therefore, the latest possible diagnosis by the clinician should be used as ground-truth in classifier design, despite in vivo diagnosis still being imperfect even when at later stages (with 10% of patients diagnosed with probable AD not meeting pathological criteria in autopsy [
13]). An imprecise ground-truth limits the quality of the statistical model and its clinical usefulness when an early diagnosis is sought. Some studies consider progression from MCI to AD during a follow-up period, differentiating stable from progressive MCI [
14]. However, to our knowledge, no study using machine learning for MCI and AD diagnosis has considered other conversions from initial diagnosis, specifically MCI reversion to HC and HC progression to MCI. It is unclear whether studies that consider progression from MCI to AD include subjects with other conversions in their sample and, if so, whether the baseline or the current diagnosis is used as a label. Not including these subjects could inflate the performance of classifiers aiming at future diagnosis and also limits their applicability. Thus, to increase classifier quality, in the current study, we considered all types of transitions and used ground-truth diagnostic labels at a minimum of 1-year follow-up and a maximum of 3-year follow-up.
Importantly, in the present study, we aimed to develop an MRI-based
multi-diagnostic classification biomarker. This classifier is multi-diagnostic in the sense that it differentiates HC, MCI, and AD simultaneously, which the vast majority of studies that consider HC, MCI, and AD do not [
11], and which more closely approximates the decision clinicians face. Furthermore, we used data from multiple datasets and MRI acquisition protocols. Specifically, surpassing a common limitation in the literature, we assessed our classifiers’ generalizability across two publicly available independent datasets [
10]. Secondly, we assessed generalizability across the two most common T1-weighted acquisition protocols: a magnetization-prepared rapid gradient echo (MPRAGE) sequence and a spoiled gradient echo with inversion recovery preparation (IR-SPGR) sequence. The acquisition protocol has also been shown to impact brain atrophy rate measurements [
15] and image signal-to-noise and contrast-to-noise ratios [
16], but its impact on ML classification has not been studied, and, more importantly, different protocols are often used unbalanced between diagnostic groups, which may gravely bias (particularly, inflate) classification performance.
ML-based diagnosis classification biomarkers should stem from a careful trade-off between complexity and interpretability, such that they explore complexity to achieve statistical power but retain pathophysiological interpretability which can be informative and reassuring to the clinician. Most previously published classifiers largely lack interpretability and do not provide any insight into how the classification decision was reached [
11,
17]. We exclusively used simple linear ML algorithms and tree-based algorithms, so that feature importances could be extracted from the classifiers, to retain some level of interpretability, while also combining classifiers into a complex ensemble. To further increase complexity, above using raw morphometric structural features extracted from the MRI scan, previous studies have built ML classifiers using structural T1-weighted MRI graph theory (GT) features for AD diagnosis. GT morphometry-derived features allow the classifiers to account for complex inter-dependence between brain regions, thus potentially increasing predictive power. Studies using GT features have obtained a good performance in “HC vs. AD” classification (87.0–92.4% area under the receiver operating characteristic curve (AUC)) [
18‐
21] and in multi-diagnostic “HC vs. MCI vs. AD” classification (72.9% AUC) [
18]. However, only one of these studies [
21] reported the performance difference between using GT vs. morphometric features, having found that thickness-based GT features result in very modest improvement when compared to raw thickness measures (92.4% from 91.6% AUC, respectively, in “HC vs. AD”). Since GT metrics may reduce classifier interpretability, these should only be used if the improvement obtained outweighs the loss of interpretability. In the present study, we build classifiers with and without GT metrics to determine whether including such metrics is preferable. We used structural MRI GT features which have all been previously associated with AD (degree [
22], clustering coefficient [
23], node betweenness centrality [
24], and eigenvector centrality [
25]). However, unfortunately, we could not herein predict the direction of these effects, as they can depend on edge definition [
26] and vary across studies [
27].
The ML biomarker literature is also rife with unclear reports, which hinders comparability between findings and an insight into their clinical applicability [
28]. We aimed to report classifier performance transparently and comprehensively. Along with commonly reported metrics such as sensitivity, specificity, and AUC, we also report more robust metrics such as MCC [
29,
30], as well as negative and positive predictive values (NPV and PPV) which are standardized (for better comparability) [
31] and prevalence-adjusted (for better clinical context interpretation). We also report performance stratified by sex, which may be relevant in clinical practice, as well as confusion matrices where possible, allowing readers to calculate additional (unreported) metrics. Finally, we sought to evaluate our classifier using our own biomarker evaluation framework published in 2014 [
32], to inform on our most inclusive classifiers’ potential clinical applicability.
In sum, we report a ML-based diagnostic tool for MCI and AD which tackles limitations of previous studies, specifically by being (1) multi-diagnostic; (2) trained and tested across 2 independent data sources with multiple acquisition protocols; (3) with tests of generalizability across datasets and protocols, the latter being completely novel in the context of AD; (4) based on baseline scans and a follow-up diagnosis, regardless of progression; (5) with transparently reported performance, and (6) with an evaluation of potential clinical applicability.
Discussion
Early diagnosis of MCI and AD is essential for early therapies and better disease management [
6]. For this, we developed a ML-based tool which is (1) multi-diagnostic; (2) trained and tested across 2 independent data sources with multiple acquisition protocols; (3) with tests of generalizability across datasets and protocols, the latter being completely novel in the context of AD (4) based on baseline scans and a follow-up diagnosis, regardless of progression; (5) with transparently reported performance; and (6) with an evaluation of potential clinical applicability. Results showed our tool was well-performing in differentiating AD, MCI, and HC (0.438 MCC; 62.1% BAC; 77.7% AUC), showing an accuracy above chance level (i.e., 33.3% BAC given the 3 possible diagnoses), regardless of diagnosis at the time of scanning, when tested against a follow-up diagnosis of at least 1 year and a maximum of 3 years. Our tool also performed equally well when being tested on data from datasets and acquisition protocols not used in training, meaning it is likely to generalize well to independent data. Additionally, since our classifiers perform equally well without GT features, which add complexity to the classifiers, we found no advantage in including GT features. Finally, we reported the relative importance of different brain regions and morphometric feature type for the diagnostic classification, which might aid in clinical interpretability of the classifiers.
A MRI-based ML classifier: performance comparison with previous reports
Several recent studies claim classifier interpretability or report feature weights for the classifiers. Those which developed classifiers for AD diagnosis (“HC vs. AD”) obtained accuracies of 85.0% [
51] and 87.2% [
52], similar to ours of 90.6% BAC on the combined ADNI+OASIS dataset (experiment B5); MCC of 0.666 [
53], lower than our ours of 0.811 MCC (experiment B5); and AUC of 98.0% [
54], 95.1% [
55], and 90.6% [
56], comparable to ours of 97.4% (experiment B5). Importantly, all these studies, except [
52] and [
56], report feature importances at the level of the individual, which we do not. Only one study [
53] used an additional AD dataset besides ADNI. Another study [
57] using simple ML classifiers and both the ADNI and OASIS obtained a performance similar to ours of 87% BAC on the ADNI dataset, but a lower performance of 70% on the OASIS dataset. While feature weights where not reported, the authors mention having extracted them from the classifiers. Finally, a study [
58] reporting landmarks with high discriminative power, when using ADNI-1 for classifier training, obtained an MCC of 0.819 testing in ADNI-2 and 0.839 testing in MIRIAD, which is better than both our experiments where test data was from a different dataset than training data (experiment B8, trained on OASIS and tested on ADNI, had an MCC of 0.739; experiment B9, trained on ADNI and tested on OASIS, had an MCC of 0.674). This study, however, might have suffered from data leakage in the form of biased transfer learning [
59].
Our achieved performance on the ADNI dataset alone for the multi-diagnostic “HC vs. MCI vs. AD” (experiment A5: 62.1% BAC, whereas a random classifier would have a BAC of 33%) is comparable to that of recent studies which also report feature weights for the classifiers or claim classifier interpretability. These range from 51.9 to 71.2% in accuracy [
54,
60‐
62]. One study [
54] achieved a higher multi-diagnostic “HC vs. MCI vs. AD” accuracy (albeit unbalanced) of 71.2% using a region abnormality score obtained using a deep neural network which, despite providing individual-level abnormality scores, does not make evident how the characteristics of the scan contribute to a region’s abnormality score. Furthermore, this study only considered baseline diagnosis for this classifier, with MCI subjects who had transitioned to AD being labeled as MCI. A second study [
60] obtained a “HC vs. MCI vs. AD” multi-diagnostic BAC similar to ours of 62.5% on the ADNI dataset using an ensemble of random forest classifiers, which is highly interpretable. A third study [
61] reporting feature importances from a random forest classifier obtained an “HC vs. MCI vs. AD” multi-diagnostic BAC of 51.9%, which was lower than ours. A fourth study [
62] obtained a similar (vs. ours) “HC vs. MCI vs. AD” multi-diagnostic BAC of 61.9% using an ensemble of linear SVMs, but do not report feature importances, only the frequency with which each feature was selected by the ensemble. Finally, the latter three studies only considered MCI-to-AD transitions when labeling subjects [
60‐
62].
Hippocampal and cingulate, frontal, and other temporal changes contribute most
Extracted feature contributions by brain regions obtained from the “HC vs. AD” classifiers (experiments A5 and B5, in Figs.
3A and
4A, respectively) are in accordance with our current etiological and neuroscientific understanding of AD. Hippocampal features were the strongest contributors to the classification, at approximately 25–45%, which is consistent with the understanding that hippocampal atrophy is a structural hallmark of AD in the brain [
63]. Temporal regions were the next most highly weighted, at approximately 13%, followed by cingulate and frontal regions, each contributing approximately 8–12% for the classifier decision, and all have been independently associated with AD and its progression [
64‐
66]. The remaining regions combined contribute approximately 25–40%, and all regions make meaningful contributions to the decision, speaking to the fact that AD is a disease with brain-wide effects.
Volume changes contribute most
Extracted feature contributions by feature type obtained from the same “HC vs. AD” classifiers as above (experiments A5 and B5, in Figs.
3B and
4B, respectively) reveal that volumes are the strongest contributors to the classification, at approximately 45–55%, followed by cortical thicknesses, at approximately 17–21%, which is in line with evidence of brain atrophy [
39] and cortical thinning in AD [
40]. The combination of gyrification measures (mean curvature, Gaussian curvature, folding index, and curvature index) contributed 12–17% to the decision, which is consistent with previous evidence associating cortical gyrification with AD [
41]. While we did not investigate the performance gained from introducing different types of measures, we show that all types of measures used had an important contribution to the classification decision. Furthermore, care should be taken not to interpret these contributions outside of the classification context, as the predictive value of features does not necessarily reflect biological contribution to the disease [
49].
MRI morphometric-based graph theory complexity may be unnecessary
When introducing complex features, such as GT-based ones, it is important to determine whether this brings an improvement, as complexity sacrifices clinical interpretability in ML. In fact, we did not observe a statistically significant improvement when GT features were used as input along with morphometric features (p = 0.060–0.998). However, the fact that both the classifier built to distinguish protocols (“ADNI MPRAGE vs. ADNI IR-SPGR”) and the classifier built to distinguish datasets (“OASIS vs. ADNI”) did not “choose” to use GT-based metrics even when these were given as input, this indicates that GT metrics may be more robust to protocol and dataset differences, which could have implications in classifier generalizability. It is important to note, as a limitation, that since models were trained exclusively on either morphometric or graph theory data, complementarity of these data might have been lost. The same is true for complementarity across GT binarization thresholds. This loss of complementarity could be particularly relevant when using larger datasets and more complex models. Altogether, the fact that no improvement is obtained from inputting GT metrics suggests that these may be unnecessary when small sample sizes and simple models are used and that the benefit of their inclusion (which might be observed with larger samples and more complex algorithm) should be evaluated as they introduce additional complexity in the models.
Generalizability across the two most common acquisition protocols
Comparing the results of experiments A1 and A2 with those of experiment A5 from Table
2 shows that combining IR-SPGR and MPRAGE scans in training and testing results in similar performance when compared to building classifiers on a single scanning protocol (“HC vs. MCI vs. AD”: 0.350 MCC for MPRAGE-only (
p = 0.285) and 0.263 MCC for IR-SPGR-only (
p = 0.160) vs. 0.438 MCC for the combination). Importantly, the considerably larger training set for the combination classifier (
n = 379 for the combination vs.
n = 106 for the IR-SPGR-only classifier and
n = 295 for the MPRAGE-only classifier) did not result in any improvement, suggesting that the disease effects are fairly large and can be detected with small sample sizes using simple ML classifiers and thus that more complex models for AD diagnostic may be unnecessary.
Moreover, the results from experiments A3 and A4 demonstrate that despite the protocols being almost perfectly distinguishable by the classifiers, a classifier trained on one protocol can classify scans of another protocol with similar performance as a classifier trained on that other protocol (“HC vs. MCI vs. AD”: 0.350 MCC on experiment A1 vs. 0.372 MCC on experiment A4 (p = 0.766); 0.262 MCC on experiment A2 vs. 0.424 MCC on experiment A3 (p = 0.207)). This similarity in performance shows that a biomarker using our features and algorithm, although trained in one dataset, is well adapted to classify patients whose scan is acquired using other MRI acquisition protocols. This is the case because the underlying physiopathology dominates in the algorithm’s decision as opposed to eventual sequence differences or image signal differences. Finally, we demonstrated that MPRAGE and IR-SPGR protocols are almost perfectly distinguishable using only morphometric features (0.968 MCC; 99.2% BAC) — ascertaining that these protocols should not be merged without balancing them across diagnoses, as otherwise a classifier might be biased if it learns to use disease-irrelevant protocol-related differences in features to attribute labels.
Generalizability across independent patient datasets
Similarly to what was observed in the acquisition protocol comparison, comparing the results of experiments B1 and B2 with those of experiment B5 from Table
5 shows that combining ADNI MPRAGE and OASIS MPRAGE scans in training and testing can significantly increase performance when compared to building classifiers on a single dataset (“HC vs. AD”: 0.814 MCC for ADNI-only (
p = 0.955) and 0.564 MCC for OASIS-only (
p = 0.028) vs. 0.811 MCC for the combination). These results demonstrate that combining data from different sources can be preferential to training a classifier for each data source, as not only performance can significantly increase, but also a more robust classifier is being trained, as it can classify subjects from multiple sources. Importantly, the OASIS-only classifier from experiment B2 had a larger training sample than the dataset combination classifier from experiment B5 (
n = 365 and
n = 295, respectively), meaning that the increase in performance observed cannot be attributed to a larger training set. Furthermore, when showing that ADNI and OASIS datasets are not perfectly distinguishable (MCC = 0.680; 83.6% BAC), we consequently also show that that applies to the MPRAGE scans with a 2-factor acceleration (which were used in OASIS) vs. those with non-accelerated MPRAGE scans (used in ADNI). This contrasts with the “MPRAGE vs. IR-SPGR” distinction (0.968 MCC; 99.2% BAC), which was nearly perfect, and is consistent with previous evidence showing that acceleration has little impact on brain atrophy metrics, with effects being dominated by MPRAGE vs. IR-SPGR differences [
15]. Finally, results from experiments B3 and B4 demonstrate that a classifier trained on one dataset can classify scans of another dataset with similar performance as a classifier trained on that other dataset (“HC vs. AD”: 0.814 MCC on experiment B1 vs. 0.722 MCC on experiment B3 (
p = 0.211); 0.564 MCC on experiment B2 vs. 0.641 MCC on experiment B4 (
p = 0.461)). These results are in line with previous evidence showing that a classifier trained on ADNI data can be used to classify OASIS data with similar performance as a classifier trained on OASIS data [
57], although we obtained a slightly higher performance of 87.2% BAC compared to 75.6% BAC when training with ADNI and testing with OASIS.
Sex-stratified results
Sex-stratified data showed no difference in performance between males and females within the ADNI dataset for the multi-diagnostic classifier (experiment A5) (0.424 MCC females, 0.454 MCC for males, 0. MCC overall, p = 0.438), nor in the combined ADNI and OASIS datasets for the “HC vs. AD” task (experiment B5) (0.875 MCC females, 0.749 MCC for males, 0.811 MCC overall, p = 0.219). Despite no differences being observed in our biomarkers, reporting sex-stratified data remains important as differences in performance between sexes have implications on the potential translation of the classifiers into clinical practice. Balancing testing sets for sex is essential to ensure that sex proportions in the training set do not inflate test results, especially when there is a significant difference between diagnoses in sex, such as in our datasets. While sex proportions could be adjusted in the training set, this could mean reducing the size of the test set, and also it would not allow for reporting of sex-stratified results with the same confidence for both sexes.
Our classifier from experiment A5 performed equally well with stable (0.420 MCC) and transition cases (0.327 MCC), which suggest that imaging changes might precede diagnostic changes. Importantly, the classifier performed poorly on the HC to MCI progression and MCI to HC regression transitions. Given the small number of these transitions in the test set (7 MCI-to-HC and 6 HC-to-MCI), their estimated performance has a wide error margin. Nonetheless, the poor performance estimated for these cases demonstrates that excluding transitions we cannot reliably predict may artificially inflate model performance. This is due to the fact that when using the classifier as intended, there is no way to exclude these transitions, as the future state of the patient is unknown (which is exactly what the classifier is being used to predict). Therefore, it is important to train and test the classifier with a population that best resembles the intended use of the classifier (i.e., including all possible transition cases) even if including these transitions reduces classifier performance. Furthermore, while many studies differentiate stable from progressing MCI [
14], failure to account for other transitions reduces their potential applicability, as predicting a progression from MCI to AD is not necessarily more useful than predicting a progression from HC to MCI, or a regression from MCI to HC. For example, accurately predicting future MCI in currently HC patients indicates that their brain already shows some pattern of structural change before symptom onset. Early intervention in these cases (recommending mentally stimulating activities, monitoring the appearance of symptoms, etc.) could result in better outcomes and delayed onset of symptoms. Similarly, predicting regression from MCI to HC could be a good indicator that the current clinical intervention and therapies are working.
Clinical applicability and usefulness
Both the classifiers we evaluated obtained a clinical applicability score of 6 (“HC vs. MCI vs. AD” from experiment A5 and “HC vs. AD” from experiment B5). Despite the better performance of the second classifier, we consider that the multi-class has more potential to be clinically useful, as it more closely approximates the decision the clinicians face. Given this evidence, we consider that the present MRI biomarker should undergo evaluation in future RCT to compare it to current practices in terms of real-world clinical usefulness, which should include measures of risk, convenience, cost, number-needed-to–assess and outcome relevance, at least [
32], as it may have potential to improve clinicians’ performance as a first-line biomarker — before invasive, less available, and more expensive ones such as PET or CSF readings are used — at an initial stage.
Limitations and future work
The present study was limited by the fact that only one of the data sources included patients with MCI. Furthermore, the testing set is not identical to the clinical situation faced by physicians in the clinical decision process. This or any future classifier built using the proposed approach on a larger, longitudinal, multi-site dataset would have to be thoroughly evaluated in the clinical setting before it can be adopted. As such, as mentioned in the previous section, future work should assess the clinicians’ performance with and without the information from the algorithm so its potential impact on patients can be understood. Also, our classifiers did not make use of cognitive features or scores, which could be of high predictive value and can be easily acquired. Finally, while we provide some insight into the classifier decision with the goal of improving interpretability, whether classifier outputs would be correctly interpreted by clinicians without a good understanding of the statistical rationale behind ML algorithms remains to be verified.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.