Background
Amyotrophic lateral sclerosis (ALS) is a fatal disease with inherited (familial) forms and sporadic subtypes arising spontaneously from gene-environment interactions. Mutations in superoxide dismutase (
SOD1) were the first to be associated with ALS [
1], but in recent decades additional susceptibility genes have been identified (e.g.,
TDP43,
C9orf72), reflecting a complex genetic basis for most forms of the disease. Ongoing epidemiologic studies have also uncovered environmental risk factors, which appear to include smoking, low body mass index, poor dietary antioxidant intake, vigorous physical activity, head injury, and occupational exposures to heavy metals or pesticides [
2]. At present, few ALS treatment options are available, including the glutamate antagonist riluzole (Rilutek/Teglutik) and antioxidant edaravone (Radicava/Radicut), along with dextromethorphan/quinidine (Nuedexta) for pseudobulbar affect. However, despite frequent clinical trial setbacks [
3,
4], research towards new ALS treatments has pressed forward, and promising candidate therapies are now at various stages of development and clinical testing (e.g., Methylcobalamin, Mastinab and NP001) [
5]. In this setting, the lack of ALS biomarkers has been cited as a factor limiting the identification, development and testing of new drug candidates [
6,
7]. Investigators have thus worked to expand the set of available ALS biomarkers, which now includes clinical performance measures, genetic risk factors, and measures derived from biological fluids (CSF, blood and urine) and neurophysiology or neuroimaging studies [
8,
9]. Despite this progress, ALS biomarkers selected for use in clinical trials have varied from study-to-study, reflecting the absence of definitive “gold standard” biomarkers widely agreed upon by ALS researchers [
8,
9].
Fluid-based ALS biomarkers have been suggested from studies of CSF, blood, urine and saliva, and in principle would offer objective, quantitative, and potentially multi-dimensional tools for investigators [
6,
10]. CSF biomarkers have been viewed as the most promising due to direct contact between CSF and central nervous system tissues [
11], but a drawback is that CSF sampling requires lumbar puncture, which is time-consuming, cannot be performed in all patients, and may cause adverse effects (e.g., headache). As an alternative, peripheral blood is easily sampled and a promising biomarker source [
12]. Although ALS is primarily a disease of motor neurons, the rationale for blood-based biomarkers is supported by factor exchange at the blood-CSF barrier [
6], which may be enhanced in ALS patients due to barrier damage and loss of pericytes [
13,
14]. Experimental evidence also supports a role for immune cells in disease progression [
15‐
17] with protective and deleterious immune responses in ALS patients [
18,
19]. Blood-based biomarkers with clinical utility for ALS appear to include phosphorylated neurofilament heavy chain (pNFH), neurofilament light chain (NFL), microRNAs (e.g., miRNA-1234-3p), inflammatory markers (e.g., IL-6, IL-8, IL-5 and IL-2), TDP-43, and metabolites (e.g., glutamate and lysine) [
6]. Serum and plasma NFL levels, for example, were shown to be effective for distinguishing ALS patients from healthy CTL subjects with a sensitivity of 89–90% and specificity of 71–75% [
20]. If sufficiently validated, such biomarkers could be used for ALS diagnosis, prognosis of clinical course, prediction of treatment response, and pharmacodynamic monitoring [
8]. Blood-based biomarkers can also be used to screen drug responses in humans or mice to identify compounds warranting investigation as new ALS drug candidates [
21]. Finally, given that development and validation of ALS mouse models remains a longstanding research challenge [
22,
23], blood-based biomarkers could be used to assess whether ALS-like mouse phenotypes have biomarker profiles similar to the human disease [
24,
25].
Gene expression profiling has previously been used to comprehensively analyze mRNA abundance to identify neurodegenerative disease biomarkers [
26,
27]. Along these lines, prior studies have used microarray or RNA-seq expression profiling of whole blood or peripheral blood mononuclear cells (PBMCs) to compare gene expression in ALS patients and control (CTL) subjects (Additional file
1) [
28‐
32]. This has led to the identification of differentially expressed mRNAs with altered expression between ALS patients and CTL subjects. In these studies, however, sample sizes have been limited (
n ≤ 43 individuals in ALS and CTL groups), which may be insufficient for a heterogeneous disease such as ALS [
33] and increase the risk for type I and II errors and findings with poor repeatability [
34]. Recently, however, a large microarray dataset from peripheral blood of ALS patients and controls was generated [
31] with sample sizes far exceeding those in prior work (
n = 1117 participants). These data represent the best resource now available for identifying ALS blood biomarkers, although an initial analysis was challenged by technical issues related to batch effects and the combination of data from two cohorts with expression evaluated using different microarray platforms [
31]. Using several classification modeling approaches, expression-based models from this initial study could discriminate between ALS and CTL subjects (0.87 ≤ area under curve (AUC) ≤ 0.90), but were less effective at discriminating ALS patients from those with ALS-mimic diseases (MIM) (0.65 ≤ AUC ≤ 0.68) [
31]. It was also concluded that prediction of ALS patient survival using blood gene expression markers was poor [
31]. These results raise questions regarding the clinical utility of blood-based ALS biomarkers, although it remains possible that alternative analysis approaches may resolve technical variability in these data to generate new insights.
This study provides an independent analysis of the large-cohort microarray dataset generated by van Rheenen et al. [
31]. We apply an alternative data normalization strategy [
35] and implement a series of analyses not applied previously. Our results provide new insights into processes and pathways associated with differentially expressed genes (DEGs) [
36,
37], expression of DEGs in exosomes [
38,
39], overlap between DEGs and genes near ALS GWAS loci [
40], shifts in immune cell abundance or activity in ALS patients [
41], and patient subgroups based upon expression profile heterogeneity among patients [
42,
43]. We utilize multiple machine learning approaches [
44] to generate diagnostic models (ALS vs. CTL/MIM), and use Cox proportional hazards (PH) models to generate a multivariate expression signature that predicts ALS patient survival.
Discussion
ALS primarily affects motor neurons in the brain and spinal cord, but peripheral blood analysis may provide a source for non-invasive biomarkers. This study provides a comprehensive analysis of blood gene expression in the largest expression profiling study of ALS patients performed to date [
31]. Expression shifts in ALS blood frequently involved genes specifically expressed by certain immune cell types (e.g., neutrophils), and we show that expression of such genes can be used to identify patient subgroups (Fig.
7). The value of blood expression markers in ALS, however, is not limited to inflammation monitoring alone. Our results suggest that blood gene expression may additionally provide insights into sub-clinical hypoxia and respiratory function [
98,
99], as well as pathogenetic mechanisms suggested by GWAS findings [
88]. Our findings support the potential of blood-derived expression markers as tools for prediction of ALS diagnosis and patient survival. Sensitivity and specificity estimates from this study were better than those reported previously in large cohort studies of blood proteins, and were competitive with estimates from CSF protein studies (Fig.
6j, k). At present, no major technical barriers exist that would prevent investigators from exploiting blood gene expression biomarkers in ALS more fully. It is nearly certain that further progress can be achieved through analysis of larger patient cohorts and leveraging of high-throughput sequencing technology. Ultimately, we expect that the most useful clinical tools will emerge from predictive “hybrid models” that combine blood-derived biomarkers with those obtained from other biofluids (e.g., CSF, urine and saliva) and/or validated clinical measures (e.g., FVC and the revised ALS functional rating scale (ALSFRS-R)) [
44].
Transcriptomic datasets can be analyzed from multiple perspectives and the purpose of online data repositories (e.g., GEO) is to facilitate re-analysis of data using alternative methods [
74]. This study focused on data originally generated by van Rheenen et al. [
31] and demonstrates challenges that arise when combining data across batches and microarray platforms [
100]. ALS DEGs identified in our study overlapped significantly with those from van Rheenen et al. [
31], but only 30% of DEGs identified in our study were identified as differentially expressed by van Rheenen et al. [
31]. This lack of overlap may be explained by alternative batch correction and differential expression analysis approaches. We used a different batch correction algorithm (ComBat) [
35] and avoided simultaneous correction for batch and platform-specific variability, with DEGs identified by “late stage” meta-analysis [
46] to ensure DEGs exhibited consistent patterns in both cohorts. Outlier removal may also be an important difference between analyses. The analysis by van Rheenen et al. [
31] identified and removed 87 samples as outliers, whereas our analysis removed only 1 outlying sample. The lack of prominent outliers in our analysis may reflect improved resolution of technical variability using the ComBat algorithm [
101]. Despite these differences, some key trends were similar between our analysis and that of van Rheenen et al. [
31]. For instance, ALS-increased DEGs we identified frequently encoded RNA-binding proteins and ribosome components (Additional file
11), consistent with findings from van Rheenen et al. [
31]. As noted by van Rheenen [
31], genes linked to RNA processing have been identified in ALS genetic association studies (e.g.,
TARDBP,
FUS) and there is evidence for RNA-mediated toxicity in cells harboring
C9orf72 repeat expansions [
102]. Both our study and that of van Rheenen et al. [
31] therefore appear to have identified dysregulation of RNA processing genes in the ALS transcriptome, which has increasingly been viewed as a component of ALS pathogenesis [
102].
An immunological component to ALS pathophysiology has also been recognized [
19,
103], and prior studies have identified alterations in immune cell abundance and activity in ALS patients [
104]. Blood gene expression is partly determined by the fractional abundance of constituent immune cells, and as such in silico approaches have been developed to “deconvolute” aggregate expression signals to allow inferences regarding fractional cell abundance [
70‐
73]. Using this approach, the strongest trend we identified was over-representation of neutrophil-specific genes among ALS-increased DEGs (Fig.
3a). This result is consistent with neutrophilia in ALS patients as previously demonstrated in smaller patient cohorts [
91,
105‐
108] and confirmed by flow cytometry [
91]. In ALS, neutrophilia and associated low-grade inflammation [
107] may be reactive and secondary to motor neuron degeneration, although some evidence supports a direct and causal role in disease pathogenesis [
109]. Recently, for example, heavy neutrophil infiltration surrounding motor axons was reported in SOD1G93A rats, and neuron degeneration and myofibril loss in this model were prevented by the tyrosine kinase inhibitor masitinib [
109]. Notably, increased expression of neutrophil-specific genes in ALS patients was not mirrored in patients with ALS mimic diseases (Additional file
18B), suggesting that neutrophilia may be an ALS-specific phenotype and potentially useful for excluding differential diagnoses. In prior work, degree of neutrophilia has been correlated with functional decline as measured by ALSFRS-R [
91,
104]. This trend was weakly supported by our analysis, since high neutrophil signature scores were marginally associated with decreased survival (HR = 1.26; P = 0.069; Fig.
7a).
ALS patients in this study had lower expression of genes specifically expressed by RBC lineage cell types (i.e., erythroblasts and reticulocytes; Fig.
3d). This trend was less robust but bolstered by associations between ALS-decreased DEGs with anemia and blood diseases (Additional file
13F). One possible interpretation is that ALS patients develop reduced RBC numbers, potentially leading to sub-clinical anemia at certain stages of the disease course. Early studies have demonstrated increased mechanical fragility of erythrocytes from ALS patients, with enhanced sensitivity to haemolysis following lead exposure [
110,
111]. ALS patient erythrocytes were also reported to have greater sensitivity to oxidative stress along with reduced activity of antioxidant defense enzymes such as glutathione peroxidase [
112‐
114]. Other unique RBC phenotypes have been described in ALS patients as well, such as increased erythrocyte deformability and acetylcholinesterase activity [
115], accumulation of altered aspartyl residues [
116], and decreased nitric oxide efflux and intraerythrocytic nitrite [
115]. In this study, ALS patients not only had reduced expression of RBC lineage-specific genes in blood, but those with higher expression of such genes had improved survival (HR = 0.71; P = 0.002; Fig.
7a). In agreement with this finding, a retrospective cohort study of 1.8 million young men enlisted in the Swedish military recently showed that a 1% increase in erythrocyte volume fraction (EVF) was associated with 4% decreased risk of later developing ALS (P = 0.05) [
117]. In view of these results, it is interesting to note that blood-CSF barrier defects have been documented in ALS patients and rodent models [
13], which may provide a route for erythrocyte extravasation into spinal cord regions with high motor neuron density [
13], where deposition of hemoglobin may be toxic [
118]. Our findings therefore augment evidence for unique RBC phenotypes in ALS patients, which may be useful as disease biomarkers [
115] or potentially play a direct role in disease onset and/or progression [
13,
118].
The idea that ALS patients can be divided into “high” and “low” inflammatory groups has previously been suggested based upon gene expression analysis of PBMCs from small patient cohorts (i.e.,
n ≤ 9 patients) [
42,
43]. Based upon our large cohort analysis (
n = 396 patients), using whole blood immune cell expression signatures, we identified two patient subgroups with myeloid- and lymphoid-dominant expression patterns, respectively (Fig.
7b). These groups approximate the “high” versus “low” inflammation groups suggested previously (Fig.
7g, h) [
42], although this distinction may not be fully applicable since certain cytokine mRNAs were elevated in both groups (Fig.
7d). The significance of these findings is that blood gene expression signatures may provide tools to screen patients prior to clinical trial enrollment. In recent years, an objective in clinical trial design has been to enroll more homogenous ALS patient groups, based upon biomarkers and/or measures that reflect disease progression or physiological function [
95,
119]. For example, a recent phase 2 study of the macrophage activation inhibitor NP001 [
120] only enrolled ALS patients with plasma C-reactive protein (CRP) concentration greater than or equal to 0.113 mg/dL (NCT02794857) [
96]. Likewise, a phase 2 study of the IL-6 receptor antibody tocilizumab enrolled ALS patients with a “high inflammatory profile” based upon PBMC gene expression analysis (NCT02469896) [
42,
121]. In our study,
IL6 expression was slightly lower in the myeloid group (P = 1.89e−02), but expression of
IL6R was elevated with greater statistical significance (P = 2.3e−25), suggesting the hypothesis that myeloid group patients would respond more favorably to tocilizumab. Myeloid and lymphoid groups also differed in the expression of mRNAs encoding IL-1, IL-17 and IL-23 pathway components (Fig.
7d), which have been less studied in ALS but can be targeted using biologic therapies now available [
122,
123]. Myeloid and lymphoid subgroups identified here may thus represent sub-cohorts enriched for patients more likely to respond to immunomodulatory agents targeting IL-1, IL-6, IL-17 or IL-23.
Blood gene expression has most often been incorporated into ALS clinical trials as an inflammation biomarker [
121], but our results suggest a broader role that extends beyond inflammation monitoring alone. Notably, gene expression shifts in ALS blood showed striking correspondence with those observed during acute high altitude stress (Fig.
4). Respiratory decline is expected with ALS progression and may have physiological effects at early disease stages, prior to the onset of overt dyspnea or measureable FVC decline [
98,
99]. A recent study, for example, documented poor sleep quality in most ALS patients (63%) at the time of diagnosis, likely due to nocturnal hypoventilation and respiratory muscle weakness [
124]. In some ways, this respiratory dysfunction may be mimicked by the reduced oxygen pressures associated with high altitude stress, which are responsible for symptoms of “acute mountain sickness” (e.g., headache, nausea, pulmonary hypertension, cerebral edema) [
125,
126]. Similar to ALS respiratory decline, moreover, physiological responses to high altitude are broad and may involve release of hypoxia inducible factor (HIF) and activation of renal compensatory mechanisms [
125‐
127]. Our results therefore suggest the novel possibility that peripheral blood gene expression can provide a sensitive indicator of respiratory dysfunction at early and late stages of the disease, which may be of clinical value since standard tests such as vital capacity may not be fully indicative of diaphragm atrophy or function [
128]. Interestingly, altitude stress scores among patients were not themselves associated with survival in covariate-adjusted models (HR = 1.06; P = 0.47; Additional file
21A). This may indicate that the high altitude blood signature is not a “marker” of severe respiratory distress, emergent at later disease stages, but may instead reflect degrees of sub-clinical hypoxia and respiratory muscle weakness, which can be present even at the time of diagnosis [
124].
Early ALS symptoms may go unnoticed or mimic other diseases, which can lead to diagnostic delay and compromised quality of care [
129‐
131]. One study reported a median time to diagnosis of 11 months with one-third of ALS patients having initially been misdiagnosed [
130]. Such delays hinder care plan formulation, slow initiation of treatments that might improve survival or quality of life, reduce time available for clinical trial enrollment, and add to the frustration and expense of ALS patients and caregivers [
129‐
131]. Diagnostic delays are in part due to the complex process for ALS diagnosis, which requires full exclusion of other conditions, requiring multiple referrals and rounds of testing [
132]. Development of an accurate biomarker test, however, would simplify the process and likely lead to earlier diagnosis [
12]. In our analyses, mRNAs encoding proteins previously identified as possible blood biomarkers (e.g., pNFH and NFL) [
6,
90] weakly discriminated ALS from CTL and MIM patients (Additional file
16A–C). We identified individual genes that could diagnose ALS patients with 62–63% accuracy (Additional file
17H, I), but ultimately best results were obtained using an SVM classifier with PC scores as predictors, which yielded 87% accuracy (sensitivity: 86%, specificity: 87%) (Fig.
6i). These sensitivity and specificity estimates are better than those previously reported for blood biomarkers in large cohort studies (≥ 60 patients per group) and comparable to CSF protein biomarkers (Fig.
6k). Our estimates also compare well to those reported in studies using other technologies such as transcranial magnetic stimulation (sensitivity: 73%, specificity: 81%) [
133], conventional MRI (sensitivity: 48%, specificity: 76%) [
134], and diffusion tensor imaging (sensitivity: 65%, specificity: 67%) [
135]. Our findings thus support the idea that blood-derived mRNA biomarkers can be superior to some diagnostic approaches and competitive with CSF protein analysis [
11,
136‐
138].
Exosomes are extracellular vesicles ranging in size from 40 to 100 nm, which are generated from endosome membranes with cargos consisting of proteins and nucleic acid species (e.g., mRNAs and microRNAs) [
64]. Because of their unique role in intercellular transport and communication, and their ability to cross the blood-CSF barrier [
139], exosomes have drawn interest as candidate neurodegenerative disease biomarkers [
39,
140]. ALS-increased DEGs had higher expression in blood exosomes from normal subjects (Additional file
16D, E), and exosome-associated mRNAs [
66,
67] were disproportionately increased in ALS blood samples (Additional file
16B, C). ALS-increased DEGs with highest blood exosome expression frequently encoded ribosomal subunits and other translation-associated proteins (Additional file
16G, H). These observations connect well with those from a recent study demonstrating an increased diameter of exosomes extracted from ALS patient plasma, which further showed that such exosomes have increased abundance of disease-related proteins (e.g., SOD1, TDP-43 and FUS) [
38]. The ability to quantify exosome mRNA strengthens the biomarker value of ALS patient blood samples, particularly since exosome mRNA may originate from motor neurons and transit the blood-CSF barrier to enter peripheral circulation [
139]. Along these lines, it is important to note that our study analyzed oligonucleotide microarray data, which compared to RNA-seq has a more limited dynamic range and less sensitivity for quantifying expression of low-abundance transcripts [
141]. Since extracellular vesicles passing into blood from CSF may be present in low abundance, RT-PCR or RNA-seq would likely provide improved quantification [
141], potentially leading to biomarkers with improved diagnostic accuracy. To our knowledge, a comprehensive comparison of exosome mRNAs in blood of ALS patients and controls has not been performed, but our findings, combined with other recent data [
38], provide rationale for such work using a deep sequencing approach.
Superoxide dismutase 1 (
SOD1) mutations were the first to be associated with ALS [
1], since replicated in multiple cohorts [
142], and may contribute to disease by causing SOD1 destabilization and mitochondrial accumulation [
143,
144]. Among 11,480 protein-coding genes evaluated, expression of copper chaperone for superoxide dismutase (
CCS) was most strongly associated with survival (P = 1.84e−05; FDR = 0.14), with increased expression favoring improved survival (HR = 0.77) (Additional file
20C).
CCS encodes a copper chaperone for SOD1 and interacts with SOD1 to facilitate copper uptake, promote structural maturation, and control intracellular localization [
145]. Consistent with this, CCS and SOD1 proteins are co-localized in human cortical pyramidal neurons, cerebellar Purkinje cells, and spinal cord motor neurons [
146]. In principle, CCS-SOD1 interactions may enhance SOD1 stability and possibly protect against misfolding and aggregation of the mutated variant [
147]. Paradoxically, however, when
CCS is overexpressed in G93A-SOD1 mice,
CCS-G93A-SOD1 transgenics die 8 times more quickly with increased G93A-SOD1 mitochondrial localization [
148]. This seems difficult to reconcile with improved survival in ALS patients having elevated blood
CCS mRNA levels (Additional file
20C). Notably, however, SOD1 aggregates are a key ALS hallmark but are absent in spinal cords from CCS/G93-SOD1 mice [
149], and mortality in these mice may be explained entirely by copper deficiency during the first 10 days of life [
150], in some ways resembling Menke’s disease better than adult-onset ALS [
151]. Further investigation may therefore be needed to understand the significance of
CCS in ALS pathophysiology apart from the copper deficiency phenotype observed in
CCS-G93-SOD1 transgenic mice.
The ability to predict ALS patient survival has been an ongoing challenge [
152‐
154], and biomarkers for this purpose would be especially valuable as early indicators of efficacy in ALS clinical trials [
7]. Although
CCS and other individual genes (e.g.,
JAK1,
CEBPE, KEL) were marginally associated with survival (FDR < 0.30), such genes, taken individually, only modestly improved survival forecasts when combined with baseline clinical data in Cox PH models (Additional file
20). We therefore developed a multivariate model with 61 predictor genes and showed that adding these genes to a Cox PH model with clinical data substantially improved survival forecasts, yielding an overall concordance index of 0.74 (Fig.
8h). In comparison, a recent study reported a concordance index of 0.78 using prediction models based upon 8 variables, including 2 used in our analyses (age at onset and site of onset) and 6 others not included in our analyses (FVC, definite versus probable or possible ALS, diagnostic delay, progression rate, frontotemporal dementia, and presence of a
C9orf72 repeat expansion) [
92]. Including these clinical variables along with others in our model would likely have improved performance, although new datasets will be needed to confirm this expectation. Important survival-associated genes in our signature include
ZNF429,
ELAC1,
MIS18A (Fig.
8f, g), although given our heuristic variable selection approach, we do not expect that the 61 genes are necessarily optimal and potentially other gene sets could be identified with similar predictive performance. Our findings, however, in contrast to an earlier report [
31], provide proof-of-principle to support the use of blood-derived expression biomarkers as predictors of ALS patient survival. Although further validation of our signature is needed, development of a prognostic blood biomarker panel would alter the landscape of tools now available to ALS researchers and clinicians [
7,
152‐
154].