Introduction
Amyotrophic lateral sclerosis (ALS) is characterized by loss of motor neurons in the brain, brainstem and spinal cord, with concurrent muscle atrophy and is typically fatal within 2–5 years from diagnosis [
1,
45]. The worldwide incidence of ALS is 1–3 cases per 100,000 individuals per year. However, considerable heterogeneity is associated with the disease at both the clinical and molecular levels, with variable sites of disease onset, variable rates of clinical disease progression, complex genetics, and a multitude of cell types involved in the disease process. Pathogenic cellular mechanisms are similarly multi-factorial, and include mitochondrial dysfunction, excitotoxicity, oxidative stress and presence of ubiquitinated neuronal and glial intracellular inclusions [
52].
Approximately 10% of ALS is familial and genetic alterations in one of over 30 ALS genes have been linked to the disease [
5,
7,
45,
52]. These familial ALS genes regulate a multitude of cellular processes, including cytoskeletal dynamics and membrane trafficking (
DCTN1,
PFN1, VAP), cellular proteostasis and autophagy (
SQSTM1,
UBQLN2,
OPTN) and RNA metabolism (
TARDBP,
FUS,
MATR3,
hnRNPA1,
hnRNA2B1,
TAF15), while the most common genetic factor is the hexanucleotide repeat expansion in the
C9orf72 gene [
20,
56].
Mutations or variants in the genes of 11 RNA-binding proteins (RBPs) or proteins that function in RNA processing are associated with ALS, including
TARDBP,
FUS,
hnRNPA1,
hnRNPA2B1,
MATR3,
SETX,
ELP3,
ATXN2,
ANG,
SMN1, and
SMN2 [
7,
45,
52]. In addition, a number of other RBPs exhibit altered subcellular distribution in neurons and/or glia in ALS patients, but lack any known mutations that cause ALS [
10,
37], suggesting that RBPs even without genetic alterations contribute to a disruption of RNA homeostasis in ALS. While mutations in the
TARDBP gene are associated with only ~ 4% of familial ALS patients, TDP-43 protein mislocalization and inclusions are detected in 97% of all ALS patients [
35]. This indicates that cytoplasmic and nuclear inclusions of RBPs are common in ALS, even without associated mutations. The number of RBPs currently associated with ALS represents a small fraction of the total RBPs, as a recent report identified 1542 putative RBPs in the human genome [
21]. Given the large number of RBPs in the human genome and the number of RBPs that have already been linked to ALS, we hypothesized that additional RBPs contribute to and/or are mis-localized in ALS, and used IBM Watson to predict new potential candidates. IBM Watson was previously used to identify novel kinases that phosphorylate p53 and has contributed to the oncology field [
48]. Since IBM Watson uses text-based information from abstract publications for its computational analysis [
8], we were limited to RBPs that have been reported in the literature. 1478 RBPs were mentioned in at least one abstract published before the end of 2015, and these were included in our study.
To test the predictive modeling capability of IBM Watson, we first limited IBM Watson’s knowledge base to publications prior to 2013, and asked Watson to use this available information to predict other RBPs associated with ALS. Watson highly ranked the four RBPs with disease causing mutations identified between 2013 and 2017, demonstrating the validity of our approach. We then used IBM Watson to screen all known RBPs and predict RBPs likely to be associated with ALS based on their similarity to all known RBPs mutated in ALS. We validated Watson’s top-ten predictions by performing immunohistochemistry (IHC), protein and RNA expression analyses in brain and spinal cord tissues from ALS and non-neurologic disease controls, as well as RNA levels in motor neurons derived from induced pluripotent stem cells (iPSC-MNs) from ALS and controls. We also performed similar experiments for three RBPs near the bottom of the list that were predicted to not be altered in ALS as negative controls.
Eight of the top-ten RBPs predicted by Watson to be associated with ALS were altered in ALS by at least two validation methods listed above. During the course of this study, one of the RBPs predicted to be linked to ALS, Caprin-1, was shown to be altered in ALS patients [
4]. As anticipated, RBPs ranked near the bottom of the list were not altered in ALS patients. Our results validate the IBM Watson predictions and identified novel RBPs altered in ALS. These findings further highlight the multitude of RBPs that contribute to the disruption of RNA homeostasis during ALS, and the strength of computer-based artificial intelligence approaches to accelerate wet lab scientific discoveries.
Materials and methods
Tissue samples
ALS and non-neurologic disease control post-mortem tissue samples were obtained from the University of Pittsburgh ALS Tissue Bank, the Barrow Neurological Institute ALS Tissue Bank, and the Target ALS Human Postmortem Tissue Core. All tissues samples were collected after informed consent from the subjects or by the subjects’ next of kin, complying with all relevant ethical regulations. The protocol and consent process were approved by the University of Pittsburgh Institutional Review Board (IRB) and the Dignity Health Institutional Review Board. Clinical diagnoses were made by board certified neuropathologists according to consensus criteria for ALS. Subject demographics are listed in Suppl. Table 5.
Immunohistochemistry
Paraffin-embedded post-mortem tissue sections from spinal cords and cerebellum were used for this study. All sections were deparaffinized, rehydrated and antigen retrieval performed using Target Antigen Retrieval Solution, pH 9.0 (DAKO) or a citrate buffer (pH 6) for 20 min in a steamer. After cooling to room temperature, non-specific binding sites were blocked using Super Block (Scytek), supplemented with Avidin (Vector Labs). Primary antibodies used for immunohistochemistry were incubated overnight in Super Block with biotin (antibodies listed in Suppl. Table 3). Slides were then washed and incubated for 1 h in the appropriate biotinylated IgG secondary antibodies (1:200; Vector Labs) in Super Block. Slides were washed in PBS and immunostaining visualized using the Vectastain Elite ABC reagent (Vector Labs) and Vector NovaRED peroxidase substrate kit (Vector Labs). Slides were counterstained with hematoxylin (Sigma Aldrich). Sections were visualized using a Leica AperioScope microscope, and analyzed using the Aperio eSlide manager image analysis.
For color intensity analysis, regions of interest (ROI) were delineated by a blinded user (motor neuron nuclei or Purkinje nuclei), slides were deconvolved for RGB of hematoxylin (blue channel) and antibody staining color (red channel) using the Leica Aperio ImageScope color deconvolution algorithm and the intensity value measured for each pixel within the ROI. These values were used to set intensity scales for each color from 0 to 255 (0 = black and 255 = white) prior to the analysis, and the same intensity thresholds were used across each antibody analysis. For hnRNPU, the negative threshold was set to be for intensities ranging from 210 to 255; weak positive staining intensity ranges were from 145 to 210; medium positive staining from 90 to 145 and strong staining was set to ranges from 0 to 90. All neurons were selected for each spinal cord section (numbers of neurons per section ranged from 20 to 50), and ROIs were defined. For Syncrip, the negative threshold was set to be for intensities ranging from 180 to 255; weak positive staining intensity ranges were from 155 to 180; medium positive staining from 95 to 155 and strong staining was set to ranges from 0 to 95. We selected 50 Purkinje cells from different areas of each section.
Laser-capture microscopy, RNA extraction and real-time PCR analysis
Lumbar spinal cord and cerebellum total RNA were prepared from frozen tissue from control and ALS cases. Samples were homogenized in Trizol (Invitrogen), and RNA was extracted using the Ambion PureLink™ RNA Mini Kit. RNA quality was determined by RIN (RNA integrity number) using a Tapestation and all samples showed RIN values of > 5. cDNA was synthesized using Superscript VILO (Invitrogen) and real-time RT-PCR was performed using the FastStart Universal SyberGreen master mix (Roche). Primer sequences used are listed in Suppl. Table 7.
For laser-capture microscopy, fresh-frozen cerebellum were sectioned at 20 μm, slides were fixed for 2 min in 70% ethanol (in nuclease-free water), washed and stained for 6 min with the RNA/DNA stain Methyl Green Pyronin (Abcam, ab150676) supplemented with SUPERase In RNAse inhibitor (AM2694, ThermoFisher). Slides were consecutively dipped in nuclease-free water, 100% ethanol, and air-dried for 2 min before capture. We used the Zeiss Axiovert Zoom, fitted with a PALM system to capture at least 120 Purkinje cells per slide. Capture time was limited to 1 h to minimize RNA degradation. Two slides from each sample were used for a total of 250 neurons, and the cells were combined for subsequent processing. RNA was extracted using the RNAqueous micro total RNA isolation kit from Ambion (AM1931), cDNA was synthesized using Superscript VILO and real-time RT-PCR was performed.
Statistical analysis
Statistical analysis was performed using Student’s t test, or one-way ANOVA with Bonferroni’s multiple comparisons testing for comparing multiple groups in GraphPad Prism 5. Fisher exact test and Wilcoxon rank sum test were used for cross-validation studies.
Data and code availability
All data generated or analyzed during this study are included in the published article and its supplementary information files (Suppl. Tables 2, 3 and 4). The pseudo-code used to generate our analysis by IBM Watson is included in the supplementary information files.
Discussion
The use of machine learning algorithms and other artificial intelligence technologies is impacting medical care and research, and offers new approaches to analyze complex biological datasets to provide new insight into human disease. We used IBM Watson to screen and rank order RBPs to identify additional RBPs involved in ALS. Using a set of 11 RBPs with known mutations that cause ALS and a candidate set comprising 1467 RBPs with at least one published abstract up to the end of 2015, IBM Watson text mined published abstracts in the literature, and ranked all candidate RBPs by their semantic similarity to the known RBPs with ALS-causing mutations. We then validated the top-ten candidates for potential alterations in ALS using a combination of immunohistochemistry, RNA and protein analysis in tissues from ALS and non-neurologic disease controls, and RNA analysis of iPSC-derived motor neurons. These results are summarized in Table
5. The top-three ranked RBPs (hnRNPU, Syncrip and RBMS3) exhibited alterations in ALS by multiple methods, including protein distribution, RNA and protein levels in ALS compared to controls. Two other RBPs ranked in the top-ten by Watson, NUPL2 and Caprin-1, also exhibited alterations by multiple validation methods (Table
5). As noted above, Caprin-1, subsequent to our Watson analysis, was shown to localize to TDP-43 and FUS positive inclusions in ALS patients with TDP-43 or FUS mutations [
4]. Our criteria for successful validation were significant RBP alterations in more than one assay. Therefore, both hnRNPH2 and RBM6 did not pass our validation criteria; whereas the five other top-ten Watson-ranked RBPs did pass our validation criteria. This top-ten list also included three other RBPs that were previously associated with ALS (RBM45, SC-35 and MTHFSD) but have no known mutations linked to familial forms of ALS. Overall, eight of the top ten ranked RBPs were altered in ALS. All RBPs tested from the bottom of the IBM Watson list showed no alterations in ALS.
One question is whether Watson could have randomly rank ordered all RBPs to generate a top-ten list that would fulfill our validation criteria. The actual number of RBPs altered in ALS is not known, so we cannot precisely determine the accuracy of Watson predictions at ranking RBPs linked to ALS. Instead, we used Fisher’s exact test to calculate the probability of Watson correctly identifying eight of the top ten RBPs as altered in ALS. Using results from the LOO analysis, we could assume that 5% of the total RBPs used in this study (73 out of 1467 RBPs) are altered in ALS. Using this assumption, the Fisher’s exact test generates p = 1.07 × 10−9 for Watson correctly predicting eight of the top ten to be altered in ALS. If we make a very conservative estimate and assume that 20% of all RBPs (293 out of 1467) are altered, then the significance of the Watson predictions is p = 7.21 × 10−5. Therefore, the probability that Watson randomly selected RBPs and correctly predicted eight of the top ten by chance is quite low. While we could not perform extensive validation of all Watson RBP predictions due to time and cost, we focused validation efforts on the top ten and selected RBPs at the bottom of the list for which there were commercially available antibodies. These negative controls are all involved in tRNA metabolism, which Watson semantically ranked as most dissimilar to the known ALS-RBPs that function predominately in mRNA metabolism. Other RBPs that function in tRNA metabolism were also ranked near the bottom of the list, suggesting that this pathway does not significantly contribute to ALS.
Even though hnRNPU, Caprin-1, SRSF2 and Syncrip can be found within supplemental tables of unbiased proteomic screens for potential interacting proteins of TDP-43, FUS and Ataxin2 (Table
4), these supplemental data were not available to Watson’s analysis that focused on published abstracts. Such global proteomic analyses typically generate hundreds of potential hits, though without further validation studies these remain putative protein interactions and it is difficult to rank order which candidate proteins should be further explored. The use of computer-based approaches such as IBM Watson to mine text and/or data can focus subsequent experimental validation efforts to those putative interacting proteins highly ranked by Watson.
The top-ranked RBP, hnRNPU co-localized to cytoplasmic TDP-43 positive inclusions and showed significant protein increases in motor neurons, as well as in cerebellum and spinal cord protein lysates from ALS compared to non-neurologic disease controls. Yet, hnRNPU transcript was significantly downregulated in ALS cerebellum. Similarly, Syncrip also showed altered subcellular distribution and increased protein expression in the cerebellum, along with modest increases in protein levels in ALS spinal cord, yet its RNA transcript was downregulated in ALS cerebellum. However, Syncrip mRNA expression was increased in C9-iPSC-derived motor neurons, suggesting the analysis of total tissue extracts may mask changes within individual cell types. Nevertheless, we did note discordance between protein and RNA expression levels of multiple RBPs within the same tissue, similar to prior results described in aging human brain [
55], and perhaps attributable to pathological changes in mRNA translation or microRNA regulation that occur in ALS.
While the use of IBM Watson in ALS and the neurosciences was novel and we successfully identified new RBPs that exhibit alterations in ALS, there remain limitations to our approach. Watson relies on gene annotations of the published literature for its text-based analysis. In our study, hnRNPH2, ranked number 6 by Watson, exhibited few alterations in ALS (Table
5), but was found to have a similar annotation nomenclature within the published literature as hnRNPH/hnRNPH1, which has been linked to ALS [
11]. This example of common annotations likely led Watson to infer that hnRNPH2 was equivalent to hnRNPH and hnRNPH1, generating a false positive in our analysis. While we used a rigorous disambiguation of gene annotations for our study (see Supplemental Methods), continued work on gene annotations will aid future gene-based studies using IBM Watson. Another limitation of Watson’s analysis is the fact that it is based on semantic similarity to a known set of proteins. For example, DDX58 was identified in 2016 as an RBP altered in ALS tissue [
37]. However, in our study Watson ranked DDX58 number 769, making it a false negative result. Since the most common function of DDX58, a cytoplasmic sensor of viral infection, is vastly dissimilar to the function of RBPs used in our known training set, Watson assigned DDX58 a low score in its model. The addition of neuroscience-specific knowledge and complex biologic datasets generated by neuroscience laboratories into the IBM Watson system will benefit future Watson-based neuroscience studies.
It is noteworthy that from the transcriptional analysis of RBP changes in ALS tissues, more changes were observed in cerebellum when compared to spinal cord; four genes were significantly altered in ALS vs control in cerebellum, while only one gene (RBMS3) was altered in ALS spinal cord. Such a trend towards more robust transcriptomic changes in cerebellum compared to other brain regions was recently reported by Prudencio et al. [
44], when comparing cerebellum to frontal cortex of C9-ALS and SALS by RNA-sequencing analysis.
Cerebellar involvement in ALS has recently gained acceptance by the field. Cerebellar atrophy, namely loss of Purkinje cells in the cerebellar vermis region, was reported in ALS cases with ATXN2 gene expansions, but not C9-ALS or SALS cases [
51]. C9-ALS cases are associated with p62-positive, phospho-TDP43 negative cytoplasmic inclusions in the granular and molecular layers, as well as in Purkinje cells of the cerebellum [
2]. Structural changes in ALS cerebellar integrity have been demonstrated as white and grey matter alterations by 3D-MRI [
29]. More recently, similar imaging analyses have shown ALS cerebellar atrophy in the inferior cerebellum and vermis, areas typically associated with motor tasks, while the cerebellum of ALS-bvFTD subjects show atrophy both in the superior and inferior cerebellum [
50]. One RBP identified by Watson and validated as significantly altered in ALS cerebellum was NUPL2. NUPL2 specifically marked ALS astrocytes in the cerebellum and spinal cord. A prior study in transgenic SOD1-G93A mice identified phospho-ERK in cerebellar astrocytes, highlighting ALS-specific changes within astrocytes in the cerebellum [
9]. NUPL2 is a nucleoporin-like protein that regulates nuclear export of protein and mRNA, yet can localize to both the nucleus and the cytoplasm. NUPL2 was also contained in the cytoplasm of control spinal motor neurons, but in many ALS cases, NUPL2 was redistributed to the nucleolus of motor neurons, although the significance of this redistribution is unknown.
A novel ALS phenotype is the increased expression of RBMS3 and RBM6 in cerebellar interneurons. Spinal cord and cortical interneuron alterations in GABA-A receptor and parvalbumin levels have been reported in ALS patients and animal models of ALS [
38,
42,
43]. In addition, reduced GABAergic transmission, hyperexcitability and excitotoxicity of layer 5 pyramidal neurons was observed in TDP43-A315T mice, while a low copy-number model of SOD1-G93A mice showed reduced GABAergic and glycinergic spinal interneurons, along with interneuron ubiquitinated inclusions prior to disease onset [
26,
57]. Our results thus highlight alterations of interneurons in ALS.
Whole exome sequencing recently identified NEK1 as a risk factor for ALS [
31], though we were unable to identify any genetic alterations linked to ALS for our Watson top-ten RBPs using publically available exome sequencing data. Additional genetic analyses of RBPs ranked in the top 5–10% of the Watson list is necessary to determine if Watson can use its algorithms to identify new gene mutations linked to ALS using only comparisons to the known RBPs with mutations that cause ALS. Although Syncrip did show a trend for a distinct phenotype in the cerebellum of C9-ALS compared to SALS patients, further studies are needed to expand the group size and include additional familial forms of ALS to confirm these findings.
In conclusion, we used IBM Watson to leverage published literature and semantic similarity to known ALS-RBPs find additional RBPs altered in ALS. This approach is a great addition to the usual candidate screening approaches, and can be used to sieve through hundreds of potential hits generated from -omics based experimental approaches and make literature-based rank-ordering of targets worthy of further validation studies. IBM Watson identified and we validated alterations in five RBPs out of seven RBPs previously unlinked to ALS, including novel alterations of RBMS3 within cerebellar interneurons. The top-ten list included three other RBPs that were previously associated with ALS (RBM45, SC-35 and MTHFSD), while RBPs ranked near the bottom of the list failed to exhibit changes in ALS. Further studies are required to determine if RBPs ranked high by Watson contain any genetic alterations that can be linked to ALS. The continued and future use of IBM Watson and other machine learning computing tools will likely accelerate scientific discovery in ALS and other complex neurological disorders.