Discussion
The use of hSCs in medicine is limited by the abundance of/accessibility to somatic cells from a donor and histocompatibility Issues with donor/recipient transplants. These two factors largely determine the reliability of hSCs for drug development and developmental studies. Nevertheless, the development of iPSCs from donor somatic cells has proven to be somewhat successful. Issues of histocompatibility with donor/recipient transplants that have been reported with hESCs and adult stem cells (ASCs) can be avoided. Additionally, information gathered from the reprogramming process that results in iPSCs is very promising for drug development research of rare diseases and developmental studies [
31]. Unfortunately, the application of iPSCs is also hindered by the highly variable efficiency of SC induction protocols and the significant costs that leads to uncertainty because of reduced reproducibility and long-term maintenance of iPSCs. In this study, we introduced an efficient, accurate, cost-effective and highly customizable computational platform to enable aiPSC model generation.
An increasing number of studies have employed computational, statistical, and mathematical approaches for modelling and analyzing the underling factors that regulate cellular reprogramming. These efforts have largely focused on specific elements of cellular reprogramming. Examples of this previous work include, (1) a Bayesian network model (i.e., a probabilistic model) provided conditional analysis of random signaling network interactions [
32], (2) a Boolean network model (i.e.
, a quantitative model) was used to study the logical interactions of network components [
33], (3) a multi-scale model, in which a framework of combined algorithms was used to mathematically predict effects of factors/genes on other factors/genes [
34], (4) a clustering algorithm, in which multiple algorithms were used to organize data points into groups that share certain similarities to enable mathematical modeling and simulation of cellular events [
35] and (5) a Support Vector Machine learning model (SVM), in which a fully supervised computational approach was used to classify datasets into pre-defined categories to enable phenotypic profiling of cellular subsets [
36,
37]. A more in-depth review of computational tools used in stem cell research has been published recently [
38].
Unlike previous and largely supervised models focused on various aspects of cellular reprogramming, the unsupervised DeepNEU platform provides a novel high dimensional and nonlinear approach for simulating simple aiPSCs, and to qualitatively assess stem cell regulatory mechanisms and pathways using a literature validated set of reprogramming factors in the context of a fully connected hybrid RNN. Once validated with the results of peer reviewed wet-lab experiments, DeepNEU aiPSC models provide an efficient, programmable, and cost-effective tool for empowering rare disease and other researchers.
In this research work, the performance of the DeepNEU platform (Version 3.2) was evaluated extensively through simulation of several experimentally validated iPSC models including iPSCs, iNSCs, iCMCs and a Rett syndrome model using aiNSC with MeCP2 deficiency.
DeepNEU simulation of aiPSCs showed that the gene expression profiles of the simulated cells were consistent with that of iPSCs. aiPSCs express many factors that are consistent with the signature of undifferentiated human ES cells. These factors include, OCT3/4, SOX2, NANOG, growth and differentiation factor 3 (GDF3), reduced expression 1 (REX1), fibroblast growth factor 4 (FGF4), embryonic cell-specific gene 1 (ESG1/DPPA5), developmental pluripotency-associated 2 (DPPA2), DPPA4, and telomerase reverse transcriptase (hTERT) [
6,
29]. Additionally, the unsupervised DeepNEU successfully simulated embryoid body-mediated differentiation (see Table
1) to confirm line specific differentiation identified by immunocytochemistry and/or RT-PCR in Takahashi et al. [
5,
6].
The unsupervised aiNSCs model (Fig.
3) showed that the gene/protein expression profile was consistent with the hiNSC cellular model. The aiNSC simulation also expressed several NSC specific markers including PAX6, NESTIN, VIMENTIN and SOX2.
In the study conducted by Yu et al. [
27] the expression levels of miR-9-5p, miR-9-3p, and miR-124 were upregulated in the hiNSCs but other miRNAs, namely miR-302/miR-367, were not detected in their system. Interestingly in our simulated aiNSC model miR-9-5p was also upregulated while miR-124 was downregulated. Unlike the hiNSC, the aiNSC expressed miR-302/miR-367 which were also “abundantly” expressed in human embryonic stem cells (hESC) (Fig.
4).
On the other hand, PCR analysis revealed expression of dopaminergic neuron markers, dopa-decarboxylase (AADC) and member 3 (DAT); ChAT; LIM homeobox transcription factor 1 beta (LMX1B); and the mature neuron marker, MAP2 (Takahashi et al, 2007). However, the astrocyte marker, GFAP was not expressed in their system. All markers identified by Takahashi et al. [
5,
6] plus GFAP were expressed in the aiNSC simulation (Fig.
6).
All the cardiomyocyte markers that were reported to be expressed by iCMCs were also expressed in the unsupervised aiCMC system (Fig.
7) entirely consistent with the data provided by Takahashi et al. [
5,
6]. Five additional cardiomyocyte markers identified in Rajala et al. (2012) including GATA-4, Isl-1, Tbx-5, Tbx-20 and cardiac Troponin I were also expressed by the aiCMC system.
DeepNEU to simulate rare disease− aiNSC for simulating RETT syndrome (MeCP2 deficiency)
To validate DeepNEU platform efficiency in modeling a rare disease (RETT syndrome) was simulated using the aiNSC protocol with the MeCP2 gene locked off. Interestingly, the upregulated genes were BDNF, FKBP5, IGF2, DLX5, DLX6, SGK1, MPP1, GAMT and FXYD1 while genes UBE3A and GRID1/GluD1 were both downregulated. All up and down regulated genes in the aiNSC-RETT neuron simulation are entirely consistent with the expression data presented in Ehrhart et al. [
26] (Fig.
8).
To the best of our knowledge, this is the first-time computer simulations of intact and functioning iPSC have been successfully used to accurately reproduce the landmark experimental results reported by Takahashi et al. (2007) and other studies cited above. The technology itself has limited overlap with some features of neutrosophic cognitive maps, evolutionary systems, neural networks and SVM applied to create a novel unsupervised machine learning platform. The papers referenced above were the source for the reprogramming and media factors used to construct the input vector for the simulations. These papers were also used here to validate in an unsupervised manner the genotypic and phenotypic output features of the simulation at the new stable state.
Conclusion/Significance
Stem cell research will inevitably be transformed by computer technologies. The results of the initial DeepNEU project indicate that currently available stem cell data, computer software and hardware are sufficient to generate basic artificially induced pluripotent stem cells (aiPSC). These initial DeepNEU stem cell simulations accurately reproduced gene and protein expression results from several peer reviewed publications.
The application of this computer technology to generate disease specific aiPSCs has the potential to improve (1) disease modeling, (2) rapid prototyping of wet lab experiments, (3) grant application writing and (4) specific biomarker identification in a highly cost-effective manner. Further development and validation of this promising new technology is ongoing with the current focus on modelling rare genetic diseases.
Methods
DeepNEU platform: We have developed a novel and powerful deep-machine learning platform employing a fully-connected recurrent neural network (RNN) architecture, in which each of the inputs is connected to its output nodes (feedforward neurons) and each of the output nodes is also connected back to their input nodes (feedback neurons). There are at least two major benefits of using this network architecture. First, RNN can use the feedback neurons connections to store information over time and develop “memory”. Second, RNN networks can handle sequential data of arbitrary length [
39]. For example, RNN can be programmed to simulate the relationship of a specific gene/protein to another gene/protein (one to one), gene/protein to multiple genes/proteins (one to many), multiple genes/proteins to one gene/protein (many to one) and multiple genes/proteins to different multiple genes/proteins (many to many). Our novel RNN DeepNEU network was developed with one network processing layer for each input to promote complex learning and analysis of how different genes and pathways are potentially regulated in embryonic and reprogrammed somatic cells in key signaling pathways. Here we have used DeepNEU to simulate aiPSCs by using defined sets of reprogramming factors (genes/proteins were turned on or off based on the modeled iPSCs)
.
Dataset
We have incorporated into the DeepNEU database key genes/proteins that were reported to be involved in regulating and maintaining signaling pathways in human embryonic stem cells (hESCs) and induced human pluripotent stem cells (hiPSCs). We have gathered genes/proteins based on literature reports that extensively studied cellular pathways of hESC and/or hiPSC [
40‐
49]. Abundant data were available. For example, a PubMed (PMC) search of the literature with “stem cells” returned more than 435,000 hits. A more focused query using “stem cell signaling”, returned more than 261,000 hits. Nevertheless, data that were included in the DeePNEU database were selected with a preference for (1) human stem cell data, (2) recency of peer reviewed English language publications and (3) highest impact factors of the journals under consideration.
To that end, the data was used to create a list of important genes/proteins (data not shown) based on their documented contributions to human stem cell signaling pathways. The current version of the database includes 3589 gene/protein (inputs) involved in hESC cellular pathways and 27,566 gene/protein regulatory relationships important in hESC that were used for aiPSC system modelling. Importantly, this simple data representation permits complex relationships including both positive and negative feedback loops that are common in biological systems.
Entry of data to DeepNEU database
All data (genes/proteins, and relationships) were entered, formatted and stored as a large CSV (comma separated values) file in Delimit Professional (v3.7.5, Delimitware, 2017). This database manager was chosen because it can efficiently handle very large CSV files where data can be represented as an NxN (an array of values with N rows and N columns) relationship matrix. In addition, built-in data entry and file scan functions help to ensure and maintain data integrity. This software can also import and export multiple data file types facilitating two-way interaction with a wide range of data analysis tools. Finally, the software scales easily to NxN or NxM (an array of values with N rows and M columns) databases having millions of rows and columns (
http://delimitware.com, 2017).
The DeepNEU platform uses a novel, but powerful neutrosophic logical (NL) framework to represent relationships between signaling genes/proteins. NL was originally created by Florentin Smarandache in 1995. In NL, every logical variable X is described by an ordered triple, X = (T, I, F) where T is the degree of truth, “I” is the degree of indeterminacy, and F is the degree of false. The strength of any relationship can have any real value between − 1 and + 1 or “I” if the relationship is considered indeterminate. Positive or stimulatory causal relationships are represented by + 1 in the database unless there is a fractional value > 0 and < = + 1. Similarly, negative or inhibitory causal relationships are represented by − 1 in the database unless a fractional value < 0 and > = − 1 is provided. Relationships are considered indeterminate and represented by an “I” if multiple sources report conflicting data or if the relationship is labelled with a question mark in an associated process flow diagram. A value of zero is used when no relationship between nodes is known or suspected [
50]. NL is an extension and generalization of Fuzzy Logic and can be easily converted by replacing all indeterminate (I) relationships with zeros (i.e. by assuming there is no causal relationship).
DeepNEU network architecture
The NxN relationship matrix is the core data for an unsupervised fully-connected RNN. A learning system is referred to as supervised when each data pattern is associated with a specific numerical (i.e., regression) or category (i.e., classification) outcome. Unsupervised learning is used to draw inferences from datasets consisting of input data patterns that do not have labeled outcomes [
50]. DeepNEU is a complex learning system in that every (gene/protein) node in the multilayered network is connected to every other node in the network. Traditional neural networks have one or a few hidden or processing layers between the input layer and the output layer. Advanced deep-learning neural networks can have more than a dozen processing layers [
51,
52]. DeepNEU has one processing layer for each input variable. Taken together, the input variables and their declared initial values constitute an N-dimensional initial input vector. Vector-Matrix multiplication uses this N-dimensional input vector and the NxN relationship matrix to produce an N-dimensional output or new state vector. The new state vector becomes the new input vector for the next iteration and this iterative process continues until a new system wide steady state is achieved. In general terms, the DeepNEU network architecture is similar to Neutrosophic and Fuzzy Cognitive Maps (NCMs/FCMs; used to represent causal relationship between concepts (genes/proteins)) which are also examples of fully-connected and recurrent neural networks [
53,
54].
The DeepNEU simulations
The initial goal of this project was to first create a computer simulation of a hiPSC and then validate the model using the results published by Takahashi et al. in 2007 and others as described above. Briefly, the input or initial state vector of dimension N was set to all zeros except for transcription factors OCT3/4, SOX2, KLF4 and CMYC. These four factors were given a value of + 1 indicating that they were turned on for the first iteration. These values were not locked on so that all subsequent values were determined by system behavior.
DeepNEU simulation protocol
1.
The machine learning process began with vector matrix multiplication (VMM). The NxN relationship matrix was multiplied by the “N”-dimensioned input vector with OCT3/4, SOX2, KLF4 and CMYC turned on. Both the input vector and relationship matrix are comprised mostly of zeros. The input vector and relationship matrix were both considered to be sparse. To minimize the computational burden, sparse vector matrix multiplication algorithms were employed at each iteration during model generation.
2.
At each iteration the sparse VMM operation produces an “N”-dimensional output vector with variable components many of which have large positive or negative values. To avoid computational explosion a squashing or activation function was used to map these values between a minimum of − 1 and a maximum of + 1. After initial evaluation of several activation functions, the Elliott function was selected based on rapidity of system convergence and outcome reproducibility [
55]. At the end of the activation process, the squashed N-dimensional output vector becomes the new input vector for the next iteration. This cycle is repeated until system convergence occurs indicating that a new system wide steady state has been achieved.
3.
The goal of the learning system is to minimize error. In this case the error being considered is the mean squared error (MSE) between a given output vector and the previous output vector. During model development several error functions including adjusted R2, SVM/Vapnik loss and MSE were evaluated. The MSE function was selected because its’ use consistently resulted in faster system convergence and more reproducible results. While the MSE function has been widely used it has also been widely criticized because the function can perform poorly due to squaring in the presence of outliers. In the current project, the error function was applied after the raw system output was “squashed” between values of − 1 and + 1 using a sigmoid type function. This squashing effectively mitigates the problem of potential outliers. As learning continues the MSE converges towards zero. For this project system convergence was defined at MSE < 0.001 and model generation stops. The system output is then saved as a CSV data file for further analysis.
4.
The final output from the aiPSC model regarding the expression or repression of genes and proteins was directly compared with published expression profiles [
6]. Model prediction values > 0 were classified as expressed or upregulated while values < 0 were classified as not expressed or downregulated. Statistical analysis of the aiPSC predictions and the published data used the Binomial Test. This test provides an exact probability, can compensate for prediction bias and is ideal for determining the statistical significance of experimental deviations from an actual distribution of observations that fall into two outcome categories (e.g., agree vs disagree). A
p-value < 0.05 is considered significant and is interpreted to indicate that the observed relationship between aiPSC predictions and actual outcomes is unlikely to have occurred by chance alone.
Acknowledgements
The author wishes to thank Dr. Sally Esmail, postdoctoral Associate at Biochemistry Department, Schulich Medicine and Dentistry, Western University, London, Canada for her expert assistance with document preparation and critical review. The author also wishes to thank Dr. James Koropatnick, Distinguished Oncology Scientist, London Regional Cancer Program, Professor, The UWO Departments of Oncology, Microbiology and Immunology, Physiology and Pharmacology, and Pathology, Director, Strategic Training Initiative in Cancer Research and Technology Transfer (CaRTT) London Regional Cancer Program, Victoria Research Laboratories, for his critical review and insightful edits that resulted in a much improved manuscript.