More than 20% of children in the United States will experience a traumatic event before they are 16 years old [
1,
2]. Of those who experienced a trauma, between 10 and 40% [
3‐
5] will develop Posttraumatic Stress Disorder (PTSD) [
6], a disorder that results in significant functional impairment and may have deleterious consequences for brain development [
7‐
9]. The early identification of a child’s level of risk – and specific vulnerabilities – opens the possibility of preventative intervention tailored to the child’s specific needs. Therefore, the ability to predict risk for PTSD from the time of the trauma is extremely important. Unfortunately, the extant research literature has been unsuccessful in reliably identifying a set of risk factors for PTSD common to all traumatized children or specific sets of risk factors that may allow the individualized treatment of a child based on their risk [
10,
11]. This limited progress in the field points to the need to identify and apply new methods that might provide improved ways to conduct research towards the reliable and accurate identification of risk factors for childhood PTSD.
This article describes a study of risk for PTSD in acutely traumatized children that – for the first time – employs Machine Learning (ML) computational methods of in order to determine if these methods can identify variable sets and models that predict the development of PTSD. As will be detailed, we believe ML may offer much needed advantages for advancing the field of risk factor research for childhood PTSD because of its track record for these purposes in other fields [
12‐
26] and its success in its first application to adult PTSD in the pages of this journal [
27].
The current state-of-the-science for child PTSD risk factor research
Over the last twenty years, a sizeable literature on childhood PTSD risk factors has accumulated. Unfortunately, the literature has not converged on a set of risk factors that accurately identify risk or inform care. Meta-analytic studies have concluded that many of these risk and protective factors have small effect sizes for traumatic stress, and the results on these effect sizes are not consistent, between studies [
10,
11]. Trickey and colleagues published the most definitive meta-analysis to date, examining 64 studies of risk factors for traumatic stress in 32,238 children (aged 6–18 years) over a 20-year period (1990–2009). Of note, only 25 risk factors were examined, as these were the only ones reported in more than one study and only six risk factors were assessed in more than 10 studies. Ten risk factor variables showed medium to large effect sizes, but four of these were only examined in two studies and three were found to have inconsistent effect sizes across studies. Only one risk factor was found to have a large effect size in a large number of studies (post-trauma psychological problem) [
10].
The fit between the complexity of childhood PTSD and the data analytic methods used to determine risk
A key observation relevant to our study is that from a mathematical perspective, a risk factor is a variable that conveys statistical information about the likelihood of the phenotypic response of interest. The discovery of accurate risk factors from data critically depends on the choice of data analytic approach. A large literature on feature selection methods developed and applied in various fields over the last several decades shows that different features (i.e., risk factors in our study) and models using those features will be selected by different data analysis methods. For a broad introduction to modern feature selection see Guyon and Elisseeff [
28]. Tsamardinos and Aliferis showed that there cannot be a uniformly “best” feature selection method, and that feature selection methods must be designed for specific requirements [
29]. For example if a maximally compact risk factor set is desired among the sets that are maximally predictive, a feature selector that discovers Markov Boundaries has to be employed (more on this later in the manuscript). This property of Markov Boundaries is intrinsic to the system under study and does not change by the method a researcher chooses to study the system.
In the majority of traditional data analysis in psychiatry and in PTSD research, including all of the 64 studies described in the Trickey et al., risk factors are discovered using either univariate association, or stepwise procedures within various forms of the General Linear Model (GLM) family of multivariate analysis methods [
30‐
32]. The GLM refers to a broad category of established statistical models based on regression that includes Analysis of Variance, Linear Regression, Logistic Regression, and Poisson Regression among other types of classical multivariate analysis. These approaches, however, do not guarantee predictive optimality and do not guarantee parsimony in a data analysis-independent manner. Very importantly, the results are tied to the specific method used for analysis and are essentially an artifact of the analysis method used and not an intrinsic property of the system under study. In addition, robust understandings of childhood PTSD, in all likelihood, involve the influence of a great many variables from a diversity of modalities (e.g. genomic, neurologic, physiologic, social, developmental) and, most importantly, the interaction between these variables. Traditional data analytic methods (e.g., classical regression and sister methods from the GLM family, but also clustering and decision tree methods) impose considerable restrictions on the number of variables that can be used in a given analysis and, especially, the analysis of interactions. Another, major, problem with older methods concerns their limited ability to shed light on causality when the data does not come from randomized experimental designs. Experimental designs however are unethical in (non-animal) risk factor research related to trauma (i.e. assigning a child to a trauma exposure condition). It is for this reason that all human risk factor research for childhood PTSD is correlational. The essential correlational nature of this research has considerable implications for prevention. An identified risk factor can represent a promising target of preventative intervention if – and only if – it represents a cause of the phenomenon it is thought to influence. ML computational techniques offer advantages for each of the above limitations of older/classical data analytic methods [
33‐
36]. For example, the aforementioned Markov Boundary feature selection methods will also find the local causal neighborhood of the response variable in most distributions [
29,
37].
Newer advances in data analysis contributed by the field of Machine Learning greatly extend the researchers’ ability to make meaningful discoveries also by:
1.
Enabling accurate and reliable prediction using data with very large numbers of variables and small sample sizes.
2.
Avoiding the significant hurdles of estimating accurate variable coefficients and of modeling the data generating processes by directly building accurate predictive classification models for phenomena of interest and testing the reliability and accuracy of these models without the need for data-generative models and accurate coefficient estimation.
3.
Enabling causal inference within non-experimental data sets.
The conceptual foundations for these algorithms are based on thoroughly validated body of work exemplified by the Nobel and Turing award winning work of Herbert Simon, of Turing award winner Judea Pearl, and of Nobel laureate Clive Granger, among other pioneers of non-experimental causal discovery [
38‐
42]. On a purely empirical and practical level, research using ML methods has met exceptional success in a wide range of scientific and technological fields [
34], and it is beginning to penetrate the domain of clinical science, including the fields of psychiatry and pediatrics. ML has demonstrated utility in a variety of applications including the accurate classification in pediatric disorders such as epilepsy, asthma, heart disease, and head injury [
20‐
23]. Within psychiatry, ML has been successfully used in the predictive classification of autism, attention deficit hyperactivity disorder, and schizophrenia [
24‐
26]. ML has recently been used to predict PTSD in acutely traumatized adults [
27,
43]. It has not yet been used to predict PTSD in children or to identify causal processes for PTSD, however. The possibility of using ML to identify causal processes – initiated shortly after exposure to trauma – has important implications for prevention. The detection of such causal processes may thus identify promising targets for preventative intervention.
The current study addresses two broad hypotheses:
Hypothesis 1: ML methods can identify an accurate and reliable predictive classification model for childhood PTSD, from variables measured around the time of trauma.
Hypothesis 2: ML methods can identify variables that not only have predictive value for childhood PTSD, but can also identify those with causal influence.