Background
Autism spectrum disorder (ASD) is a clinical diagnosis defined by neurodevelopmental impairments in two domains: persistent deficits in social reciprocity and communication across multiple contexts, together with restricted, repetitive patterns of behavior [
1]. Individuals with ASD can display a broad clinical profile with varying severity in the core symptoms and often accompanied by medical comorbidities. With onset in the first years of life, ASD entails a life-long condition with diverse outcomes in adulthood [
2]. The prevalence of ASD has been estimated as high as 1 in 68 children [
3], yet an understanding of the biological mechanisms underlying ASD remains unclear, hampering attempts to develop specific molecular diagnostics or targeted therapeutics.
A multifactorial etiological model for ASD is being increasingly recognized. Several epidemiological studies have firmly established a genetic component underlying ASD with heritability estimates ranging from 50–90 % depending on the study parameters [
4‐
6]. Consequently, numerous efforts to identify genes associated with ASD risk have been undertaken in hopes of inferring molecular pathways or surrogate markers associated with clinical manifestations of ASD. The ability to screen large cohorts using high-throughput genomic technologies has led to the discovery of hundreds of candidate genes containing thousands of variants, highlighting enormous genetic heterogeneity in ASD [
7‐
10]. Although the significance of the vast majority of identified variants remains unresolved, a subset of genes have been found to be highly penetrant for ASD based on recurrent findings of rare, de novo, damaging variants in probands [
11]. While initial estimates suggested between 350 and 400 autism susceptibility genes [
12], more recent statistical models predict that well over 1000 genes may eventually be associated with ASD [
13,
14]. Despite the incredible insight into the molecular genetics of ASD that these studies have provided, the diversity in study design, the significant variance in sample sizes and replication cohorts, and the use of different statistical models have resulted in a large set of candidate genes that are difficult to compare on a single platform. Moreover, within any given ASD candidate gene, multiple variants may be found, each with its own associated risk [
15], further complicating a clear understanding of their relevance with respect to autism. To address these challenges, databases of ASD risk genes have been established in attempts to aggregate the ever-increasing number of candidate genes implicated in this disorder [
16,
17]. However, only recently have strides been made towards developing methodologies for quantitative assessment of ASD risk genes [
13,
18‐
20]. For example, transmission and de novo association (TADA) analysis was developed to identify risk-conferring genes by integrating rare de novo and inherited genetic variations from high-throughput, whole exome sequencing (WES) studies of large ASD cohorts such as the Autism Sequencing Consortium (ASC) and the Simons Simplex Collection (SSC) [
11,
21]. While TADA analysis has proven to be a critical first step, further assessment strategies are required to fully integrate the complete spectrum of ASD genetic variations and consider all potential attributes that are likely to be encountered in patients evaluated in ASD clinics.
The Gene Scoring module (
https://gene.sfari.org/autdb/GS_Home.do) of Simons Foundation Autism Research Initiative (SFARI) was created as a means for evaluation of candidate genes on a discrete or categorical scale taking into account the strength of genetic evidence linking a gene to ASD [
22]. A set of scoring criteria was developed to assess different types of evidence, methodologies, and variability reported in the genetic studies of ASD [
22]. Here, we have extended this initial work to incorporate a systematic evaluation of diverse types of genetic variants implicated in ASD. Our approach is based on assessment of multiple attributes of an ASD variant including mode of inheritance, effect size, and variant frequency in the general population. In this study, we report a consolidated gene score by summing the various evidence scores generated for each individual variant of an ASD-implicated gene. Next, we compared the gene scores generated in this study with the expert-led SFARI Gene Scoring module as well as the top ranking ASD genes identified in simplex families [
11,
23]. We found strong concordance between our ASD gene ranking strategy and the other three approaches [
11,
23]. Using our model, we prioritized a larger set of genes including
SHANK3,
CHD8,
ADNP,
MET,
CNTNAP2, and others derived from the most complete collection of genetic variations associated with ASD originating from simplex, multiplex, multigenerational, and consanguineous families.
Discussion
Given the accelerated pace of ASD candidate gene discovery, it is critical that resources be available to the research community that not only catalog the identified variants in detail but also provide tools to evaluate the potential risk conferred by each individual variant. In this report, we describe a systematic variant scoring strategy utilizing the autism gene database AutDB that encompasses detailed annotation of both rare and common genetic variants associated with ASD for candidate gene prioritization. The large set of variants analyzed here was extracted from studies that varied in size—from single case reports to analysis of large cohorts such as the Simons Simplex Collection. Additionally, our dataset included variants identified by a variety of methodologies ranging from targeted sequencing to whole genome-based screening. While a number of other recent analyses of ASD genes have focused on rare damaging de novo mutations in simplex ASD cases, our study design also allowed the inclusion of inherited autosomal recessive variants and variants observed in multiplex and multigenerational families.
This scoring approach identified three ASD risk genes (
SHANK3,
CHD8, and
ADNP) that exhibited significantly higher scores than all other genes.
SHANK3 was first reported as an ASD candidate gene based on identification of heterozygous mutations in ASD probands from three unrelated families [
24]. Subsequently, additional variants in
SHANK3 have been identified by targeted sequencing in multiple ASD cohorts [
15,
25]. In contrast, the risk conferred by functional variants in
CHD8 and
ADNP have only recently been described by WES studies of large ASD cohorts [
9], followed by smaller studies focused exclusively on the identification of variants within these two genes [
25,
26]. However, comparable WES studies of large ASD cohorts have failed to identify a large number of functional variants in
SHANK3, due in part to the high GC content of this gene, which complicates WES approaches. These findings clearly indicate the importance of considering genetic evidence from multiple sources and multiple experimental methodologies in accurately prioritizing ASD candidate genes.
A comparison of the prioritized gene list generated by our scoring model with three other recently published ASD-related gene lists [
11,
22,
23] demonstrated strong agreement in all three instances, confirming the validity of our approach. Differences in the ranking of autism candidates between our approach and these previous studies are largely due to our exclusive focus on the variant’s/gene’s role in autism, not other neurodevelopmental diseases. For example, in our approach, a candidate gene’s score is entirely dependent on the attributes of the ASD-specific genetic variants; we excluded variants from scoring when associated with a neurodevelopmental disorder without an accompanying diagnosis of ASD. By comparison, SFARI Gene Scoring takes into consideration the broader involvement of an ASD gene in related neurodevelopmental/neuropsychiatric disorders as well as its biological role in relation to ASD. These differences in scoring approaches account at least in part for the discrepancies in scores for genes such as RBFOX1 (Fig.
4b), a gene for which considerable functional evidence exists including its role in regulating other ASD genes [
27,
28] and its involvement in ASD pathogenesis as manifested by differential expression in postmortem brain of ASD individuals [
29]. As ASD itself is already a heterogeneous diagnosis, we built our model specifically on confirmed cases of ASD only so as to be as stringent as possible to ensure our resultant prioritization scheme is as specific to ASD as possible, as we believe this is critical to the use of such lists both for basic science researchers and especially clinicians.
An important aspect of our study is the inclusion of common variations associated with ASD (Additional file
4: Table S4). As previously indicated,
MET had the highest common variant score (CVS = 85) based on replicated genetic association studies. Similarly, a higher evidence category was assigned to
MET in the expert-mediated scoring in SFARI Gene. Multiple lines of research indicate an important functional role for
MET in ASD [
30]. However, the role of common variants with small effect size remains poorly understood in ASD as compared to their role in other neuropsychiatric disorders such as schizophrenia and bipolar disorder. In these other disorders, a number of common variants have reached genome-wide significance across multiple studies; common variants in ASD have by and large failed to show similar replication across independent cohorts [
31,
32].
Of note is the concern that more commonly studied genes will have more variants in the database simply by virtue of having been assessed more often and therefore will rank higher in any prioritization scheme. In fact, we did find significant correlations between the total variant scores and the number of publications from which variants for a gene were extracted. This represents somewhat of a “winner’s curse” phenomenon reflecting heightened attention from the ASD research community for select genes. Nevertheless, the number of reported variants per gene (which partially reflects the scientific interest in these genes) explained only ~50 % of the variations in scores—highlighting the comprehensive nature our scoring algorithm. As more unbiased whole exome and whole genome sequencing studies are undertaken and added to this database, this effect should continue to diminish. Furthermore, ongoing future development of our algorithm will attempt to correct for such effects.
Conclusions
In conclusion, we describe the most comprehensive database to date of both common and rare DNA variants associated with ASD. Using our novel scoring and ranking algorithm that considers both genetic and biologic data, we systematically characterized all classes of variants implicated in ASD on one platform and provide a summary score for each ASD-associated genes (Additional file
4: Table S4), which for the first time allows for a fair comparison of ASD-associated gene relevance irrespective of the type, number, or quality of study in which the underlying variant(s) were identified. In addition to strong ASD genes such as
CHD8,
ADNP, and
SCN2A recurrently identified by WES, our prioritized gene set includes
SHANK3,
MET, and
CNTNAP2 supported by multiple lines of genetic evidence, however missed by WES. This database and ranking system represents an important step in moving from simply cataloging ASD genes to using unbiased, data-driven approaches to determine the relative strength of association with ASD of each gene. This resource, which is free to access and will continually be updated, will serve as an important tool to both basic scientists and clinicians working with ASD patients.
Acknowledgements
We thank Catherine C. Swanwick for proof reading the manuscript.