Introduction

Schizophrenia is a chronic and debilitating brain disorder that affects 1% of the population.1 It is characterized by delusional beliefs, hallucinations, disordered speech and deficits in emotional and social behavior (see, for example, Mowry et al.2) and is highly familial with heritability estimates of 81%.3

The genome-wide association (GWA) studies have explained only a small amount of genetic variance in schizophrenia and are limited in power because of the many tests performed and do not necessarily lead to knowledge about molecular mechanisms of a clinical trait. In addition, the strongest associated variants—when part of a pathway—might not be the best drug target for therapeutic intervention, and identifying variants in the same cellular pathway or functionally related gene group may help in finding additional drug targets. Recent GWA studies for schizophrenia have implicated the major histocompatibility complex on 6p21.2–22.1, neurogranin (NRGN) and transcription factor 4 (TCF4).4, 5, 6 In addition, they have provided molecular genetic evidence for a substantial polygenic component, implicating a large number of single-nucleotide polymorphisms (SNPs) of very small effect in the etiology of schizophrenia.4 Some of these exceed genome-wide significance, but the currently available sample sizes are insufficient to detect these effects.7 Therefore, SNPs in the 5 × 10−8−1 × 10−6 band are a mix of SNPs, some of true effect and some that are false positives. Exactly what is the mix of true/false positives is currently unknown. Gene set or pathway analysis involves testing for the combined effect of multiple SNPs, which individually may have a very small effect that does not reach significance. By using a competitive testing scheme, associations of gene sets with a disease are corrected for false positives.

It seems likely that the substantial polygenic component involves SNPs that are not distributed randomly across the genome but are distributed across genes that share a common biological function or pathway.8, 9, 10, 11 Recent pathway analyses provided evidence for the importance of the cell adhesion molecule pathway in schizophrenia,12 as well as the glutamate metabolism, transforming growth factor-β signaling and tumor-necrosis factor receptor-1 pathways.10

Pathway analysis is predicated upon accurate pathway definitions and validated assignment of genes to pathways. However, many of the available databases used for pathway definitions are not optimally annotated and the same pathways can be differentially defined across databases. Classically defined pathways are usually not independent, as the same genes, especially the end points, are often active in different pathways. Consequently, genetic variation that affects the expression or function of genes in different pathways may have similar consequences, have similar impact on pathogenesis and show similar disease association. Genes may also be grouped according to shared cellular function (see, for example, Ruano et al.11). Such ‘functional gene grouping’ goes across the traditionally defined biological pathways as it groups genes based on similar cellular function, and not based on a cascade of induced events, as in biological pathways. We recently proposed such a functional gene grouping approach to test for the combined effect of genetic variants in genes with shared cellular function in the synapse using a manually curated database of gene function based on both experimentation and data mining.11 Using this approach, it was found that one relay element involved in many pathways (G proteins) was associated with cognitive traits, a strong association, which had remained unnoticed by traditional single-marker analysis.11

Numerous statistical methods are available to evaluate the enrichment of selected pathways or functional gene groups for selected traits, including, for example, the Gene Set Enrichment Analysis,13, 14 testing for overrepresentation of categories of genes,9 the SNP ratio test,12 hypergeometric tests (see, for example, Jia et al.10) or the Σ-log(P) method combined with permutation.11 Most of these methods correct for linkage disequilibrium between SNPs, the number of SNPs per gene, gene size and multiple testing of independent pathways. Permutation is generally used to determine how likely a given result is if the null hypothesis of no association is true. However, the more genes are present in the defined gene group, the more likely it becomes to observe smaller P-values. In addition, generally these methods do not test how unique a certain result is given the polygenic nature of many studied traits. The latter involves testing whether a given pathway is significantly associated with a trait because it (1) includes a lot of genes and the trait is polygenic in nature or (2) because of the biological function of the pathway. This can be resolved by testing for association of matched-control gene groups in comparison with the targeted gene groups or pathways.

The purpose of the current study is to apply a functional gene group approach to detect well-annotated functional gene groups that are important to the risk of schizophrenia. The synaptic hypothesis of schizophrenia15, 16, 17, 18 is one of the leading hypotheses in the field of schizophrenia. Recent genetic findings underscore the importance of synaptic dysfunction in schizophrenia.6, 12 Therefore, we formally tested whether the group of genes involved in pre- and post-synaptic functioning is related to schizophrenia, and whether this group is more strongly related than randomly drawn matched sets of genes, using a ‘competitive’ control method.19 Apart from testing all synaptic genes for an association with the risk of schizophrenia, we also tested 17 subgroups of synaptic functioning, defined based on data mining and experimentation.11 We used the data genotyped within the International Schizophrenia Consortium (ISC) case–control sample4 and the Genetic Association Information Network (GAIN) schizophrenia data set.

Materials and methods

Participants and genotyping

The ISC case–control sample includes 3322 cases and 3587 controls (European ancestry), derived from seven different collection sites and is described in detail elsewhere.4 Subjects were genotyped on Affymetrix 5.0 or 6.0 SNP arrays (Affymetrix, Santa Clara, CA, USA), and 300 523 SNPs passed quality control for the ISC_affy5 sample (3353 subjects) and 717 126 SNPs for the ISC_affy6 sample (3556 subjects).

The GAIN schizophrenia case–control sample has been described elsewhere5, 20 and has been downloaded from dbGAP (phs000021.v2.p1). Briefly, this sample included 1351 cases and 1378 controls of European ancestry. All individuals were genotyped on the Affymetrix 6.0 array, and 727 872 SNPs were available for analysis. The quality control procedures followed those described in Shi et al.5 The GAIN and two ISC (affy5 and affy6) are independent and non-overlapping, and together contain 9638 individuals (4673 cases/4965 controls) of European ancestry.

Defining functional gene groups

Synaptic functional gene group definition was based on cellular function as determined by previous protein identification and data mining for synaptic genes and gene function.11 Genes were considered ‘synaptic’ based on proteomic analysis of synaptic preparations.21 In case of presynaptic genes, an additional expert curation was performed because only few analyses of highly purified preparations are currently available for the presynaptic proteome, except synaptic vesicles.22 Hence, presynaptic genes not covered by Takamori et al.22 were manually curated using published functional data and a cumulative scoring paradigm with the following set of weighted criteria: null mutation produces a synaptic phenotype; activation of the gene product (for example, receptor) or blockade thereof directly modulates synaptic function; and immunoelectron microscopy detects gene product in the synapse. More than 500 PubMed entries were manually screened. Although this approach introduces a bias toward well-studied genes, this is inherent to creating functional gene groups, as functional grouping is by definition limited to those genes for which functional data are available. Synaptic genes were subdivided into 17 functional groups based on shared cellular function (a full listing of genes assigned to functional groups is provided in the Supplementary Material, Table S4).

SNP assignment

All SNPs that survived quality control in the ISC and GAIN samples were mapped to genes on the basis of NCBI (National Center for Biotechnology Information) human assembly build 36.3 and dbSNP release 129 (following Holmans et al.9). For the definition of the gene boundaries we downloaded the ‘seq_gene.md’ file from the FTP website of NCBI. From this list of records we deleted genes coded as pseudo in the column ‘feature_type’. Subsequently, we selected the records with gene as ‘feature_type’ and reference as ‘group_label’. For these records, we assigned SNPs to genes when annotated between ‘chr_start’ (transcription start site) and ‘chr_stop’ (transcription stop site).

Association analysis

SNP association analyses were carried out using additive models of allele counts. For the ISC data set, a correction for clustering within stratum (collection site) was performed.4 Cochran–Mantel–Haenszel tests implemented in PLINK were used for the association analyses (PLINK, Boston, MA, USA). All analyses were carried out separately for the ISC_affy5, ISC_affy6 and the GAIN data sets. Empirical P-values from the three data sets were combined by Stouffer's weighted Z-transform method23 to obtain an overall P-value.

Evaluating the combined effect of all SNPs in a functional gene group: the Σ-log(P) method

We summed the logarithm of the reciprocal of the P-values (denoted as Σ-log(P) method)24, 25 as previously applied to gene group analysis11 to determine the significance of the combined effect of SNPs annotated to genes in a functional group. The Σ-log(P) method combines P-values from association analyses within a group of genetic variants, then calculates the –log10 of each P-value, and sums over all P-values in a group to obtain the Σ-log(P) test statistic. To allow unbiased interpretation of the Σ-log(P) test statistic, 10 000 permutations were conducted, which are implicitly conditional on linkage disequilibrium, sample size, gene size, the number of SNPs per gene and the number of genes per group, by permuting affection status over genotypes. With this permutation procedure, only the relation between any genetic variant and affection status was disconnected, whereas linkage disequilibrium structure was kept intact. In addition, each group of genetic variants included the same (numbers of) SNPs and genes and had the same sample size as the original data set. For each permutation, we obtained the Σ-log(P) for each functional group and then compared the observed Σ-log(P) of a group with the empirical P-value distribution by calculating the proportion of Σ-log(P) in the permuted data sets that was higher than the observed Σ-log(P).

Controlling for known polygenic effect on schizophrenia

The permutation approach described above provides information on how likely a given value of the combined effect of all SNPs in a group of genes is under the null hypothesis of no association of any SNP included in the functional group with the risk of schizophrenia (that is, self-contained testing).19 We additionally applied matched-control methods (that is, competitive testing) that allow to test whether randomly drawn groups of SNPs/genes would provide an equally or more significant (combined) empirical P-value as compared with the combined P-value from the group of synaptic genes. We created control gene groups matched for the number of genes (method 1) and groups that were matched for the effective number of SNPs, which could be drawn from all genic and nongenic SNPs (method 2), from genic SNPs only (method 3), from nongenic SNPs only (method 4) or from genic SNPs in brain-expressed genes only (method 5). The effective number of SNPs denotes the number of independent SNPs that is consistent with the empirical mean and variance of the distribution of the test statistic under the null hypothesis of no association26 (see Supplementary Material, section 2). Matching for the number of genes as well as for the effective number of SNPs in a functional gene group would be ideal but is highly limited as there will only be a few (<5) sets of gene groups that can be created when the original group of genes is large (1026 in our case). We thus created matched control groups following the five methods described above, each testing slightly different null hypotheses (see Table 1).

Table 1 Five applied competitive control methods to test whether synaptic genes are more strongly associated with the risk for schizophrenia than any other set of randomly grouped genes or single-nucleotide polymorphisms (SNPs)

For each of the five competitive test designs, 100 matched control groups were drawn. For each draw we carried out an association analysis of all SNPs in the matched control group in each of the three data sets, calculated the Σ-log(P) and then conducted 10 000 permutations of the data set to determine the empirical P-value of each of the 100 matched control groups of genes in each data set, similar to the actual analysis with the group of synaptic genes. These empirical P-values were combined across the three data sets using Stouffer's weighted Z-transform method.23 For the five control designs, we thus obtained five sets of 100 combined empirical P-values. We then calculated how often the true combined empirical P-value (from the synaptic gene group) was higher than the combined empirical P-value from the matched control groups and divided that by the number of draws. As there were 100 draws, the lowest empirical P-value of the combined empirical P-value that could be obtained was <0.01, when none of the combined empirical P-values from the random draws was equal or more significant than the combined empirical P-value from the synaptic gene group (see Figure 1 for a graphical overview of the steps in data analysis).

Figure 1
figure 1

Overview of steps in data analysis. The arrows in red represent the flow for the real data whereas the blue arrows represent the flow for the control methods.

PowerPoint slide

Enrichment tests of previously implicated genes

To test whether synaptic functional groups contained previously implicated genes more often than by chance, we retrieved all SNPs with P1.0–5 from all significant loci reported in GWA studies for schizophrenia that were published before 14 February 2011, using the GWAS catalog,27 and mapped these loci to protein-coding genes (NCBI build v36.3). In addition, we added genes implicated from genome-wide copy number variation studies. Fisher's exact tests were used to determine the presence of enrichment.

Results

Are synaptic genes significantly associated with the risk of schizophrenia?

No individual SNP reached the threshold for genome-wide significance in any of the three data sets using a genome-wide association analysis for each data set separately (see Supplementary Material, Section 1). Functional gene group analysis of all 1026 synaptic genes jointly resulted in a significant association of the total group of pre- and post-synaptic genes to the risk of schizophrenia. This was true in all three samples separately and highly significant when combined across samples, with a combined P-value of 7.6 × 10−11. For each sample, the Σ-log(P) obtained from the original analysis with all synaptic genes was in the higher end of the empirical distribution and highly significant with only one of 10 000 permutations exceeding the observed Σ-log(P) for the ISC_affy5, ISC_affy6 and GAIN data sets (see Figure 2).

Figure 2
figure 2

Empirical distribution under the null hypothesis of no association between any single-nucleotide polymorphism (SNP) and the risk of schizophrenia. The Σ-log(P) obtained in the original analysis of all synaptic genes is indicated in red.

PowerPoint slide

Are synaptic genes more significantly associated with the risk of schizophrenia than randomly drawn groups of genes/genetic variants?

Results from the five control methods show that SNPs in synaptic genes are more strongly associated with the risk of schizophrenia than any other set of randomly drawn genes. For none of the control methods we found a combined empirical P-value that was more significant than the combined empirical P-value from the synaptic genes (see Figure 3). The ‘empirical P-value of the combined empirical P-value’ was <0.01 in all methods, suggesting that the group of synaptic genes is generally more strongly associated with schizophrenia than other groups of genes that either include the same number of genes, the same effective number of nongenic or genic SNPs, nongenic SNPs only, genic SNPs only, or the same effective number of SNPs drawn from brain-expressed genes only (see Supplementary Material, Section 2 and 3).

Figure 3
figure 3

Overview of combined empirical P-values from the total group of synaptic genes and the three subgroups that were significant after correction for multiple testing, obtained from the analysis based on the actual functional gene groups (‘real’, red bars), the five most significant results from the 100 draws for each control method (green bars) as well as the average combined empirical P-value (blue bars) obtained from five control methods across 100 draws. Note that the combined empirical P-values for the real group analysis as well as those for each of the 100 draws in 5 control methods are obtained from 10 000 permutations of the data and are the combined P-values across the three samples. For the ‘all synaptic genes’ group, none of the control methods resulted in a lower P-value than the real analysis (that is, all empirical P-values of the empirical P-values <0.01), for the intracellular signal transduction group, control methods 1, 2, 3, 4, 5 resulted in empirical P-values of the empirical P-values of 0.02, 0.04, 0.03, 0.04 and 0.03 respectively. For Excitability, this was 0.02, 0.03, <0.01, <0.01 and 0.03, respectively, and for cell adhesion and trans-synaptic molecules signaling this was 0.03, 0.04, 0.04, 0.06 and 0.05, respectively. For description of competitive control methods and different null hypotheses tested we refer to Table 1.

PowerPoint slide

Which synaptic subgroups are most strongly related to the risk of schizophrenia?

We tested 17 synaptic subgroups and one group of synaptic genes that did not share a known function for association with schizophrenia. We found that three synaptic subgroups were significantly associated with increased risk of schizophrenia under the null hypothesis that none of the SNPs in these groups were associated with schizophrenia: intracellular signal transduction group (P=0.0002), genes related to excitability (P=0.0009) and genes involved in CAT signaling (P=0.0024) (see Table 2). The matched control methods for these subgroups resulted in P-values between 0.02 and 0.04 for the intracellular signal transduction group, P-values between <0.01 and 0.03 for the excitability group and between 0.03 and 0.06 for the CAT signaling group (see Figure 3 and Supplementary Material, Section 4).

Table 2 Association of synaptic functional gene groups with schizophrenia in the three data sets

The signal of the most significant functional group (intracellular signal transduction) was mainly derived from the two ISC samples, whereas the GAIN sample contributed less to the overall evidence of significance of these groups but contributed mostly to association with the CAT signaling group. For the group of excitability genes, however, all three samples independently showed nominally significant or suggestive evidence. Quantile-quantile (Q-Q) plots (Supplementary Figures S3a–c) of the significant functional groups in the three samples show that for each functional group a multitude of SNPs in multiple genes, each of small effect, contribute to the overall significance, suggesting that the association cannot be explained by only a few genes in the group but rather by the joint effect of many genes in the functional group.

Synaptic subgroups include genes associated previously with schizophrenia

We tested whether the synaptic functional groups included genes for schizophrenia previously implicated from GWAS or copy number variation studies. The intracellular signal transduction group includes NRGN that was one of the most significant genes identified in the SGENE+-based GWAS,6 but was not below the genome-wide threshold in the ISC or MGS GWAS.5 The excitability group contains CACNA1C, which was one of the two most significant genes identified in a recent GWAS for bipolar disorder,28 and was recently also associated with schizophrenia.29 From the group of genes involved in CAT signaling, four genes were implicated previously in schizophrenia. Enrichment analysis of previously implicated genes in schizophrenia from GWAS and copy number variation studies indicated significant, although moderate, enrichment of previously associated genes in the total group of synaptic genes (P=0.02) using Fisher's exact test (see Supplementary Material, Section 5). This enrichment was mainly because of enrichment in the CAT signaling group (P=0.0002). However, three out of the four genes in CAT signaling group that were implicated previously were very large genes. As significant results from GWAS studies may be biased toward large genes, the enrichment test for the CAT signaling group needs to be interpreted with caution.

Discussion

Our overarching goal was to test whether genetic variation associated with schizophrenia risk accumulates in functional gene groups operating in the synapse. We showed that the total group of genes encoding proteins in the synapse was highly associated with the risk of developing schizophrenia with a combined P-value of 7.6 × 10−11. In addition, the group of synaptic genes was more strongly associated with schizophrenia than any of the matched-control groups of genes (P<0.01). The functional gene group approach is a novel approach in which genes are grouped according to cellular function, and which goes across traditionally defined biological pathways, also referred to as horizontal versus vertical grouping.11 We used a manually curated database of functional gene groups, which tends to include more updated annotation information of gene function—especially for genes expressed in brain—than some of the online available databases. We do note, however, that gene function annotation is an ongoing endeavor and that annotation of functional gene groups is therefore continuously improved.

Apart from testing all genetic variants in synaptic genes as a group, we tested subgroups of synaptic functioning and found that three subgroups of synaptic functioning mainly drive the association of the synaptic gene group with schizophrenia; intracellular signal transduction (P=0.0002), excitability (P=0.0009) and CAT signaling (P=0.0024). In general, these associations were stronger than associations with matched-control groups of genes, except for CAT signaling (method 5, P=0.06), indicating that at least some groups of similar size as the CAT signaling group and existing of SNPs in brain-expressed genes are more significantly associated with schizophrenia than the CAT signaling group. We do note however that the CAT signaling group overlaps with the cell adhesion molecule pathway from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database that was previously associated with schizophrenia in the ISC.12

The group of intracellular signal transduction was most strongly associated with the risk of schizophrenia and includes the NRGN gene, which was one of the most significant loci identified in the (independent) SGENE+ GWAS,6 but—as a single marker effect—not below the threshold of significance in the individual samples on which the current analysis was based. In the samples included in our study, each individual SNP in the intracellular signal transduction group contributed very little to the risk of schizophrenia. However, combining their contributions resulted in a significant association.

Intracellular signal transduction in neurons and synapses is characterized by a high degree of crosstalk. A great variety of initial steps, such as activation of many different cell membrane receptors, leads to changes in a rather limited number of enzymes that generate second messengers (adenylyl cyclase, phospholipases) and a limited number of second messengers inside the cell (calcium, cyclic adenosine-monophosphate, cyclic guanosine monophosphate, inositol 1,4,5-triphosphate; reviewed in de Jong and Verhage30). Hence, it is plausible that genetic variation in the genes encoding these factors has similar biological consequences and additive contributions to pathogenesis.

The second most significant functional group (Excitability) regulates steady-state and action potential-induced ionic currents and membrane potential. Many different channels can contribute but they all allow a limited number of types of biologically relevant ions to pass. Hence, as for the group of intracellular signal transduction genes, it is plausible that genetic variation in the genes encoding these channels have similar biological consequences in cellular excitability, and thus additive contributions to pathogenesis.

For complex traits with evidence for large numbers of variants of small effect size contributing to disease risk—such as schizophrenia,4, 31 multiple sclerosis32 and type 1 diabetes mellitus33—it is of crucial importance to test whether a reported association with a group of genes is merely because of the polygenic nature of the disease or the biological function of that group of genes. Any large group of genes is likely to emerge from pathway or functional gene group analysis merely because of background polygenic effects to the risk of disease. Reporting a significant association with the group of synaptic genes may therefore seem rather trivial, as it merely confirms that synaptic genes are included in the multitude of genes related to schizophrenia. A more interesting question is thus whether the group of synaptic genes is more strongly related to schizophrenia than other randomly drawn groups of genes. To test this, we designed five methods in which we created matched-control groups of genetic variants. As the genetic variants were drawn from different pools, every control method tested slightly different null hypotheses, providing insight into how important an observed association with a group of genes is under a polygenic model of inheritance. We propose that such competitive tests for pathways or functional groups need to be included in any future pathway or functional gene group analysis.

In this study we investigated whether the accumulated effects of genetic variants in multiple genes may cause dysfunction of a biological system (for example, intracellular signal transduction), while a single genetic variant is not sufficient to cause disease. The functional gene groups we defined are characterized by redundancy, which is most likely accomplished by previous gene duplication. Over time, genetic mutations may arise causing different or less optimal protein function, which may thrive in a gene pool, thus leading to diversity or genetic heterogeneity. To some extent genes in the same functional group may functionally replace each other when others function suboptimal. Such redundancy and heterogeneity provide for fail-safe mechanisms, which render functional gene groups—like most other biological systems—robust. Robustness is a property that allows a system to maintain its functions against internal and external perturbations.34, 35

Typically, in different individuals a different set of mutations may be responsible for dysfunction. As a consequence, individuals with the same disease may have completely different genetic backgrounds, which is consistent with both a polygenic model of disease and a threshold model of disease but seriously hampers single-marker GWAS analysis, as it decreases the effect sizes of single SNPs/genes. When focusing on a functional gene group, it becomes less relevant which particular genes carry a mutation, whereas the number of genes carrying a mutation before the system starts to dysfunction is much more important. Robustness, inherent to for instance synaptic protein networks, and their underlying genes, may thus provide biologically meaningful ways to interpret the notion that ‘thousands of genes underlie complex traits’ and may provide important insights in the biological systems important in disease etiology (see Supplementary Material, Section 6 and 7).

Our current results suggest that multiple genes involved in synaptic functioning are important for schizophrenia, provide support for the synaptic hypothesis of schizophrenia15, 16, 17, 18 and provide tentative evidence for the involvement of the biological mechanisms involved in intracellular signal transduction, excitability and cell adhesion and trans-synaptic signaling molecules in schizophrenia.