Introduction
Bridging the gap between genes and brain structural connectivity is of the utmost importance to make further progress in neuroscience. One important reason for doing so is to unravel the internal wiring diagram of the brain often referred to as the connectome, given its generation by specific spatiotemporal patterns of gene expression during development and its fine-tuning by neural activity beyond that period (Kang et al.
2011; Henry and Hohmann
2012). Hence, gene expression has been suggested to explain aspects of the connectome that can not be fully explained by its spatial constraints, such as heritability, optimizing network-cost efficiency and the overdispersion of projections (Smit et al.
2008; Glahn et al.
2010; Fornito et al.
2011; van den Heuvel et al.
2013; Gǎmǎnuţ et al.
2018; Wang
2020b). Additionally, finding correlations between gene expression patterns and changes in endophenotypes such as cortical thickness has been used to understand aspects of neurodegerative diseases, such as autism, Huntington’s disease, schizophrenia and Alzheimer’s disease (Rittman et al.
2016; Romme et al.
2017; Lein et al.
2017b; McColgan et al.
2018; Grothe et al.
2018; Romero-Garcia et al.
2019).
In Sperry (
1963), it was first suggested that there are correlations between connected neurons and their transcriptional profiles, which was termed the
chemoaffinity hypothesis. Since then, classical candidate-variant and GWAS studies have been used to search for relations between genetic variants and interindividual phenotypical variance related to a brain network of interest (Lein et al.
2017a; Luo et al.
2018). As a subsequent step, the emergence of brain-wide gene expression atlases paved the way for new types of hypotheses testing. In particular, the new studies centered around investigating associations between the spatial organization of gene expression and properties related to brain structure or function (Lein et al.
2007; Hawrylycz et al.
2009; Keil et al.
2018). In tandem with this new era of spatial transcriptomics, Roy et al. and Zhu et al. investigated its proteomic counterpart. Specifically, they found postsynaptic protein profiles of excitatory synapses to be markers of synaptic diversity patterns across brain regions that account for different brain networks (Roy et al.
2018; Zhu et al.
2018).
The first studies to establish structural links between gene expression and connectivity were done for the Caenorhabditis elegans (C. elegans) species, by applying computational models to predict synaptic connections between neurons using their gene expression profiles (Kaufman et al.
2006; Baruch et al.
2008). Afterwards, multiple studies focused on the rodent brain and applied statistical and computational analyses finding relationships between gene expression and structural connectivity at the mesoscale level (French and Pavlidis
2011a,
2011b; Wolf et al.
2011; Rubinov et al.
2015; Fulcher and Fornito
2016). The common denominator between these studies was the finding of significant correlations across brain areas between network properties of the mesoconnectome, such as the number and strength of ingoing and outgoing connections, and correlated gene expression (CGE) patterns.
Fakhry et al. (
2015) applied the partial Mantel test to find relationships between gene expression and the projection target specificity of different source brain areas. This was the first analysis to be done on a volumetric level instead of using a graph representation of the connectome-based network properties, thus retaining a level of information that was closer to the original experimental data than before.
However, the partial Mantel test faces a number of limitations. First, it computes the correlation between multiple distance matrices, with the pairwise distance being taken across a shared dimension (Castellano and Balletto
2002). In Fakhry et al. (
2015), the shared dimension was at the level of brain areas and the original matrices used for estimating the distance matrices were the projection density, injection density and gene expression datasets, respectively.
Given that genes do not correspond to the shared dimension, the effects of genes on connectivity patterns can only be accounted in a consequent analysis. Second, a consequent gene ranking strategy does not highlight modules of gene co-expression, modules of heavily interconnected areas and interactions between the two types of modules, whose importance in brain structures and function have been highlighted in multiple studies and can serve as a dimensionality reduction strategy (Langfelder and Horvath
2008; Grange et al.
2014; Li et al.
2017; Kobak et al.
2019).
In this study we simultaneously identify links between the gene expression and the axonal projection density in the mouse brain, using volumetric data from the Allen Institute for Brain Science and applying a modified version of the Linked ICA method (Groves et al.
2011) to identify independent sources of information that link both modalities at the voxel level. This approach overcomes the limitations of post-hoc correlation strategies by providing multiple implicit linkages between groups of gene expression and projection density patterns, whose functional context can be validated by comparison with literature and ontology enrichment analysis.
Discussion
In this work we searched for links between gene expression and axonal projection densities in the mouse brain. Specifically, we used a modified version of the Linked ICA method (Groves et al.
2011) to link volumetric gene expression and axonal projection data, which were provided by the Allen Institute for Brain Science. Specifically, we identified independent components that account for shared spatial variance across both data modalities.
Initially, we created projection subsets from the three most densely sampled brain areas, namely the visual cortex, midbrain reticular nucleus and caudoputamen injection groups (see “
Projection Density”). For each group, we performed a local analysis and we identified independent components whose spatial patterns exhibited high shared variance in brain areas related to the injected location (source) and long-range projections. These results were validated by literature, including known cortico-midbrain and cortico-striatal projections as well as intra-connections within the cortex, brainstem and subcortical nuclei. Moreover, the results were highly preserved when including the complete dataset of 498 injections in the analysis, hence indicating the capability of Linked ICA to preserve independent components under increasing data variance and size (see “
Local and Global Independent Components”, Table
3). The validity of these results was enhanced by consistency with previous studies and the well-established
Org.Mm.eg.db and
KEGG databases (see Table
2). This consistency was related to a number of detected white-matter tracts and to identified gene groups with functional annotations relative to neurotransmitter-relevant pathways, neuronal function and cell-type specific markers (Tasic et al.
2016,
2018a).
To our knowledge, this is the first study that identified data-driven links between volumetric gene expression and projection density in the mouse brain, instead of links between simplified graph representations at the level of brain areas. Thus, this work expands spatial transcriptomic-based and connectomic-based analyses to high-dimensional data.
The reason to compare the local analyses with the global one is the reduced connectivity sample sizes available when performing the local analyses with respect to the obviously bigger sample size of the global analysis. Note that the gene expression sample size is constant since it is fixed across all analyses. While the advantage of a higher sample size is clear, including different injected areas strongly increases the connectivity variance. Therefore, the increased sample size might provide less specific results. Since it is not absolutely clear which approach is the optimal one, we decided to explore both and we found that the spatial maps of the local analyses were significantly reproduced in the global analysis.
As an additional validation, we compared the components from Linked ICA with dictionaries from the DLSC technique, which explained exclusive variance from each data modality (
exclusive-DLSC) and shared variance (
concat-DLSC). We observed that a pairwise correlation between the spatial maps and the coefficients of both approaches revealed significant links between components and dictionaries that indicated high variance in the same brain regions. Therefore, these patterns of shared spatial variance were captured by multiple decomposition methods. A comparison of their reconstruction accuracy revealed that Linked ICA was superior to
concat-DLSC but slightly inferior to
exclusive-DLSC. Hence, Linked ICA was more optimal in data fusion instead of reconstruction which is reasonable given that it focuses on explaining variance of multiple modalities (see “
Comparison with Dictionary Learning and Sparse Coding”).
These findings suggest that relating both types of dictionaries using pair-wise correlations is not a trivial issue, since a gene dictionary might be more accurately represented as a mixture of projection dictionaries and vice versa. This points out the necessity of conducting post-hoc regression analyses for identifying the most optimal mixtures of dictionaries. In Ji et al. (
2014) and Timonidis et al. (
2020), predictions of projection patterns as sparse linear combinations of gene expression patterns were shown to be significant when representing both modalities at the level of brain areas. However, Linked ICA provides an advantage in terms of interpretation, since reconstructing both data modalities is implicitly modelled by the method instead of requiring post-hoc analyses.
We acknowledge some limitations. Unlike the Diffusion Tensor Imaging technique that uses seeds to directly represent source locations (Le Bihan and Breton
1985), the injected locations were indirectly represented by the feature space of the projection matrix. This resulted in difficulties to find connections between the identified components and axonal pathways. For augmenting the source representation, incorporating single-neuron morphological data could shed light on projection motifs that have not been covered by tract tracing data, as shown in (Han et al.
2018). An exemplary repository was made available by the
Mouselight project, where they have provided reconstructions of long-range projections from
\(\sim \)1000 individual neurons in the mouse brain (Gerfen et al.
2016; Hooks et al.
2018; Economo et al.
2018,
2019; Winnubst et al.
2019). Such data could be fused together with the bulk tracing data and the gene expression data using Linked ICA, with the resulting components linking genes to previously unidentified projection motifs. A preliminary evaluation of this strategy can be found in Supplementary Material Section
1.10, where we have linked single-neuron morphology data with the other two modalities using Linked ICA. We show that the resulting spatial patterns highlight brain regions shown in previous studies (Winnubst et al.
2019), and that they can be used to complement tract-tracing data from less sampled brain regions in the Allen Mouse Brain Connectivity Atlas, such as the motor cortex.
Second, the cell-type specificity of components was evaluated through ontology enrichment analysis and comparison with literature. Note that we could not present direct evidence of cell-type specificity, since the 200
μm3 spatial resolution of the data is insufficient to resolve the cellular-level. The relation to cytoarchitecture is important, since it has been shown in literature that connected brain areas have similar synaptic and protein profiles (Sperry
1963; Roy et al.
2018; Zhu et al.
2018). Therefore, relating cell-type-specific densities or expression patterns to connectome-based data is crucial for understanding the causative factors that link molecules to brain structure. A pivotal step would be to incorporate single-cell RNA sequencing data with the use of tools such as
SEURAT (Satija et al.
2015) for identifying cell-types with less bias and imputing missing data that were caused by the 200
μm thick sections of the original ISH volumes along the posterior-anterior axis (Lein et al.
2007). Important single-cell RNA-seq sources can be found in Tasic et al. (
2016,
2018b), and Mancarci et al. (
2017).
A question that can arise is whether the observed links can be attributed to spatial autocorrelation, meaning an increased connection likelihood and correlated gene expression between nearby brain regions. It is well known that spatial gene expression patterns have a strong spatial autocorrelation that reflects the mouse brain cytoarchitecture (French et al.
2011a,
2011b). Previous studies have shown that highly correlated gene expression patterns exhibit both strong global spatial autocorrelation and spatially overlap with connectivity networks (Richiardi et al.
2015; Pantazatos and Schmidt
2020). Linked ICA automatically estimates spatial degrees of freedom that are included in the cost function (Groves et al.
2011), Therefore, spatial autocorrelation is carried downstream by our analysis, since there is no explicit correction for this in output spatial maps. However, previous studies have identified significant correlations between connectivity and gene expression, when correcting for spatial correlation by regressing correlations on the distance and assessing the significance of the residuals (French and Pavlidis
2011a). Thus, we acknowledge that there might be relevant statistical links between structural connectivity and gene expression beyond spatial autocorrelation that need further characterization (Fulcher and Fornito
2016; Fornito et al.
2019). This could be exemplified by components exhibiting high variance in brain areas distal to each other and the injected region, with a relatively balanced contribution between both modalities suggesting a strong linkage beyond spatial autocorrelation. Exemplar cases include
vis ICAs 4,7,8 and
cp ICA 0 (see Fig.
5a-c for the modality contributions and Fig.
3 for the spatial maps). For instance, the major highlighted areas in
vis ICA 4 were striatum, both dorsal and ventral regions, and cerebellum-related fiber tracts (see Supplementary Material Table
4), which can not be fully attributed to spatial proximity. Moreover, the identification of glutamatergic markers in the gene modules of these components (see Supplementary Material Table
3) could explain the presence of long-range projection patterns between the distal areas, given that glutamatergic neurons from one area are known to project to different brain areas (Tasic et al.
2018b). Determining the underlying cellular subtypes of these markers could shed light on the regional and layer-specific projection preferences of these areas. Consequently, links between connectivity and genes might be more localized than it was presented in this work; the 200
μm resolution at the source level smoothes out cell-type-specific patterns and their potential links, hence retrieving data at 25
μm or higher could resolve this issue (Cheveé et al.
2018; Han et al.
2018; Huang et al.
2020; Kim et al.
2020).
A follow-up question to the spatial autocorrelation issue is whether the underlying causal factors connecting both modalities can be uncovered through linked ICA. The patterns that are linked are spatial patterns, hence the density of a particular gene can be expressed as a pattern across regions, which matches the patterning of the fluorescent labels of the connectivity data. Thus, this can be an epiphenomenon. There are two approaches for validating the causality of the link. First, via a gene ontology analysis that establishes a functional role of the involved genes in generating the projection, for instance, by being expressed in the subset of neurons that make up the projection. For that reason, integrating single-cell RNA seq data is useful. Second, via experimental manipulation of the identified genes (Polleux
2005; Miller et al.
2010; De la Rossa et al.
2013; Daimon et al.
2015; Razoux et al.
2017; Goodman and Bonni
2019). The goal of the paper is to provide a toolbox that can generate such hypotheses and be used to formulate experimental studies to validate them. It has to be borne in mind that development processes generating the projections have finished by the time we quantify the gene expression patterns (French and Pavlidis
2011a), so a direct link is difficult and must rely on turn-over of molecules at the synapse.
While the main focus of this study was to find links between genes and projection patterns on the mouse mesoconnectome, we aim to go beyond qualitative descriptions of such links and move towards more into quantitative tests. We intend to do that by manipulating the expression of genes of interest according to their functional gene modules and then predicting the brain-wide changes in projection densities. Hence, this would make it possible to test in silico a number of neurodegenerative disease-related hypotheses. For two preliminary test cases highlighting the predictive capabilities of Linked ICA, readers are referred to Supplementary Material Section
1.11 in which controlled manipulations of gene expression patterns lead to changes in projection patterns.
Subsequently, the activity of the resulting structural patterns could be tested in frameworks such as the Virtual Mouse Brain (Sanz-Leon et al.
2013; Ritter et al.
2013; Woodman et al.
2014) or it could be validated by electrophysiology-based experiments. These approaches could be useful for translational neuroscientists.
Taken together, we have built and validated a novel paradigm for linking gene expression and structural projection patterns in the mouse mesoconnectome, based on volumetric data from the Allen Institute and using a modified version of the Linked ICA method. A comparison with the DLSC technique and the preservation of the results under increasing data volume suggest robustness of the method in capturing independent components of shared variance across both modalities. Finally, our method presents a relevant framework through a number of use-cases, which could support assisting studies aiming to relate genes to brain function.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Matlab implementation of the original Linked ICA method was provided online by Llera et al. (
2019) on Github. Additionally, the meta-analytic steps of our analysis have been designed and tested in the form of a Jupyter Notebook and have been published online with their descriptions at Github. The Github Notebook has been incorporated in the Connectomic-Composition-Predictor (CCP), a Neuroinformatics-related tool that we developed in our previous work (Timonidis et al.
2020). Regarding the modifications made to Linked ICA in this work, a potential user can consult the steps described in Supplementary Material Section
1.1. See Main Table
2 for links to the Matlab code and to the Notebook mentioned here.