Mapping gene expression in the brain
Computational analysis of spatial and temporal gene expression data in the brain
Analyzing the expression patterns of genes in the brain
Gene expression visualization
Summary statistics and visualization-based methods
Box1 | Gene Sets
Complex biological functions and disorders usually involve several rather than a single gene. Gene sets are groups of genes that share common biological functions and that can be defined either based on prior knowledge (e.g. about biochemical pathways or diseases) or experimental data (e.g. transcription factor targets identified using CHIP-seq). Gene set databases organize existing knowledge about these groups of genes by arranging them in sets that are associated with a functional term, such as a pathway name or a transcription factor that regulates the genes. Gene sets can be classified into 5 types: |
Gene Ontology (GO)
The Gene Ontology project (Ashburner et al. 2000) developed three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions. Genes annotated with the same GO term(s) constitute a gene set. |
Biological Pathways
Biological pathways are networks of molecular interactions underlying biological processes. Pathway databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata et al. 1999) and REACTOME (Croft et al. 2014), catalog physical entities (proteins and other macromolecules, small molecules, complexes of these entities and post-translationally modified forms of them), their subcellular locations and the transformations they can undergo (biochemical reaction, association to form a complex and translocation from one cellular compartment to another). |
Transcription
Transcription databases include information on regulation of genes by transcription factors (TFs) binding to the DNA, or post-transcriptional regulation by microRNA binding to the mRNA. Determining these physical interactions can be done either in silico using computational inference (motif enrichment analysis) or using experimental data (such as CHIP-seq and microRNA binding data). For the motif enrichment analysis, position weight matrices (PWMs) from databases TRANSFAC (Matys et al. 2006) and JASPER (Portales-Casamar et al. 2010) can be used to scan the promoters of genes in the region around the transcription factor start site (TSS). CHIP-seq data, such as the large collection of experiments from the Encyclopedia of DNA Elements (ENCODE) project (Bernstein et al. 2012b) and the Roadmap Epigenomics consortium (Consortium 2015a), is used to identify genes targeted by the TFs. Similarly, microRNA targets can be extracted from databases such as TargetScan (Lewis et al. 2003). |
Cell-type markers
Cell type-specific transcriptional data provide a very rich source of cell type marker genes. Genes are identified as a cell type marker if they are up-regulated in one cell population compared to other cell populations. Several studies have used microarrays and RNA-seq to profile the transcriptome of a number of neuronal cell types (Cahoy et al. 2008; Zhang et al. 2014). Recently, studies are using single-cell sequencing to precisely capture the transcriptome of individual neuronal cells (Darmanis et al. 2015; Zeisel et al. 2015). |
Disease
Genes can be grouped into sets based on their association to the same diseases. Public databases, such as OMIM (2015a) and DisGeNet (Pinero et al. 2015), contains curated information from literature and public sources on gene-disease association. Another source to obtain disease-related gene sets is by identifying genes harboring variants identified using GWAS (Simón-Sánchez and Singleton 2008; Welter et al. 2014), exome-sequencing (2015b), or whole-genome sequencing. |
Identifying genes with localized expression patterns
Spatial and temporal gene co-expression
Box 2 | Dimensionality reduction
|
The high dimensionality of transcriptomes, and other biological data (e.g. proteomes, epigenomes, etc.), provides a challenge for visualization as well as for selecting informative features for clustering and classification. Dimensionality-reduction approaches aim at finding a smaller number of features that can adequately represent the original high dimensional data in a lower dimensional space. The conventional principal component analysis (PCA) is the most commonly used dimensionality reduction method. Despite its utility, PCA can only capture linear rather than non-linear relationships, which are inherent in many biological applications. Several non-linear dimensionality reduction techniques have been proposed (e.g. Isomap (Tenenbaum et al. 2000)), see (Lee and Verleysen 2005) for an extensive review. The t-distributed stochastic neighbor embedding (t- SNE) method (Maaten and Hinton 2008) has been widely used to visualize biological data in two dimensions by preserving both the global and local relationships between the data points in the high-dimensional space (Saadatpour et al. 2015). |
Box 3 | Clustering
|
Clustering is the unsupervised learning process of identifying distinct groups of objects (clusters) in a dataset (Duda et al. 2000). There are two main types of clustering: hierarchical and partitional. Hierarchical clustering algorithms start by calculating all the pair-wise similarities between samples and then building a dendrogram by iteratively grouping the most similar sample pairs. By cutting the tree at an appropriate height, the samples are grouped into clusters. On the other hand, partitional clustering optimizes the number of simple models to fit the data. Examples of partitional clustering include k-means, Gaussian mixture models (GMMs), density-based clustering, and graph-based methods.
|
In order to cluster the samples hierarchically, all the pair-wise similarities between sample Si and Sj are calculated. Samples are then grouped iteratively based on the calculated similarities (grouping the most similar first). Once the full dendrogram is built, a cut-off (dashed line) is used to group samples into groups. For k-means we set the number of clusters based on the data heatmap. K-means groups samples by minimizing the within-cluster sum of square distances between each point in the cluster and the cluster center. |
Box 4 | Classification
|
Classification is a supervised learning process of labeling unseen objects (test set) given a set of labeled objects (training set) (Duda et al. 2000). Classification approaches can be divided into Bayesian methods and prediction error minimization methods. The former group is based on Bayesian decision theory and uses statistical inference to find the best class for a given object. Bayesian methods can be further divided into parametric classifiers (e.g nearest-mean classifier and Hidden Markov Model) and non-parametric classifiers (e.g. Parzen window or k-nearest neighbor classifier). Alternatively, classifiers can be designed to minimize a measure of the prediction error. Well-known classifiers in this category include regression classifiers (e.g. Lasso regression), support vector machines, decision trees and artificial neural networks. Neural networks (in particular Deep Learning), have become very successful in solving problems in a wide range of applications, including bioinformatics (Xiong et al. 2014; Alipanahi et al. 2015; Engelhardt and Brown 2015).
|
A low dimensional embedding of the samples is generated using two features (genes). A Baysian Classifier assigns each sample to one of the two classes (Diseases or Healthy) based on statistical inference. A prediction error-minimization classifier updates the classification boundary (dashed line) based on the prediction error and terminates when a certain criterion is met. |
Gene co-expression networks
Box 5 | Co-expression Measurements
Gene co-expression is widely used for functional annotation, pathway analysis, and the reconstruction of gene regulatory networks. Co-expression measurements assess the similarity between a pair of gene expression profiles by detecting bivariate associations between them. These co-expression measurements can be summarized in five categories (Kumari et al. 2012; Allen et al. 2012; Song et al. 2012; Wang et al. 2014): |
Correlation
The most widely used co-expression measure is Pearson correlation, due to its straightforward conceptual interpretation and computational efficiency. However, Pearson correlation can only capture linear relationships between variables. Alternatively, Spearman correlation is a nonparametric measure of non-linear associations. Other correlation-based methods include Renyi correlation, Kendall rank correlation, and bi-weight mid-correlation. |
Partial correlation
Partial correlation is used to measure direct relationships between a pair of variables, excluding indirect relationships. Based on Gaussian graphical models, partial correlations infer conditional dependency as the non-zero entries in the precision matrix (the inverse of the covariance matrix). |
Mutual-Information
Mutual information-based methods measure general statistical dependence between two variables. Based on information theory, mutual information does not assume monotonic relationships and hence can capture non-linear dependencies. |
Other measures
Euclidian distance; Cosine similarity; Kullback-Leibler divergence; Hoeffding’s D, distance covariance, and probabilistic measures (as used in Baysian networks). |
Co-expression of disease-related genes
Box 6 | Co-expression Networks
Gene co-expression networks provide a framework to uncover the molecular mechanisms underlying biological processes based on gene expression data. A co-expression network consists of nodes to represent genes and edges to encode the co-expression between two genes. A weighted network is a network in which the edges have continuous values to indicate the strength of co-expression. Networks with binary edges (an edge either exists or not) are termed binary networks. Analysis of co-expression networks can be summarized in four main steps: |
Network Construction
The first step in building a co-expression network is to construct a similarity matrix, by quantifying the similarity between the expression profiles of each pair of genes (i.e. co-expression). Several methods to measure gene co-expression are discussed in Box 5. For non-regularized estimations of co-expression, all off-diagonal elements of this similarity matrix will be nonzero. We can take these similarities as edge weights in the network, but that will give a fully connected network (each gene is connected to each gene). An additional step can be to threshold the similarity matrix, either to prune edges, or to binarize (absent/present) the similarities to obtain an adjacency matrix. In the latter case, pairs of genes with co-expression values above a threshold will be connected in a binary network. In the weighted gene co-expression network analysis (WGCNA) framework the similarity matrix undergoes a power transformation and a weight diffusion step, to optimize the topological properties and stability of the network (Zhang and Horvath 2005). |
Network Characterization
The obtained networks can be analyzed in a number of ways. Topological measures characterize the structure of the network, and quantify the importance of genes in their network context. These measures have been extended to weighted networks (Zhang and Horvath 2005), and can capture topology on different levels of scale (Hulsman et al. 2014). Sets of networks can also be aligned and compared (Przulj 2007; Hayashida and Akutsu 2010; Fionda 2011). Network comparison can be used either to assess changes between different conditions, or to replicate a network in an independent dataset for validity assessment. |
Module Identification
To interpret a network, it can be divided into sub-networks, or gene modules. To do this, the network edges are often treated as similarities in a clustering approach (see Box 3). Alternatively, graph properties, such as topological overlap or modularity, can be used to divide a network into modules (Blondel et al. 2008). |
Module Characterization
Finally, modules can be characterized using a wide range of approaches. The expression profile of genes within the same module can be summarized using the average or the first principle component (also called eigengene (Oldham et al. 2006)). Alternatively, one can characterize a module according to its hub genes: the genes with the largest number of connections within the module. Another option is to assess the association of a module to external data by testing statistical enrichment in various gene sets (see Box 1 for different types of gene sets). In addition, modules can be characterized based on changes between conditions (e.g. health and disease) in their summary statistics (average expression profile), their topological measures (inter-connectivity), or the number of differentially-expressed genes they include. |