Uncovering the underlying causal mechanisms of T2D risk loci is not exclusively a matter of finding causal genes, since these efforts are complicated by the need to identify both causal variant(s) and the affected tissue(s) in order to obtain a complete picture of disease pathology. Moreover, this additional information is often an inevitable requirement for performing functional follow-up studies in an appropriate model system.
Causal Variants
In GWAS, the variant most strongly associated with disease risk is reported for each locus, though such ‘lead SNPs’ may only serve as surrogate markers for other genetic perturbations that directly contribute to disease pathology. Identifying the true causal variants can provide a direct functional link between genotype and the observed disease phenotype, especially in cases where the variant is protein altering. To identify a causal variant, or a set of likely causal variants, several strategies have been developed, including fine-mapping of disease-associated regions, experimental prioritisation, and in silico prediction tools.
Fine-mapping of a locus involves analysing SNPs in a defined region of the genome for disease association and is used to refine a GWAS association signal from the surrogate lead SNP to the actual causal variant(s). The SNPs are assayed by deep sequencing, or custom array-genotyping based on GWAS variants and imputation from extensive sequencing efforts such as the 1000 Genomes Project [
14,
15]. To achieve sufficient statistical power to detect the association of the true causal variant, large sample sizes are required and the studies often include populations drawn from diverse ancestries to exploit differences in LD patterns [
16].
Even so, most fine-mapping efforts uncover a large number of variants that, between them, are likely to be driving a particular association signal—a so-called credible set. In some exceptional cases, however, it is possible to narrow down the credible set to a single variant, as is the case for the melatonin receptor 1B gene (
MTNR1B) [
17•]. The
MTNR1B locus has previously been implicated in T2D risk and the identification of the single causal variant revealed a likely, direct functional link to the causal gene [
18]. The risk allele creates a binding site for the transcription factor NEUROD1 and is associated with preferential binding in human pancreatic beta cells. This additional transcription factor binding event also implicates increased FOXA2-bound enhancer activity and
MTNR1B expression.
Another way to approach the search for causal variants at GWAS loci is by experimentally testing prioritised SNPs. This strategy was, for example, pursued at the
JAZF1 and
CDC123/
CAMK1D loci [
19‐
21]. Variants in high LD (
r
2 > 0.8) with the lead GWAS SNP were selected for functional analysis based on maps of open chromatin. Effects on gene expression were tested in luciferase reporter assays, and DNA binding capability was analysed through electrophoretic mobility shift assays. The identified potential causal variants at the
JAZF1 and
CDC123/
CAMK1D loci appear to act as part of cis-regulatory modules (CRMs). These specific regions harbour combinatorial transcription factor binding sites (TFBS), and the variants affect binding of PDX1 and FOXA1/FOXA2, respectively. However, due to practical limitations, this type of experimental studies mostly analyses a subset of regional variants, opening up the possibility of missing potential true causal variants. Further, the evidence generated is only circumstantial, since establishing functionality is necessary but not sufficient to prove causation. The emergence of new experimental lines of evidence may affect the prioritisation of the true causal variants and should ideally involve integration of different types of analyses (see section on “
Integrative approach”).
To overcome the practical limitations of functional approaches for identifying causal variants, in silico prediction tools offer an alternative method based on specific assumptions regarding their properties. A recent study, for example, leveraged phylogenetic conservation of TFBS within CRMs to predict causal variants at the
PPARG and
FTO T2D risk loci [
22,
23•]. This computational approach, termed phylogenetic module complexity analysis (PMCA), identified a clustering of homeobox TFBS at T2D risk loci, and initially proposed a potential causal variant at the
PPARG locus, which allowed for a subsequent functional interpretation [
22]. The risk allele at
PPARG2 leads to enhanced binding of the repressive homeobox transcription factor PRRX1, and thus reduced
PPARG2 expression, defective lipid handling, and insulin sensitivity. PMCA was also successfully applied to identify the causal variant and a potential disease mechanism at the obesity-associated
FTO locus, a region showing the strongest genetic association in GWAS for obesity and body mass index traits [
24,
25]. The proposed causal allele was shown to alter an ARID5B repressor motif, leading to activation of the distant
IRX3 and
IRX5 in adipocyte precursor cells, and pro-obesity consequences for adipocyte thermogenesis regulation [
23•]. This work also highlights the additional complexity arising from having multiple causal genes for disease-associated haplotypes. Though post-GWAS efforts have tended to focus on the idea of a single causal gene per locus, causal variant(s) may influence any number of regional genes, and not necessarily in the same manner across different contexts.
Causal Contexts
An important aspect of the prioritisation of causal genes and variants at GWAS loci is to consider the appropriate tissue(s) and developmental stage(s), which allow any functional follow-up studies to be performed in a disease-relevant model. As the majority of T2D association signals are located in non-coding regions and exert regulatory effects, their influence on gene expression may be subject to context-specific activity [
26]. Thus, studies analysing the implicated variants and genes need to consider the surrounding genomic context and expression patterns. A notable example is provided by work on the
PTF1A gene, where a disease-relevant model, human pancreatic progenitor cells, was critical to elucidating a mechanism for isolated pancreatic agenesis [
27•]. The identified mutations were found to disrupt an enhancer region that is selectively active in pancreatic progenitor cells and, importantly, show no activity in corresponding adult cell lines.