Background
Genome-wide association studies (GWAS) have been widely used as a reliable method for identifying genetic variants associated with a trait or complex disease. A high density of SNPs increases the chance of finding either a causal mutation for the trait or SNPs close enough to the mutation to confidently suggest a gene or another sequence feature underlying the trait. One way to overcome this problem is using imputation, a process in which samples are genotyped using a low-density SNP array and imputed with information from a reference panel genotyped on a high-density SNP array. This method will also recover genotypes that are missing because of technical issues.
Imputation has successfully helped to identify genetic susceptibilities to various diseases and phenotypes that were not recognized in a genotyped panel [
1,
2]. The method relies on the number of SNPs being shared between the two panels and the amount of linkage disequilibrium (LD) between genotyped and non-genotyped SNPs [
3]. A low average LD will reduce the accuracy and might require more typed SNPs. The quality of imputation also depends on the choice of reference [
1]. If the reference contains genetic variants not present in the actual sample population, it will increase the noise in the data and reduce the usefulness of the imputation. One study of malaria resistance in Gambian children only identified a previously known hemoglobin S variant in the hemoglobin-β gene when a Gambian-specific reference was used [
1]. Although this problem is more likely to occur in Africa, where there is a considerably lower LD compared to Europe and Asia [
4], determining how to choose the best reference is relevant for any study performing imputation with publicly available reference sets.
Many studies have validated the accuracy and reliability of imputation [
5‐
7], but most of these studies focused on populations of European descent [
5,
7]. One study showed that the accuracy of using a publicly available database varied across human populations with Europeans having the highest accuracy and Africans having the lowest [
6]. Because Asian populations have some unique genetic characteristics [
8], it is not always possible to directly adapt information about genetics or genomics from studies in Caucasian populations [
9].
Several types of software are currently available for performing genotype imputation [
10‐
13]. Similarly, many publicly available genetics databases are accessible for public use [
14,
15]. One of these is the Pan-Asian SNP genotyping database (PanSNPdb), which collects SNPs and copy number variations from 1719 samples in 71 populations from China, India, Indonesia, Japan, Malaysia, the Philippines, Singapore, South Korea, Taiwan, and Thailand [
16,
17]. The genotyping process was performed using the Affymetrix GeneChip Human Mapping 50 K Xba Array.
Most of the studies on imputation have looked at the overall outcome of all SNPs [
5,
18], and a few have focused on a particular region within a gene, not the whole genome [
19,
20]. We proposed two objectives for the current study. The first was to identify the most preferred reference for imputation in Southeast Asian populations. Using two publicly available haplotype databases, the International HapMap Project (HMII) and the 1000 Genomes project (1000G), we compared the accuracy and yield of imputation in several Southeast Asian populations. Additionally, we looked at imputed results using genotyped samples from a study of a Thai genome cohort. The second objective was to evaluate the imputation results of different regions in the human genome using a real dataset from the Thai dengue study as a model. This is the first extensive study of imputation in Southeast Asian populations and the first illustration of imputation differences between SNPs in different regions of the genome.
Methods
This study was divided into two parts. The first part aimed at showing the difference in imputation accuracy by using different criteria for selecting a reference database. Additionally, using data from populations within the Southeast Asian region illustrated the variation in accuracy when going from one population to another. The second part used real genotype data in all autosomes to classify SNPs into different groups according to their location within genes. Imputation accuracy, GWAS significance and allele frequency were then correlated with the classification.
Sample datasets
We performed the first part of our analysis using data from PanSNPdb [
16]. To illustrate the imputation accuracy in Southeast Asian populations, we selected all available samples from Indonesia (ID,
n = 288), Malaysia (MY,
n = 217), the Philippines (PI,
n = 125), Singapore (SG,
n = 90), and Thailand (TH,
n = 245). Only SNPs that were polymorphic in all populations were used in this study (
n = 52,160).
The second part was imputation accuracy and yield in a patient dataset in which we had access to phenotypes because the phenotypes allowed us to observe the effect of imputation on subsequent association tests. The subjects were 609 Thai dengue patients who were 1–15 years-old from Siriraj, Ramathibodi, and Khon Kaen hospitals. A total of 468,987 SNPs from Illumina Human Hap610 array (Illumina Inc., San Diego, CA) passed the quality control requirements (QC). The accuracy of imputation was tested for each SNP from the dengue dataset by first randomly choosing half of the SNPs from the genotyping panel to create a mutually exclusive set of SNPs: Set 1 and Set 2. Then, Set 1 SNPs were used to impute Set 2 to create a complete SNP panel. Set 2 was also used to impute Set 1 to create a complete SNP panel. Based on our results in the previous section, HMII data were used as a reference for imputing the SNPs from the dengue dataset. The total number of SNPs after imputation was 1,417,081. Post-imputation QC reduced these numbers to 858,480.
Quality control and multidimensional scaling
For all of the sample datasets, QC was performed in PLINK v1.07 [
12] using standard procedures for GWAS [
21]. We included all markers with a call rate > 0.95, a minor allele frequency (MAF) > 0.01, and a Hardy–Weinberg equilibrium (HWE) > 10
− 7. Samples with call rates < 0.95 were excluded from the analysis along with samples that had first-degree relationship agreement, as evaluated by expected IBD sharing in PLINK v1.07. Multidimensional scaling (MDS) of Southeast Asian populations from PanSNPdb was performed in PLINK v1.07. This method allowed for visualization of principle components in the admixed population. Plotting of the MDS was conducted in R version 3.0.2 (
http://www.r-project.org/).
Imputation procedure
In the first part of the study, each population from PanSNPdb was analyzed independently. Five percent of SNPs were randomly selected and removed. The same SNPs set of the removed SNPs were applied to all populations. SHAPEIT version 2 software was used to pre-phase the SNPs [
22]. Imputation was accomplished with IMPUTE2 to recover the removed SNPs [
23]. Each population was phased and imputed using both references in turn. According to guidelines from IMPUTE2, we imputed each chromosome separately and used windows of 5 Mb with an additional 250 kb buffer region on both sides of the analysis interval. The options used in the program were -buffer 1000, −iter 30, −burnin 10, and -k 80. The processes for random removal, phasing, and imputation were repeated five times.
The second part of our study used all the autosomal SNPs from the dengue dataset. Half of the autosomal SNPs from Thai dengue patients were removed by every second SNP (Set 1). Another dataset (Set 2) was then created in which the other half of the SNPs (Set 1) were removed. In this way, all SNPs were imputed once. Imputation was performed with HMII as described above. After imputation, SNPs were filtered using a QC process similar to the initial filtering of raw genotypes. Post-imputation QC excluded SNPs with MAF < 0.01, call rate < 0.95, and HWE < 5 × 10− 7. These datasets were used in the GWAS analyses. We then selected only the imputed SNPs from the two datasets and merged them into a single dataset in which all SNPs had been imputed. This dataset was used to compare imputation accuracies of SNPs according to their location relative to known genes.
References used for the imputation were downloaded prior to the imputation process from the Impute website (
http://mathgen.stats.ox.ac.uk/impute/impute.html). The references were labeled on the website as International HapMap project phase II release #22 (HMII) and 1000G phase I. A total of 1,417,081 SNPs from 90 Chinese and Japanese samples were used from HMII with an additional 39,343,900 SNPs from 1092 worldwide sample populations in the combined reference from 1000G. PanSNPdb shared 47,870 SNPs with the HapMap reference and 51,849 with the 1000G reference. The Thai dengue dataset shared 493,846 SNPs with the HapMap reference and 565,912 with the 1000G reference.
Imputation yield and accuracy
IMPUTE2 gives each imputed genotype a posterior probability score (info score) between zero and one. A higher threshold cut off for the probability score will usually result in higher accuracy but a lower yield. In this study, the posterior probability threshold was set to 0.9 to gain results with a high confidence of accuracy [
24]. Genotypes with posterior probabilities < 0.9 were set to missing. Yield was reported as the percentage of non-missing genotypes within the removed SNPs and accuracy as the percentage of imputed, non-missing genotypes that matched the original genotypes.
SNP selection and annotation
All SNPs from the dengue dataset were grouped based on their location in genes. Gene annotation was collected from Illumina sample sheets (Illumina Inc., San Diego, CA) and the NCBI database of genetic variation [
25]. Targeted gene regions included coding, intergenic, intronic, and untranslated regions (UTR). Some SNPs mapped to more than one location and were marked as being in a complex region. There were 53,277 SNPs that were not in any of these 5 groups and were subsequently discarded from further analysis.
The difference between imputed and genotyped data in the dengue dataset was evaluated by looking at several properties of the SNPs. Differences in accuracy and yield for each gene region were measured by varying the posterior probability threshold from 0.5 to 1.0. MAFs and
p-values from chi-square tests for each SNP were compared between imputed and genotyped datasets. Coefficients of determination (r
2) were calculated for each comparison to estimate the concordance of imputed and genotyped SNPs. Pairwise LD, which is measured as r-squared, was calculated within a region of < 1 Mb around each SNP. The calculation was performed in PLINK v1.07 using the option --r2 with --ld-window-r2 0 and --ld-window-kb 1000. Average R-squared values for each region were calculated and plotted in Microsoft Excel. R-squared values were also plotted against distance and averaging over 1 kb bins using R software (
http://www.r-project.org/).
Discussion
Imputation of genotyped datasets is a common practice when performing genome-wide association studies. This technique is used to fill in missing genotypes and to increase the density by adding information from SNPs that are not present in the original dataset [
5,
26]. Imputation of SNPs that are not available in the dataset serves several purposes. First, if available SNP arrays are designed based on a specific population, such as Europeans, the SNPs may not cover the areas of interest for another population. SNPs important to populations from Southeast Asia might therefore be underrepresented or missing from these arrays. This situation has been reported for populations from Africa [
27] and Mexico [
28]. Second, if data are collected independently between groups of case and control populations, the datasets might have been genotyped on different SNP sets. Third, GWAS usually requires a high number of SNPs to increase chance to detect association signals. Although current genotyping arrays could contain more than a million markers, imputation still adds more SNPs for denser full genome coverage.
The choice of reference panels can affect the accuracy of imputation through the genetic variation of the samples and the genetic relationship between the samples in the reference panel and the imputed references [
6,
27]. We studied these effects in five populations from Southeast Asia. The 1000G reference provided the highest yield, while the HMII reference had the highest accuracy. These results were the same for all Southeast Asian populations. IMPUTE2 software provides the posterior probability score for each imputed genotype. There was increasing accuracy with a decreasing yield when the probability threshold increased (Additional file
1: Figure S2). At the threshold 1.0, the result showed a large drop in yield but only a limited increase in accuracy. At the threshold 0.9, the slope of the yield and accuracy were significantly changed. This threshold might be a good starting point for using IMPUTE2 for Southeast Asian populations.
Genotype imputation of the Thai population had the highest accuracy in the current study. Previous work has shown that in the PanSNPdb database, the Thai population had the highest relationship to the Chinese and Japanese populations out of the other four study populations [
16]. The increased accuracy can therefore be explained by this closer relationship. To determine if population diversity influenced the average imputation accuracy, classical MDS was used to display the variation within Southeast Asian populations. The main variation in the MDS plot (C1) was related to the sub-populations in 4 of the 5 main populations, and the only exception was the Thai population (Fig.
3). The Thai samples form a more homogenous group compared to other populations, and this outcome can help explain why they had the most accurate results. Eighteen individuals from the Thai Mlabri group, which is a hunter-gatherer group in Northern Thailand, clustered away from other Thai samples, which is consistent with the findings of previous studies [
17,
29]. Nevertheless, further investigation is necessary to understand the effects of population stratification on imputation results.
Plotting the imputation accuracy and yield for each chromosome revealed that the variation within populations was the highest for the smaller chromosomes, especially chromosomes 19 and 22. The same results were also observed for yield. In particular, the population from the Philippines showed a large increase in variability for these chromosomes when using the 1000G reference. This result might indicate a specific issue with the SNP selection for these chromosomes, such as the LD structure being different or the relationship to the reference being lower in these regions. One possible reason for the increased variability was found by looking at the long distance LD in each chromosome. Chromosome 19 and 22 was shown to have the fewest SNPs connected by LD above 0.2 when comparing inter-SNP distances above 10 kb (Additional file
1: Figure S1). The accuracy of resulting imputation for these two chromosomes will therefore be more dependent on which SNPs are removed and will show more variance between replications with randomly selected SNPs. The results also clearly show that only taking the average numbers for accuracy or yield into account will result in overlooking potentially important information.
Coding sequences have been the favored area to search for functional mutations because these sequences are more informative and easier to interpret due to the direct link to a protein and the possibility of functional changes [
30‐
32]. This approach made it useful to compare the imputation results between coding and other SNP regions. Our results show that a higher percentage of SNPs in coding regions passed the post-imputation QC than in other regions. SNPs in coding regions also had the highest accuracy. This outcome correlates well with the results showing coding regions to have the highest LD with the surrounding SNPs (Additional file
1: Figure S8). This is an important factor to consider when discussing significant results because imputed SNPs in coding regions will have a higher accuracy compared to SNPs in less conserved areas. This trend does not mean that we can omit non-coding SNPs because they have been shown to be associated with phenotypes in more than one-third of GWAS [
33‐
35].
Our study also compared GWAS results from a dataset from Thai dengue fever patients to see how imputation affected the reported results. Imputation appeared to systematically reduce the allele frequency of the initially minor allele for most imputed SNPs (Additional file
1: Figure S5). This effect was more pronounced for SNPs with an initial low MAF and will make imputing low frequency alleles difficult. Our results demonstrated that imputation tended to increase the common allele. This outcome was especially problematic if the genotypes are very rare variants [
26]. Even if the most significant SNPs had higher significance in the imputed dataset, this trend was not seen when comparing
p-values from GWAS before and after imputation.
The imputed genotypes were also subjected to QC-filtering, which is similar to the QC being performed on the raw genotype data. It was previously shown that post-imputation QC did not influence the imputation outcome [
36]. However, we observed an improvement in the correlation between measured MAF and
p-values from imputed data and genotyped data. Without the SNPs failing the QC, the correlation between imputed data and genotyped data was close to one for allele frequency and had improved
p-values. Even if over half the SNPs were removed in this step, the remaining data had higher quality and were more trustworthy. This difference in post-QC improvement could be due to the initial imputation accuracy. Post-imputation QC might be more important if the initial imputation results are less accurate. This result is also supported by a previous experiment that similarly demonstrated Hardy–Weinberg disequilibrium is a crucial step for post-imputation filtering [
37].
This study demonstrates that the expected accuracy and yield of imputation in various Southeast Asian populations varies between populations. Our reference comparison of HMII and 1000G in imputation in Thai GWAS showed that using a larger reference provided a higher yield but caused a reduction in accuracy compared to a smaller but more related reference. We also extensively showed the imputation results with respect to SNP localization near genes using Thai genome-wide genotypes as a model. This study provides crucial information for investigators undertaking imputation, especially in Southeast Asian populations.
Acknowledgements
We thank all team members of the Pan-Asian SNP Genotyping Database especially Dr. Sissades Tongsima and Dr. Chumpol Ngamphiw for collecting and curating PanSNPdb data.