Introduction

Small molecular regions (for example, spanning <10 KB) comprised of three or more SNPs that define multi-allelic haplotype loci (minihaps) have the potential to convey more identity and ancestry-related information than a like number of single SNPs would convey. We defined and have advocated developing such multi-SNP haplotype systems as one type of forensic DNA marker, lineage informative SNPs, LISNPs (Pakstis et al,1 Butler et al;2 see also Ge et al3). The multiple alleles (haplotypes) available in these more complex systems can serve to identify relatives with higher probabilities than simple di-allelic SNPs. By restricting the molecular extent to under 10 KB in regions with no recombination hot spot, recombination among the SNPs will be so rare that the possibility of recombination within a kindred approaches the mutation rate for SNPs. Depending on allele frequency variation among populations, minihaps could also be useful in ancestry inference. Direct comparison testing could also benefit from minihaps, but match probabilities will in general be population specific.

Ge et al3 described a strategy for identifying sets of SNPs with nearly complete to complete linkage disequilibrium (LD) among them. Although we agree that the objective of identifying haplotypes for forensic purposes is valid, we think their primary criterion of complete LD results in loci with lower than optimal heterozygosity. Based on our interest in haplotypes and their global patterns, we have recently begun pursuing the same objective of finding forensically useful haplotype systems for familial identification as they can have high heterozygosity, the relevant issue for familial identification. Our strategy is very different from that of Ge et al.3 We focus on finding very closely spaced SNPs in a region that is not a ‘hot spot’ for recombination but also does not show strong linkage disequilibrium (LD). Our criteria are to find regions in which we can (1) clearly see multiple haplotypes that show conservation of evolutionary lineages that have evolved from the ancestral human haplotype and (2) show no evidence of recurring recombination (though there may be some historic recombinants that have drifted to sufficiently high frequency to become common haplotypes). Such regions are not necessarily ‘LD blocks’ or ‘haploblocks’, as those terms are usually used3, 4 and are not necessarily in strong LD. This approach of haplotype lineage identification is justified by and illustrated in our recent work on other projects in which we have identified the gene segments with multiple haplotypes (alleles) but that show no evidence of historic recombination or at most a single historic recombinant: ADH7;5 POLB;6 SLITRK1;7 CYP2E1;8 CYP2C8;9 ADH1B;10 OCA211 and TAS2R16.12

For minihaps to be useful in forensics, haplotype frequencies must be known and ideally there must be at least moderate heterozygosity in most populations of relevance in forensics. Given the highly diverse ancestries of the US population, and increasingly of European populations, our collection of population samples of diverse global origins (Table 1) is particularly appropriate for identifying forensically useful minihaps. We have applied our empirical screening to identify potentially useful mini-haplotypes in a subset of data that we have accumulated in past and ongoing research projects. In these genomic regions, we have accumulated dense SNP data; such dense SNP data are not available for most other global studies. This report presents eight minihaps that have emerged as examples. These exclude the examples referred to in the papers cited above.

Table 1 The 45 population samples studied (descriptions of the populations and samples can be found in ALFRED)

Methods

We have studied routinely 45 population samples that were available for this screen. Table 1 displays the sample sizes (averaging 53 individuals) for each of the 45 populations (organized by geographical region of the world). For these populations, we have sufficient typing data to estimate haplotype frequencies with generally acceptable standard errors, at least for the common haplotypes. We restricted our search to SNPs that we have typed for all individuals in all 45 populations and that meet Hardy–Weinberg expectations in all populations. Our typing methods are described in detail elsewhere.5, 6, 7, 8, 9, 10, 11 As noted in those studies, an occasional typing failure persists after two attempts to obtain an acceptable result. Experience has shown that the yield from additional attempts at typing is extremely low. Those persistent failures are randomly distributed among individuals and populations and appear idiosyncratic chance or individual- and SNP-specific aspects of the DNA unrelated at the population level to the markers involved. Overall, the missing data rate (‘typing failure rate’) for the SNPs in the eight minihaps presented here was 1.83%.

We identified mini-haplotypes with the following properties: clusters of three to five SNPs with overall molecular extents below 10 000 bp and median heterozygosities ≥0.5 for the 45 population samples studied. We chose 10 KB to provide an upper limit of about 10−4 for the recombination rate assuming a rough genome average of 1% per megabase and the absence of a recombination hot spot. We sought SNP clusters in which the average pairwise LD values across the populations were weak to moderate (that is, r2 ≤0.60) to help maximize the informativeness (heterozygosity) of the haplotypes.

Haplotype frequencies were estimated using the HAPLO program13 that implements the EM algorithm and calculates jackknife standard errors as well as binomial standard errors. Individual multi-SNP phenotypes can be unambiguously resolved into the haplotype-based genotype by observation whenever none or only one of the SNPs is heterozygous; only phenotypes heterozygous for more than one SNP have an ambiguous mapping to genotype. Larger percentages of unambiguous genotypes make the maximum likelihood estimates of the haplotype frequencies more accurate. The haplotype frequencies were used to estimate the linkage disequilibrium (LD) between any two pairs of alleles. Linkage disequilibrium was calculated using the commonly used r22 in Devlin et al14). Such descriptive statistics as the expected heterozygosities and linkage disequilibrium characteristics of the haplotypes were calculated for each minihap system for all the populations studied.

Results and Discussion

We examined over a dozen gene regions with dense SNP coverage and identified eight mini-haplotype systems that meet our screening criteria, seven regions defined by 3-SNPs and one defined by 4-SNPs. Table 2 presents the basic characteristics for these eight minihaps, including the molecular extent. For the eight minihaps we identified, Figure 1 presents the frequencies (in stacked bar format) of the individual haplotypes in each population. Allele frequencies have been deposited in ALFRED (with keyword ‘minihap’) for each of the SNPs for the populations studied and for the minihaps; detailed results for each mini-haplotype in each population are also given in Supplemental Materials.

Table 2a Characteristics of mini-haplotypes
Figure 1
figure 1

Haplotype frequencies (proportional to colored bar lengths) in 45 populations from around the world. Each mini-haplotype consists of three LISNPs except for KRAS which has four LISNPs. Populations are ordered geographically on the x axis as in Table 1.

As can be seen from the minihap characteristics shown in Tables 2 and 3 and the allele frequencies for each of the eight minihaps across the populations studied, the screening was very successful. Each 3-SNP haplotype system has four to seven multi-SNP alleles (out of the eight possible haplotypes given three SNPs) with commonly occurring frequencies. The 4-SNP minihap has 2 to 4 haplotypes with common frequencies of 5% or more out of the 16 possible haplotypes. The median heterozygosity among the 45 populations studied ranges from 0.54 (PAH) to 0.72 (DBH) for the eight minihaps. Table 2b shows the percentage of the populations at each of the eight minihaps, with expected heterozygosities better than an optimally informative SNP (that is, ≥0.5). Although only one of the eight mini-haplotypes studied came close to having all population samples, with an expected heterozygosity better than an optimally heterozygous SNP (the GRAMD1C minihap has 44 of 45 populations with heterozygosity >0.5), five of the eight minihaps had over 87% (39 or more) of the population samples with heterozygosities exceeding that of an optimal SNP. For familial identification, heterozygosity is the most important variable as less common alleles are more informative in identifying a likely relative. For a diallelic SNP, the average allele frequency is 0.5; but for a multiallelic locus, the average allele frequency is 1/n. Although the informativeness of these minihaps tends to be good across all world regions, it would be desirable to have more stringent criteria that provide for a high minimum average heterozygosity for each world region. However, some of the smaller and especially the relatively isolated and inbred populations, such as the Karitiana and Nasioi, will naturally have lower average heterozygosities and require more effort if it is important to find good minihaps in those populations.

Table 2b Characteristics of mini-haplotypes

Table 2b also shows that a high proportion of the individuals with these 3-site and 4-site phenotypes in our eight mini-haplotype examples can be resolved unambiguously into haplotypic genotypes by direct examination because no more than one SNP is heterozygous. The median percentage of individual 3-SNP phenotypes is resolvable by direct examination in the populations studied (that is, unambiguous, ranges from 50 to 80%). This high percentage of resolvable genotypes will usually allow high probabilities for statistical estimation of the genotypes of the individuals with ambiguous phenotypes.

Each of the minihaps has different properties in each of the 45 populations. Figure 2 plots these 360 different situations by the heterozygosity of the system and the percent of individuals with unambiguous resolution of genotype. Both higher heterozygosity and a higher percentage of unambiguous resolution of genotype are better for familial inference. It is clear that the different loci differ in general and with respect to specific populations. Some minihaps, such as those at GRAMD1C, TAS2R1 and RASGEF1A, have both high heterozygosity (>0.5) and high resolvability (>0.5) in most populations (upper right quadrant). Others, especially at AGT, have high heterozygosity but low resolvability (<0.5) in most populations. Still others, such as at KRAS, show considerable variation among populations. Thus, it is obvious that the value of any minihap for familial inference is a function of the population in which it is studied. At the same time, the minihap data themselves can often help identify the relevant population to the degree that allele frequencies vary among populations. The informativeness of a locus for ancestry inference is a function of the allele frequency variation among populations, frequently measured as Fst. In the Supplementary Table S1, we present the Fst values for these eight minihaps and for the component individual SNPs. These loci were selected to be multiallelic in most populations biasing against high levels of allele frequency variation and show a range of Fst values distributed across the range we have seen for a random selection of individual SNPs in these populations.

Figure 2
figure 2

A scatterplot of each of the eight minihaps in each of the 45 populations plotting the heterozygosity and the percentage of individuals with an unambiguous genotype. The eight different loci are indicated with different symbols, as indicated. One PAH point (73% heterozygosity and 10% resolvable) falls outside the range visualized.

For comparison, we have also carried out analyses on the 24 ‘haploblocks’ identified by Ge et al3 in the HapMap data set; the ‘haploblock’ frequency bar graphs are shown in Figure 3 for three typical blocks (the details for the other 21 ‘haploblocks’ can be found in the Supplemental Material). Based on this result, we conclude that the definition and the search procedure they described were not optimal, and identified many haplotypes that are only slightly better than a single SNP with maximum allele frequencies of 0.5 (Table 3, Figure 3, and Supplemental Data). This falls short of the potential informativeness for familial identification of molecularly short multi-SNP haplotype systems and mitigates the advantage of using haplotypes. Our definition and search procedures, as shown by our minihap examples, result in higher heterozygosity on average than is possible with an individual SNP, demonstrating the validity of the minihap concept and its distinction from the ‘haploblock’ concept. Although heterozygosity is a function of actual haplotype frequencies, a hypothetical 3-SNP haplotype system with five of eight possible haplotypes at equal frequencies will have a heterozygosity of 0.80 (if all eight possible haplotypes occur at equal frequencies, the heterozygosity is 0.875).

Figure 3
figure 3

Frequency bar graphs for 3 of the 24 ‘haploblock’ haplotypes identified by Ge et al3 using HapMap information. Compare to bar graphs in Figure 1 for the eight minihaps.

Table 3 Multi-SNP haplotypes: heterozygosity and pairwise LD values (r2) for all SNP pairs.

For use in familial searching or ancestry inference, it is important that the ‘mutation rate’ be very low. In the case of haplotypes of SNPs, that translates into a very low recombination rate since we already know mutation rates for individual SNPs are on the order of 10−8 (Reich et al15). This is significantly lower than the estimated mutation rates at STRPs.16, 17 Thus, we expect that a new mutation within a family at one of the relevant SNPs (or affecting a PCR primer function) will be very low, even considering the multiple meioses that may be involved. Recurring recombination in the minihaps we have identified is very rare to be absent in human populations globally. The selection criteria and the data support the validity of assuming that the haplotypes identified and being studied are identical by descent over most of recent human evolution. Obviously, it is necessary to avoid regions that encompass a recombination hot spot as we have identified at other loci.5, 8 However, high frequency of a recombinant haplotype does not necessarily indicate a hot spot for recombination. A single historic crossover gamete may have drifted to high frequency over many human generations. Thus, the actual haplotype frequencies depend on the pattern of accumulated mutations and any historical crossovers plus the historical random genetic drift making the new haplotypes sufficiently common.

It is important to recognize that measures of LD are not measures of recombination rate but of recombinant haplotypes in the population. Thus, the presence of all four combinations of the alleles at two di-allelic SNPs gives an LD value less than 1.0, but that does not mean there is recurring recombination. One check for the relative frequency of recombination is to examine the flanking SNPs. Occurrence of many different combinations among the SNPs on either side of an interval, as is seen at ADH7,5 is an indication of frequent recombination. In contrast, very similar flanking combinations (allowing for mutation) relative to the targeted genomic interval, as is seen at ADH1B,10 likely indicate a single origin from a historic crossover. For each pairwise comparison there should be at least three of the four possible allelic combinations present in order for r2 to yield a value less than 1.0 (though D' will be exactly 1.0 if only three combinations occur). If both r2 and D' equal 1.0 then only two combinations occur and one of the two SNPs is completely redundant (that is, uninformative). This is apparently the case for the ‘haploblocks’ identified by Ge et al.3

As seen in Figure 1, and our studies of various genes cited earlier, some requisite recombination events appear to have occurred and the resulting haplotype became common early in human dispersal. These recombinant haplotypes may have persisted as identical by descent (IBD) lineages since their origin. Thus, these historical recombination events can generate haplotypic heterozygosity but do not necessarily indicate high recombination rates, per se, just high frequency of one or a few crossover products. We note that assumption of equilibrium and use of frequency of recombinant chromosomes to estimate recombination rates can give different rates for different populations, none of which may be correct. The distinction needs to be considered in any search for forensically useful minihaps.

We think that this report substantiates the feasibility of finding a panel of multi-allelic mini-haplotypes that would be useful in routine forensic applications in many different populations from around the world. Substantial SNP resources exist (for example, the Human Genome Diversity Project 650Y Illumina chip data set among others) to search for more and better mini-haplotypes. We note the caveat that SNPs in the HGDP data set are often not sufficiently dense for minihap forensic purposes, although there is an inverse relationship with actual recombination rate—in a region of very low recombination the markers can be more distant. SNP data in the HapMap data sets are much more dense, but the representation of global human diversity is much poorer. The increasing amount of genomic sequence data provides another resource for identification of minihaps, but as yet global human diversity is poorly represented.

The population data we provide in this paper are essential for any use of minihaps in forensic applications: calculating the probability of a match between an individual and crime-scene DNA, inference of the likely ancestry of the source of a DNA sample and inference of the likelihood that a relative of a known person might match a similar crime-scene sample (that is, familial searching). This last case is also applicable in mass disaster and missing persons cases in which data on an individual (or on ‘remains’) need to be assigned to families. All of these calculations will depend on estimating the population-specific haplotype-based genotype from the multisite phenotype when the phenotype is ambiguous. Nonetheless, we note that most phenotypes are not ambiguous (cf., Figure 2) and even some of the ambiguous phenotypes have very large probabilities of logically mapping to a single genotype. In all cases, but especially in the familial searching (LISNP) context, it is necessary to calculate likelihood ratios considering the probabilities of a match, of an ancestry, or of a family member identification relative to the probabilities of the phenotype occurring by chance in each specific relevant population. In the mass disaster case, the relative likelihoods of the assignment to different families would need to be calculated. All of these estimates are population dependent and can be estimated from the population haplotype frequencies presented here. Basic Mendelian transition probabilities18 can be used in combination with the phenotype to genotype probabilities to determine the numerator in the familial searching and mass disaster situations.

The eight examples we present here are statistically independent both in terms of LD within populations and in terms of linkage between minihaps, even the two on chromosome 12 (75 MB apart on separate arms), allowing simple multiplication among the values calculated for the individual minihaps. However, we are not advocating immediate attempts to implement this panel as we consider these eight to be insufficient in general for highly significant statistical results in any of the applications above. Nonetheless, a likelihood ratio could be quite large in a specific population for some specific combination of genotypes among relatives. These eight minihaps do illustrate the concept and provide the conceptual basis for actual application in forensics considered broadly to include identification of family relationships and ancestry. Ultimately, they can be used along with other minihaps to be identified and documented in future studies. These examples are also centered around known genes, though none of the sites is part of an expressed protein. For some researchers/ethicists this could raise an issue of whether privacy and personal health information are compromised. We argue that these SNPs are all normal variation as documented by their high heterozygosity and the multiple haplotypes common in all populations. Hence, none could be strongly associated with a highly deleterious trait. Thus, until and unless some exceedingly high risk of disease/disorder is demonstrated–a very unlikely event given the population genetics of these minihaps–we consider these to be ethically acceptable forensic markers.

In summary, we have shown how, by using certain selective criteria, mini-haplotypes can provide the informativness of multiallelic loci, while using SNP genotyping technology. Moreover, these minihaps have the evolutionary stability that allows haplotypes to be equated with alleles basically identical by descent in broader studies. Because of their high potential heterozygosity and the population differences in haplotype frequencies, we expect that mini-haplotypes are likely to be very useful markers for connecting an individual to an extended family or clan in forensic work. Thus, they fall into the lineage-informative category of genetic markers. They can also provide information on ancestry, as shown by the AAA haplotype at AGT (most frequent in Europeans) and by the GAA haplotype at PAH (most frequent in East Asians). Information from such markers may be especially valuable for identification purposes in the case of mass disasters.