Building a forensic ancestry panel from the ground up: The EUROFORGEN Global AIM-SNP set

https://doi.org/10.1016/j.fsigen.2014.02.012Get rights and content

Abstract

Emerging next-generation sequencing technologies will enable DNA analyses to add pigmentation predictive and ancestry informative (AIM) SNPs to the range of markers detectable from a single PCR test. This prompted us to re-appraise current forensic and genomics AIM-SNPs and from the best sets, to identify the most divergent markers for a five population group differentiation of Africans, Europeans, East Asians, Native Americans and Oceanians by using our own online genome variation browsers. We prioritized careful balancing of population differentiation across the five group comparisons in order to minimize bias when estimating co-ancestry proportions in individuals with admixed ancestries. The differentiation of European from Middle East or South Asian ancestries was not chosen as a characteristic in order to concentrate on introducing Oceanian differentiation for the first time in a forensic AIM set. We describe a complete set of 128 AIM-SNPs that have near identical population-specific divergence across five continentally defined population groups. The full set can be systematically reduced in size, while preserving the most informative markers and the balance of population-specific divergence in at least four groups. We describe subsets of 88, 55, 28, 20 and 12 AIMs, enabling both new and existing SNP genotyping technologies to exploit the best markers identified for forensic ancestry analysis.

Introduction

The prospects for typing 200–300 single nucleotide polymorphisms (SNPs) in one multiplexed sequencing analysis are now much more realistic with the emergence of fast, compact next-generation sequencing systems (NGS), such as Life Technologies Ion Torrent and Illumina MiSeq [1], [2]. SNPs have the benefit of complimenting conventional forensic STR analysis by providing information about the DNA donor that can progress a criminal investigation lacking any leads beyond knowledge of gender. Principal amongst the complimentary data generated by SNP analysis is the inference of genetic ancestry and prediction of common physical traits, with SNP-based analysis of pigmentation now established as a viable investigative tool [3], [4], [5]. Until the development of compact NGS approaches, forensic ancestry analysis centered on small-scale multiplexes of carefully chosen SNPs and Indels, exemplified by a 34-SNP SNaPshot multiplex and a 46-Indel dye-labeled PCR multiplex [6], [7], [8]. Once optimized, we successfully applied these tests to a variety of challenging DNA cases [9], [10], [11], [12] and their combination into 80-marker profiles provides good data depth, short-amplicon PCRs sensitive to degraded DNA and complimentary features including Indel's enhanced ability to detect mixed DNA. However, the original choice of ancestry informative markers, particularly components of the 34-plex SNP test, reflected the state of knowledge of human SNP variation some nine years ago. Now much more extensive SNP catalogs can be screened for suitable candidate markers with major human genome initiatives including HapMap, 1000 Genomes and Complete Genomics publicly releasing project data to allow identification of the best markers for ancestry inference purposes.

We decided to build, from a completely refreshed list of candidates, a new ancestry SNP (AIM-SNP) panel using our own bio-informatics search tools [13], [14] that front-end public genome data. Reconfiguring a forensic AIM-SNP set allows several characteristics to be prioritized: (i) identifying the most powerful differentiators for each population comparison; (ii) finding alternative loci with near-identical frequency distributions due to LD-block correlations [15] when SNP multiplexing problems arise, and (iii) carefully balancing marker combinations to give equivalent levels of differentiation between population groups comprising: Africans, Europeans, East Asians, Native Americans and Oceanians. The third characteristic is the most desirable for ensuring less biased assessments of admixture proportions in individuals with detectable co-ancestry–a significant demographic feature of urban populations and regions with histories of population movement (see Chapter 14 of [16]). However, population differentiation balance is also the most challenging characteristic to achieve, since, of the above five groups, Native American and Oceanian variation is not represented in any of the full human SNP catalogs. Luckily, more than 650,000 SNPs have been characterized for the CEPH Human Genome Diversity Panel (HGDP-CEPH) with two Oceanian populations and five American populations [17], so suitable SNPs can be identified for differentiating these two groups, albeit from much smaller sample sizes.

This paper outlines the AIM-SNPs chosen to construct a set of 128 markers suitable for inclusion in forensic NGS tests. The set maintains near-identical population differentiation balance between admixture contributors originating from the five main continentally defined population groups. Therefore the AIM-SNPs together allow analysis of admixed individuals, provided the co-ancestry contributors themselves are not admixed. The AIMs are applicable to a large proportion of the worldwide distribution of human populations, including regions where populations meet and admixture contributors are not necessarily confined to Europeans, Africans or East Asians, e.g. American contributors in the USA and South America or Oceanians in Australia. However, differentiation of European from Middle East or South Asian sub-groups of Eurasia was ignored in favor of ensuring Oceanian differentiation comparable to the other groups. The possibility of allele frequency bias in the populations used to select AIM-SNPs can still exist so we attempted to minimize this by using at least two geographically separated populations per group. Four populations likely to be divergent from those used for selection were also tested to gauge the degree of allelic heterogeneity they exhibited for the same SNPs. Because size constraints can still apply to PCR multiplexes in all technologies, (forensic NGS tests may include STRs as main components), we also reduced the SNP set to smaller scale subsets while maintaining the population differentiation balance at each stage of reduction. Lastly, we describe Sequenom iPLEX® MALDI-TOF genotyping tests used to validate additional population variation in the AIM-SNPs chosen and to assess each SNP's multiplex performance ahead of porting them to larger-scale NGS chemistries.

Section snippets

Sources of AIM-SNPs and allele frequencies in the five main global population groups

Candidate AIM-SNPs were compiled from three sources: (i) SNP sets previously developed for a range of forensic ancestry test initiatives at Santiago (USC); (ii) allele frequency screens of the Stanford HGDP-CEPH 650 K SNP dataset [17], [18] – identifying SNPs with the highest divergence between targeted population comparisons by finding the top 5% most differentiated in each case, and (iii) AIM-SNP lists published both before and after availability of whole genome scan (WGS) high-density SNP

Characteristics of the ancestry informative SNPs selected

A final set of 122 bi-allelic and 6 tri-allelic SNPs were selected from a total candidate pool of 189 loci (and 12 tri-allelic) and are detailed in Table 1 and Supplementary Table S2A. All candidate SNPs from sources detailed in Section 2.1 are listed in Supplementary Table S2B. Global AIM-SNP allele frequency distributions in five population groups are summarized in Fig. 2. The cumulative PSD values in each group required a smaller number of AFR-informative SNPs and for 28 candidates, Oceanian

Discussion

This study shows the current extensive human genome variation catalogs can be easily accessed and their allele frequency data used to select highly differentiating ancestry informative SNPs. We were able to build sets with a range of sizes that meet the statistical power demands of forensic analysis, while focusing on the key characteristic of population differentiation balance. Although prompted by the previous study of Galanter, that addressed AFR-EUR-AME populations [22], for all but the

Acknowledgements

This work was funded by the EUROFORGEN Node of Excellence (Grant Agreement No. 285487). Studies leading to the reported results were financially supported by the Austrian Science Fund (FWF, P22880-B12). CS is supported by funding awarded by the Portuguese Foundation for Science and Technology (FCT) and co-financed by the European Social Fund (Human Potential Thematic Operational Program SFRH/BD/75627/2010).

References (37)

  • N.A. Rosenberg et al.

    Informativeness of genetic markers for inference of ancestry

    Am. J. Hum. Genet.

    (2003)
  • C. Phillips

    Ancestry informative markers

  • T. Bersaglieri et al.

    Genetic signatures of strong recent positive selection at the lactase gene

    Am. J. Hum. Genet.

    (2004)
  • P. Gill et al.

    An evaluation of potential linkage disequilibrium between the STRs vWA and D12S391 with implications in criminal casework

    Forensic Sci. Int. Genet.

    (2012)
  • S.B. Seo et al.

    Single nucleotide polymorphism typing with massively parallel sequencing for human identification

    Int. J. Legal Med.

    (2013)
  • R. Pereira et al.

    Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing

    PLoS One

    (2012)
  • C. Phillips

    Applications of autosomal SNPs and Indels in forensic analysis

    Forensic Sci. Rev.

    (2012)
  • C. Phillips et al.

    A 34-plex autosomal SNP single base extension assay for ancestry investigations

    Methods Mol. Biol.

    (2012)
  • Cited by (98)

    • Overview of NGS platforms and technological advancements for forensic applications

      2023, Next Generation Sequencing (NGS) Technology in DNA Analysis
    View all citing articles on Scopus
    View full text