Short communicationSTRait Razor v2s: Advancing sequence-based STR allele reporting and beyond to other marker systems
Introduction
STRait Razor [1] initially was developed to capture and characterize variation in the repeat motifs of target short tandem repeat (STR) markers from next generation sequencing or, the more aptly named, massively parallel sequencing (MPS) data. Since its inception, STRait Razor has been used for length-based (LB) analysis and repeat motif variation of population data [2], [3], [4], assessment of novel MPS multiplexes [5], [6], alignment-free characterization of insertion-deletion (InDel) polymorphisms [7] and other novel applications [8], [9], [10]. While STRait Razor v2 [11] improved allelic reporting and expanded target loci, the flanking regions of STR loci (and other marker types such as single nucleotide polymorphisms (SNPs) and InDels) remained largely unreported without modification to the included configuration files [12] or use of alternative software [13], [14], [15], [16].
Variation in the flanking regions of STRs has been used to study human evolution and migration patterns [17], [18], [19] and increasing the probability of exclusion [20]. These combinations of SNP and STR (SNPSTR) loci phased within the same amplicon provide a finer granularity to better discriminate individuals. In an effort to standardize nomenclature, the International Society of Forensic Genetics (ISFG) published a set of considerations [21] regarding reporting of STR alleles, repeat and flanking regions, for comprehensive and consistent nomenclature.
The proposed ISFG nomenclature [21] is not implemented currently into commercial or third-party software. Researchers must convert their data manually which may be error prone. Traditional alignment software [22], [23], while effective for genotyping SNPs, produces inconsistent results on a per read basis. For example, the Burrows-Wheeler Aligner [22], widely used for read mapping, places the insertion/deletion points in relation to the reference at different points within the repeat region when considering sequence variants and/or the end point(s) of each read. However, direct haplotype capture used by various software tools [16], [24], including STRait Razor, allows users to extract phased data of flanking region variants as well as the target locus. As data analysis pipelines for this task are still nascent to the field of forensic genetics, bioinformatics concordance with operationally distinct methodologies are critical to ensure complete, as possible, results are obtained. Amplified products separated by capillary electrophoresis (CE) provide analysts with a comprehensive allele determination in respect to size; however, sequence characterization, thus far, has been regulated to the repeat region of the amplicons [5], [6], [25], [26], [27], [28]. While effective, this approach to allele reporting has been shown to be limited in its informative value [4], [14], [20] and, in some cases, its backwards compatibility with CE data [2], [4], [14]. Therefore, characterization of the flanking region of loci is necessary to realize the full potential of MPS systems.
Bi-allelic loci (e.g., SNPs and InDels) are useful particularly in challenged samples. Kidd et al. [29] have shown the utility of combining closely linked SNPs (<200 bp) into microhaplotype loci phased within a single amplicon. However, interpretation of these small amplicon markers has been limited to the target SNP(s) of interest and have ignored potential variation in the surrounding region of the amplicon. However, variation along the entire amplicon in the form of InDels-SNPs [7], DIP-STRs [30], SNPSTRs [17], [18], [19], [20], [31], or microhaplotypes [29], [32], [33], [34], [35] may be present at some level for every forensic locus but are ignored, as yet, by first-party software. More recently, these additional data have been shown to increase the discrimination power of commercially available forensic-genomics assays from 8.54 × 10−34 to 1.31 × 10−39 for LB and sequence-based (SB) STRs and 7.66 × 10−58 to 5.49 × 10−63 when identity SNPs/microhaplotypes are included [4], [36].
The two primary sequencing chemistries currently being considered for forensic applications are those of the Illumina MiSeq FGx™ Forensic Genomics System (Illumina, San Diego, CA, USA) and the Ion Torrent PGM™ and S5™ (Thermo Fisher Scientific, San Francisco, CA, USA). Each chemistry has a distinct detection method which generates unique interpretation considerations. Substitution errors (SBEs) are the primary source of error within the Illumina sequencers [37] with a relatively small proportion of insertion/deletion errors typically associated with specific sequences (sequence-specific error; SSE) [37], [38], [39]. Schirmer et al. [38] used 16S rRNA amplicons to estimate the source and distribution of sequencing error with the MiSeq 2 × 250 sequencing chemistry. They concluded that the accumulation of phasing and pre-phasing errors over the course of the sequencing read complicated base calling and lead to a concomitant increase in sequence errors or miscalls. There was an observed increase in sequencing error approximately at position 200–225 within the sequencing reads using MiSeq v2 2 × 250 sequencing chemistry. Data from the Ion Torrent, however, have a relatively high false InDel rate associated with homopolymeric stretches in conjunction with SBEs [40], [41], [42], [43]. Bragg et al. [40] also observed a marked increase in SBE rate for both the 100 and 200 bp chemistries as the read length increased at approximately 75 and 150 bp, respectively. These sequencing errors may either complicate interpretation of the data by producing relatively high-abundance, non-target haplotypes or affect the ability of the configuration files to capture the desired haplotypes. With these limitations in mind, optimization with respect to bioinformatics of each system on a per-marker basis may be necessary to ensure application beyond single-source reference samples.
With the advent of commercial MPS multiplexes, forensic practitioners now have the capacity to interrogate a large number of markers and marker types (e.g., STRs, SNPs, InDels, and microhaplotypes) within a single analysis [4], [5], [6], [27], [44], [45], [46], [47], [48]. However, orthogonal bioinformatics solutions for phased data including microhaplotypes have been limited to traditional alignment-based software [49], [50] which provide only computationally derived phasing. STRait Razor v2s (the “s” designation referring to the addition of SNP loci, SRv2s) is a freely available update that provides a direct haplotype capture approach that includes physical phasing to determine the diplotype (i.e., a specific combination of haplotypes at a particular site in an individual analogous to the genotype of alleles) at each locus (STRs, SNPs, microhaplotypes, and InDels) from current (or soon to be available) commercially available forensic multiplex assays.
Section snippets
New features
STRait Razor v2s suite of tools features kit-specific locus-configuration files for the Applied Biosystems™ Precision ID GlobalFiler™ Mixture ID panel (Thermo Fisher Scientific), Illumina® ForenSeq™ DNA Signature Prep Kit (Illumina), and Promega PowerSeq™ Systems (Auto, Y, and Mito) (Promega Corporation, Madison, WI, USA). In addition to these multiplexes developed expressly for MPS, configuration files were included for forensically relevant InDels and the repeat-region configuration files for
Results and discussion
Each large MPS multiplex presents unique challenges for bioinformatics development. When considering the placement of the anchors for each assay, multiple factors (e.g., size of the amplicon, presence of SSEs, sequencing platform, flanking region SNPs, etc.) should be considered. Previous studies have used the PCR primers for anchor assignment [24], [54]. Anchors based on PCR primers allow for capture of the maximum amount of information from an amplicon in the form of both repeat region and
Conclusion
STRait Razor v2s uses a direct haplotype capture approach to extract phased data from entire amplicons of all marker types within current (or soon to be available) commercially available forensic multiplex assays. This software update provides users a database of, currently, ∼2500 unique sequences matched with nomenclature in the format recommended by Parson et al. [21]. The sequences contained within this database are based on the anchors defined as reported herein; as primers define the
Conflict of interest
The authors declare they have no conflict of interests.
Acknowledgements
This work was supported in part by award no. 2015-DN-BX-K067, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect those of the U.S. Department of Justice. The authors would like to thank Nicole Novroski, Jennifer Churchill, Lisa Borsuk, Lilliana Moreno, Ryan England, and Katherine Gettings for their contributions and
References (61)
- et al.
STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data
Forensic Sci. Int. Genet.
(2013) - et al.
Sequence variation of 22 autosomal STR loci detected by next generation sequencing
Forensic Sci. Int. Genet.
(2016) - et al.
Genetic analysis of the Yavapai Native Americans from West-Central Arizona using the Illumina MiSeq FGx (TM) forensic genomics system
Forensic Sci. Int. Genet.
(2016) - et al.
Characterization of genetic sequence variation of 58 STR loci in four major population groups
Forensic Sci. Int. Genet.
(2016) - et al.
High sensitivity multiplex short tandem repeat loci analyses with massively parallel sequencing
Forensic Sci. Int. Genet.
(2015) - et al.
An evaluation of the PowerSeq Auto System: a multiplex short tandem repeat marker kit compatible with massively parallel sequencing
Forensic Sci. Int. Genet.
(2015) - et al.
Massively parallel sequencing of 68 insertion/deletion markers identifies novel microhaplotypes for utility in human identity testing
Forensic Sci. Int. Genet.
(2016) - et al.
Sequence-based analysis of stutter at STR loci: characterization and utility
Forensic Sci. Int. Genet. Suppl. Ser.
(2015) - et al.
Massively parallel sequencing of 17 commonly used forensic autosomal STRs and amelogenin with small amplicons
Forensic Sci. Int. Genet.
(2016) - et al.
The next dimension in STR sequencing: polymorphisms in flanking regions and their allelic associations
Forensic Sci. Int. Genet. Suppl. Ser.
(2015)
Introduction of the Python script STRinNGS for analysis of STR regions in FASTQ or BAM files and expansion of the Danish STR sequence database to 11 STRs
Forensic Sci. Int. Genet.
A DNA sequence searching tool for massively parallel sequencing data
Forensic Sci. Int. Genet.
FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise
Forensic Sci. Int. Genet.
Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements
Forensic Sci. Int. Genet.
My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing
Forensic Sci. Int. Genet.
Second-generation sequencing of forensic STRs using the ion torrent HID STR 10-plex and the ion PGM
Forensic Sci. Int. Genet.
Evaluation of the Illumina((R)) beta version ForenSeq DNA signature prep kit for use in genetic profiling
Forensic Sci. Int. Genet.
Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics
Forensic Sci. Int. Genet.
Microhaplotype loci are a powerful new type of forensic marker
Forensic Sci. Int. Genet. Suppl. Ser.
Genetic markers for massively parallel sequencing in forensics
Forensic Sci. Int. Genet. Suppl. Ser.
Flanking region variation of ForenSeq™ DNA signature prep kit STR and SNP loci in Yavapai Native Americans
Forensic Sci. Int. Genet.
Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM)
Forensic Sci. Int. Genet.
Inter-laboratory evaluation of the EUROFORGEN Global ancestry-informative SNP panel by massively parallel sequencing using the Ion PGM
Forensic Sci. Int. Genet.
Inter-laboratory evaluation of SNP-based forensic identification by massively parallel sequencing using the Ion PGM
Forensic Sci. Int. Genet.
Revealing the challenges of low template DNA analysis with the prototype Ion AmpliSeq Identity panel v2.3 on the PGM Sequencer
Forensic Sci. Int. Genet.
D5S2500 is an ambiguously characterized STR: identification and description of forensic microsatellites in the genomics age
Forensic Sci. Int. Genet.
Massively parallel sequencing of short tandem repeats-Population data and mixture analysis results for the PowerSeq system
Forensic Sci. Int. Genet.
Massively parallel sequencing for the forensic scientist—sequencing archived amplified products of AmpFlSTR Identifiler and PowerPlex Y multiplex kits to capture additional information
Aust. J. Forensic Sci.
STRait Razor v2.0: the improved STR allele identification tool–razor
Forensic Sci. Int. Genet.
TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes
Bioinformatics
Cited by (41)
Machine learning applications in forensic DNA profiling: A critical review
2024, Forensic Science International: GeneticsHigh-resolution genotyping of 58 STRs in 635 Northern Han Chinese with MiSeq FGx ® Forensic Genomics System
2023, Forensic Science International: GeneticsDevelopmental validation of the ForenSeq MainstAY kit, MiSeq FGx sequencing system and ForenSeq Universal Analysis Software
2023, Forensic Science International: GeneticsAssessing sequence variation and genetic diversity of currently untapped Y-STR loci
2022, Forensic Science International: ReportsBioinformatic tools for interrogating DNA recovered from human skeletal remains
2022, Forensic Genetic Approaches for Identification of Human Skeletal Remains: Challenges, Best Practices, and Emerging TechnologiesEvaluation of Promega PowerSeq™ Auto/Y systems prototype on an admixed sample of Rio de Janeiro, Brazil: Population data, sensitivity, stutter and mixture studies
2021, Forensic Science International: Genetics