Short communication
STRait Razor v2s: Advancing sequence-based STR allele reporting and beyond to other marker systems

https://doi.org/10.1016/j.fsigen.2017.03.013Get rights and content

Highlights

  • Perl script has been modified to report reverse complement sequence to facilitate merger of read strands.

  • Configuration files have been generated for all commercially available MPS multiplexes.

  • Preloaded database of ∼2500 haplotypes allows easy conversion of string to ISFG-recommended comprehensive nomenclature.

  • STRait Razor Profile Viewer allows all loci to be viewed simultaneously for better mixture evaluation.

Abstract

STRait Razor has provided the forensic community a free-to-use, open-source tool for short tandem repeat (STR) analysis of massively parallel sequencing (MPS) data. STRait Razor v2s (SRv2s) allows users to capture physically phased haplotypes within the full amplicon of both commercial (ForenSeq) and “early access” panels (PowerSeq, Mixture ID). STRait Razor v2s may be run in batch mode to facilitate population-level analysis and is supported by all Unix distributions (including MAC OS). Data are reported in tables in string (haplotype), length-based (e.g., vWA allele 14), and International Society of Forensic Genetics (ISFG)-recommended (vWA [CE 14]-GRCh38-chr12:5983950-5984049 (TAGA)10 (CAGA)3 TAGA) formats. STRait Razor v2s currently contains a database of ∼2500 unique sequences. This database is used by SRv2s to match strings to the appropriate allele in ISFG-recommended format. In addition to STRs, SRv2s has configuration files necessary to capture and report haplotypes from all marker types included in these multiplexes (e.g., SNPs, InDels, and microhaplotypes). To facilitate mixture interpretation, data may be displayed from all markers in a format similar to that of electropherograms displayed by traditional forensic software. The download package for SRv2s may be found at https://www.unthsc.edu/graduate-school-of-biomedical-sciences/molecular-and-medical-genetics/laboratory-faculty-and-staff/strait-razor.

Introduction

STRait Razor [1] initially was developed to capture and characterize variation in the repeat motifs of target short tandem repeat (STR) markers from next generation sequencing or, the more aptly named, massively parallel sequencing (MPS) data. Since its inception, STRait Razor has been used for length-based (LB) analysis and repeat motif variation of population data [2], [3], [4], assessment of novel MPS multiplexes [5], [6], alignment-free characterization of insertion-deletion (InDel) polymorphisms [7] and other novel applications [8], [9], [10]. While STRait Razor v2 [11] improved allelic reporting and expanded target loci, the flanking regions of STR loci (and other marker types such as single nucleotide polymorphisms (SNPs) and InDels) remained largely unreported without modification to the included configuration files [12] or use of alternative software [13], [14], [15], [16].

Variation in the flanking regions of STRs has been used to study human evolution and migration patterns [17], [18], [19] and increasing the probability of exclusion [20]. These combinations of SNP and STR (SNPSTR) loci phased within the same amplicon provide a finer granularity to better discriminate individuals. In an effort to standardize nomenclature, the International Society of Forensic Genetics (ISFG) published a set of considerations [21] regarding reporting of STR alleles, repeat and flanking regions, for comprehensive and consistent nomenclature.

The proposed ISFG nomenclature [21] is not implemented currently into commercial or third-party software. Researchers must convert their data manually which may be error prone. Traditional alignment software [22], [23], while effective for genotyping SNPs, produces inconsistent results on a per read basis. For example, the Burrows-Wheeler Aligner [22], widely used for read mapping, places the insertion/deletion points in relation to the reference at different points within the repeat region when considering sequence variants and/or the end point(s) of each read. However, direct haplotype capture used by various software tools [16], [24], including STRait Razor, allows users to extract phased data of flanking region variants as well as the target locus. As data analysis pipelines for this task are still nascent to the field of forensic genetics, bioinformatics concordance with operationally distinct methodologies are critical to ensure complete, as possible, results are obtained. Amplified products separated by capillary electrophoresis (CE) provide analysts with a comprehensive allele determination in respect to size; however, sequence characterization, thus far, has been regulated to the repeat region of the amplicons [5], [6], [25], [26], [27], [28]. While effective, this approach to allele reporting has been shown to be limited in its informative value [4], [14], [20] and, in some cases, its backwards compatibility with CE data [2], [4], [14]. Therefore, characterization of the flanking region of loci is necessary to realize the full potential of MPS systems.

Bi-allelic loci (e.g., SNPs and InDels) are useful particularly in challenged samples. Kidd et al. [29] have shown the utility of combining closely linked SNPs (<200 bp) into microhaplotype loci phased within a single amplicon. However, interpretation of these small amplicon markers has been limited to the target SNP(s) of interest and have ignored potential variation in the surrounding region of the amplicon. However, variation along the entire amplicon in the form of InDels-SNPs [7], DIP-STRs [30], SNPSTRs [17], [18], [19], [20], [31], or microhaplotypes [29], [32], [33], [34], [35] may be present at some level for every forensic locus but are ignored, as yet, by first-party software. More recently, these additional data have been shown to increase the discrimination power of commercially available forensic-genomics assays from 8.54 × 10−34 to 1.31 × 10−39 for LB and sequence-based (SB) STRs and 7.66 × 10−58 to 5.49 × 10−63 when identity SNPs/microhaplotypes are included [4], [36].

The two primary sequencing chemistries currently being considered for forensic applications are those of the Illumina MiSeq FGx™ Forensic Genomics System (Illumina, San Diego, CA, USA) and the Ion Torrent PGM™ and S5™ (Thermo Fisher Scientific, San Francisco, CA, USA). Each chemistry has a distinct detection method which generates unique interpretation considerations. Substitution errors (SBEs) are the primary source of error within the Illumina sequencers [37] with a relatively small proportion of insertion/deletion errors typically associated with specific sequences (sequence-specific error; SSE) [37], [38], [39]. Schirmer et al. [38] used 16S rRNA amplicons to estimate the source and distribution of sequencing error with the MiSeq 2 × 250 sequencing chemistry. They concluded that the accumulation of phasing and pre-phasing errors over the course of the sequencing read complicated base calling and lead to a concomitant increase in sequence errors or miscalls. There was an observed increase in sequencing error approximately at position 200–225 within the sequencing reads using MiSeq v2 2 × 250 sequencing chemistry. Data from the Ion Torrent, however, have a relatively high false InDel rate associated with homopolymeric stretches in conjunction with SBEs [40], [41], [42], [43]. Bragg et al. [40] also observed a marked increase in SBE rate for both the 100 and 200 bp chemistries as the read length increased at approximately 75 and 150 bp, respectively. These sequencing errors may either complicate interpretation of the data by producing relatively high-abundance, non-target haplotypes or affect the ability of the configuration files to capture the desired haplotypes. With these limitations in mind, optimization with respect to bioinformatics of each system on a per-marker basis may be necessary to ensure application beyond single-source reference samples.

With the advent of commercial MPS multiplexes, forensic practitioners now have the capacity to interrogate a large number of markers and marker types (e.g., STRs, SNPs, InDels, and microhaplotypes) within a single analysis [4], [5], [6], [27], [44], [45], [46], [47], [48]. However, orthogonal bioinformatics solutions for phased data including microhaplotypes have been limited to traditional alignment-based software [49], [50] which provide only computationally derived phasing. STRait Razor v2s (the “s” designation referring to the addition of SNP loci, SRv2s) is a freely available update that provides a direct haplotype capture approach that includes physical phasing to determine the diplotype (i.e., a specific combination of haplotypes at a particular site in an individual analogous to the genotype of alleles) at each locus (STRs, SNPs, microhaplotypes, and InDels) from current (or soon to be available) commercially available forensic multiplex assays.

Section snippets

New features

STRait Razor v2s suite of tools features kit-specific locus-configuration files for the Applied Biosystems™ Precision ID GlobalFiler™ Mixture ID panel (Thermo Fisher Scientific), Illumina® ForenSeq™ DNA Signature Prep Kit (Illumina), and Promega PowerSeq™ Systems (Auto, Y, and Mito) (Promega Corporation, Madison, WI, USA). In addition to these multiplexes developed expressly for MPS, configuration files were included for forensically relevant InDels and the repeat-region configuration files for

Results and discussion

Each large MPS multiplex presents unique challenges for bioinformatics development. When considering the placement of the anchors for each assay, multiple factors (e.g., size of the amplicon, presence of SSEs, sequencing platform, flanking region SNPs, etc.) should be considered. Previous studies have used the PCR primers for anchor assignment [24], [54]. Anchors based on PCR primers allow for capture of the maximum amount of information from an amplicon in the form of both repeat region and

Conclusion

STRait Razor v2s uses a direct haplotype capture approach to extract phased data from entire amplicons of all marker types within current (or soon to be available) commercially available forensic multiplex assays. This software update provides users a database of, currently, ∼2500 unique sequences matched with nomenclature in the format recommended by Parson et al. [21]. The sequences contained within this database are based on the anchors defined as reported herein; as primers define the

Conflict of interest

The authors declare they have no conflict of interests.

Acknowledgements

This work was supported in part by award no. 2015-DN-BX-K067, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect those of the U.S. Department of Justice. The authors would like to thank Nicole Novroski, Jennifer Churchill, Lisa Borsuk, Lilliana Moreno, Ryan England, and Katherine Gettings for their contributions and

References (61)

  • S.L. Friis et al.

    Introduction of the Python script STRinNGS for analysis of STR regions in FASTQ or BAM files and expansion of the Danish STR sequence database to 11 STRs

    Forensic Sci. Int. Genet.

    (2016)
  • J.C.-I. Lee et al.

    A DNA sequence searching tool for massively parallel sequencing data

    Forensic Sci. Int. Genet.

    (2017)
  • J. Hoogenboom et al.

    FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise

    Forensic Sci. Int. Genet.

    (2017)
  • W. Parson et al.

    Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements

    Forensic Sci. Int. Genet.

    (2016)
  • C. Van Neste et al.

    My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing

    Forensic Sci. Int. Genet.

    (2014)
  • S.L. Fordyce et al.

    Second-generation sequencing of forensic STRs using the ion torrent HID STR 10-plex and the ion PGM

    Forensic Sci. Int. Genet.

    (2015)
  • J.D. Churchill et al.

    Evaluation of the Illumina((R)) beta version ForenSeq DNA signature prep kit for use in genetic profiling

    Forensic Sci. Int. Genet.

    (2016)
  • K.K. Kidd et al.

    Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics

    Forensic Sci. Int. Genet.

    (2014)
  • K. Kidd et al.

    Microhaplotype loci are a powerful new type of forensic marker

    Forensic Sci. Int. Genet. Suppl. Ser.

    (2013)
  • K.K. Kidd et al.

    Genetic markers for massively parallel sequencing in forensics

    Forensic Sci. Int. Genet. Suppl. Ser.

    (2015)
  • F.R. Wendt et al.

    Flanking region variation of ForenSeq™ DNA signature prep kit STR and SNP loci in Yavapai Native Americans

    Forensic Sci. Int. Genet.

    (2017)
  • W. Parson et al.

    Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM)

    Forensic Sci. Int. Genet.

    (2013)
  • M. Eduardoff et al.

    Inter-laboratory evaluation of the EUROFORGEN Global ancestry-informative SNP panel by massively parallel sequencing using the Ion PGM

    Forensic Sci. Int. Genet.

    (2016)
  • M. Eduardoff et al.

    Inter-laboratory evaluation of SNP-based forensic identification by massively parallel sequencing using the Ion PGM

    Forensic Sci. Int. Genet.

    (2015)
  • S. Elena et al.

    Revealing the challenges of low template DNA analysis with the prototype Ion AmpliSeq Identity panel v2.3 on the PGM Sequencer

    Forensic Sci. Int. Genet.

    (2016)
  • C. Phillips et al.

    D5S2500 is an ambiguously characterized STR: identification and description of forensic microsatellites in the genomics age

    Forensic Sci. Int. Genet.

    (2016)
  • K.J. van der Gaag et al.

    Massively parallel sequencing of short tandem repeats-Population data and mixture analysis results for the PowerSeq system

    Forensic Sci. Int. Genet.

    (2016)
  • R. England et al.

    Massively parallel sequencing for the forensic scientist—sequencing archived amplified products of AmpFlSTR Identifiler and PowerPlex Y multiplex kits to capture additional information

    Aust. J. Forensic Sci.

    (2016)
  • D.H. Warshauer et al.

    STRait Razor v2.0: the improved STR allele identification tool–razor

    Forensic Sci. Int. Genet.

    (2014)
  • S.Y. Anvar et al.

    TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes

    Bioinformatics

    (2014)
  • Cited by (41)

    • Bioinformatic tools for interrogating DNA recovered from human skeletal remains

      2022, Forensic Genetic Approaches for Identification of Human Skeletal Remains: Challenges, Best Practices, and Emerging Technologies
    View all citing articles on Scopus
    View full text