Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

Abstract

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Simplified flow diagram of the integrated calling algorithm.
Figure 2: Size distribution of indel calls in the NA12878 trio.
Figure 3: Genotypes of the HLA-A, HLA-B and HLA-C loci at two- and four-digit resolution.

Similar content being viewed by others

References

  1. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  2. Albers, C.A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

    Article  CAS  Google Scholar 

  3. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Article  CAS  Google Scholar 

  4. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

    Article  CAS  Google Scholar 

  5. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    Article  CAS  Google Scholar 

  6. Raczy, C. et al. Isaac: ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29, 2041–2043 (2013).

    Article  CAS  Google Scholar 

  7. O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).

    Article  CAS  Google Scholar 

  8. Montgomery, S.B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).

    Article  CAS  Google Scholar 

  9. Holcomb, C.L. et al. A multi-site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens 77, 206–217 (2011).

    Article  CAS  Google Scholar 

  10. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    Article  CAS  Google Scholar 

  11. Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

    Article  CAS  Google Scholar 

  12. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  13. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).

    Article  CAS  Google Scholar 

  14. Garrison, A. & Marth, G. Haplotype-based variant detection from short-read sequencing, http://arxiv.org/abs/1207.3907 (2012).

  15. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  16. Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).

    Article  CAS  Google Scholar 

  17. Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).

    Article  Google Scholar 

  18. Kim, J.H., Waterman, M.S. & Li, L.M. Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Res. 17, 1101–1110 (2007).

    Article  CAS  Google Scholar 

  19. Donmez, N. & Brudno, M. in Research in Computational Molecular Biology, Lecture Notes in Computer Science Vol. 6577 (eds. Bafna, V. & Sahinalp, S.) 38–52 (Springer, Berlin, Heidelberg, 2011).

  20. Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).

    Article  CAS  Google Scholar 

  21. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  Google Scholar 

  22. Myers, E.W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).

    Article  CAS  Google Scholar 

  23. Simpson, J.T. & Durbin, R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26, i367–i373 (2010).

    Article  CAS  Google Scholar 

  24. Martin, H.C. et al. Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis. Hum. Mol. Genet. 23, 3200–3211 (2014).

    Article  CAS  Google Scholar 

  25. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  26. Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).

    Article  CAS  Google Scholar 

  27. Averof, M., Rokas, A., Wolfe, K.H. & Sharp, P.M. Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287, 1283–1286 (2000).

    Article  CAS  Google Scholar 

  28. McVey, M. & Lee, S.E. MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings. Trends Genet. 24, 529–538 (2008).

    Article  CAS  Google Scholar 

  29. O'Roak, B.J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585–589 (2011).

    Article  CAS  Google Scholar 

  30. Ku, C.S., Tan, E.K. & Cooper, D.N. From the periphery to centre stage: de novo single nucleotide variants play a key role in human genetic disease. J. Med. Genet. 50, 203–211 (2013).

    Article  CAS  Google Scholar 

  31. Sanders, S.J. et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 (2012).

    Article  CAS  Google Scholar 

  32. Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).

    Article  CAS  Google Scholar 

  33. Veeramah, K.R. et al. De novo pathogenic SCN8A mutation identified by whole-genome sequencing of a family quartet affected by infantile epileptic encephalopathy and SUDEP. Am. J. Hum. Genet. 90, 502–510 (2012).

    Article  CAS  Google Scholar 

  34. Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).

    Article  CAS  Google Scholar 

  35. Conrad, D.F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).

    Article  CAS  Google Scholar 

  36. Chen, J.M., Ferec, C. & Cooper, D.N. Transient hypermutability, chromothripsis and replication-based mechanisms in the generation of concurrent clustered mutations. Mutat. Res. 750, 52–59 (2012).

    Article  CAS  Google Scholar 

  37. Itoh, Y. et al. High-throughput DNA typing of HLA-A, -B, -C, and -DRB1 loci by a PCR-SSOP-Luminex method in the Japanese population. Immunogenetics 57, 717–729 (2005).

    Article  CAS  Google Scholar 

  38. Leslie, S., Donnelly, P. & McVean, G. A statistical method for predicting classical HLA alleles from SNP data. Am. J. Hum. Genet. 82, 48–56 (2008).

    Article  CAS  Google Scholar 

  39. de Bakker, P.I.W. et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).

    Article  CAS  Google Scholar 

  40. Ruark, E. et al. Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer. Nature 493, 406–410 (2013).

    Article  CAS  Google Scholar 

  41. Pagnamenta, A.T. et al. Exome sequencing can detect pathogenic mosaic mutations present at low allele frequencies. J. Hum. Genet. 57, 70–72 (2012).

    Article  CAS  Google Scholar 

  42. Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).

    Article  CAS  Google Scholar 

  43. Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This study was funded by Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/I02593X/1 (G.L., G.M., A.R. and H.P.), by Wellcome Trust grants 102731/Z/13/Z (A.O.M.W. and S.R.F.T.), 089250/Z/09/Z (I.M.) and 090532/Z/09/Z (G.M., G.L. and A.R.), and by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre Programme. The views expressed are those of the authors and not necessarily those of the National Health Service (NHS), NIHR or the UK Department of Health.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

A.R. developed Platypus. A.R., H.P., I.M., Z.I. and G.L. contributed code and algorithms. A.R., H.P. and G.L. analyzed data. H.P., S.R.F.T. and A.O.M.W. performed validation experiments. WGS500 contributed data. A.O.M.W., G.M. and G.L. wrote the manuscript. G.L. initiated and led the project.

Corresponding author

Correspondence to Gerton Lunter.

Ethics declarations

Competing interests

G.M. and G.L. are cofounders and shareholders of Genomics, Ltd. A.R. is currently employed by Genomics, Ltd. The other authors declare no competing financial interests.

Additional information

A list of members and affiliations appears in the Supplementary Note.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1–6 and Supplementary Note. (PDF 13583 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rimmer, A., Phan, H., Mathieson, I. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 46, 912–918 (2014). https://doi.org/10.1038/ng.3036

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3036

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing