Using MLST to study bacterial variation: prospects in the genomic era
Abstract
ABSTRACT:
Multilocus sequence typing (MLST) indexes the sequence variation present in a small number (usually seven) of housekeeping gene fragments located around the bacterial genome. Unique alleles at these loci are assigned arbitrary integer identifiers, which effectively summarizes the variation present in several thousand base pairs of genome sequence information as a series of numbers. Comparing bacterial isolates using allele-based methods efficiently corrects for the effects of lateral gene transfer present in many bacterial populations and is computationally efficient. This ‘gene-by-gene’ approach can be applied to larger collections of loci, such as the ribosomal protein genes used in ribosomal MLST (rMLST), up to and including the complete set of coding sequences present in a genome, whole-genome MLST (wgMLST), providing scalable, efficient and readily interpreted genome analysis.
Multilocus sequence typing
Since its introduction in 1998, multilocus sequence typing (MLST) has proven to be an effective and widely used method for characterizing bacterial Isolates. MLST indexes the diversity of nucleotide sequences of fragments of housekeeping genes (loci), with most bacterial MLST schemes employing seven loci of approximately 400–500 bp each, a length initially chosen as achievable with dideoxy sequencing technology [1]. Each novel sequence at each locus is assigned a number in order of discovery (adk-1, adk-2… and so on) and the numbers for all the loci characterized in a particular scheme are incorporated into an allelic profile (e.g., 2–3–4–3–8–4–6). Each profile, or unique combination of alleles, is assigned an arbitrary sequence type (ST; e.g., ST-11). By the simple expedient of having look-up tables which relate STs to allelic profiles and alleles to allele sequences, individual sequences of thousands of base pairs in length can be uniquely associated with a bacterial isolate using a single number [2].
MLST & bacterial isolate characterization
One of the principal reasons for the success of MLST was that it was developed in the light of an improved appreciation of the population and evolutionary biology of the bacteria, specifically the role of lateral gene transfer and the consequences of this process [3]. The concept of examining the genome at multiple housekeeping genes, that is, core genome loci presumed to be under neutral or ‘nearly neutral’ selection pressures, was previously established with multilocus enzyme electrophoresis (MLEE) [4]. MLST introduced the innovation of indexing nucleotide sequence variation in these genes, providing dramatically better resolution than had been possible with MLEE, which inferred the presence of distinct alleles from differential migration of proteins during starch gel electrophoresis. A further advantage of using nucleotide sequences was their reproducibility and portability, with the availability of curated reference data sets via the internet [2,48,49].
The use of allele designations as units of analysis, rather than nucleotide sequences themselves, addresses many of the problems associated with employing phylogenetic methods to assess isolate relationships among recombining bacteria. These problems arise from the fact that individual recombination events, which are common, introduce multiple polymorphisms, while point mutations, which are often relatively rare, change only single nucleotides. In MLST, these changes (i.e., point mutations or recombination events that change many nucleotides in one event) are effectively weighted the same, as both represent an allelic change [5]; however, ST and allele designations can be used when required to retrieve the relevant nucleotide sequences from the look-up tables for sequence-based analysis.
For numerous bacteria, MLST schemes comprising as few as seven housekeeping loci have proved to be highly discriminatory [2]. A survey of the current publicly available MLST schemes reveals a staggering level of diversity among allelic profiles that represent only a fraction of the genome in question (usually less than 0.2%) [50]. This demonstrates the importance of having a straightforward and infinitely expandable means for summarizing and comparing data on bacterial diversity, such as the allele and ST number definitions. As the allele and ST designations are arbitrary, they can be grouped into different higher order groups as improved understanding of the biological structure of the diversity they catalogue emerges, without the need to rewrite the fundamental nomenclature. MLST data are amenable to such analyses of population structure since the presence of shared alleles among MLST loci can be used to infer ancestral lineages by various clustering methods and Bayesian techniques such as BAPS [6], Structure [7] and ClonalFrame [8].
The extent of genetic exchange in many bacteria is evident from the fact that the number of STs observed frequently exceeds the number of alleles observed per locus by more than an order of magnitude [9]. Nevertheless, even in highly recombining organisms certain genotypes, for which STs are effective surrogates, dominate and the analysis of seven-locus MLST data resolves numerous bacterial populations into ‘clonal complexes’. While the majority of STs in a given data set are rare and transitory, certain STs are both high frequency and stable over time and during geographic spread [2]. These STs are markers for persistent consensus genotypes, which are variously referred to as a ‘central genotype’ or, with less accuracy, as a ‘founder’ or ‘ancestor’: there is rarely, if ever, any evidence that central genotypes represent either [10]. A striking example of such persistence is the spread of carbepenem-resistant ST-235 Pseudomonas aeruginosa across Eastern Europe [11]. The linking of specific STs with consensus genotypes has led to them acting as markers for ‘high-risk clones’ in some cases where, for example, the aforementioned ST-235, along with ST-111 and ST-175 P. aeruginosa, are associated with extensive drug-resistance in healthcare settings. [12] This association is clinically useful in the absence of specific information of the chromosomal resistance mechanisms involved.
The clonal complex
Clonal complexes can be detected in population data sets with various heuristic algorithms including split decomposition [13], NeighborNet [14,15], minimum spanning-trees, eBURST [16] and the related goeBURST [17,18] (Figure 1). The clonal complexes, conventionally named after the consensus ST (e.g., the meningococcal ST-11 clonal complex or CC11), are frequently associated with important phenotypes: in pathogenic bacteria, for example, they are often associated with properties such as the propensity to cause disease, the type of disease caused, particular vaccine antigens, antimicrobial resistance or host association. Consequently, the MLST-defined clonal complex has become a principal unit of analysis, which has facilitated functional investigations by enabling complex phenotypes, such as host association, pathogenicity or antimicrobial resistance, to be associated with bacterial genotype [2]. By condensing sequence information into a series of numbers, comparisons among isolates can be simplified to a matter of counting the number of loci that vary. When performed for a collection of profiles, the resultant distance matrix can be graphically represented with various algorithms to identify relationships among isolates (see Box 1).
Limitations of MLST
The metabolic diversity of the bacterial domain has prevented the development of a universal MLST scheme for all bacteria based on metabolic housekeeping genes, as even these genes are either not widely shared or are too diverse. There is a paradox that any sequence-based typing scheme relies on variation for discrimination; however, as conventional MLST employs the amplification of these loci by PCRs, their variability makes the design of reliable amplification and sequencing primers difficult [2]. In addition, the variability of the content of the core genome among different bacteria makes it impossible to use the same metabolic genes other than within quite closely related organisms. Consequently, even within genera, such as the genus Streptococcus, it is frequently necessary to have more than one scheme, each with different target loci [19–22]. At the other end of the diversity scale MLST on its own cannot provide discrimination among very closely related organisms – this includes the recently evolved single-clone asexual pathogens such as Bacillus anthracis [23] or isolates of more diverse pathogens that belong to the same clone. In these cases it has been necessary to use additional typing schemes that index rapidly evolving loci; for example, those encoding antigen genes [24,25] or variable number tandem repeats [26]. More recently, whole-genome single-nucleotide polymorphisms (SNPs) have been used for this type of analysis in such pathogens [27]. Notwithstanding these limitations at the whole domain and sub-strain typing levels, MLST has proven to be very successful in describing population diversity and structure for a wide range of bacteria.
Conclusion & future perspective
MLST has been highly successful as an approach to the description, archiving and unambiguous cataloging of the diversity of a broad range of bacteria and, in many cases, groups of related bacteria defined by MLST have been used as the basis for functional studies. However, MLST has not proved to be a complete solution to the characterization problem at two levels: there has been no single universal MLST scheme applicable to all bacteria; and seven-locus MLST lacks the very high resolution required for some applications.
The advent of rapid and inexpensive sequencing has removed the practical constraints that have framed the design of MLST approaches [28]; however, while the torrent of genome sequence data now available can potentially overcome the shortcomings of MLST, it also threatens the ordered investigation of bacterial diversity by swamping the field with ‘too much information’, or at least too much data without organization or an understandable nomenclature framework: major reasons for the success of MLST [29]. This mirrors the multiple incompatible molecular typing methods, YATMS (Yet Another Typing Method) [30] developed in the 1990s, which in large degree stimulated the proposal of MLST as a general approach [1,2]. The principles behind MLST can, however, be applied to whole-genome analysis, with schemes consisting of increasing numbers of loci, up to and including the entire complement of coding sequences within the genome (whole-genome MLST [wgMLST]). This approach has been termed ‘gene-by-gene’ genomic analysis [31–33] and is the philosophy behind the design of the Bacterial Isolate Genome Sequence Database (BIGSdb) platform [34] that is currently used to host most of the MLST, and increasingly now genome, databases on PubMLST [48]. One such MLST scheme that offers universal bacterial species identification and typing is ribosomal MLST (rMLST) [35,36]. This uses 53 ribosomal protein genes, the products of which come together to form the ribosome, the essential translation machinery of the cell and found throughout the bacterial domain [37]. Initial validation of rMLST with selected species indicate that it provides resolution higher than standard 7-locus MLST, and it has been used to resolve species groups within the Neisseria [38] and lineage structure of Campylobacter [39].
Even though whole-genome data are becoming ubiquitous, MLST is still relevant. It provides the overall clonal frame of the organism and allows genomic data to be related to legacy data sets collected over the past 15 years. Allele designations for MLST can be readily extracted from whole-genome data [40–43] and the costs of sequencing a genome and generating 7-locus MLST data by Sanger sequencing are comparable while gene-by-gene methods, such as rMLST and wgMLST provide scalable means of studying the sequence variation encoded in the genome.
ST | abcZ | adk | aroE | fumC | gdh | pdhC | pgm | Frequency |
---|---|---|---|---|---|---|---|---|
32 | 4 | 10 | 5 | 4 | 6 | 3 | 8 | 11 |
34 | 8 | 10 | 5 | 4 | 5 | 3 | 8 | 8 |
33 | 8 | 10 | 5 | 4 | 6 | 3 | 8 | 5 |
749 | 8 | 10 | 77 | 4 | 6 | 3 | 8 | 4 |
259 | 4 | 10 | 5 | 40 | 6 | 3 | 8 | 3 |
290 | 8 | 3 | 5 | 4 | 1 | 3 | 8 | 2 |
2931 | 4 | 5 | 5 | 4 | 6 | 3 | 8 | 2 |
8049 | 8 | 10 | 5 | 4 | 6 | 3 | 15 | 2 |
639 | 8 | 10 | 5 | 9 | 6 | 3 | 8 | 1 |
1096 | 4 | 10 | 5 | 26 | 6 | 3 | 8 | 1 |
6083 | 4 | 10 | 5 | 4 | 6 | 416 | 8 | 1 |
7460 | 4 | 10 | 48 | 4 | 6 | 3 | 8 | 1 |
9890 | 4 | 10 | 5 | 630 | 6 | 3 | 8 | 1 |
10285 | 4 | 10 | 5 | 24 | 6 | 3 | 9 | 1 |
10286 | 4 | 10 | 5 | 8 | 6 | 3 | 8 | 1 |
Anatomy of a clonal complex.
The Neisseria ST-32 complex (previously identified as ET-5 complex) has been responsible for epidemics of meningococcal disease in Europe [44,45] and the Americas [46,47]. To demonstrate how MLST approaches can be used to analyze related strains at different levels of resolution the whole genomes of ST-32 complex isolates recovered from all cases of disease in England and Wales in two recent epidemiological years were investigated (44 isolates). ST-32 was the most frequently isolated genotype, followed by ST-34 and ST-33 (Table 1) – double- and single-locus variants of ST-32, respectively. A Neighbor-Net comparison at the MLST loci was performed using the BIGSdb Genome Comparator tool [34] (Figure 2A). The vertices of the network can be readily annotated to identify the locus changes represented, clearly showing that ST-32 itself has the largest number of variants that differ at a single locus, with ST-33 also possessing a smaller set of its own variants. Scaling the analysis up to use wgMLST (1548 variable loci) shows that while ST-32 isolates are largely clustered together, there are some present in other parts of the network (Figure 2B). Likewise, ST-33 isolates are dispersed in the network. This shows that while standard MLST is useful for comparisons at the level of the clonal complex, strains with identical ST numbers may not be genetically closer to each other than to other members of the complex.
EXECUTIVE SUMMARY
Multilocus sequence typing methodology
• A molecular typing method that indexes the sequences of fragments of housekeeping genes at (usually) seven loci.
• Each unique sequence for a given locus is given an arbitrary allele number.
• Each unique combination of alleles (or allelic profile) is given an arbitrary sequence type (ST) number.
• Allele-based methods such as multilocus sequence typing (MLST) correct for the effects of lateral gene transfer.
Clonal complexes
• These are groups of related STs that may be associated with particular phenotypes.
• Can be defined by various methods that involve counting locus differences to a ‘central’ ST or to other members of the complex.
Ribosomal MLST
• A universal MLST scheme that uses the 53 ribosomal protein genes.
• Suitable for use with whole-genome data.
• Higher resolution than conventional (seven locus) MLST and works for all bacteria.
Gene-by-gene analysis: whole-genome MLST
• MLST-type approach to analyzing genomic variation.
• Alleles for all coding sequences in the genome are indexed.
• Highly scalable and computationally efficient analysis of whole genome data using the same methods as used for MLST.
Financial & competing interests disclosure
MCJM is a Wellcome Trust Senior Fellow in Basic Biomedical Sciences. This publication made use of the Meningitis Research Foundation Meningococcus Genome Library (http://www.meningitis.org/research/genome) developed by Public Health England, the Wellcome Trust Sanger Institute and the University of Oxford as a collaboration. The project is funded by Meningitis Research Foundation. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
Open access
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/
Papers of special note have been highlighted as: • of interest; •• of considerable interest
References
- 1 Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl Acad. Sci. USA 95(6), 3140–3145 (1998).• Describes the development of the first multilocus sequence typing (MLST) scheme.
- 2 . Multilocus sequence typing of bacteria. Annu. Rev. Microbiol. 60, 561–588 (2006).
- 3 . How clonal are bacteria? Proc. Natl Acad. Sci. USA 90(10), 4384–4388 (1993).
- 4 . Methods of multilocus enzyme electrophoresis for bacterial population genetics and systematics. Appl. Environ. Microbiol. 51, 837–884 (1986).
- 5 . The influence of recombination on the population structure and evolution of the human pathogen Neisseria meningitidis. Mol. Biol. Evol. 16(6), 741–749 (1999).
- 6 . Bayesian semi-supervised classification of bacterial samples using MLST databases. BMC Bioinformatics 12, 302 (2011).
- 7 . Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9(5), 1322–1332 (2009).
- 8 . Inference of bacterial microevolution using multilocus sequence data. Genetics 175(3), 1251–1266 (2007).
- 9 Carried meningococci in the Czech Republic: a diverse recombining population. J. Clin. Microbiol. 38(12), 4492–4498 (2000).
- 10 . Small change: keeping pace with microevolution. Nat. Rev. Microbiol. 2(6), 483–495 (2004).
- 11 Spread of extensively resistant VIM-2-positive ST235 Pseudomonas aeruginosa in Belarus, Kazakhstan, and Russia: a longitudinal epidemiological and clinical study. Lancet Infect. Dis. 13(10), 867–876 (2013).
- 12 Genetic markers of widespread extensively drug-resistant Pseudomonas aeruginosa high-risk clones. Antimicrob. Agents Chemother. 56(12), 6349–6357 (2012).
- 13 . Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol. Phylogenet. Evol. 1(3), 242–252 (1992).
- 14 . Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21(2), 255–265 (2004).
- 15 . Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23(2), 254–267 (2006).
- 16 . eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J. Bacteriol. 186(5), 1518–1530 (2004).
- 17 . PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods. BMC Bioinformatics 13, (2012).
- 18 . Global optimal eBURST analysis of multilocus typing data using a graphic matroid approach. BMC Bioinformatics 10, 152 (2009).
- 19 Population structure of Streptococcus oralis. Microbiology 155(Pt 8), 2593–2602 (2009).
- 20 Development of an unambiguous and discriminatory multilocus sequence typing scheme for the Streptococcus zooepidemicus group. Microbiology 154(Pt 10), 3016–3024 (2008).
- 21 First insights into the evolution of Streptococcus uberis: a multilocus sequence typing scheme that enables investigation of its population biology. Appl. Environ. Microbiol. 72(2), 1420–1428 (2006).
- 22 . A multilocus sequence typing scheme for Streptococcus pneumoniae: identification of clones associated with serious invasive disease. Microbiology 144(11), 3049–3060 (1998).
- 23 . Population structure and evolution of the Bacillus cereus group. J. Bacteriol. 186(23), 7959–7970 (2004).
- 24 . Molecular typing of meningococci: recommendations for target choice and nomenclature. FEMS Microbiol. Rev. 31(1), 89–96 (2007).
- 25 . Extended sequence typing of Campylobacter spp., United Kingdom. Emerg. Infect. Dis. 14(10), 1620–1622 (2008).
- 26 Diversity in a variable-number tandem repeat from Yersinia pestis. J. Clin. Microbiol. 38(4), 1516–1519 (2000).
- 27 . Phylogenetic understanding of clonal populations in an era of whole genome sequencing. Infect. Genet. Evol. 9(5), 1010–1019 (2009).
- 28 Microbiology in the post-genomic era. Nat. Rev. Microbiol. 6(6), 419–430 (2008).
- 29 . Pathogen typing in the genomics era: MLST and the future of molecular epidemiology. Infect. Genet. Evol. 16, 38–53 (2013).• Discusses the use of whole-genome sequencing for high-throughput typing with multilocus methods, along with a detailed comparison of allele-based and sequence-based methods of comparison.
- 30 . A surfeit of YATMs? J. Clin. Microbiol. 34(7), 1870 (1996).• An historic reference that highlights the dangers of multiple incompatible typing methods. This is still pertinent today with the advent of whole-genome sequencing.
- 31 . A gene-by-gene approach to bacterial population genomics: whole genome MLST of Campylobacter. Genes 3(2), 261–277 (2012).
- 32 MLST revisited: the gene-by-gene approach to bacterial genomics. Nat. Rev. Microbiol. 11(10), 728–736 (2013).•• A review of MLST in the genome age and how the same methodology can be applied to whole-genome analysis.
- 33 . Evolutionary and genomic insights into meningococcal biology. Future Microbiol. 7(7), 873–885 (2012).
- 34 . BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 11(1), 595 (2010).
- 35 Ribosomal multi-locus sequence typing: universal characterization of bacteria from domain to strain. Microbiology 158, 1005–1015 (2012).• Describes a universal MLST scheme that can speciate and type any bacteria since it uses the sequences of genes that encode the ribosomal proteins that are universally present throughout the bacterial domain.
- 36 . Two novel methods for using genome sequences to infer taxonomy. Microbiology 158(Pt 6), 1414 (2012).
- 37 . Phylogenomics of prokaryotic ribosomal proteins. PLoS ONE 7(5), (2012).
- 38 A genomic approach to bacterial taxonomy: an examination and proposed reclassification of species within the genus Neisseria. Microbiology 158(Pt 6), 1570–1580 (2012).
- 39 Evidence for phenotypic plasticity amongst multi-host Campylobacter jejuni and C. coli lineages using ribosomal MLST and Raman spectroscopy. Appl. Environ. Microbiol. 79(3), 965–973 (2013).
- 40 . Short read sequence typing (SRST): multi-locus sequence types from short reads. BMC Genomics 13, 338 (2012).
- 41 . Automated extraction of typing information for bacterial pathogens from whole genome sequence data: Neisseria meningitidis as an exemplar. Euro Surveill. 18(4), 20379 (2013).
- 42 . Ion torrent personal genome machine sequencing for genomic typing of Neisseria meningitidis for rapid determination of multiple layers of typing information. J. Clin. Microbiol. 50(6), 1889–1894 (2012).• A practical demonstration of the ease of extracting molecular typing information, such as MLST, from whole-genome data.
- 43 Multilocus sequence typing of total-genome-sequenced bacteria. J. Clin. Microbiol. 50(4), 1355–1361 (2012).
- 44 Population genetic and evolutionary approaches to the analysis of Neisseria meningitidis isolates belonging to the ET-5 complex. J. Bacteriol. 181(18), 5551–5556 (1999).
- 45 . Meningococcal carriage during a clonal meningococcal B outbreak in France. Eur. J. Clin. Microbiol. Infect. Dis. 32(11), 1451–1459 (2013).
- 46 The genetic structure of Neisseria meningitidis populations in Cuba before and after the introduction of a serogroup BC vaccine. Infect. Genet. Evol. 10(4), 546–554 (2010).
- 47 Molecular epidemiology of Neisseria meningitidis serogroup B in Brazil. PLoS ONE 7(3), e33016 (2012).
- 48 PubMLST website hosted at the University of Oxford, UK. http://pubmlst.org/
- 49 MLST website hosted at Imperial College, UK. http://www.mlst.net/
- 50 Comprehensive list of all MLST databases. http://pubmlst.org/databases.shtml