Background
The World Health Organization has shown that the worldwide tuberculosis (TB) epidemic is larger than was previously estimated [
1]. Tuberculosis caused an estimated 1.4 million deaths in 2015, with an estimated 10.4 million new cases. TB rates in the United States continue to steadily drop, yet TB rates in the State of Hawaii remain steady [
2]. Hawaii experienced an average of 120 incident cases per year from 2006 to 2017, ranging from a low of 114 in 2006 to a high of 136 in 2014. Hawaii currently displays the highest incidence rate of TB in the US, at 8.1 per 100,000 in 2017. Comparing this rate to the median US state rate of 1.8 per 100,000 illustrates the public health burden of TB in Hawaii. Of the 119 incident TB cases in Hawaii in 2016, 100 (84%) were non-US born, well above the national average of 68.5%. Furthermore, of those 100 cases, 69 were in persons born in the Philippines.
Thus, it is not surprising that Hawaii perennially experiences among the highest rates of TB cases in the United States due to a continuous influx of immigrants from the Western Pacific and Asian regions. As a result of this immigration pattern, TB in Hawaii is composed of a unique distribution of genetic lineages relative to the continental United States or Europe, but similar to the United States Affiliated Pacific Islands [
3‐
5]. The Beijing and Manila families of
Mycobacterium tuberculosis (
Mtb) comprise over two-thirds of the TB cases in Hawaii [
6,
7]. These families are defined by spoligotyping (reverse-line hybridization of 43 sequences complementary to CRISPR spacers), mycobacterial interspersed repetitive units–variable number of tandem repeats (MIRU-VNTR) patterns, and whole-genome single nucleotide polymorphism (SNP) phylogenies [
6‐
10]. The Manila family has been shown to comprise the majority of
Mtb lineage 1 and has spread into the Pacific islands with Filipino migration, while the Beijing family comprises the majority of lineage 2 and is the dominant family in East Asia [
10]. In contrast, lineage 4, whose members are the most commonly found among TB cases in Europe and North America, contains a larger set of spoligotyping clades [
4].
Potentially long latency periods in tuberculosis cases make molecular epidemiological tools an essential part of its control. IS
6110 restriction fragment length polymorphism (RFLP) typing historically represented the “gold standard” for
Mtb genotyping [
11]. However, IS
6110 typing is time consuming and labor intensive, and provides limited resolving power for clusters composed of isolates with low IS
6110 copy numbers [
12,
13]. Two other methods, spoligotyping and MIRU-VNTR fingerprinting, are currently the standard employed by the Centers for Disease Control and Prevention (CDC) in the United States [
8,
12,
14]. However, these fingerprinting methods still perform poorly when used to identify actual transmission. One study conducted in the English Midlands found that the positive predictive value (PPV) that two isolates with identical MIRU-VNTR fingerprints represent actual recent transmission between those cases was only 18.6% [
15]. Furthermore, they found that this PPV varied by lineage, with lineage 4 displaying a PPV of 30.6%, while lineage 1 displayed a PPV of only 8.0% and lineage 2 only 13.8%. Even more previous work has demonstrated that these genetic fingerprinting methods perform poorly for identifying actual transmission of Beijing family isolates, showing that MIRU-VNTR fingerprinting is superior to IS
6110 when only lineage 4 isolates are being typed, but performs poorly when Beijing family isolates are being typed [
16]. Multiple other studies have also indicated that 12-loci MIRU-VNTR is insufficient to resolve suspected Beijing family clusters, and that 24-loci MIRU-VNTR is similarly ineffective when Beijing family isolates are present [
17‐
19]. However, similar studies are not available for the Manila family.
Attempts to optimize VNTR typing for the Beijing family have been proposed and implemented with the switch from 12-loci to 24-loci typing, but as we further demonstrate in this study, have failed to result in a comprehensively effective solution [
12,
20]. The need for effective epidemiological tracking for the Beijing family is highlighted by this family’s association with drug resistance. The population structure of TB in areas of high drug resistance has been shown to be rapidly shifting towards the Beijing family, which specifically has a significantly higher rate of developing rifampin resistance [
21]. Alarmingly, the Beijing family has been shown to manifest increased transmission fitness relative to a non-Beijing lineage while streptomycin resistant [
22]. However, limited research has been performed on the Manila family, despite its dominance in Hawaii and the Philippines, and despite the prediction that rates of multiple drug resistant (MDR) TB in the Philippines will continue to increase [
23].
As a result of the predominance of these two Mtb families in Hawaii and the Pacific, identifying autochthonous Mtb transmission is especially difficult both in Hawaii and throughout the Western Pacific Region. Although extensive TB screening is implemented in Hawaii (including requiring tuberculin skin tests prior to enrolment in education or prior to employment as a food-handler), frequent travel of Hawaii residents to visit family in high-incidence areas throughout the Pacific, combined with insufficient existing molecular fingerprinting methods, prevents TB controllers in Hawaii from developing a comprehensive understanding of local TB transmission. In this study, we examined the ability of CDC-standard genetic fingerprinting (spoligotyping plus 24-loci MIRU-VNTR fingerprinting) for Mtb to identify Beijing and Manila family transmission clusters, and attempted to identify the cause of its reduction in genotyping resolution compared to when applied to lineage 4. We previously observed that the Beijing and Manila families demonstrated lower allelic Shannon evenness at most MIRU-VNTR loci [J.T. Douglas unpublished data]. The Shannon diversity index is a measurement of diversity in a community that considers both the richness (total number of alleles at each MIRU-VNTR locus, in our case) present in the community and evenness (relative abundance) of each of those alleles. Our study utilizes this measurement to determine if certain Mtb genetic lineages possess a dominance of specific alleles (indicated by reduced Shannon evenness values) at any MIRU-VNTR loci that may explain why MIRU-VNTR performs poorly when utilized for molecular epidemiology on these lineages. Here, we utilized a dataset of all fully fingerprinted Mtb isolates recorded in Hawaii from 2002 through 2016 to further investigate this apparent cause for MIRU-VNTR’s poor ability to resolve apparent lineage 1 and 2 clusters relative to its considerably greater ability for lineage 4 clusters.
Our previous cooperation with the State of Hawaii Department of Health Tuberculosis Control Branch revealed that the CDC’s standard Mtb fingerprinting methodology was of limited epidemiological use for Hawaii’s TB clinicians. Large numbers of epidemiologically unrelated Beijing and Manila family isolates frequently shared identical fingerprints, and nearly all suspected transmission clusters also fingerprinted identically within those suspected clusters, preventing fingerprinting results from being a useful tool for confirming or disproving suspected transmission events.
Whole genome sequencing (WGS) has been shown to be able to identify specific transmission chains within fingerprinting clusters [
24]. Advances in next-generation sequencing have resulted in the cost of WGS decreasing to the point where it is feasible for many laboratories to sequence most or all clustered isolates [
25]. WGS is increasingly being employed for tuberculosis epidemiology, including identifying the transmission chains of a TB outbreak in British Columbia, Canada, verifying contact investigation-based links in an outbreak in San Francisco, California, and use in a large, retrospective observational study in the UK Midlands [
26‐
28]. For this study, we selected 19 apparent TB transmission clusters that were identified by fingerprinting or epidemiological data in Hawaii from 2003 to 2017 and conducted Illumina whole genome sequencing to determine if WGS could be used to further resolve these clusters and identify the transmission connections among isolates.
Making full use of the resulting WGS dataset, we further examined isolates from clusters that WGS identified to represent actual transmission events and investigated which genes or regions were developing mutations that differentiated individual isolates in a cluster. Our previous work has identified virulence factor mutations in the Beijing and Manila families that may be involved in virulence or latency, and this work seeks to help us further characterize these historically under-studied families [
29,
30].
Methods
Identification of clusters for WGS
Records of all genotyped tuberculosis cases processed by the Hawaii State Department of Health Tuberculosis Control Program from 2004 to 2016—as well as partial data from 2002, 2003, and 2017—were analyzed to identify fingerprinting clusters that possibly represented actual transmission clusters. One thousand sixty-one isolate records were available for analysis. Names were assigned to spoligotypes using the SpolDB4 database [
31]. Genetic fingerprints, dates and locations, patient histories, and nursing contact investigation records were all considered in the selection of these clusters. Four large historic
Mtb fingerprinting clusters in Hawaii were selected for investigation (Table
1).
Table 1
Sequenced Mtb Fingerprinting or Epidemiological Clusters
Large Clusters Identified by Identical Genetic Fingerprints |
Manila Cluster 1 | Manila | 23 | 3 | 178 | 73–148 | No |
Manila Cluster 2 | Manila | 24 | 2 | 161 | – | No |
Beijing Cluster 1 | Beijing | 11 | 4 | 63 | 0–52 | Partial |
Beijing Cluster 2 | Beijing | 7 | 7 | 0 | – | Yes |
Clusters Identified by Shared Uncommon Spoligotypes |
Manila-like Cluster 1 | Manila | 2 | 2 | 3 | – | Yes |
Manila-like Cluster 2 | Manila | 2 | 2 | 4 | – | Yes |
Beijing Cluster 5 | Beijing | 2 | 2 | 1 | – | Yes |
Manila-like Cluster 3 | Manila-like | 3 | 3 | 3 | 1–3 | Yes |
H3 Cluster 1 | H3 | 3 (1) | 3 | 1230 | 3–1230 | Partial |
Epidemiologically Identified Putative Clusters |
Manila Cluster 3 | Manila | 2 | 2 | 90 | – | No |
Manila Cluster 4 | Manila | 2 | 2 | 0 | – | Yes |
Manila Cluster 5 | Manila | 2 | 2 | 192 | – | No |
Manila Cluster 6 | Manila | 2 | 2 | 229 | – | No |
Beijing Cluster 3 | Beijing | 2 | 2 | 3 | – | Yes |
Mixed Cluster 2 | U/Beijing | 2 | 2 | 1153 | – | No |
U Cluster 1 | U | 4 | 4 | 131 | 1–117 | Partial |
Mixed Cluster 1 | Beijing/Manila | 3 | 3 | 1762 | 1–1762 | Partial |
Manila Cluster 7 | Manila | 2 | 2 | 142 | – | No |
Beijing Cluster 4 | Beijing | 2 | 2 | 0 | – | Yes |
As we hypothesized that these large fingerprinting clusters did not represent actual transmission clusters due to their relatively high number of cases, geographic distribution throughout the state, and chronological diversity, we further selected five clusters with spoligotypes that were less common in Hawaii, including two clusters with “Manila-like” patterns, one cluster with an uncommon Beijing family pattern (000000000003751 versus the common 000000000003771), one cluster with no spoligotype match in SpolDB4, and one H3 cluster (which is common globally, but uncommon in Hawaii) in order to analyze clusters with greater suspected likelihood of being transmission-derived. Nineteen isolates were selected for WGS from those clusters in order to maximize chronological diversity for the largest clusters and to fully sequence the smaller clusters.
We further worked with staff at the State of Hawaii Tuberculosis Control Program’s Lanakila Tuberculosis Clinic, including doctors, nurses, and Tuberculosis Epidemiological Studies Consortium (TBESC) staff, to identify 17 epidemiologically-derived possible transmission clusters, of which ten had two or more isolates sent to CDC-contracted laboratories for genetic fingerprinting (Table
1). Twenty-one isolates from these clusters were selected for sequencing.
Recall of state of Hawaii Mtb isolates
Twenty isolates were requested from the Michigan Department of Community Health, where they had been previously sent by the State of Hawaii for contracted fingerprinting, and where they had been archived. We received extracted DNA from those isolates. Sixty-one isolates were sent from the California Department of Public Health State Laboratory as “double-killed” sample preps using a treatment of immersion in 70% ethanol followed by heating at 80 °C for 1 h.
DNA extraction and whole genome sequencing
DNA extraction was performed as previously described by the National Institute of Public Health and Environmental Protection (RIVM), Bilthoven, The Netherlands (Isolation of Genomic DNA from Mycobacteria Protocol), or according to the source state laboratory’s standard protocol. In brief, Mtb cultures were harvested and lysed with lysozyme followed by a SDS/proteinase K mix. Non-nucleic acid cell debris was precipitated using a CTAB/NaCl solution and removed using a chloroform/isoamyl alcohol extraction. Finally, DNA was precipitated using isopropanol. DNA was quantified with the Qubit 2.0 dsDNA Broad Range Assay. Isolate libraries were prepared using the Illumina Nextera XT DNA Library Kit using manual normalization and sequenced on the Illumina MiSeq Platform with v3 Chemistry for 300 bp paired-end reads.
Data analysis
SNP matrices were produced using a modification of the NASP pipeline [
32], with Bowtie2 used for alignment [
33], and GATK used for SNP-calling [
34], and SNPs being filtered for ten-fold read coverage and 75% read consensus as previously described [
28]. Repetitive regions were removed by the NASP pipeline utilizing MUMmer to perform a self-self comparison with a minimum match length of 20. When two compared isolates presented with 30 SNPs or fewer between them during analysis of the pipeline output, those differentiating SNP loci were compared against their alignment’s annotated scaffold genomes in NCBI GenBank (
https://www.ncbi.nlm.nih.gov/nuccore/) to identify and discard any SNPs in repetitive regions that were not automatically excluded by the NASP pipeline. Relatedness of isolates was determined by the method developed by Walker et al. [
28], with the stepwise 95% prediction interval from the mean rate of change between their paired isolates being used as our baseline. Identification of SNPs among members of clusters, or between isolate pairs, was performed by importing the SNP matrices produced by the NASP pipeline into a database and performing custom SQL queries. Minimum spanning trees were produced for selected clusters with PHYLOViZ 2.0 using goeBURST Full MST [
35,
36].
Analysis of the resolving capability of CDC-standard
Mtb fingerprinting (spoligotyping plus 24 loci MIRU-VNTR typing) was conducted on all 562 fully fingerprinted
Mtb isolates recorded in the State of Hawaii from 2002 through 2016. Only isolates with the “EAI2_MANILLA” designation in SpolDB4 were utilized as “lineage 1” isolates, as we have previously shown that other spoligotypes with “EAI” designations can span diverse evolutionary lineages [
37]. All isolates with “BEIJING” or “BEIJING-LIKE” spoligotypes were placed in “lineage 2.” All isolates with LAM, H, S, T, U, and X spoligotypes were grouped into “lineage 4.” MIRU-VNTR loci were individually analyzed using the Shannon diversity index. Evenness of allelic distribution at each locus was calculated by dividing the Shannon diversity index by the maximum possible Shannon diversity index for that locus, assuming that all alleles could possibly be observed at each locus. Statistical significance of the means of diversity indices for all 24 MIRU-VNTR loci was calculated in Microsoft Excel using the t-Test: Two-Sample Assuming Unequal Variances, with
p-values < 0.05 being considered significant. Sensitivity and specificity for genetic fingerprinting were calculated using VassarStats Clinical Calculator 1 [
38].
Isolates discussed in this paper are identified by their one or two-digit University of Hawaii DNA extraction number. Gene names are presented as annotated in their respective genomes hosted in GenBank (see accession numbers below).
Discussion
This work demonstrated that established standard molecular fingerprinting methods for
Mtb (spoligotyping plus 24-loci MIRU-VNTR typing) are insufficient for epidemiological investigation of TB in Hawaii. Our study is not alone in such findings. One study that utilized 1999 consecutive
MTb isolates processed by a laboratory in the English Midlands from 2012 to 2015 identified that the performance of MIRU-VNTR profiles for identifying genomic relatedness in
Mtb differed by lineage [
15]. Notably, when they modeled the number of SNPs between paired isolates assuming a linear relationship over 1–3 MIRU-VNTR locus differences, they found that while paired lineage 4 isolates with identical MIRU-VNTR profiles displayed a median of 10 SNPs, lineages 1 and 2 displayed 122 and 159 SNPs, respectively. However, this study also showed that the number of pairwise SNPs between isolates was significantly higher when one or both isolates were from a recent immigrant, suggesting that the study’s specific conclusions partially represented trends in domestic versus foreign transmission associated with different lineages. Regardless, it further illustrates the necessity of WGS over MIRU-VNTR for investigation of Mtb transmission.
With WGS serving as our “gold standard,” we demonstrated the specificity of CDC-standard fingerprinting (spoligotyping plus MIRU-VNTR) in our geographic region with high levels of Beijing and Manila family
Mtb to be only 28.6% (Additional file
5). Such a low level provides clinicians and epidemiologists with very low confidence that a purported transmission cluster identified by standard fingerprinting represents an actual transmission cluster. Note that these data are not intended to propose that WGS be considered the gold standard for
Mtb epidemiological analysis; rather, they are intended to illustrate how high prevalence of certain
Mtb families exposes shortcomings in presently-employed
Mtb genetic fingerprinting methods. However, although IS
6110 has previously been considered the “gold standard” for
Mtb molecular epidemiology, isolates with as many as 130 SNPs between them have been shown to have identical IS
6110 fingerprints, adding support that WGS has become the de-facto “gold standard” for
Mtb molecular epidemiology [
39,
40].
Our previous work illustrated that even with the full set of 24 MIRU-VNTR loci, potential Beijing and Manila family transmission clusters are poorly resolved by this method of fingerprinting [
29]. Here, we identified that MIRU-VNTR’s lack of resolving ability results from the Beijing and Manila families both being characterized by a greater number of loci that are dominated by either one allele or a small set of alleles than lineage 4. While the Shannon diversity index itself does not indicate how much of its diversity is derived from allelic richness versus allelic evenness, evenness can be easily calculated using values from the Shannon diversity index. Figure
3 shows that most of the reduction in Shannon diversity demonstrated by the Beijing and Manila families is due to a decrease in allelic evenness instead of a decrease in allelic diversity. However, it should be noted that lineage 4 contains multiple major clades, compared to one clade each for lineages 1 and 2, and thus higher allelic evenness should generally be expected from lineage 4 overall. Regardless, this work illustrates why MIRU-VNTR fingerprinting is less effective at identifying actual transmission when applied to Beijing and Manila family isolates.
These data help demonstrate why CDC-standard molecular fingerprinting of Mtb is insufficient for areas of the world where the Beijing and Manila families are dominant. Thus, this study investigated in detail the ability of whole genome sequencing-based analysis to compensate for MIRU-VNTR’s shortcoming by resolving fingerprinting-derived clusters from those two families in order to identify actual transmission.
Combining epidemiology with whole genome sequencing for cluster resolution
Of the 19 possible transmission clusters we investigated, definitive verdicts of recent transmission, partial transmission, or non-transmission were reached for all clusters. Epidemiological investigation was used to further strengthen or disprove the determinations of transmission or non-transmission. Although WGS analysis was able to disprove the apparent transmission that was initially suspected based on epidemiological connections for several apparent clusters, there were no cases where epidemiological information was sufficient to call WGS-derived transmission determinations into question.
Genes containing cluster-informative SNPs
In order to explore which genes could be experiencing rapid mutation and producing the SNPs that distinguished isolates within individual transmission clusters, isolates from those clusters were aligned against GenBank genome CP003248.2, which was selected due to its manually-curated annotation at TubercuList. These informative SNPs that distinguished isolates in actual transmission clusters are contained in a broad range of genes (Additional file
7). The genes identified in this study differ from those identified by a previous study examining an outbreak in San Francisco with the H1 spoligotype [
27]. The genes where intra-cluster SNPs were located did not appear to demonstrate any lineage association, and included an ATPase, an ABC transporter membrane protein, a PHOH-like protein PhoH2 phosphate starvation-inducible protein, a PSIH-like sequence-specific RNA helicase, an RNAse, and several hypothetical proteins, among others [
41,
42].
The selection of cut-off points for a SNP’s required read coverage and read consensus (allele frequency) are of interest for developing a system for applied WGS epidemiology. Previous studies have required 75% read consensus or 10x read coverage and 80% read consensus, and found a mutation rate of ~ 0.5 SNPs per genome per year and 0.4 SNPs per genome per year [
16,
28]. At the extreme ends of range proposed by Walker et al. for identifying transmission-linked or possibly linked isolates (0–1 SNPs and 6–12 SNPS), this information may suggest to tuberculosis controllers whether two isolates were likely the result of recent, direct transmission, or whether the transmission occurred in the more distant past (allowing time for divergent accumulation of SNPs in each infection) or through an intermediate host [
28]. However, with several transmission clusters investigated in this work displaying 3–4 SNPs distinguishing their isolates, we cannot propose whether they represent direct transmission or not – only recent transmission.