Background
Tuberculosis (TB) is found in every population of the world today and kills 1.1–1.6 million people globally each year [
1]. There is also significant geographic variation in the prevalence, incidence, and mortality of TB [
1]. The factors that contribute to individual and geographic variation in TB infection and disease are incompletely understood. An intact immune response is required to prevent infection and progression to active disease as conditions that weaken the immune system are strongly associated with TB, including HIV co-infection, type II diabetes mellitus, undernutrition, and immunosuppressive medications such as anti-tumor necrosis factor (TNF) therapy [
2]. Environmental factors likely also play a role in infection and disease progression, including population density, indoor and outdoor air pollution, and health care quality and access [
2]. However, these risk factors are insufficient to explain the current burden of TB [
3].
An additional driver of variation may be human and bacterial genetic variation [
4]. There are human genetic polymorphisms associated with susceptibility to latent TB infection and progression to active disease [
5], as well as polymorphisms in the
Mycobacterium tuberculosis complex (MTBC) associated with the ability to cause disease [
6] and with transmissibility [
7]. The host-pathogen relationship in TB is sympatric [
8], i.e., the host and pathogen tend to share a common ancestral geographic origin [
8]. When patients are infected with an allopatric strain or a strain that originates from a different geographic origin than the patient, they may be at risk for greater pulmonary impairment [
9]. Similarly, there is evidence for associations between human leukocyte antigen (HLA) type and susceptibility to TB disease caused by particular MTBC strains [
10,
11]. However, there is considerable variation in studies that test for associations between MTBC genotypes and clinical characteristics [
12,
13].
A better understanding of MTBC molecular epidemiology could improve our ability to treat and control TB. Genetic data are already being used by epidemiologists as tools for outbreak investigations to identify sources of mycobacterial infection [
14] and as tools in surveillance to identify the strains most likely to spread rapidly through new human populations [
3]. Additionally, understanding the risk factors associated with MTBC genetic data could help direct the development of biomarker-based diagnostic tests to identify patients early that are infected with strains associated with higher risk of treatment failure, relapse, drug resistance, or death [
15]. Finally, there is accumulating evidence for variation in the immune response to distinct MTBC strains [
16‐
21]. Therefore, understanding the global variation in MTBC strains will be important as new vaccines, biomarkers, and host-directed therapies are developed [
13].
The objective of this study was to systematically synthesize all available information on MTBC genotypes in order to (1) map the global distribution of genotypes that cause TB disease and (2) determine whether any epidemiologically relevant clinical characteristics were associated with those genotypes. Previous systematic reviews that mapped MTBC genotype distribution focused on MTBC Beijing family strains and their association with drug resistance [
22,
23]. We expanded on this previous work by considering data for all MTBC lineages, making this the most comprehensive synthesis of MTBC genotypes that has been conducted to date.
Discussion
To our knowledge, this study represents the most comprehensive dataset on MTBC lineages that has been created by systematically assembled genotyping data from studies that used representative sampling techniques. The data show geographic variation in MTBC genotypes, which is consistent with previously published studies that used convenience samples and much smaller datasets. We find some evidence for clinical variation between genotypes, though, we also show significant variation between studies, which highlights the need for additional data.
Global variation in bacterial strains that cause TB disease
The results presented in this study are consistent with previously published maps that showed that MTBC strains that evolved more recently in human history—lineage 2, lineage 3, and lineage 4 strains—tend to be more widely distributed around the world [
22,
35,
47,
48]. We also showed that lineage 1, lineage 2, and lineage 3 are more prevalent in Europe and in North and South America than shown in previously published maps [
35,
47,
48]. Moreover, we show that lineage 3 strains may be increasing in prevalence in Europe, while lineage 1 strains may be decreasing in prevalence in West Asia. These patterns in genotype distribution likely reflect both historical and recent movement of strains with people from East Asia and the Indian subcontinent to Europe and the American continent. The dominance of lineage 4 globally, and in particular in South American countries, also supports the hypothesis that European colonialists aided in the dispersion of this lineage in the mid-sixteenth to nineteenth centuries [
32,
48,
49]. If the first inhabitants of the American continent brought early forms of lineage 2 strains with them when they migrated from north-eastern Asia, these strains may have been eliminated with the arrival of strains from European colonialists.
Human migration is likely not the only determinant of MTBC genotype distribution. Lineages 5 and 6 are prevalent only in West Africa [
35,
47,
48]. The reasons for this geographic restriction are largely unknown but may have to do with clinical characteristics of the patients infected with these strains. Patients infected with lineage 6 are more likely than patients infected with other strains to be older, HIV-infected, and severely malnourished [
50]. In addition, we showed that lineages 5 and 6 strains may be less likely to cause transmission chains than lineage 4 strains and that these findings were more consistent in Europe and the Americas than in Africa, which may reflect biological differences and/or social mixing which prevents these strains from spreading through non-West African populations. We also found that lineage 3 strains were associated with reduced risk of transmission chains in Europe and the Americas, which is consistent with the findings from a household contact study in Montreal [
51]. In contrast, we found that Beijing family strains may be more likely to cause transmission chains, which could reflect the ability of Beijing strains to spread quickly through human populations [
46,
52,
53]. These findings are not consistent with previous work that showed no differences between lineages in transmission from household contacts [
46,
54,
55]. Thus, further studies would be required to confirm our findings.
Several studies included in our analysis showed that treatment failure was associated with lineage 2 Beijing family strains [
43,
44]. Beijing family strains are also associated with drug resistance [
56], which has been reviewed previously [
12,
22,
23]. Additionally, lineage 1 strains have been associated with more rapid response to treatment in drug-susceptible TB cases in the USA [
57]. Thus, there is evidence for a relationship between bacterial genotype and treatment outcome, at least in certain populations or contexts. Future studies that carefully control for potential confounders that may impact treatment failure are required to confirm these findings. This type of information could be particularly important to clinicians if it could inform the development of novel diagnostic tools that test for bacterial genotypes associated with poor response to treatment and development of drug resistance.
Variation between studies and implications for variation in MTBC genotypes
There was variation in the sampling methods and representativeness of the studies included in this systematic review. The majority of studies were representative of much smaller geographic locations than the national level, and despite the large number of bacterial isolates included in this study, they represented only a small fraction of the total estimated TB cases. While the goal of this study was to summarize the MTBC genotyping data available, not to make nationally representative estimates, it is important to note that this variation was not distributed evenly throughout the world. There was less information available about MTBC genotype distribution in South America and Sub-Saharan Africa than in other regions, and the data in Central and Eastern Asia represented a smaller proportion of all estimated TB cases than elsewhere. Thus, the genetic diversity shown in the map in Fig.
3 for these regions is likely less representative of the underlying populations.
Another source of variation that may impact representativeness is whether studies were biased towards including either rural or urban populations. There is likely greater MTBC genetic diversity in patients from urban populations than patients from rural areas since urban areas experience higher rates of travel and migration. Most studies included in this analysis did not report the urban/rural composition of their sample, and the bias towards one or the other would likely vary depending on study location. For example, the majority of the studies included in our systematic review used samples collected from public hospitals or reference laboratories. Therefore, in countries such as India, where people in urban areas may be more likely to seek care from private health clinics [
58], the urban population may be underrepresented and we may have underestimated genetic diversity. On the other hand, in countries such as Uganda, where the rural population has limited access to public health facilities [
59], the rural population may be underrepresented and we may have overestimated genetic diversity. This highlights the importance of data from prevalence surveys that use active surveillance techniques to reach a broader subset of the population.
We also identified a significant amount of heterogeneity between studies in the meta-analysis of genetic clustering associated with genotypes. One source of this heterogeneity is likely methodological differences between the studies, such as genotyping method, sampling method, and study duration, which have been shown to impact genetic clustering [
27,
28]. For example, duration of sampling ranged from 2 months to 9 years, and genotyping methods ranged from the use of either spoligotyping or MLVA typing to the use of both methods (Additional file
1: Table S4). Studies that used shorter sampling durations may have missed transmission chains and underestimated clustering, while studies that used spoligotyping only may have overestimated clustering [
60]. An additional source of heterogeneity may be confounders that impact genetic clustering and transmission, such as social mixing, immigration, age structure, comorbidities, and underlying TB incidence [
27,
28]. These confounders likely also varied between these studies but were often not reported. For example, only 14 of the studies reported HIV prevalence (range 0 to 91%), only 6 reported proportion of immigrants (range 0 to 78%), and only 14 reported mean age of patients (range 25 to 50) included in the sample (Additional file
1: Table S4). If social mixing was high in each of the studies, this could have led us to overestimate the impact of genotype on transmission chains, while if migration was high, this could have led us to underestimate the presence of transmission chains.
Study limitations
A limitation of this study is that we grouped strains into seven lineages, which masks within-lineage variation. Distinct sub-lineages of the Beijing family are associated with differences in transmissibility in human populations [
61,
62], and lineage 4 contains both geographically widespread and restricted sub-lineages [
49]. However, we propose that this was the best method as it allowed us to (1) include a broad range of studies, including those that did not report sub-lineages, and (2) synthesize studies that used WGS- or PCR-based typing together with studies that used methods more common in resource-limited settings, such as spoligotyping and MLVA typing.
Another limitation is that we did not include data from WGS databases. A challenge of incorporating WGS data is identifying study meta-data, such as sampling methods and demographic characteristics of patients, linked with genomes. In addition, many of the WGS data available are poised for phylogeographic studies and for examining the presence of specific mutations [
32,
49,
56], but are less representative of the populations they are isolated from. These data are often from outbreaks or studies of specific sub-populations, which we excluded in this analysis. As WGS data linked with meta-data become more available (through prevalence surveys [
63] and endeavors such as ReSeqTB) including this data would be an important extension of our study. Our study supports these future studies by illustrating the importance of using genome sequences to determine phylogenetic lineages or sub-lineages. The dataset we have created could be used to fill geographic gaps in future WGS-based maps, particularly in regions where WGS technology is unavailable, and to verify results from convenience-based samples.
Conclusions
The evidence gathered in this systematic review support a role for bacterial genetic diversity in understanding global variation in TB disease. However, there are aspects of the studies that restrict our ability to confidently attribute clinical characteristics to genotypes. In order to address these conditions in the future, there will need to be a shift in the design of MTBC strain diversity studies such that data is collected in a way that is clinically and epidemiologically informative, wherever possible. We encourage future studies to carefully consider potential confounding variables in study design and analysis and to make all genotypes and study meta-data publicly available upon publication. We also encourage the analysis of less-studied strains from lineages 1 and 3 in order to increase comparability with the relative abundance of data on lineage 2 and lineage 4 strains. The evidence presented in this study demonstrate these types of data could potentially be used to create tools to inform the clinical diagnosis and treatment of TB and improve our understanding of the epidemiology of this disease.
Acknowledgements
We thank Diana Louden (University of Washington, Seattle, WA) for the assistance with the methods employed in the systematic review. We also thank Ian Pollock (Institute for Health Metrics and Evaluation, Seattle, WA) and Emilie Maddison (Institute for Health Metrics and Evaluation, Seattle, WA) for the assistance in organizing and indexing the literature collected in this study. We thank Brent Bell (Institute for Health Metrics and Evaluation, Seattle, WA) for the assistance with preparing the data for publication, and we thank Nicole Weaver (Institute for Health Metrics and Evaluation, Seattle, WA) and Laurie Marczak (Institute for Health Metrics and Evaluation, Seattle, WA) for the editorial assistance.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.