Background
Mycobacterium tuberculosis (
Mtb) transmission often occurs within a household or small community because prolonged duration of contact is typically required for infection to occur, creating the potential for localised clusters to develop [
1]. However, geospatial TB clusters are not always due to ongoing person-to-person transmission but may also result from reactivation of latent infection in a group of people with shared risk factors [
1,
2]. Spatial analysis and identification of areas with high TB rates (clusters), followed by characterisation of the drivers of the dynamics in these clusters, have been promoted for targeted TB control and intensified use of existing TB control tools [
3,
4].
TB differs from other infectious diseases in several ways that are likely to influence apparent spatial clustering. For example, its long latency and prolonged infectious period allow for significant population mobility between serial cases [
5]. Thus,
Mtb infection acquired in a given location may progress to TB disease in an entirely different region, such that clustering of cases may not necessarily indicate intense transmission but could rather reflect aggregation of population groups at higher risk of disease, such as migrants [
6]. Similarly,
Mtb infection acquired from workplaces and other congregate settings can be wrongly attributed to residential exposure, as only an individual’s residence information is typically recorded on TB surveillance documents in many settings [
7,
8].
Identifying heterogeneity in the spatial distribution of TB cases and characterising its drivers can help to inform targeted public health responses, making it an attractive approach [
9]. However, there are practical challenges in appropriate interpretation of spatial clusters of TB. Of particular importance is that the observed spatial pattern of TB may be affected by factors other than genuine TB transmission or reactivation, including the type and resolution of data and the spatial analysis methods used [
10]. For instance, use of incidence data versus notification data could give considerably different spatial pattern [
11], as the latter misses a large number of TB cases and could be skewed towards areas with better access to health care in high-burden settings [
12,
13]. Thus, spatial analysis using notification data alone in such settings could result in misleading conclusions.
Similarly, the type of model used and the spatial unit of data analysis are important determinants of the patterns identified and their associations [
14‐
16]. That is, different spatial resolutions could lead to markedly different results for the same dataset regardless of the true extent of spatial correlation [
15,
17,
18] and the effect observed at a regional level may not hold at the individual level (an effect known as the ecological fallacy) [
19]. Therefore, we aimed to review methodological approaches used in the spatial analysis of TB burden. We also considered how common issues in data interpretation were managed, including sparse data, false-positive identification of clustering and undetected cases.
Discussion
While a range of methodologies has been employed in divergent contexts, we found that essentially all geospatial studies of TB have demonstrated significant heterogeneity in spatial distribution. Spatial analysis was applied to improve understanding of a range of TB-related issues, including the distribution and determinants of TB, the mechanisms driving the local TB epidemiology, the effect of interventions and the barriers to TB service uptake. Recently, geospatial methods have been combined with genotypic clustering techniques to understand the drivers of local TB epidemiology, although most such studies remain limited to low-endemic settings.
In almost all reviewed studies, retrospective program data (notifications) were used. Notification data, especially from resource-scarce settings, suffer from the often large proportion of undetected cases and are heavily dependent on the availability of diagnostic facilities [
12]. None of the spatial studies of TB that used notification data accounted for undetected cases, such that the patterns in the spatial distribution and clustering could be heavily influenced by case detection performance [
11]. Hence, distinguishing the true incidence pattern from the detection pattern has rarely been undertaken, despite its importance in interpretation.
The problems of undetected cases could be compounded in the spatial analysis of drug-resistant forms of TB, especially in resource-scarce settings where testing for drug-resistant TB is often additionally conditional on the individual’s risk factors for drug resistance [
75]. However, recently, there have been some attempts to account for under-detection in the spatial analysis of TB. A Bayesian geospatial modelling approach presented a framework to estimate TB incidence and case detection rate for any spatial unit and identified previously unreported spatial areas of high burden [
11]. Another approach is to estimate incidence using methods such as capture-recapture [
76,
77] and mathematical modelling [
78]. If case detection rate is truly known for a defined region, incidence can be calculated as notifications divided by case detection rate, although this is rarely if ever the case. Spatial analysis using prevalence data could also be considered in areas where such data are available.
In relation to the data problems outlined above, spatial analysis of TB could benefit from the use of model-based geostatistics, which is commonly used in other infectious diseases [
79], although there are few studies that consider
Mtb [
80]. In particular, measurement of TB prevalence is impractical to perform at multiple locations due to logistic reasons. Therefore, model-based geostatistics can be used to predict disease prevalence in areas that have not been sampled from prevalence values at nearby locations at low or no cost, producing smooth continuous surface estimates.
Mapping of notification rates was the most commonly used data visualisation technique, in which TB cases were categorised at a particular administrative spatial level. This approach has the advantage of easy interpretability, although it can introduce bias because the size of the regions and the locations of their boundaries typically reflect administrative requirements, which may not reflect the spatial distribution of epidemiological factors [
19,
22]. In addition, patterns observed across regions may depend on the spatial scale chosen, an effect known as the modifiable areal unit problem (MAUP) [
17]. Because the choice of spatial scale mainly depends on the limitations of available data [
81], only one study was able to provide a systematic evaluation of the effect of scale on spatial patterns, demonstrating improved performance of Kulldorff’s spatial scan statistic method at a high geographic resolution [
25]. Different spatial resolutions could lead to markedly different results for the same dataset regardless of the true extent of correlation, due to averaging (aggregation effect) or other spatial processes operating at different scales [
15,
17,
18]. Assessing the presence of this effect should be a priority for future studies using aggregated data in spatial TB studies.
Bayesian smoothing techniques can mitigate the problems of stochastically unstable rates from areas with small population [
81], although such techniques were not widely used in the included studies and so false spatial clustering remains an important consideration. The less frequent use of rate smoothing techniques in the spatial analysis of TB could have various explanations, including lack of software packages that are easily accessible to the wider user (although GeoDa spatial software currently provides an accessible platform to people with limited statistical or mathematical backgrounds [
82]). It may also be that most spatial analyses of TB are based on data aggregated over larger geographic areas from several years, such that the problem of statistical stochasticity may not be a major problem, although this was not explicitly discussed in the included studies.
In all studies that applied spatial cluster identification tools, TB cases were clustered irrespective of whether the setting was low or high endemic. However, in studies that incorporated more than one cluster identification method, areas identified as hotspots were not identical, with the extent of agreement between the alternative methods highly variable. This could be partly attributable to different methods testing separate hypotheses, such that these results may correctly support one hypothesis while refuting another. However, there is no consensus on how to interpret these findings appropriately and consistently [
82,
83], and method selection did not typically appear to be based on such considerations [
84,
85]. Thus, caution is required when considering interventions assessing clusters with one method only, as is frequently undertaken in TB spatial analysis [
22].
Use of multiple cluster detection methods and requiring their overlap to represent a truly high-risk area is increasingly recommended [
82,
84,
86]. However, this approach could also increase the risk of false-positive spatial clustering when different methods are used serially until significant clusters are observed [
85]. Sensitivity analysis of spatial clustering [
87,
88] and cluster validation using geostatistical simulations [
23,
89,
90] can help identify robust clusters. While methods that adjust for confounding are generally preferred [
91], further investigative strategies including data collection and cluster surveillance are required to validate an observed spatial cluster before introducing interventions [
84,
85]. Although the focus of this study is TB, several methodological considerations outlined here would remain true for many infectious diseases.
In several studies, presence of spatial clustering or spatial autocorrelation in TB distribution was considered to reflect ongoing TB transmission, while its absence was taken to indicate reactivation [
58]. Recently, molecular techniques have been combined with geospatial methods to understand the drivers of local TB epidemiology, although findings from these studies vary by country and the subset of the population studied. While spatial clustering of genotypically related cases was reported in several studies and likely reflected intense local TB transmission [
61,
65], spatial clusters were dominated by genotypically unique strains in some studies, implying that reactivation was the dominant process [
47,
72]. Hence, the combination of genotypic and geospatial techniques can improve understanding of the relative contribution of reactivation and transmission and other local contributors to burden.
Notwithstanding the general principles outlined above, not all spatial clusters of genotypically related cases will necessarily result from recent transmission, as simultaneous reactivation of remotely acquired infection and limited genetic variation in the pathogen population can also lead to genotypic similarity of spatially clustered cases [
2,
92]. In some studies, the time between the first and last diagnosis of the cases in the genetic cluster ranged from 1 to more than 8 years [
1,
72], suggesting that genotypic clustering could occur from spatially clustered reactivation. Similarly, limited spatial aggregation of genotypically clustered cases [
72,
93,
94] and lack of epidemiological links between genotypically clustered cases in some studies may reflect migration of the human population over the extended time scale over which TB clusters occur [
95], although casual transmission creating spatially diffuse clusters is an alternative explanation.
The extent of genotypic similarity between cases also depends on the discriminatory power of the genotyping method and the diversity of the pathogen population. Compared to whole genome sequencing, standard molecular genotyping (spoligotyping, MIRU-VNTR and IS6110) methods generally overestimate TB transmission with a false-positive clustering rate of 25 to 75% based on strain prevalence in the background population [
92,
96]. The accuracy of these tests in distinguishing ongoing transmission from genetically closely related strains is very low among immigrants from high TB incidence settings with limited pathogen diversity [
92,
97]. Thus, care should be taken when interpreting the genotypic similarity of cases among immigrant groups, as independent importation of closely related strains is possible. The frequent finding of more extensive genotypic than spatial clusters [
71,
94] may reflect overestimation by the genotypic methods [
98]. On the other hand, TB transmission might not result in apparent spatial clustering due to reasons that include population movement, poor surveillance and unmeasured confounding.
Regression models used for spatial analysis of TB were either conventional regression models or models that incorporated spatial effects. Although the former was more commonly employed, the majority of models incorporating spatial effects confirmed that accounting for spatial correlation improved model fit [
11,
33,
44,
58,
99‐
101]. Conventional regression models assume spatial independence of model residuals and so ignore the potential presence of spatial autocorrelation, such that non-spatial models may lead to false conclusions regarding covariate effects.
The use of the conventional regression models described above may be appropriate for spatial analysis and spatial prediction, in the case that spatial dependence in residuals has been ruled out. Under this approach, the standard procedure is to start with classical ordinary least squares (OLS) regression models and then look for spatial dependence in the residuals, which implies the need for a spatially explicit regression model [
82]. Several of the models reviewed here did not appear to adopt this approach, and so, caution is required when interpreting the findings from such analyses.
Most regression models treat the association between TB rates and ecological factors as global and are unable to capture local variation in the estimates of the association. However, geographically weighted regression (GWR) estimates coefficients for all spatial units included [
22] and has often found the effect of risk factors on TB incidence to be spatially variable [
16,
102‐
104], implying that global models may be inadequate to consider locally appropriate interventions. Few studies were able to perform explicit Bayesian spatial modelling incorporating information from nearby locations, thereby producing stable and robust estimates for areas with small populations and robust estimates of the effects of covariates [
91].
While our review focused on methodological issues, several consistent observations were noted. Most importantly, all studies included in this review demonstrated that TB displayed a heterogeneous spatial pattern across various geographic resolutions. This reflects the underlying tendency for spatial dependence that can be caused by person-to-person transmission, socio-economic aggregation [
49] and environmental effects [
58,
93]. However, in nearly all included studies, spatial analyses of TB were based on the individual’s residence, although considerable TB infection is acquired from workplaces and other social gathering sites [
8,
54]. Such studies could wrongly attribute TB acquired from such sites to residential exposure, leading to resource misallocation.
Several models have shown significant associations between TB rates and demographic, socioeconomic and risk-factor variables, although it is difficult to rule out publication bias favouring studies with positive findings. However, associations observed between TB rates and different factors such as population density, unemployment and poverty at the population level varied across studies. These were recognised as important individual-level risk factors, highlighting the potential for ecological fallacy.
We did not perform individual study level analysis of bias in this review. Analyses in the reviewed studies involved counts and proportions across different spatial distributions, rather than comparisons across different treatment/exposure groups. Standard tools of bias analysis predominantly focus on different treatment groups within cohorts (absent from our included studies) and hence are not applicable to this review. We have however discussed many potential sources of bias in the studies included in our review.
Most of the reviewed studies were from high-income settings, which may either reflect publication bias or a focus of research efforts on such settings. In high-incidence settings, the more limited use of spatial analysis methods could reflect a lack of access to resources (e.g. georeferenced data and spatial software packages) or insufficient expertise in these settings. However, it is these high-transmission settings which stand to gain the most from an improved understanding of TB spatial patterns and also these settings in which geospatial clustering may be most important epidemiologically.