Data
We obtained health related data of child-bearing age women in Bangladesh from three latest available Demographic and Health Survey (DHS) conducted in 2007, 2011 and 2014. DHS, began in 1984, are nationally-representative household surveys that provide data of a wide range of issues including birth rate, mortality, migration, family planning, maternal and child health, nutrition, family living conditions and education in developing countries [
21]. Bangladesh DHS (BDHS) was conducted every three or 4 years from 1993 up to the present. The most recent survey was in 2017–2018, however, the corresponding data hasn’t been published yet. A two-stage cluster sampling design was used. Firstly, the country was divided into a number of enumeration areas (EAs) and a certain amount of EAs were sampled according to the proportion of urban and rural areas. EAs, known as survey clusters, are city blocks or apartment buildings in urban areas, or villages or groups of villages in rural areas, which cover an average of 100–120 households. Secondly, 30 randomly selected families from each cluster were surveyed, resulting in indicators not only nationally representative but representative at the lower level of DHS regions and urban/rural residence [
22]. Family-, women- (age 15–49), and health-related information, such as house situation, reproductive history, marriage, were collected using standard questionnaires.
Coordinates of clusters were available as geo-located data. However, in order to maintain privacy of respondents, the geo-located data is displaced, which leaves the coordinates of all cluster containing random errors: i) urban clusters contain between 0 and 2 km of error; ii) rural clusters contain 0 and 10 km of error [
4]. These errors may result in misclassified assignments of predictor variables in geostatistical analysis. Nevertheless, according to the guidelines on the GPS data published by DHS, point extraction provided adequately unbiased estimates for most surface types other than highly non-smooth surface [
22,
23]. We fitted simultaneous autoregressive regression models, proposed by the guidelines of DHS (see Additional file
1 for details), to test the smoothness of the predictor surfaces in our study.
In consistent with other researches, we defined a woman reported with RTI symptoms if she reported to have abnormal genital discharge or genital sore/ulcer during the last 12 months [
4,
24]. This and the potential influencing data (i.e., wealth index of household, education years of women, education years of husbands, rate of births attended by a skilled provider, number of children in a household, coverage of improved toilet facilities) were extracted from DHS dataset at individual level. As the survey prior to 2007 only investigated women’s specific gynecological health problems during the 6 months preceding the survey, while the 2007 and subsequent survey focused on having abnormal genital discharge or genital sore/ulcer during the last 12 months, we just used survey data in the 2007, 2011 and 2014 in order to have a uniform definition of RTIs symptoms for the outcome variable.
Besides, we introduced a suite of environmental, climatic, socioeconomic and demographic data with high spatial resolution, such as human influence index (HII) and elevation. They are often related to the social, health, accessibility and demographic factors that underlie geographic change [
25], assist in explaining the spatial variation of outcome variables, and are committed to the accuracy of predictions. A detailed information of the data sources, data periods, temporal and spatial resolutions are listed in Table S1 in Additional file
1.
Model fitting and validation
Bayesian geostatistical logistic regression models were applied to obtain spatially explicit RTI risk estimates. We denoted
pit and
nit the probability of reporting with RTI symptoms and the number of total surveyed individuals at location
i (
i = 1, 2,…,
L) in the survey year
t, respectively, For cluster
i at survey time
t,
nit independent individuals were randomly sampled from the population (the
nit is not vey huge in our study), and each of the individuals had a binary outcome. So, we assumed the number of positive individuals
Yit follows a binomial distribution, that is
Yit~
Binomial(
pit,
nit). The covariates were modelled with the logit link function as follows:
$$ logit\left({p}_{it}\right)={\beta}_0+{\boldsymbol{X}}_{it}^T\boldsymbol{\beta} +{\delta}_{it}+{\lambda}_i $$
Here
β0 is the intercept,
\( {\boldsymbol{X}}_{it}^T \) is the vector of covariates for location
i of the year
t, and
β the vector of the corresponding coefficients.
λi indicates a location-specific exchangeable random effect assumed to follow a zero-mean normal distribution
\( {\lambda}_i\sim N\left(0,{\sigma}_{nonsp}^2\right) \).
δit is a spatiotemporal effect term, assumed to follow a stationary spatiotemporal Gaussian process.
To avoid “big
n problem” arising when working with the dense covariance matrix of a Gaussian field (GF), a discretely indexed spatial random process Gaussian Markov random field (GMRF) with a sparse precision matrix was used to approximate the continuous spatial process GF, by the computationally effective approach SPDE [
26,
27]. A proper triangulated mesh is constructed over the study region representing the spatial domain.
ξ = (
ξ1, …,
ξT)′ is denoted as the
T ×
G-dimentional GMRF, where
T is the total discrete time points and
G the number of vertices of the mesh. The joint distribution of
ξ is expressed as
ξ~N(
0,
Q−1) with
Q = Qs ⨂
Qt.
Qt is the
T-dimensional precision function for temporal effect, which is defined as either autoregressive process of order 1 (AR1) or exchangeable.
Qs is the sparse precision matrix of the spatial GMRF, coming from the SPDE representation of the GF with a Matérn covariance function, constructed with spatial correlation
C(
d) = (
κd)
νKν(
κd) and spatial variance
\( {\sigma}_{sp}^2=1/\left(4\pi {\kappa}^2{\tau}^2\right) \). Here
d is the Euclidean distance between pair of locations,
κ a scaling parameter,
υ a smoothing parameter usually kept fixed,
τ the precision parameter and
Kυ the modified Bessel function of second kind with order
υ (same value as the smoothing parameter). the spatial range is defined as
\( R=\sqrt{8\upsilon }/\kappa \), regarded as the distance with spatial correlation becoming negligible (< 0.1).
For a given time point
t, we have
ξt = ρξt − 1 + wt under AR1 temporal process, where
\( {\boldsymbol{\xi}}_1\sim \boldsymbol{N}\Big(\mathbf{0},{\boldsymbol{Q}}_{\boldsymbol{s}}^{-\mathbf{1}}/\left(1-{\rho}^2\right) \) and
wt is assumed temporally independent with
\( {\boldsymbol{w}}_t\sim \boldsymbol{N}\left(\mathbf{0},{\boldsymbol{Q}}_{\boldsymbol{s}}^{-\mathbf{1}}\right) \) and
\( \mathrm{Cov}\left({w}_{it},{w}_{j{t}^{\prime }}\right)=\Big\{{\displaystyle \begin{array}{c}0\ if\ t\ne {t}^{\prime}\\ {}{\sigma}_{sp}^2\ if\ t={t}^{\prime}\end{array}} \). By approximation of the GF with GMRF using SPED approach, for a given location
i at time point
t, the spatial random effect
δit can be expressed as
\( {\delta}_{it}=\sum \limits_{g=1}^G{a}_{ig}{\xi}_{gt} \), where
aig is the generic element of the sparse weight matrix A, which maps the GMRF
ξ from the
G triangulation vertices of the mesh to the observation locations [
28]. As the spatial random effect of any observational survey location at any study time can be expressed by the GMRF
ξ and the weight matrix A, it’s not necessary to require survey locations exactly the same during survey times.
We first defined
Qt (the precision function for temporal effect) as AR1 process with autoregressive coefficient |
ρ| < 1. To handle the current situation of irregularly spaced survey times (i.e., 2007, 2011 and 2014), we set equally spaced time knots (i.e., 2007, 2010.5, 2014) and built the GMRF on the knots, that is
ξ = (
ξt = 2007,
ξt = 2010.5,
ξt = 2014)′. The latent field for survey year 2011
ξt = 2011 is approximated by the projection of
ξ based on B-spline basis function of degree one [
28]. In case of the 95% BCI of posterior distribution of
ρ including zero, which suggests statistically non-significant time-dependence between survey years, we set
Qt =
I (unstructured unit matrix) for exchangeable time effect of the spatial-temporal process for the final model.
Furthermore, to avoid spatial confounding between spatially structured effects and fixed-effect covariates [
29,
30], we restricted the spatial random effect to the orthogonal complement of the fixed effect covariates, by setting a constraint of
Bξ′ =
0′, where
B is the orthogonal matrix from the QR decomposition of the covariates matrix
X [
31]. Considering the identification issues between the main fixed effect and the random effects, we set a sum-to-zero constraint to the location-specific effect and an integrate-to-zero constraint to the spatial-temporal random effect [
32].
We adopted Bayesian inferential framework to estimate the parameters as well as hyperparameters. We fitted the model under Bayesian inferential framework using the INLA package in R (version 3.5.0) with INLA-SPDE approach [
28,
33]. We didn’t have much information to specify precise prior distributions for the parameters. To avoid subjectivity in the choice of priors and to keep inferences in a reasonable range, we used minimally informative priors. They were set as distributions with large variances, covering a wide range of reasonable values according to the characteristics of parameters, thus provide good representation of genuine ignorance about the parameters and do not affect strongly on the posteriors. Priors were set for parameters and hyper-parameters as following:
β0,
β~
N(0, 1000), log((1 +
ρ)/(1 −
ρ))~
N(0,0.15), log(
τ)~
N(0.378, 10) and log(
κ)~
N(−1.64,10) and
\( 1/{\sigma}_{nonsp}^2\sim gamma\left(\mathrm{1,0.00005}\right) \). In addition, default setting for the smoothing parameter
υ = 1 was adopted, which was used for many previous studies. To assess model sensitivity to different settings of
υ, we also tried alternative values
υ = 0.5.
In order to identify the best set of predictors, we carried out Bayesian variable selection. The following variables were considered for variable selection: education years of women, the proportion of birth attendance by skilled provider, number of children, the proportion of improve toilet, wealth, normalized difference vegetation index, land surface temperature (LST) in the daytime, LST at night, elevation, moisture, Human influence index urban extents, the distance the nearest fresh water body. Firstly, to identify the best functional form (i.e., linear of categorical) of continuous potential predictors, we converted continuous variables to three-level categorical ones according to preliminary exploratory graphical analysis. For each continuous potential predictor, we fitted two univariate Bayesian geostatistical models with the predictor as the only fixed effect independent variable, one with it in linear form and the other in categorical form. The deviance information criterion (DIC) and marginal predictive likelihood (MPL) were both recorded, and the functional form with smaller MPL and DIC of the model was selected.. Secondly, we fitted geostatistical models with all possible combinations of potential predictors as covariates and select those with smallest DIC and MPL as best set of predictors for the final model.
We carried out 10-fold cross-validation to assess the model performance. Survey locations were randomly divided 10 times in 90% (training set) and 10% (validation set) splits and the following performance indicators were calculated: (i) mean error (mean of observed prevalence minus predicted one), and (ii) the percentage of observations covered by 95% Bayesian credible intervals (BCI) of posterior predicted prevalence [
17].
Estimates of RTI risk for each survey year were done using the INLA package over a grid of 7137 pixels across Bangladesh at 5 × 5 km spatial resolution. Pixel-level number of infected women was also calculated (Additional file
1). The results were mapped using ArcGIS 10.0. Furthermore, we calculated population-adjusted prevalence (median and 95%BCI) for all the 8 divisions (administrative divisions of level one, ADM1) and 64 districts (administrative divisions of level two, ADM2) in Bangladesh, by summarizing the pixel-level estimates to the corresponding ADM1 or ADM2 level. Risk changes over time at ADM2 level were mapped to show the temporal trend clearly.