Background
Tuberculosis (TB) continues to challenge the international community. It is estimated that there are about 1.7 billion people with potential TB infection, accounting for 23% of the world’s population, are at risk of developing TB disease during their lifetime [
1]. Moreover, the global burden was estimated by the World Health Organization (WHO) at 10.0 million incident cases in 2017. It is also one of the top 10 causes of death which caused an estimated 1.6 million deaths in 2017, and has killed more people than other infectious diseases in the past few decades [
1,
2].
China has always been one of the countries with the most serious Tuberculosis epidemic in the world [
3‐
6]. There were 866,000 patients with infection of TB in China, 2018 [
7]. Due to the continuous attention to public health and increasing investment in resources, China’s Tuberculosis epidemic has significantly improved in recent years. However, due to the large number of people infected with TB, the epidemic situation of Tuberculosis is still not optimistic, so further long-term research on the incidence of it in China is needed.
Currently, China has conducted five national epidemiological investigations to find the epidemiological characteristics of Tuberculosis. However, the spatiotemporal distributions of Tuberculosis cannot be evaluated continuously, and the survey was unable to measure other important indicators of the severity of the epidemic. The mathematical models may help us better understand the epidemiological characteristics of Tuberculosis. Some of the studies mainly focused on the seasonality impact on the transmission of Tuberculosis [
5,
8,
9], while others focused on the spatial distributions [
10,
11]. There is no model that assesses the spatiotemporal characteristics and the epidemiology of Tuberculosis among the whole population in China over 14 years.
The aim of this study was to observe the Spatial-temporal characteristics and the epidemiology of Tuberculosis in China from 2004 to 2017. The incidence trend of the TB was observed by the Joinpoint regression analysis. The Seasonal autoregressive integrated moving average (SARIMA) model was used to predict the monthly incidence. Geographic clusters was employed to analyze the spatial autocorrelation. The relative importance component of TB was detected by the multivariate time series model. These models additively divided TB risks into spatiotemporal, autoregressive, and endemic components.
Methods
The data collection
Tuberculosis incidence data were extracted from the Chinese Center for Disease Control and Prevention (
http://www.phsciencedata.cn/Share/edtShareNew.jsp?id=39208) in 31 provinces of China from 2004 to 2017. The data were aggregated to 168 monthly counts across the 14 years. Population data came from the website of the statistical yearbook of the National Bureau of Statistics (
http://www.stats.gov.cn/tjsj/ndsj/). The population size was easy to find in the website, and it represented the average population each year.
Joinpoint regression
From 2004 to 2017, the continuous change of the TB incidence trend was analyzed using Joinpoint software. The grid search method was applied to find significant trends, and multiple permutation tests were applied to detect the Joinpoint points for each trend [
12‐
14]. The overall time trend was calculated by the annual average rate of change (AAPC). If the final model was 0 Joinpoint model, the average percent change (APC) was considered equal to AAPC. We used the Joinpoint regression model to find the long-term trend of the TB incidence.
Time-series estimation
The SARIMA model was used to predict the future trends in many disease incidences [
13,
15‐
17]. In our study, A SARIMA model was applied to predict the incidence of TB epidemics in China. The SARIMA model can be written as the form of (p, d, q) (P, D, Q) [s], which P, D, and Q indicate seasonal SAR terms, seasonal differencing, and seasonal SMA terms, respectively; p, d, and q indicate non-seasonal AR terms, non-seasonal differencing, and non-seasonal MA, respectively; s indicated the seasonal period (s = 12 in our study).
The construction of the SARIMA model can be divided into the following steps. First, an augmented Dickey-Fuller (ADF) test was performed to test the stationary status of time series. Second, model parameters (p, d, q, P, D, and Q) were determined by autocorrelation function (ACF) plot, partial autocorrelation function (PACF) plot, and inverse autocorrelation function (IACF) plot. An alternative SARIMA model was constructed by transforming the parameters of model. Lastly, the Akaike information criterion (AIC) and Schwartz Bayesian Criterion (SBC) were used to determine the fitness of different SARIMA models. An optimal model was considered to have the lowest AIC and SBC values, and the residuals of the final model were tested by the Box-Ljung test to know whether they were time independent. The mean square error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE), and root mean square error (RMSE) were used to see the predictive validity of the models. We use year 2004–2016 to construct the SARIMA model, and year 2017 to testify the forecast of the model. The SARIMA model is used to forecast the short-term incidence of TB to testify the accuracy of model. We also decompose the monthly data into the overall trend, seasonal trend, and random noise with a goal to identify the truly long-term trend.
Spatial autocorrelation analysis
Spatial analysis was used to identify the clustering regions and observe geographic variation [
18,
19]. Global Moran’s I of reported TB cases was computed to detect the spatial clustering pattern. A Moran’s I value is between − 1 and 1, whereas the value near 1 means positive spatial autocorrelation, the value near − 1 means negative spatial autocorrelation, and 0 means random distribution. Local Moran’s Index was calculated and a hotspot analysis was performed to determine the location of clusters. Local Moran’s Index was applied to determine the spatial autocorrelation, which detects some spatial clusters with similar adjacent features and exception values. When the incidences rate had similar low values or high values, these areas were deemed as having positive autocorrelation (low-low or high-high autocorrelation). If not, they were defined as having a negative autocorrelation (low-high or high-low autocorrelation) [
10].
The multivariate time series model
A multivariate time-series model for disease counts Y
i,t during periods t = 1, T from units i = 1, I was first established by Held et al [
20] and was extended and applied in some papers [
21‐
23]. The Y
i,t denoted the number of TB cases which were considered to be a negative binomial distribution Yit|Y
i,t-1 ~ NegBin (u
it, ψ), with an additively decomposed mean:
$$ {\mathrm{u}}_{\mathrm{i}\mathrm{t}}={\upsilon}_{\mathrm{i}\mathrm{t}}{e}_{\mathrm{i}\mathrm{t}}+{\lambda}_{\mathrm{i}\mathrm{t}}\;{\mathrm{Y}}_{\mathrm{i},\mathrm{t}\hbox{-} 1}+{\upphi}_{\mathrm{i}\mathrm{t}}\sum \limits_{j\ne i}{w}_{ji}{Y}_{j,t=1}, $$
Where ψ is an over-dispersed parameter that the conditional variance of Y
i,t is
\( {\mu}_{it}\left(1+\psi {u}_{it}\right) \). υ
itℯ
it is the endemic component, and the autoregressive component λ
itY
i,t-1 reflects the patient numbers at previous time. The spatiotemporal component ϕ
it ∑j ≠
iwjiYj,
t − 1 reflects the transmission among different units. Each parameter υ
it, λ
it, and ϕ
it follow the form of log-linear:
$$ \log \left({\upsilon}_{\mathrm{i}\mathrm{t}}\right)={a}^{\left(\mathrm{v}\right)}+{{\mathrm{b}}_{\mathrm{i}}}^{\left(\mathrm{v}\right)}+\sum \limits_{s=1}^S\left\{\gamma \sin \left({w}_st\right)+\delta \cos \left({w}_st\right)\right\}, $$
$$ \log \left({\lambda}_{\mathrm{i}}\right)={\alpha}^{\left(\lambda \right)}+{{\mathrm{b}}_{\mathrm{i}}}^{\left(\lambda \right)}, $$
$$ \log \left({\upphi}_{\mathrm{i}}\right)={\alpha}^{\left(\upphi \right)}+{{\mathrm{b}}_{\mathrm{i}}}^{\left(\upphi \right)}, $$
Where α(v), α(λ) and α(ϕ) are intercepts and bi(v), bi(λ) and bi(ϕ) are random effects accounting for heterogeneity among different regions. The endemic υit contains a sinusoidal frequency wave (ws = 2π/12 for monthly data), and S is the seasonal parameters. The population fraction ei can be used as a multiplicative offset for the regional specific measure for the incidence of infectious disease.
The weights ω
ji describe the transmission from district j to district I. Considering that most regions are very large, higher-order neighbourhood are not that relevant as we only constructed our model with first-order neighbourhood. The score rule of the Dawid-Sebastiani score (“dss”) was applied to identify the optimal model with random effects. The optimal model corresponds to lower scores with better predictions [
22,
24]. All the multivariate time analysis used the R package Surveillance.
Statistical analysis
The incidence trend of TB from 2004 to 2017 was observed by the Joinpoint software (version 4.7.0.0). The Seasonal autoregressive integrated moving average (SARIMA) model was used to predict the monthly incidence of TB by SAS9.4 (SAS Institute Inc., Cary, NC). Geographic clusters was employed to analyze the spatial autocorrelation with ArcGIS software (version 10.2, ESRI Inc.; Redlands, CA, USA). The relative importance component of TB was detected by the multivariate time series model with R software (version 3.6.0, package = surveillance). P value < 0.05 was considered as statistically significant for all the tests.
Discussion
According to our research, there were 13,991,850 TB cases from January 2004 to December 2017, with a yearly average morbidity of 999,417 cases which was a huge burden for the public health of China. Understanding the epidemiology patterns of TB may help China to reduce the number of TB cases which ranked second in 2017 according to the WHO report [
1]. The incidence of TB from 74.58 (/100,000) cases in 2004 to 60.08 (/100,000) cases in 2017 which was a 19.4% reduction of TB incidence. The annual average percent change (AAPC) was − 3.3, which is better than the world average of 2% [
1]. The reason for the decline of TB incidence is the rising GDP (Gross Domestic Product) (China ranked second in 2019), high urbanization, and the widespread modern control strategy. Previous studies demonstrated that the TB incidence of China decreased with the rising of GDP and better healthy treatment and management [
25,
26], which was also found in other countries [
27,
28].
Consistent with previous research [
9], we found two peaks in January and March every year for TB incidence in China, with close numbers in these two peaks. The low number of confirmed cases in February may probably attribute to the Chinese traditional Spring Festival holiday. The average time from disease onset to confirmation of the diagnosis was 72 days when some infected persons develop active Tuberculosis, and patients were most likely to be diagnosed 2–3 months after symptom onset [
8]. So, we should enhance patient control and the prevention of susceptible population in the autumn and winter, and the detection of TB in spring.
For a long time, the hotspots were distributed in the northwest areas such as Xinjiang Province and Tibet. Xinjiang Province has been at a high incidence level in the 14 years, while the incidence in Tibet increased since 2012. Except for 2004, 2010, and 2014, Guizhou Province has been at a high incidence level in the later 11 years. Some provinces such as Hainan, Guangxi, and Chongqing were at a high incidence level before 2009, but have been at a low level since 2009. More attention is needed in these high incidence areas, especially in Xinjiang, Tibet and Guizhou, which may need more financial assistance. It should be noted that some High-High spatial autocorrelation including Hainan, Guizhou, Guangxi, and Chongqing Province have disappeared since 2009, while Xinjiang Province and Tibet have become new H-H regional areas since 2009. The possible explanation is the unbalanced economic development in these areas [
29]. Some studies [
30‐
32] have demonstrated that there has positive correlations between the poverty level of regions, families or individuals and the incidence of Tuberculosis.
At the average level of the province component over the 14 years, autoregressive components dominated all the provinces which can explain 81.5–84.5% of the incidence, while the spatiotemporal component was mainly located in the well-developed provinces. For some provinces such as Beijing, Jiangsu and other well-developed economic provinces which were partly affected by the spatiotemporal component, it is recommended to monitor TB infection of the floating population form the neighbouring areas. For example, individuals who work in Beijing but become infected with TB in their hometowns should stay at home before anti-Tuberculosis treatment and maintain the treatment for a couple of weeks, avoiding going to public places or having close contact with others. We also did an analysis for the provinces with a high incidence (> 70 cases per 100,000 over 14 years) of the three components. For the autoregressive component which dominated all the high incidence provinces, early protective implementation 2–3 months ahead of the peak could help us reduce the number of TB patients [
8]. For the endemic parts, most infected patients could be explained by living conditions, ecological and climatological changes, and socioeconomic activities. Active treatment for TB patients and cutting off the pathway of transmission may be the most effective way to prevent TB [
8,
33]. Another important method is increasing the public awareness, especially among old people and children, and enhancing their physical exercise, immunity, and general hygiene. In addition, the spatial-temporal component can also affect the transmission of TB. Guangxi and Guangdong Provinces, which are in the south-east coastal area, were partly influenced by the spatial-temporal component, indicating that these regions may have imported TB from adjacent country with high incidence such as Philippines [
34,
35] or the neighbouring province Guizhou. Alarmingly, although there was no clear evidence that Tibet and Xinjiang had a high value of spatial-temporal component, we still need to pay attention to transmission from India [
33,
34] which was ranked first in global TB patients.
Our study had several limitation. First, the monthly data from 2004 to 2017 did not collect some risk factors including socioeconomic status, climatic factors, gender, age, and human activities. The relationship between the incidence of TB and these factors was still unknown. These factors should be included in the future studies in order to get an accurate multivariate time series model. Second, we included TB patients reported from the passive surveillance system which inevitably underestimated the total number of TB cases. Further researches could consider the level of reporting, including some subclinical and mild individuals not accessing healthcare. Lastly, the level of diagnosis in some provinces can lead to an underestimation of the TB incidence. We should think over the diagnostic level in the future studies to correct the incidence.
Acknowledgements
We thank all the clinical personnel for investigating and reporting the information about the TB cases. Thanks to Dr. Sebastian Meyer, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Institute for Medical informatics, Biometry and Epidemiology (IMBE), for help with the R package surveillance. Thanks to Dr. Edward C. Mignot, Shandong University, for linguistic advice.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.