Background

Tuberculosis (TB) is a chronic respiratory infectious disease caused by the pathogen Mycobacterium tuberculosis. Infected people can spread TB germs from their mouth when they cough or sneeze. After suffering from TB, if the patients are not given timely, thorough treatment, which can pose a serious threat to their health, even make them completely lost the ability to work, and the TB patients may also infect others [1]. At present, although great progress has been made around the world in the prevention and control of TB, many countries, especially in low-income and middle-income settings, are still afflicted with a chronic plague of TB with huge economies losses [2]. Moreover, TB remains one of the top 10 causes of death worldwide; it is estimated that globally there were 10.0 million new cases of TB in 2017, of which 1.3 million individuals’ deaths were directly attributable to TB, and TB has killed more people than any other infectious disease in the past few decades [2, 3]. China is one of the countries with high burden of TB, number of TB patients ranked second in the world, accounting for a quarter of the world’s patients, and about 250 thousand patients died of TB every year in China [2].

The Guangxi is a province of China, it is located in the south of China, the latitude 20°54′ ~ 26°24′ N, longitude 104°26′ ~ 112°04′ E, it covers a total area about 236,700 km2, with a population over 49.26 million in 2018, and is one of the Chinese provinces that is most affected by TB. From 2015 to 2017, the annual incidences (per 100,000 populations) of TB in China were 63.42, 61 and 60.53, respectively, while the annual incidences of TB in Guangxi province of China were 96.41, 86.27 and 87.86, respectively. These incidences of TB in Guangxi were much higher than that in the national level, so it is necessary to pay more attention to the prevention and control of TB in this area.

To master the regularity of infectious diseases, analyze and know the epidemic situation of infectious diseases by using the existing surveillance data, then predict the future, which can provide scientific reference for disease prevention and control. The Box-Jenkins method is a representative time series analysis and prediction method, which can take into account trend changes, periodic changes, and random disturbances in time series. It is very useful in modeling temporal dependence structure of a time series. At present, this method has been widely used in the prediction of infectious diseases, and has achieved successful prediction results, for instance, Tian C W et al. [4] forecasted monthly cases of hand-foot-mouth disease successfully in China; Wang T et al. [5] suggested that ARIMA(3,1,1)(2,1,1)12 model was reliable with a high validity, which could be used to predict hemorrhagic fever with renal syndrome incidence in Zibo; Myriam Gharbil et al. [6] predicted the dengue incidence in Guadeloupe based on time series analysis; López-Montenegro LE [7] predicted dengue cases in Colombia from 2018 to 2022 based on Auto-Regressive Integrated Moving Average (ARIMA) model; Zheng Y-L et al. [8] and Liao Z [9] forcasted TB incidence successfully using SARIMA model, etc. [10,11,12,13,14,15,16,17].

The incidence of TB in Guangxi is very high, but there are few related prediction studies so far. In order to do a better job of prevention and control, in the study, the prediction research was carried out. Firstly, we briefly analyzed the change trend of the TB incidence in Guangxi over the years, and then, based on the data characteristics of the TB incidence in Guangxi, China, we established the best SARIMA model for prediction. Finally, the TB incidence in the future was predicted, which can provide scientific reference for prevention and control of TB in Guangxi.

Methods

Data source

The data of the TB cases in Guangxi from January 2012 to June 2019 was obtained from the Guangxi center for Disease Control and Prevention, China; Population data was obtained from the official website of Guangxi Bureau of Statistics, based on the population data and the reported number of TB cases, we calculated the monthly incidence of TB (per 100,000 populations). The data used in this study is provided as Additional file 1.

SARIMA model descriptions

The Box-Jenkins method is a famous time series prediction method proposed by Box and Jenkins in the early 1970s, it includes the ARIMA(p,d,q) model called. Autoregressive Integrated Moving Average Model, AR is auto regression, p is the number of auto regression term, MA is moving average, q is the number of moving average terms [18, 19]. If the time series contains a seasonal cycle, it is often necessary to do a seasonal difference to establish a SARIMA model, the SARIMA model with s observations per period, denoted by SARIMA (p, d, q)(P, D, Q)s. Generally, the standard statistical methodology to construct an SARIMA(p, d, q)(P, D, Q)s model includes four steps:

First step, data stationary test. Usually, data set needs to be divided into two subsets for model: one for training set, and the other one for testing set. The training set needs to be stationary time series. If the original training set data is not stationary, common differential or seasonal difference is required, d is the order of the ordinary difference, and D is the order of the seasonal difference. Augmented Dickey-Fuller (ADF) test can determine whether the time series was stationary, the significance level of the test is 0.05 (if the test Prob is less than 0.05, then, the data is stationary).

Second step, based on the data of stationary time series, to plot the graphs of the autocorrelation function (ACF) and partial autocorrelation function (PACF). According to the analysis of ACF and PACF, we can determine the possible values of p, q, P and Q, this process requires both skill and experience. Generally, more than one tentative model is chosen in this step.

Third step, to do parameter estimation and hypothesis test of all tentative SARIMA models by least square method. These model passed by the parameter test is feasible, furthermore, to do diagnostic checking of their residuals, if residuals are almost equivalent to white noises (significant level Prob> 0.05) by using the Box-Jenkins Q test, then SARIMA model has good performance. Then, to select the best SARIMA model by the Akaike information criterion (AIC) and Schwarz criterion (SC). The preferred model is the one with the lowest AIC and SC values.

Forth step, to predict the TB incidence based on the preferred SARIMA model, then, to calculate forecast accuracy indexes, such as root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). Good fitting and prediction performance of SARIMA model are demonstrated with RMSE, MAE and MAPE as small as possible.

Data processing and analysis

All analyses were performed using ArcGIS 10.4, Eviews7.2, R3.6.2 and Matlab 2012b.

Results

From January 2012 to June 2019, a total of 587,344 cases of TB and 879 deaths of TB were reported in Guangxi. It can be seen from Fig. 1 that the TB incidence was decreasing year by year, and there was certain seasonality. The TB incidence in the second and third quarters were higher than that in the first and fourth quarters.

Fig. 1
figure 1

The TB incidence in Guangxi from January 2012 to June 2019

We used R3.6.2 software to decompose TB incidence data, and found that TB incidence data have obvious seasonality, periodicity and randomness (see Fig. 2), so it is suitable to establish SARIMA model for prediction analysis.

Fig. 2
figure 2

Time series decomposition of TB incidence from January 2012 to June 2019

The data from January 2012 to June 2019 was divided into two parts, the part from January 2012 to December 2018 was used to construct the SARIMA(p,d,q)(P,D,Q)s model, and the other part from January 2019 to June 2019 was used to test the prediction performance of the SARIMA(p,d,q)(P,D,Q)s model.

The SARIMA(p,d,q)(P,D,Q)s model method requires data to be stationary, otherwise, neither of backcast or forecast of the series can be available. First, ADF was used to test the stability of original series, and the tested Prob value was 0.94 greater than 0.05, which showed that the series was not stationary. Because there was obvious seasonality in the TB incidence series in Guangxi (see Fig. 2), we did the first-order seasonal difference with period 12 on original series, and then, did ADF test of the seasonal difference data again, and the tested Prob value was less than 0.01, therefore, after the first-order seasonal difference, the data was stationary, then, d = 0, D = 1 and s = 12. The test results were shown in Table 1.

Table 1 The ADF tests of Modeling data

Second, to draw ACF and PACF graphs of stationary data (see Fig. 3). According to the analysis of the ACF and PACF graphs, we established eight tentative models, SARIMA(1,0,1)(0,1,0)12,SARIMA(1,0,(2))(0,1,0)12,SARIMA((2),0,1)(0,1,0)12, SARIMA((2),0,(2))(0,1,0)12,SARIMA(2,0,(2))(0,1,0)12, SARIMA(2,0,1)(0,1,0)12, SARIMA(1,0,2)(0,1,0)12, and SARIMA(2,0,2)(0,1,0)12. Then, the least square method was used to test the parameters of the eight models, and the AIC and SC values of these models were calculated, the test results were shown in Table 2. It could be seen that only the SARIMA((2),0,(2))(0,1,0)12 model with lowest AIC and SC passed the parameter test (all Prob values were less than 0.05).

Fig. 3
figure 3

The ACF and PACF graphs of the stationary series

Table 2 The Parameter estimates of the tentative models with their AIC and SC

Finally, we did the diagnostic checking of residuals of the SARIMA((2),0,(2))(0,1,0)12 model by using the Box-Jenkins Q test, the test Prob was more than 0.05, therefore, according to these analyses, the SARIMA((2),0,(2))(0,1,0)12 model was feasible for the prediction of TB incidence in Guangxi.

We used the SARIMA((2),0,(2))(0,1,0)12 model to fit the TB incidence data from March 2013 to December 2018, and the RMSE, MAE, and MAPE were 0.98, 0.77 and 5.8 respectively; We used the SARIMA((2),0,(2))(0,1,0)12 model to predict the TB incidence from January 2019 to June 2019, and the RMSE, MAE, and MAPE were 0.62, 0.45 and 3.77, respectively. Both the fitting accuracy values and the prediction accuracy values were very small, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was very good and its prediction accuracy was high. Based on the SARIMA((2),0,(2))(0,1,0)12 model, we predicted the TB incidence in Guangxi from July 2019 to December 2020, these predicted values were shown in Table 3, and the fitted and predicted incidence were compared with the observed incidence in Fig. 4.

Table 3 The observed TB incidence and predicted TB incidence by SARIMA((2),0,(2))(0,1,0)12 model from January 2019 to December 2020
Fig. 4
figure 4

The fitted and predicted TB incidence by the SARIMA((2),0,(2))(0,1,0)12 model

Discussion

Currently, the annual TB incidence in Guangxi is much higher than that in the national level, although it has been slightly decreasing annually; the potential achievement is diminished by an increasing large-scale transient population, the emergence of MDR-TB, along with the co morbid conditions of AIDS and non-communicable diseases, which have led to a resurgence of TB in recent years [20,21,22]. Additionally, WHO initiated the End TB Strategy with the target of a 90% reduction in new TB cases by 2035 compared with 2015,and a milestone of reducing the TB incidence by 50% by 2025 relative to 2015 [2], in order to accelerate progress towards such a daunting task, corresponding measures and actions are expected at both the national and international levels. At the national level, every province should make efforts, especially in provinces with high incidence, such as Guangxi. Appropriate plans may fail to be becomingly formulated without getting a clear perspective of the past, current and future temporal levels of this disease, therefore, advanced detection and early response systems for epidemics have formed an integral part of the effective precautions against TB and the reasonable allocation of available health resources.

In this study, the historical trend of TB incidence in Guangxi was carefully analyzed, then, the prediction model of TB incidence in Guangxi was established by using Box Jenkins model method, and this method is one of the most widely used time series forecasting techniques because of its structured modeling basis and acceptable forecasting performance. Through the analysis of the change trend and decomposition graph of original TB incidence, we found that the data had obvious seasonality, trend and randomness, so it is suitable to establish SARIMA model for prediction analysis. For SARIMA model construction, monthly TB incidence from January 2012 to December 2018 was used; for testing the predictive ability of this model, TB incidence from January 2019 to June 2019 was used.

SARIMA model requires data to be stationary, Table 1 showed that the Prob value of ADF test was 0.94 more than 0.05, indicating that the original data was not stable. Considering the seasonal variation of TB incidence data, we did the first-order seasonal difference with a period of 12, after that, we used ADF to test the stationarity of the seasonal-difference data, the Prob value of the test was less than 0.01(see Table 1), which indicated that the difference data was stable and could be used to build SARIMA model. Then, in order to determine the p, q, P and Q in SARIMA(p,0,q)(P,1,Q)12 model, the ACF and PACF graphs were drawn, then, eight tentative models were established by the analysis of ACF and PACF graphs. The parameters of these tentative models were tested and these models performance were compared by AIC and SC, Table 2 showed SARIMA((2),0,(2))(0,1,0)12 model had smallest AIC and SC, as well as, all the Prob values of its parameter test were less than 0.01, and the Prob value of the Box-Jenkins Q test was more than 0.05, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was feasible to predict the TB incidence in Guangxi. Using SARIMA((2),0,(2))(0,1,0)12 model to fit original TB incidence from January 2012 to December 2018, the RMSE(0.98), MAE(0.77), and MAPE(5.80) were very small; Using SARIMA((2),0,(2))(0,1,0)12 model to predict TB incidence from January 2019 to June 2019, the RMSE(0.62),MAE(0.45),and MAPE(3.77) were very small too, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was very good and its prediction accuracy was very high. We predicted the TB incidence in Guangxi based on the SARIMA((2),0,(2))(0,1,0)12 model from July 2019 to December 2020(see Table 3 and Fig. 4), the results suggested the change trend of predicted TB incidence was similar to change trend in the previous two years, and TB incidence will experience slight decrease, the predicted results can provide scientific reference for the prevention and control of TB in Guangxi, China.

Conclusions

The incidence of tuberculosis in Guangxi is high, but there is little prediction study of the disease in recent years, advanced detection and early response systems have formed an integral part of the effective precautions against TB and the reasonable allocation of available health resources. In view of this, we used Box-Jenkins method to establish the SARIMA((2),0,(2))(0,1,0)12 model for predicting the TB incidence in Guangxi. The RMSE, MAE and MAPE of the SARIMA((2),0,(2))(0,1,0)12 were very small, which indicated that the model was successful, its prediction accuracy was high, and its prediction performance was good. Based on SARIMA((2),0,(2))(0,1,0)12 model,we predicted the TB incidence of Guangxi from July 2019 to December 2020, the results suggested the TB incidence will experience slight decrease, and its changing trend will be similar to before. The prediction results can provide help for reallocating resources so as to get better in control and prevention of TB in Guangxi, China.