Background

Hepatitis E is a liver disease caused by hepatitis E virus (HEV), a non-enveloped, positive-sense, single-stranded RNA virus which is transmitted mainly through contaminated drinking water or uncooked/undercooked food [1]. Since the earliest report of this water-borne disease in New Delhi, India during 1955 to 1956, it has been epidemic in many developing countries [2]. Every year there are 20 million hepatitis E infections, over 3 million acute cases of hepatitis E, and 70,000 hepatitis E-related deaths in the world. The prevalence is highest in Eastern and Southern Asia [3]. Sporadic hepatitis E has also become an important public health concern in developed countries, causing over 50% of acute viral hepatitis cases in recent years [47].

Shanghai is the largest metropolis in China with a permanent population of over 23.8 million. About 14 million are officially registered residents and 9.7 million are migrants. In order to control the spread of HEV, a surveillance system was established and a series of studies of HEV genotype, transmission route, and risk factors for infection have been conducted in Shanghai since 1997 [810]. According to surveillance data from Shanghai Municipal Center for Disease Control and Prevention, hepatitis E has been far more common than hepatitis A since 2004. Many researchers have developed mathematical models to forecast the incidence of hepatitis E.

Few mathematical models are applicable for modeling as time series data of hepatitis E infection has both linear and nonlinear characteristics. Autoregressive integrated moving average (ARIMA) has become one of the most popular and convenient linear models in time series forecasting [1114]. It has advantages in both statistical properties and Box-Jenkins methodology in the model building process [15]. Although the ARIMA model could fit several different types of time series data, the major limitation is the pre-assumed linearity of the model [16]. In contrast, artificial neural networks (ANNs) have the ability to learn and describe highly-nonlinear and strongly-coupled relationships between multi-input and multi-output variables [17], and have no need to specify a detailed model. However, ANNs cannot handle both linear and nonlinear patterns equally well [18]. We designed a combined model using an ARIMA model and a neural network to forecast the incidence of hepatitis E in Shanghai.

Methods

Data source

Hepatitis E is one of Nationally Notifiable Infectious Diseases in China. Upon laboratory confirmation, hospital physicians register each patient’s information in the China Information System for Disease Control and Prevention within 24 hours. Community physicians then conduct an epidemiological investigation, health education, and three months follow-up of each patient and their family members. The morbidity data of hepatitis E from 2000 to 2012 were released from the China Information System for Disease Control and Prevention by Shanghai Municipal Center for Disease Control and Prevention. The annual average population data from 2000 to 2012 was obtained from Shanghai Public Security Bureau.

The model

The ARIMA-BPNN combined model consisted of an ARIMA model and a back propagation artificial neural network (BPNN). The model was developed to forecast the incidence of hepatitis E in Shanghai. The model was trained using 144 months of morbidity data from January 2000 to December 2011, validated with 12 months of morbidity data from January 2012 to December 2012, and finally employed to forecast the incidence of hepatitis E from January 2013 to December 2013 in Shanghai. The whole process was divided into three steps:

The first step was to determine the best-fitting ARIMA model and to predict the values of each time point. The Box-Jenkins approach was applied to seasonal ARIMA (p,d,q)×(P,D,Q)n modeling of time series data. The model was defined with an autoregressive part of order p, a moving average part of order q, a seasonal-autoregressive part of order P, a seasonal-moving average part of order Q, differencing and seasonal-differencing orders d and D, and periodic variable n. This model building process was designed to take advantage of associations in the seasonally and sequentially lagged relationships that usually exist in periodically collected data. Model parameters were estimated using the conditional Least Squares method. Residual analysis, Root Mean Square Error (RMSE), normalized Bayesian Information Criterion (BIC), and stationary R square were conducted to compare the goodness-of-fit among ARIMA models.

The second step was to train the BPNN. Neuron model and network architectures of BPNN have been previously reviewed [19]. In our study, the BPNN architecture consisted of three layers. Two neurons collected predicted morbidity values from ARIMA and corresponding time values in the input layer, 3 neurons estimated the actual morbidity values as targets and made a simulation in the hidden layer, and 1 neuron transferred the forecasted incidence to the output layer. The neurons in the hidden layer had a hyperbolic tangent sigmoid transfer function and the neuron in the output layer had a linear transfer function (Figure 1). A Bayesian regularization back-propagation algorithm was used to train the network and provide a unifying approach for dealing with issues of model complexity and over fitting [20].

Figure 1
figure 1

The combination of ARIMA and BPNN models. The ARIMA-BPNN combined model consisted of three layers: 2 neurons collected predicted morbidity values from ARIMA and corresponding time values in the input layer, 3 neurons estimated the actual morbidity values as targets and made a simulation in the hidden layer, and 1 neuron transferred the forecasted incidence to the output layer.

The third step was to validate the combined model with 12 months of morbidity data from January 2012 to December 2012 and to further forecast the incidence of hepatitis E in 2013.

The mean error rate (MER) was used to explain the comparison of predicted and actual values between single ARIMA and ARIMA-BPNN combined models in 2012.

Data processing and analysis

An augmented Dickey-Fuller test and the X-12-ARIMA seasonal adjustment program of Eviews 5.0 (http://www.eviews.com) were employed to determine the stabilization of time series data [21]. All analyses were performed using SPSS 17.0 (Chicago, IL, USA) and MATLAB 7.0 (Natick, USA).

Ethical review

The study protocol and utilization of hepatitis E morbidity data were reviewed by Shanghai Municipal Center for Disease Control and Prevention and no ethical issues were identified. Therefore, no ethics approval was required by our Investigation Review Board.

Results

General patterns of hepatitis E

A total of 7,489 sporadic hepatitis E cases was reported in Shanghai from 2000 to 2012. This included registered residents and the immigrant population. The annual incidence rate declined to 2.307 per 100,000 population in 2012 and then fluctuated 2.307 to 4.240 per 100,000 population (Table 1). The male morbidity was significantly higher than that of females (t=8.951, P<0.001). The X-12-ARIMA seasonal adjustment program showed that the monthly morbidity data of hepatitis E from 2000 to 2012 had seasonal variations with a peak during January-March and a nadir from August-October (F=40.02, P<0.001) (Figure 2).

Table 1 The morbidity of hepatitis E in Shanghai from 2000 to 2012 (per 100,000 population)
Figure 2
figure 2

Comparison of actual, predicted and forecasted morbidity rates of hepatitis E (2000–2013) in Shanghai, China. The x-axis represents calendar time from 2000 to 2013. The y-axis represents actual morbidity rates and predicted/forecasted morbidity values of hepatitis E (per 100,000 population). From January 2001 to December 2012, morbidity values were predicted using the best-fitting ARIMA model or the ARIMA-BPNN model. From January 2013 to December 2013, morbidity values were forecasted using the best-fitting ARIMA model or the ARIMA-BPNN model. Forecast values for the two models were 0.259 and 0.372 (Jan), 0.305 and 0.356 (Feb), 0.301 and 0.315 (Mar), 0.259 and 0.290 (Apr), 0.215 and 0.256 (May), 0.161 and 0.216 (Jun), 0.138 and 0.163 (Jul), 0.123 and 0.120 (Aug), 0.114 and 0.095 (Sep), 0.118 and 0.101 (Oct), 0.134 and 0.146 (Nov), 0.158 and 0.187 (Dec), respectively. 95% confidence intervals are presented.

The best-fitting ARIMA model

Since the time series data of hepatitis E morbidity had both seasonal and non-seasonal trends, a logarithmic transformation (non-seasonal and seasonal first order differencing) were employed to stabilize the series (Augmented Dickey-Fuller test: t= −13.23, P<0.001). The goodness-of-fit (stationary R2=0.531, RMSE= 0.084, BIC= −4.768, Ljung-Box Q statistics=15.59, P=0.482) and parameter estimates (Table 2) determined the best-fitting ARIMA model to be ARIMA (0,1,1)×(0,1,1)12. The equation was lg Y t = ϵ t ‒ 0.678 × ϵ t ‒ 1 ‒ 0.679 × ϵ t ‒ 12 + 0.460 × ϵ t ‒ 13.

Table 2 Parameters for the final seasonal ARIMA (0,1,1)×(0,1,1) 12 model

The predicted values from best-fitting ARIMA model in 2012 fluctuated from 0.135 to 0.362 per 100,000 population, with the same seasonal variation as the actual ones. The MER of the best-fitting ARIMA model was 0.250 (Table 3, Figure 2).

Table 3 Predicted and error rates of the single ARIMA model and ARIMA-BPNN combined model in 2012

ARIMA-BPNN combined model

To construct the ARIMA-BPNN combined model, the predicted morbidity values from the best-fitting ARIMA model and corresponding time values were used as input (2×131 matrix), while the actual morbidity values were used as target data (1×131 matrix) (Figure 1). The model fitted values in 2012 fluctuated from 0.117 to 0.345 per 100,000 population. The MER of the ARIMA-BPNN combined model was 0.176, lower than the 0.250 MER of the single ARIMA model. This proved that the combined model was more effective.

The combined model was then used to forecast the incidence of hepatitis E in 2013. The prediction was a continued fluctuance within a narrow range from 0.095 to 0.372 per 100,000 population, with a peak during winter (January-March) and a nadir during autumn (August-October) (Figure 2).

Discussion

Hepatitis E is generally regarded as a disease predominantly restricted to areas with poor sanitation and polluted drinking water supplies [22]. However, more cases due to zoonotic spread and unclear transmission methods are occurring in non-endemic areas including Shanghai, China [10, 23, 24]. A total of 7,489 hepatitis E cases was reported in Shanghai from 2000 to 2012. The incidence fluctuated between 2.307 and 4.240 per 100,000 population, with seasonal variations. This has led to a major shift in the understanding of the epidemiology of hepatitis E and warranted further study.

Compared to blood-borne infectious diseases (e.g. hepatitis B and C, AIDS), hepatitis E is more affected by environmental and natural factors. These factors lead to a seasonal variation in incidence. The multiple factors involved cause difficulties in modeling. Time series analysis has the advantage of forecasting the incidence without focusing on specific risk factors; however, it cannot describe a nonlinear trend in incidence data. ANNs have been widely accepted as a potentially useful means in modeling complex nonlinear and dynamic systems which could remove the need for model builders to correctly specify the precise functional forms of the relationship that the model seeks to represent. However, they still require the need for knowledge as well as prior information about the systems of interest [2527]. It has been argued that combining multiple models for forecasting may provide better estimates than single time series models, by taking advantage of each model’s capabilities [18, 28]. Accordingly, we constructed a hybrid architecture which comprised an ARIMA model and a neural network for forecasting hepatitis E incidence and validated its efficacy. The MER of the single ARIMA model and the ARIMA-BPNN combined model were 0.250 and 0.176, respectively. The combined model forecasted that the incidence of hepatitis E in Shanghai in 2013 would be similar to that of previous years, and that there would be a seasonal variation with a peak during winter and a nadir during autumn.

We determined that an ARIMA-BPNN combined model better fit time series data of hepatitis E morbidity in Shanghai than a single ARIMA model. This combined method could not be applied to all time series data without assuming that the relationship between the linear and non-linear components was additive. If the relationship was different (e.g. multiplicative), the combined method would lower the capacity [29]. The morbidity of hepatitis E was influenced by many environmental and natural factors which are dynamic and possibly evolving over time. Thus, the parameters of an ARIMA-BPNN combined model should be periodically re-assessed according to continuously updated data to maintain long-term sustainability and precision.

Conclusions

Time series analysis demonstrated a seasonal pattern of hepatitis E infection in Shanghai, China. An ARIMA-BPNN combined model was used to describe the linear and nonlinear patterns of the time series data. This model effectively forecasts hepatitis E infection. We focused on the ARIMA-BPNN combined model because single ARIMA and BPNN models had been intensively studied. The construction and interpretation of other combined analyses should be explored.