Prediction mapping of human leptospirosis using ANN, GWR, SVM and GLM approaches
- 265 Downloads
Recent reports of the National Ministry of Health and Treatment of Iran (NMHT) show that Gilan has a higher annual incidence rate of leptospirosis than other provinces across the country. Despite several efforts of the government and NMHT to eradicate leptospirosis, it remains a public health problem in this province. Modelling and Prediction of this disease may play an important role in reduction of the prevalence.
This study aims to model and predict the spatial distribution of leptospirosis utilizing Geographically Weighted Regression (GWR), Generalized Linear Model (GLM), Support Vector Machine (SVM) and Artificial Neural Network (ANN) as capable approaches. Five environmental parameters of precipitation, temperature, humidity, elevation and vegetation are used for modelling and predicting of the disease. Data of 2009 and 2010 are used for training, and 2011 for testing and evaluating the models.
Results indicate that utilized approaches in this study can model and predict leptospirosis with high significance level. To evaluate the efficiency of the approaches, MSE (GWR = 0.050, SVM = 0.137, GLM = 0.118 and ANN = 0.137), MAE (0.012, 0.063, 0.052 and 0.063), MRE (0.011, 0.018, 0.017 and 0.018) and R2 (0.85, 0.80, 0.78 and 0.75) are used.
Results indicate the practical usefulness of approaches for spatial modelling and predicting leptospirosis. The efficiency of models is as follow: GWR > SVM > GLM > ANN. In addition, temperature and humidity are investigated as the most influential parameters. Moreover, the suitable habitat of leptospirosis is mostly within the central rural districts of the province.
KeywordsLeptospirosis GIS ANN GWR SVM GLM Machine learning Prediction
Akaike Information Criterion
Artificial Neural Network
Autoregressive Integrated Moving Average
Bayesian Information Criterion
Environment for Visualizing Images
Feed-Forward Neural Networks
Geographical Information System
Generalized Linear Model
Geographically Weighted Regression
Inverse Distance Weighting
Mean Absolute Error
Modified Areal Unit Problem
The Moderate Resolution Imaging Spectroradiometer
Mean Relative Error
Mean Square Error
National Aeronautics and Space Administration
Normalised Difference Vegetation Index
National Health Care Network
National Ministry of Health and Treatment of Iran
Radial Basis Function
Shuttle Radar Topography Mission
Support Vector Machine
Variance Inflation Factor
Since the discovery of leptospira in the body of Japanese mine workers over a hundred years ago, human leptospirosis has been treated as a “neglected tropical disease” worldwide . Reports of World Health Organization show that annual incidence rate of leptospirosis per 100,000 people varies from 0.1 to 1 in temperate regions and 10–100 in humid regions and over 100 in tropical areas. Global report of the disease reveals that over 1 million severe cases take place annually with approximately 60,000 fatalities . As a Zoonotic disease, it occurs in tropical and sub-tropical areas with high humidity . This disease is caused by leptospira bacteria which live in the urine of mammals such as rodents . Human infection from leptospirosis occurs through direct or indirect contact with infected animals or environment . Several contributing factors are contemplated for the incidence of leptospirosis including geographical location with frequent rainfall and floods, adjacency to mammal reservoirs and human activities . One of the most important reasons of leptospirosis mortality is its resemblance to other diseases such as influenza and dengue fever . Indeed, underestimating its infectiousness and loss of timely diagnosis give rise to fatality .
Rafyi and Magami in 1968 confirmed the first report of human leptospirosis in Iran, but no definite report has been made about the current status of human leptospirosis distribution in the country . Human leptospirosis, an endemic disease in Caspian region, is more widespread in Gilan Province because of humid and wet climate . In addition, high population densities of rural districts, farmlands (often paddy fields) and fishing activities help propagate the prevalence of leptospirosis in Gilan. Amongst provinces, the annual incidence rate of leptospirosis in Gilan is always the highest. In this region, most farmers keep domestic animals in their houses and irrigate their farms using river resources, where the population of leptospirosis-contaminated rodents is abundant . Hence, modelling and predicting leptospirosis will help policy makers to better understand the disease, prioritise regions and budget for early prevention or treatment and provide accurate planning. It will help the government policy makers ease the burden of medical and health care expenditure on the province.
Several studies were made on modelling leptospirosis worldwide [10, 11, 12, 13]. Many studies elucidated the effect of drivers such as precipitation [14, 15], temperature [16, 17], humidity [18, 19], elevation  and vegetation [10, 21] on the distribution of leptospirosis because its prevalence highly depends on environmental factors. However, most studies focused on clinical aspects of the disease and animal type of leptospirosis. Based on literature review and to the best of our knowledge, papers rarely worked on spatial modelling and predicting human leptospirosis utilising Geographical Information System (GIS) and its approaches [11, 12].
GIS is a powerful tool that its capabilities have been already proven in various fields of studies such as disease [22, 23, 24] and environment [25, 26, 27]. In disease problems, GIS can play a major role in showing how the disease propagates and finding the parameters that affect its prevalence . The advantages of GIS have been proven in developed countries, but it is rarely employed for health issues in developing countries such as Iran [29, 30].
Given that the heterogeneity relationship between the disease and effective parameters, some methods should be utilized to consider heterogeneity . Geographically weighted regression (GWR) is a common approach that can solve the heterogeneity by considering variability of coefficients in diverse locations across the study area . An advantage of GWR is considering the location of parameters as input to improve spatial prediction capability and reduce heterogeneity effect. GWR is an efficient approach for modelling in various fields of study [33, 34, 35], especially disease modelling and predicting. However, GWR is a linear method that cannot consider the nonlinear behaviour of the phenomenon. Owing to high capability in solving nonlinear problems, Artificial Neural Network (ANN), a widely used approach in disease prediction, is selected to predict leptospirosis disease [36, 37, 38, 39]. Another approach used in this study is General Linear Model (GLM), which is a statistical model commonly used in modelling and predicting diseases . It utilizes the polynomial regression to investigate the relationship between dependent and independent variables . Also, SVM, a supervised classifier, is used as a novel machine learning method which can be used for classification and in regression analysis . The SVM classifier takes a set of input dataset and predicts the class of each input data which is used in various medical issues [30, 43] .
This study aims to model and predict human leptospirosis in Gilan Province of Iran, using capabilities of GWR, GLM, SVM and ANN approaches. Background section provides knowledge about leptospirosis and the reasons of its prevalence based on previous studies. Methods section explains how data are prepared and asserts fundamentals about utilized approaches. Results section presents the results of models. Discussion section interprets data ally with analysing the information which can be obtained from the results of the models in detail. The final section describes the conclusions of the study and indicates future work.
Data acquisition and preparation
Topographic and vegetation data
Gilan shows remarkable topographic variations with almost 3700 m altitude difference between the lowest and highest locations and average altitude of 1800 above sea level. Elevation continually decreases from south to north. Owing to the significant variability of elevation, climate and vegetation differ across the study area. The elevation map is obtained from NASA3‘s 90 m resolution SRTM4data. All parameters such as elevation are assigned to the centroids of rural districts for further analysis. ArcGIS software tool ‘Extract to Points’ is employed, and the elevation data are assigned to the centroids.
Input parameters and their characteristics
Output of model
Positive reported cases of human leptospirosis across Gilan province
Input of model
Monthly average temperature of rural districts
Input of model
Monthly average rainfall of rural districts
Input of model
Monthly average humidity of rural districts
Input of model
Average height of rural districts
Input of model
Average NDVI of rural districts
GWR model for leptospirosis prediction
To predict leptospirosis, a model is established based on environmental parameters utilising GWR approach. Five parameters, including temperature, precipitation, humidity, elevation and vegetation, in 2009 and 2010 together with disease data are used as inputs of the model. The model is used for predicting of leptospirosis in 2011.
Fixed and adaptive kernel functions are applicable for the GWR model. Fixed kernel considers a constant bandwidth (distance to neighbour in metre) across the study area, which is the main deficiency of this kernel, whereas adaptive kernel applies variable and appropriate bandwidths (number of neighbours) in each rural district according to the number of neighbours . In addition to type of kernel, defining bandwidth selection criteria is necessary in the GWR model. Three bandwidth criteria of AIC, CV and BIC are available. Adaptive kernel and AIC criteria are utilized in this study due to better performance . Notably, all steps are performed using GWR 4.0 software.7
ANN is a nonlinear model that focuses on determination of dependence between input and output parameters by simulating highly connected processing units (neurons) of human nervous system . It consists of three layers including input, hidden and output, and it is composed of weighted connections between the inputs and outputs . A major characteristic of ANN is its capability to learn for solving complex problems . The other advantage of ANN is proper description of nonlinear dependences. However, the black box mechanism is its major shortcoming .
A particular form of ANN is Multilayer Perceptron (MLP) which is created by multiple layers of nodes in a directed graph . MLPs are Feed-Forward Neural Networks (FFNN) that stream information in one direction from the input to the output layer. MLPs are the most popular FNNs due to efficient training processes .
ANN model for leptospirosis prediction
MLP, a class of FFNN is utilized for leptospirosis prediction. MATLAB 2018 is used for MLP implementation. According to the trial and error approach (Additional file 1), one hidden layer is selected to be utilized in this study. The final MLP architecture consists of five nodes in input layer, including temperature, precipitation, humidity, elevation and vegetation, one hidden layer with five nodes and one node in output layer, which presents the incidence rate of leptospirosis. Data of 2009 and 2010 and Leungberg–Marquard algorithm are used for training the model to predict the disease in 2011. Weights are randomly initialised, and the threshold of the training process is considered when the error difference of two consecutive runs of the model is negligible. Notably, after running ANN under such condition (reaching a negligible difference of two consecutive runs), the maximum number of epochs is 36. Total sample points are 969 for 2009 and 2010 in which 290 samples are selected as validation set. The learning rate, which is acquired using trial and error approach, is 0.01.
SVM, first introduced by Vapnik , is a supervised classifier based on the statistical theory. In a linear situation, the basic SVM tries to maximize the distance between closest samples of binary classes by creating optimal hyperplanes . However, most of the problems in real world do not behave in linear manner. In order to deal with non-linear datasets, SVM utilizes kernel functions to map data into higher dimensional space in which the data is linearly separable .
SVM model for leptospirosis prediction
Efficiency of different kernel functions
Polynomial (degree 2)
Polynomial (degree 3)
ANN and SVM function as a black box, so investigating the relative importance of input parameters is not possible. However, sensitivity analysis can be used to examine the contribution of input parameters in modelling and predicting . To perform sensitivity analysis, one parameter is excluded from the model in each run, and the effect of that parameter on model performance is determined based on the evaluation criteria . A larger decrease indicates greater influence of the parameter.
GLM model for leptospirosis prediction
Pearson correlation coefficients among parameters
Coefficients of parameters using GWR model
Coefficients of parameters using GLM model
Results of sensitivity analysis in ANN model
Results of sensitivity analysis in SVM model
During 2009–2011, reports of leptospirosis in Gilan revealed that it occurs in definite months and disappears for the remainder of the year. This periodic prevalence explains the relationship between leptospirosis cases and paddy season when workers start to work in paddy fields. This phenomenon is due to the fact that in paddy season when workers begin to plant or harvest rice, their contact with contaminated water or soil increases, and the possibility of disease prevalence increases. In Gilan, rice farming and livestock are popular amongst farmers because suitable climate contributes to the fertility of soil which is inevitable for farming, and the existence of many rural regions covered by grasslands and forests facilitates feeding animals. Considering that this job is physically demanding, the ratio of men to women workers is approximately 2 to 1 in 2009–2011, which confirms that men are more vulnerable to this disease and deserve more attention (Fig. 7.b). This fact prompted decision makers to carry out prevention programmes such as boosting the knowledge of workers by explaining the advantages of using gloves during work time or bandaging the wound as soon as it occurs. Knowledge and literacy are at low levels in rural districts, so such programmes led to a great decrease of disease reports (almost 1/3) in 2011 (Fig. 7.a).
Spatial modelling of leptospirosis would better clarify different aspects of this phenomenon. To model the disease, the correlation between input parameters should be investigated using the assumption of independence . Correlation values vary from 0 (no correlation between two parameters) to 1 (maximum correlation between two parameters), and the closer the values are to 0, the more reliable they are as input in the model. Based on statistical studies about the assumption of independence, less than 0.70 correlation is acceptable . Thus, two-tailed Pearson correlation as a common approach  is used in this study to calculate the correlation amongst all parameters. According to the obtained values, maximum correlation is between elevation and temperature parameters (0.33) with 0.005 significance level, and minimum is between vegetation and temperature (0.11) with 0.1 significance level. The results prove that all values are less than critical threshold (0.70)  and can be reliably utilized in spatial modelling of leptospirosis (Table 3).
In addition to assumption of independence, multicollinearity should be considered in spatial modelling . Severe multicollinearity increases the variance estimation of coefficients and decreases the reliability of the model. VIF measures the intensity of multicollinearity amongst independent parameters . Confirmed by statistical studies, VIF values of input parameters that are less than 10 are acceptable for entering the model . Table 3 presents that the maximum calculated VIF values of parameters belong to vegetation parameter (2.71), and the minimum is acquired for precipitation parameter (1.17). All VIF values are less than 10, which proves acceptable multicollinearity amongst input parameters. According to the assumption of independence and VIF values, input parameters can be fed to GWR, GLM, SVM and ANN models for predicting leptospirosis distribution in this study.
The values of coefficients calculated for each parameter using GWR and GLM are presented in Table 4 and Table 5. GWR considers a different model for each rural district, so the coefficients of parameters vary across the study area. Slight changes in the range of elevation (D2009 = 0.17, D2010 = 0.73 and D2011 = 0.13) and vegetation (D2009 = 0.09, D2010 = 0.14 and D2011 = 0.16) reveal almost uniform and constant effect of these parameters. High values of temperature, precipitation and humidity range (1740.69, 321.64 and 812.94, respectively) show inconstant effects on diverse rural districts. Despite GWR and GLM models, ANN and SVM operate as black box. The coefficients of parameters cannot be calculated, but sensitivity analysis can be utilized for this issue. The results of sensitivity analysis are presented in Table 6 and Table 7, which show the effect of parameters on spatial modelling of leptospirosis distribution. According to four evaluation criteria, omission of temperature and humidity parameters decreases the fitness of the models, which confirms their importance in modelling the disease. Temperature and humidity do not directly affect leptospirosis distribution but provide appropriate circumstances for durability of leptospira and indirectly affect the prevalence of leptospirosis. Paddy fields are almost always located in rural districts with higher values of these parameters, and they are more vulnerable to the disease occurrence, as shown in Fig. 10, where coefficients are mapped for better understanding of the effect of parameters on different rural districts. Maps of coefficients of humidity and temperature are closer to prediction maps and reports of leptospirosis data in 2011. This finding proves that these two parameters play more important roles in the modelling and predicting leptospirosis.
Prediction maps of GWR, GLM, SVM and ANN
The models clarify the fact that the disease prevalence occurs more in the central rural districts. The existing remarkable number of paddy fields and livestock activities, which leads people to more contact with the contaminated environment, can be the major reasons of this pattern. Given that leptospirosis is an occupational water-borne disease  and no paddy fields are in the southeast area of the province, the probability of the disease prevalence is negligible there. Visual comparison of the prediction maps shows that GWR, SVM and GLM models predict high disease prevalence in the central rural districts while the prediction of ANN model is less consistence with the reported cases of disease across the study area. Although SVM and GLM indicate satisfying results, GWR prediction map in 2011 is more similar to the map of leptospirosis data in 2011. Model predictions are statistically discussed in the “prediction evaluation” section.
A major advantage of GWR model is the presentation of local variability and local collinearity  which are not available in modelling with ANN, SVM and GLM. Local variability for each rural district shows the power of the model in different locations across the study area. Figure 9a demonstrates that GWR model performs more accurately on some rural districts with high local R2. The maximum value is 0.96, and the minimum is 0.16, but the overall R2 is 0.85 for the entire study area (Fig. 9a). The other issue is local collinearity, which is unavoidable in modelling and it has adverse effects on the estimation of coefficients. According to many studies, local collinearity of more than 30 indicates decreased reliability of results . GWR shows local collinearity by measuring the condition number for each location. Condition numbers over 30 result in serious concern. Condition number measures how much the output value of the model can change for a small variation in the input of the model. Figure 9b indicates that the obtained condition number for each rural district is less than 20, and the local collinearity is negligible for the prediction of leptospirosis.
Evaluation results of GWR, GLM, SVM and ANN in modelling Leptospirosis
Spatial autocorrelation (Moran’s I) of residuals and significance level
Spatial autocorrelation in the residuals of model verifies weakness in some parts of the model . In this study, weak but meaningful spatial autocorrelation is found in residuals. Environmental parameters model and predict the disease carefully, but the power of model is less in some regions. The capability of Moran’s I is verified in the investigation of residuals , so it is used in this study.
Results of Moran’s I for GWR, GLM and ANN residuals in 2009–2011
Spatial clusters of GWR, GLM and ANN residuals obtained from Moran’s I approach are presented in Fig. 11. It illustrates the performance of models for prediction in various areas. High–High (HH) shows rural districts surrounded by neighbours with high spatial autocorrelation. Low–High (LH) indicates rural districts that have low spatial autocorrelation of residuals, but their neighbours have high values. Low–Low (LL) presents rural districts surrounded by neighbours with low spatial autocorrelation. Given the high spatial autocorrelation in residuals, HH clusters illustrate the rural districts where the models have lower performance in prediction of leptospirosis.
Leptospirosis is predicted in this study utilizing GWR, SVM, GLM and ANN models. Five input parameters, including temperature, precipitation, humidity, elevation and vegetation are used in this study. Model predictions are investigated statistically and visually to understand the efficiency of used approaches. According to the results, the performance of the models is as follow: GWR > SVM > GLM > ANN. Also, spatial autocorrelation of residuals is used to investigate the deficiency of models. The results prove that GWR presents less deficiency in modelling and predicting leptospirosis. Additionally, based on coefficients of GWR and GLM parameters and sensitivity analysis of SVM and ANN, temperature and humidity have greater effects on the leptospirosis distribution. Moreover, analysis of coefficients shows that higher temperature and humidity coincide with higher disease occurrence in central regions. In contrast, the southeast rural districts have the lowest outbreaks due to lack of related occupations conducive to leptospirosis propagation. In a nutshell, utilizing useful approaches for prediction of leptospirosis can provide health managers and governments with sufficient information to set proper measures for controlling the disease prevalence across the study area.
Many researches including our study are limited based on data and model. As an analytical shortcoming of many disease studies, Modified Areal Unit Problem (MAUP) presents that scale of study is crucial in spatial analysis . In this study, the results of leptospirosis prediction are acceptable at the rural district level, but this disease should be examined in other scales for better understanding the fitness of models. Disease data used in this study are based on the address of patients, whereas the exact locations of the disease occurrence are paddy fields. The paddy fields must be considered as the base level for more accurate analysis, but such data are not available in Iran. More social and epidemiologic parameters should be considered for more accurate prediction.
As future work, the model will be developed by considering socioepidemiologic parameters. Time series models such as Autoregressive Integrated Moving Average (ARIMA) and their comparison with geographically temporal weighted regression is also considered as future work.
The authors are very grateful to the editor and reviewers’ comments and suggestions, which helped us to revise the manuscript.
AM collected the data and implemented the GWR and GLM approaches, BS implemented ANN models for prediction, ZG performed SVM method, BP edited, revised, improved the manuscript as expert professor in this field and also arranged the funding for the publication fees. Analysis of data were done by all authors and they read the manuscript, revised and approved the final version.
This research is funded by the Centre for Advanced Modelling and Geospatial Information Systems (CAMGIS), University of Technology Sydney (UTS) under grant numbers 321740.2232335, 323930, and 321740.2232357.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 2.de Vries SG, et al. Travel-related leptospirosis in the Netherlands 2009–2016: an epidemiological report and case series. Travel Med Infect Dis. 2018;24:44-50.Google Scholar
- 3.Rafiei A, et al. Review of leptospirosis in Iran. J Mazandaran Univ Med Sci. 2012;22(94):102–10.Google Scholar
- 20.Ferreira M, Ferreira M, INFLUENCE OF TOPOGRAPHIC AND HYDROGRAPHIC FACTORS ON THE SPATIAL DISTRIBUTION OF LEPTOSPIROSIS DISEASE IN SÃO PAULO COUNTY. Brazil: an approach using GEOSPATIAL TECHNIQUES and GIS analysis. Germany: International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences; 2016. p. 41.Google Scholar
- 25.Saeidian B, Mesgari MS, Ghodousi M, Optimum allocation of water to the cultivation farms using Genetic Algorithm. International Archives of the Photogrammetry. Germany: Remote Sensing & Spatial Information Sciences; 2015. p. 40.Google Scholar
- 45.Brunsdon C, Fotheringham S, Charlton M. Geographically weighted regression. J R Stat Soc Ser A. 1998;47(3):431–43.Google Scholar
- 46.Bidanset, P.E. and J.R. Lombard, Optimal kernel and bandwidth specifications for geographically weighted regression. Applied Spatial Modelling and Planning, 2017.Google Scholar
- 49.Zhang Z. Artificial neural network, in Multivariate Time Series Analysis in Climate and Environmental Research: Springer; 2018. p. 1–35.Google Scholar
- 51.Walczak S. Artificial neural networks, in Encyclopedia of Information Science and Technology, Fourth Edition. Finland: IGI Global; 2018. p. 120–31.Google Scholar
- 52.Da Silva IN, et al. Artificial Neural Networks. Switzerland: Springer; 2017.Google Scholar
- 53.Moreira MW, et al. In International Conference on Frontier Computing. Singapore: Springer; 2017.Google Scholar
- 55.Chatterjee S, et al. Cuckoo search coupled artificial neural network in detection of chronic kidney disease. In: Electronics, Materials Engineering and Nano-Technology (IEMENTech), 2017 1st International Conference on. India: IEEE; 2017.Google Scholar
- 56.Reddy VR, Reddy VV, Mohan VCJ. Speed control of induction motor drive using artificial neural networks-Levenberg-Marquardt Backpropogation algorithm. Int J Appl Eng Res. 2018;13(1):80–5.Google Scholar
- 57.Vapnik VN. Statistical learning theory, vol. 2. New York: Wiley; 1998.Google Scholar
- 63.Nieto PG, et al. A SVM-based regression model to study the air quality at local scale in Oviedo urban area (northern Spain): a case study. Appl Math Comput. 2013;219(17):8923–37.Google Scholar
- 78.Nguyen Q-H, Understanding Factors Affecting the Outbreak of Malaria Using Locally-Compensated Ridge Geographically Weighted Regression: Case Study in DakNong, Vietnam. Advances and Applications in Geospatial Technology and Earth Resources: Proceedings of the International Conference on Geo-Spatial Technologies and Earth Resources 2017. Vietnam: Springer; 2017.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.