Background

With rapidly expanding urban populations, recent refugee movements, internal displacement, and widespread civil strife, parts of the Middle East are becoming increasingly vulnerable to vector-borne diseases (VBDs). Urban growth, in particular, has expanded faster than the supply of available housing and supporting infrastructure, resulting in a large proportion of the population living in unsanitary conditions, which may increase human exposure to bites of infected vectors that proliferate in urban environments [14].

Prior to the recent period of civil conflict, the region has experienced the emergence and resurgences of two deadly diseases. Rift Valley Fever Virus caused an epidemic in Egypt in 1977, 1993 and 2003, and hit Saudi Arabia and Yemen in 2000 and 2001 [57]. In addition, West Nile virus, a nearly “forgotten” disease re-emerged in a severe country-wide epidemic in Israel in 2000, with a fatality rate of 8.4% [8]. Between 2010 and 2012, 200 West Nile cases were reported in Israel [9]. While the causes for the recent resurgence of neglected tropical diseases like West Nile Virus, Rift Valley Fever Virus, and Leishmaniasis are not fully understood, it is cause for renewed interest and increased attention to disease dynamics in the region [2, 10, 11].

Molecular and epidemiological studies have shown that Cx. pipiens is a vector of both Rift Valley Fever virus and West Nile Virus [6, 1216]. It has a preference for living near areas of high human population density, and breeding in the artificial containers abundant in close proximity to human settlements [17, 18] or polluted pools of water associated with human activities [1922]. The species also feeds opportunistically from a wide variety of blood hosts [23, 24], and is active nearly year round [21]. This vast ecological plasticity makes it a potentially significant vector. As such, understanding the distribution of Cx. pipiens throughout the region is a fundamental requirement to understanding transmission dynamics.

Habitat suitability models for Cx. pipiens in general have identified rainfall, temperature, and vegetation to be major drivers of population distribution [25, 26]. Density of hosts has also been found to be significantly correlated with habitat suitability in related species [27]. However, these studies were largely conducted in different climatic regions than those dominant in the MENA region, and therefore the identified predictors would not necessarily explain vector distributions in arid and semi-arid regions. Environmental predictors will be most informative at scales and across regions where they exhibit sufficient heterogeneity to accurately discern suitable from unsuitable habitat. The relative information values of predictors shift as underlying climatic conditions change, and so does the nature and identity of the most significant vector-habitat relationship in the model. In Egypt, climatic variables such as ambient temperature, relative humidity, and wind speed did not significantly distinguish sites with high and low filariasis transmission vectored by Cx. pipiens, which was positively correlated with temperature and negatively correlated with rainfall [28]. In Australia, the critical climate factors for predicting outbreaks of Ross River virus vary significantly between different environmental regions within a single state [29]. Simulation models of Cx. quinquefasciatus across the southern United States revealed significantly different sensitivities of mosquito populations to temperature and precipitation in arid and humid habitats [30].

The aim of the present work is to investigate the relationship between the distribution of Cx. pipiens and the environment in the Middle East and subsequently derive vector distribution surfaces. Accordingly, we used data collected from a wide range of habitats in the region and two different but robust presence-only species distribution models (SDMs), Maxent and boosted regression trees (BRT). The models examined the role of human populations, climatic, hydrological, and vegetation parameters in predicting habitat suitability.

Methods

Selecting modeling approaches

In order to strengthen confidence in predicting Cx. pipiens distributions, we incorporated two modeling approaches, Maximum entropy implemented in Maxent software, and boosted regression trees. Regression models are powerful tools for selecting relevant predictors and modeling complex interactions, while boosting avoids misclassification problems inherent in a single tree model. Thus our combined models would enable powerful selection of relevant predictors and accurate modeling of complex interactions. The previously discussed non-uniform relationships between other Culex species and critical climate factors [30] suggest such interactions may possibly contribute significantly to Cx. pipiens distributions as well. Thus, the use of model intercomparisons leads to greater confidence in the predictions of the distribution. Using Maxent as our alternative method allows for comparisons with the predictions of one of the most familiar and frequently applied presence only modeling techniques [3134].

Modeling species distribution using maximum entropy

Maximum entropy is a machine learning algorithm that produces predictions of habitat suitability by comparing the conditional density of predictors at presence sites with the marginal density of predictors across the study area. The raw output of Maxent is a probability of habitat suitability [35, 36]. To make model output more interpretable, Maxent converts the exponential values of its raw output estimate of habitat suitability into a logistic output that represents an estimate of probability of presence [37].

Modeling species distribution using boosted regression trees

Boosted regression trees predict the value of a target variable based on the value of several input variables. Boosting uses a machine learning algorithm to produce a final prediction model that is an ensemble of weak prediction models, in this case, an ensemble of regression trees. The model is built in a stage wise fashion; a regression tree is fit to the original data, then the residuals of that model become the new data values, to which a second tree is fit, and so forth. By fitting each subsequent tree in the model to the residuals of the previous tree, the data become re-weighted in each iteration. Points that were misclassified by the previous model will now have more weight than values that were classified correctly. As a result, the subsequent tree will focus on fitting these misclassified points. The process is conditioned by the learning rate, which controls the contribution of each tree to the final model.

The final model can consist of thousands of trees, however, over-fit models will exaggerate minor fluctuations in the data, making them poor predictors. Boosted regression uses cross validation to minimize over-fitting by determining when adding additional trees no longer improves predictive performance, and selecting that optimum number of trees.

Predictive error is measured as the Bernoulli residual deviance between the predicted values of the model and the observed values of the test data. All BRT models are fitted in R using the ‘gbm’ and ‘dismo’ libraries [3840].

Target species and occurrence data

Our study area of interest (Figure 1) focuses on an approximately 5 million square kilometer region of the MENA that encompasses Egypt, Israel, Lebanon, Syria, Jordan as well as substantial portions of Turkey, Saudi Arabia, and Iraq.

Figure 1
figure 1

Sampling locations of Culex pipiens .

The Cx. pipiens complex are the most widely distributed mosquito species in the world and two biotypes of Cx. pipiens L. are present in our study area, the anautogenous and autogenous biotypes. These mosquito biotypes breed in overlapping niches and readily hybridize in areas where they coexist [41]. In Egypt, both autogenous and anautogenous Cx. pipiens individuals were encountered in the progeny of autogenous or anautogenous female parents and more than 75% of field caught females produced mixed progenies [42].

Cx. pipiens is the most abundant mosquito in Lebanon, collected both indoors and outdoors. Active and abundant year round, it is anthropophilic, endophagic, and endophillic [21]. The species is highly behaviorally plastic, while Cx. pipiens in the Al Sharqyia governorate of Egypt’s Nile delta also displays endophagic and primarily anthropophiliic behavior [24], mosquitoes in the neighboring governorate have been observed to switch from indoor to outdoor feeding with seasonal shifts in temperature and wind speed [43]. Females will feed from a diverse range of hosts, including horses, cows, sheep, dogs, cats, humans and rats [23].

Cx. pipiens breeds in water with high organic content [41]. In Lebanon breeding sites for Cx. pipiens include containers which range in size from small cans filled with water to garden pools and irrigation ditches [21]. In the urban habitat of Cairo, the common breeding habitats for Cx. pipiens are cesspits, drainage canals, springs, cesspools, and irrigation ditches [22].

Training data from both Egypt and Lebanon were included to maximize sampling of the species’ environmental range, from the hot and hyperarid to cool and subhumid. This range of samples maximizes the combinations of environmental factors used in calibrating the models, increasing their predictive power [44, 45].

Occurrence records for Cx. pipiens were obtained from previously unpublished vector surveys in Egypt and Lebanon (Figure 1). Adult samples from Lebanon were collected in the summer of 2009. Methods included CDC light traps with a dry ice attractant, and BG-sentinel traps with pheromone attractant. Traps were placed in outdoor and indoor habitats at 5 pm and collected the following morning. In addition to adult sampling, larval stages of mosquitoes were sampled in the summers of 2002 and 2003. Samples were collected by the classical dipping method using a white tray and/or a ladle from both artificial and natural breeding sites including swamps, ponds, riverbanks, irrigation tanks, and water storage tanks. Coordinates for sampling locations were obtained from Google Earth and GPS. Adult specimens were pinned and 4th instar larval specimens were mounted on slides and identified morphologically using local and regional identification keys [46]. Occurrence records from Egypt (Gad, unpublished data) were collected using similar methods as part of a nation-wide survey.

Environmental layers

We considered twenty-four environmental variables as potential predictors of Cx. pipiens habitat (Table 1). Temperature and precipitation were represented by 19 variables of the WorldClim dataset [47]. Quality of vegetative cover was described by the standard deviation, mean, and maximum values of the enhanced vegetation index (EVI), as derived from Moderate Resolution Image Spectroradiometer (MODIS) imagery from the Terra Satellite for 2001. The EVI provides a measure of photosynthetic activity or landscape greenness and hence captures vegetation features such as leaf area, canopy cover, sugar resources that may provide resting sites and alternate food sources for this vector species [4850]. As available soil water limits primary productivity in the study region, mean and maximum EVI are generally correlated with annual rainfall, whereas the standard deviation of EVI provides a measure of vegetation seasonality, which is controlled by a variety of factors including temperature, day length, insolation, irrigation, and biotic factors [51]. An additional covariate related to potential vector breeding habitat was derived from topographic data, the topographical wetness index (TWI), which describes the tendency of water to collect in areas of topographic minima [52]. Because information regarding the distribution of hosts greatly improves models of mosquito distributions [53] and Cx. pipiens is adapted to breeding and feeding in human-altered landscapes, we also included human population density as a predictor using data from Landscan™ 2010. Environmental layers were gridded to a spatial resolution of approximately 1 km (926.63 m), a scale that captured as much environmental heterogeneity as possible within the limits of computer processing capability.

Table 1 Data sources of environmental predictors used in species distribution models

Data processing

Presence Data: All point locations recorded as positive for Cx. pipiens from Egypt (n = 239) and Lebanon (n = 83) surveys were included as presence locations in all models (Figure 1).

Background Data: Both boosted regression and maximum entropy methods predict areas with a high probability of presence of conditions suitable for the target species by comparing the values of environmental predictors at presence locations to the values of the same predictors at a set of random “background” locations, which represent the range of available environmental values [54]. Because our data, like many presence-only data sets [55, 56], showed a strong bias in sampling effort, we selected background points with the same geographical bias as the occurrence data (Additional file 1).

An independent data set of 79 Cx. pipiens locations, 23 from Israel and 56 from Egypt [57], was used to assess accuracy of model predictions (Additional file 2: Figure S1). These data were used in all Maxent test data calculations.

Assessing the contribution of human population density to Cx. pipiens distribution models

Human population density has the potential to be a strong predictor of Cx. pipiens habitat [53, 58]. However, because the survey locations of the original studies were all in close proximity to populated areas, population density is also a strong predictor of sampling effort. To control for this we ran the entire modeling process twice; once on the complete set of environmental variables (N = 24), and once using all environmental variables except population density (N = 23). Including population density allows us to examine the relative influence of an important socioeconomic environmental factor, while excluding human populations allows us to examine the bio-physical environmental factors on their own.

To increase the interpretability of the models both predictor sets were reduced using the ‘gbm.simplify’ function, which creates BRT models with every possible combination of the initial predictors, and based on minimizing predictive error, ranks predictors from most to least influential, and identifies the optimal predictor set [59].

Creating species distribution models

BRT models using the reduced predictor sets were fitted using a learning rate of 0.05, a tree complexity of 5, a bag fraction of 0.75, 10-fold cross validation, and a Bernoulli loss function. All other parameters were left at default settings. Predictions in geographic space were made in R using the “raster” package [60].

Maxent models using the reduced predictor sets were fit using Maxent v 3.3.3 k [61]. Spatial predictions of the model were "clamped", a Maxent feature which reduces the value of the predictions when extrapolating into areas of parameter space that exceeded the range of values present in the training data. The final model was the average logistic output of 10 repetitions of the modeling process. Additional specifics on program specifications can be found in Additional file 1.

Evaluating and comparing model predictions

Maxent models were evaluated using the area under the receiver operating characteristic curve (AUC), the corrected Aikaike information criterion (AICc), and the omission rate. BRT models were evaluated using AUC, point biserial correlation (COR), and the Bernoulli deviance. Because the two techniques use different test points for the intrinsic measures of model performance, the point biserial correlation was also calculated for all models at a common set of 158 points made of the 79 test presence locations and 79 random background points (Additional file 2: Figure S1). Pearson’s correlation coefficient was used to measure how closely predictions agreed between the two methods. Agreement was compared in the “training region”, using the common 158 points, as well as an “expanded region” of 158 points drawn from the entire model extent (Additional file 2: Figure S1).

Results

Cx. pipiens habitat suitability

The predicted maps of suitable Cx. pipiens habitat are presented in Figures 2, 3, 4 and 5. Overall, maximum entropy models predicted a broader distribution of suitable habitat than boosted regression methods, which produced much more conservative predictions extent of suitable habitat and a lower probability of presence. However, relative suitability of habitats was consistent between models: areas of highest suitability identified by BRT were also the areas of highest suitability selected by maximum entropy.

Figure 2
figure 2

Maxent Pop. Probability of Culex pipiens presence based on a species distribution model generated using maximum entropy and a set of environmental predictors (N = 9) including human population density. Habitat suitability has been converted to probability of presence, assuming a prevalence of 0.5. Bottom: Enlarged to show population centers.

Figure 3
figure 3

Maxent No Pop. Probability of Culex pipiens presence based on a species distribution model generated using maximum entropy and a set of environmental predictors (N = 14) that do not include human population density. Habitat suitability has been converted to probability of presence, assuming a prevalence of 0.5. Bottom: Enlarged to show population centers.

Figure 4
figure 4

BRT Pop. Probability of Culex pipiens presence based on a species distribution model generated using boosted regression trees and a set of environmental predictors (N = 9) that includes include human population density. Bottom: Enlarged to show population centers.

Figure 5
figure 5

BRT No Pop: Probability of Culex pipiens presence based on a species distribution model generated using boosted regression trees and a set of environmental predictors (N = 14) that does not include human population density. Bottom: Enlarged to show population centers.

All four models agree in predicting a moderate to high probability of Cx. pipiens presence in several distinct habitats: along the banks of the Nile and throughout the Nile Delta, throughout the coastal plains of Israel, Lebanon, and Syria, as well as the valleys of east Lebanon and along Israel’s Jordanian border (Figures 2, 3, 4, and 5). In models built using human population data (Figures 2 and 4), habitats with the highest probability of presence were found near large population centers. Both BRTPop and MaxentPop models also indicate moderate probability of presence along the banks of the Tigris and Euphrates rivers. Models built without population data (Figures 3 and 5) predicted greater probability of presence in the semi-arid Central Anatolia region of Turkey, the Jazirah plain in North East Syria and Northern Iraq, and the marshes of southern Iraq than the population-inclusive models. However, much of Iraq and Eastern Turkey occupy a region of environmental space that falls outside the range of the training data (Additional file 2: Figures S3 and S4), so predictions in these areas should be interpreted cautiously.

Model accuracy

Evaluation metrics for Maxent and BRT models are presented in Tables 2 and 3. Among Maxent models, both sets of predictors performed well, with AUC values significantly greater than the null model. The test AUC values for both predictor sets were smaller than the training AUC values, which indicates slight over-fitting of the models. This difference was smaller for the model created without population data. The “NoPop” model also had a higher test AUC value, greater entropy, and a smaller omission rate than the model that included population. The population-inclusive model had a lower AICc than the “NoPop” model.

Table 2 Evaluation parameters for species distribution models built using human population data
Table 3 Evaluation parameters for species distribution models excluding human population data

The boosted regression models also performed well, and also displayed a degree of over-fitting similar to the Maxent models. Among the BRT models, population-inclusive models performed slightly better than the “NoPop” models on test data- with higher cross-validated AUC values, higher cross-validated COR values, and lower cross validated deviance scores.

When evaluated over a standard set of points, the Maxent “NoPop” model had the strongest correlation (COR = 0.534) between model predictions and presence/pseudo-absence, the BRT NoPop and Maxent population-inclusive model performed equally well, and the BRT population-inclusive model had the weakest correlation (COR = 0.413) of the four models evaluated.

Model agreement

When evaluated over the training region, the predictions of Maxent and BRT models were very strongly correlated, with models including population data more closely correlated than models which excluded it (Figure 6). When evaluated over the entire modeled region, the correlation between models was weaker.

Figure 6
figure 6

Correlation between predictions of Culex pipiens distribution models created using boosted regression trees (BRT) and maximum entropy (Maxent) algorithms. “Training region” consists of 158 points taken from within a similar geographical area to the samples used building the model, 79 background points and 78 independent occurrence points. “Expanded Region” measures model agreement throughout the entire modeled extent, and compares values at 158 random points taken with equal probability from the entire model extent. Left panels describe models built using human population density as a parameter (N = 9). Panels on the right describe models excluding human population density (N = 14). COR = Pearson correlation coefficient.

Contribution of environmental predictors to Cx. pipiens distribution models

Parameter reduction of the full predictor set (n = 24) resulted in a simplified set of 9 significant predictors. (Table 4). Parameter reduction of the full predictor set excluding human population density (n = 23) resulted in a simplified set of 14 significant predictors (Table 5).

Table 4 Contribution of environmental parameters to SDMs including human population data
Table 5 Contribution of environmental parameters to SDMs excluding human population data

Both modeling techniques selected similar variables as the most important environmental factors driving species distribution. Population was the largest single contributor to model predictions, ranked as most influential under both BRT (34.9%) and Maxent (60.6%).

In models created without human population density, the variable with the single largest contribution was the standard deviation of the enhanced vegetation index (EVI). Mean EVI was also highly ranked in all models and all data sets.

Discussion

Model agreement and predictions

Our models showed strong agreement between boosted regression and maximum entropy methods in selection of habitat with the highest probability of occurrence of Cx. pipiens. High AUC values for all four models indicated that occupancy sites were highly likely to be assigned a higher probability of presence than background sites regardless of method. The strong correlation between BRT and Maxent outputs in areas with a high proportion of known Cx. pipiens habitat indicates that the models strongly agree on areas of potentially high risk. These areas included populated areas within the Nile delta, the valleys of Israel, Lebanon, and Jordan, and southwestern Saudi Arabia.

Our model predictions are supported by agreement between predicted areas of highest probability of Cx. pipiens occurrence, and recent positive vector surveys, or outbreaks of vector-associated diseases. The regions of central Israel, predicted by our models to be areas of high probability of Cx. pipiens presence, corresponded to the areas of highest incidence of West Nile virus during the 2000 outbreak [8]. Likewise, the region of highest probability of occupancy in Saudi Arabia corresponds with recent collections of Cx. pipiens[6264] and the incidence of infection during the 2000 Rift Valley Fever virus outbreak [65]. The distribution of suitable Cx. pipiens habitat in Saudi Arabia is of special interest because of the high potential for repeated import of diseases into the country via large numbers of pilgrims travelling through for the annual Hajj [66]. The city of Jeddah has twice been hit by epidemics of dengue directly following the Hajj [67]. Our results indicate that this same area is a suitable habitat for supporting Cx. pipiens, and potentially capable of transmitting West Nile virus, should it be re-introduced.

Agreement between model predictions was less correlated when evaluated over the full region, including areas where models were required to extrapolate. This is a result of the different behavior of each algorithm in extreme environments [68]. Maxent, when clamped, extrapolates in a horizontal line from the fit at the most extreme value in the training data. BRT, which does not use clamping, extrapolates at a constant value from the last known site. This difference in extrapolation can also be seen in the spatial predictions of the population-exclusive models when predicting into the most dissimilar areas of the modeling region. BRT assigns a higher relative probability of occupancy to habitat in northern Iraq and southern Syria, classifying the area with a probability of occupancy similar to that of the river valley along the Israel-Jordan border. The Maxent model, which constrains its predictions more severely, predicts a much more moderate probability of occupancy. The same relative over-estimation of the BRT algorithm is not as apparent in the population-inclusive model, because the environmental predictor that most exceeds its training values in this region is temperature seasonality, which was not as influential a predictor in the population-inclusive models as it was in the population-exclusive models. The effect of clamping can still be seen in the population-inclusive Maxent model. An advantage of using two different modeling methods is our ability to detect regions of parameter space where choice of the underlying modeling algorithm has the greatest influence on strength of predictions. Results in these areas need to be interpreted more cautiously than areas where both models are in agreement.

Presence-only modeling methods, such as those used in this study, make the assumption that a target species is equally detectable across all habitats [35]. However, if sampling detection probability varies with one or more environmental predictors, our model will not distinguish between habitat with a higher probability of occupancy and a habitat with greater detectability. Absolute values of the predictions should be interpreted with caution with this limitation in mind. Predictions should also be evaluated with the consideration that presence-only models treat densely and sparsely occupied sites the same, as the input data are binary.

In order to efficiently predict species distributions across the study area, we were required to coarsen the resolution of environmental predictors to 1 km. It is important to keep in mind that this may be a larger than optimal scale for examining the relationship between vectors and certain predictors with very fine scale heterogeneity [25, 69].

Relationship between Cx. pipiens and environmental predictors

Our study is unique in its aim of identifying the underlying environmental predictors driving Cx. pipiens distribution across a climate gradient that encompasses hyper arid to sub humid habitats. The novel aspects of our findings are the importance of human population density and seasonality of vegetation indices as powerful predictors of Cx. pipiens occurrence.

Human population density was identified by both modeling methods as the highest contributing predictor to habitat suitability. Interpreting this result is complex, because areas of dense human habitation may be favorable to mosquitoes for several reasons. Humans are a host species, their dwellings also provide a sheltered resting area, and in rural areas human activities such as irrigation and agriculture may also provide favorable breeding habitat and non-human hosts. Breeding conditions for mosquitoes are also favorable in slums, where high human population density is combined with poor infrastructure and inadequate services [70]. Characterization of mosquito breeding habitat in urban Cairo found that 93.5% of breeding sites were found in slum areas, characterized by incomplete construction, disorganized roads, and crowded, dense conditions [71]. Socioeconomic conditions were also significantly related to Cx. pipiens breeding sites in Washington DC, although in the northern temperate region the relationship was reversed, breeding sites were positively associated with presence of functional containers, like flower pots and garbage pails, which were found primarily in areas of higher socioeconomic status [27].

Our results indicate that quality and seasonality of vegetation is a powerful predictor of Cx. pipiens in arid and semi-arid habitats. The relationship between Cx. pipiens occurrence and vegetation quality was strongly positively associated with the maximum value of the enhanced vegetation index (EVI), and negatively associated with the mean and standard deviation (seasonality) of the EVI. This suggests that in our study region, high quality Cx. pipiens habitat consists of areas of high primary productivity, which maintain that quality of habitat with very little change throughout the year. In rural areas of Egypt, year round irrigation maintains precisely this kind of habitat. Vegetation indices are not always the most informative predictors in the region, in a study examining another Cx. pipiens vectored infection, filariasis, the Normalized Difference Vegetation Index did not distinguish between Egyptian villages at high and low risk of infection [72]. The positive relationship between Cx. pipiens and stable areas of high primary productivity suggested by our results is the opposite of the relationship seen in the forested northeastern United States, where the most accurate model of Cx. pipiens abundance was negatively correlated with forest cover and did not include the vegetative index at all [25].

Topographic wetness index has been a significant predictor of anopheline abundance and malaria risk among households in Thailand [73] and Kenya [74, 75]. Despite this, in our models it was a significant predictor only in the BRT "NoPop" model. It is possible that in arid climates with low humidity and little rainfall, the potential for pooling water, as predicted by TWI, does not translate into pools of water with enough frequency to make TWI a stronger indicator of vector breeding habitat. It is also possible that, even if TWI does accurately predict the distribution of pools of clear groundwater in the region, those pools of clear groundwater may not be the breeding habitat most favored by Cx. pipiens. In habitats where mosquitos chose to oviposit more frequently in artificial or natural containers, rather than ground pools, this will not be reflected in the TWI. It is also notable that none of our models found rainfall to be a very powerful predictor. In this context, the vegetation indices may provide better information on available soil moisture than hydrology or rainfall parameters.

Conclusions

Our study provides insight into the drivers of Cx. pipiens distribution in an understudied region at growing risk from the re-emergence of several arboviruses. The most dominant predictors in our species distribution model, human population density and the seasonality of the enhanced vegetation index, are also both environmental factors with potential correlations with urbanization and agricultural practices. By understanding the relationship between these predictors and disease vector distributions, we gain a better understanding of how changes in land use or shifting patterns of human settlement might influence disease transmission. Given the resurgence of vector borne diseases like West Nile and Rift Valley Fever [10] understanding how choices may influence disease ecology through agriculture, urbanization, and population growth, is a topic of vital importance.