Abstract
Background
Several malaria risk maps have been developed in recent years, many from the prevalence of infection data collated by the MARA (Mapping Malaria Risk in Africa) project, and using various environmental data sets as predictors. Variable selection is a major obstacle due to analytical problems caused by overfitting, confounding and nonindependence in the data. Testing and comparing every combination of explanatory variables in a Bayesian spatial framework remains unfeasible for most researchers. The aim of this study was to develop a malaria risk map using a systematic and practicable variable selection process for spatial analysis and mapping of historical malaria risk in Botswana.
Results
Of 50 potential explanatory variables from eight environmental data themes, 42 were significantly associated with malaria prevalence in univariate logistic regression and were ranked by the Akaike Information Criterion. Those correlated with higherranking relatives of the same environmental theme, were temporarily excluded. The remaining 14 candidates were ranked by selection frequency after running automated stepwise selection procedures on 1000 bootstrap samples drawn from the data. A nonspatial multiplevariable model was developed through stepwise inclusion in order of selection frequency. Previously excluded variables were then reevaluated for inclusion, using further stepwise bootstrap procedures, resulting in the exclusion of another variable. Finally a Bayesian geostatistical model using Markov Chain Monte Carlo simulation was fitted to the data, resulting in a final model of three predictor variables, namely summer rainfall, mean annual temperature and altitude. Each was independently and significantly associated with malaria prevalence after allowing for spatial correlation. This model was used to predict malaria prevalence at unobserved locations, producing a smooth risk map for the whole country.
Conclusion
We have produced a highly plausible and parsimonious model of historical malaria risk for Botswana from pointreferenced data from a 1961/2 prevalence survey of malaria infection in 1–14 year old children. After starting with a list of 50 potential variables we ended with three highly plausible predictors, by applying a systematic and repeatable staged variable selection procedure that included a spatial analysis, which has application for other environmentally determined infectious diseases. All this was accomplished using generalpurpose statistical software.
Similar content being viewed by others
Background
Recent years have seen widespread application of geographic information systems and spatial statistical methods in modelling and mapping the distribution of vector borne diseases, including malaria. In subSaharan Africa the Mapping Malaria Risk in Africa (MARA) project has been working towards a malaria risk atlas for rational and targeted control of the disease [1]. To this end historical and current survey data have been collated of the prevalence of infection with human Plasmodium parasites.
A number of malaria risk maps, at country and regional level, have been produced by analysing georeferenced prevalence data against environmental data to predict prevalence at localities where it was not recorded [2–6]. Different analytical approaches of varying sophistication have been explored. Multiple variable logistic regression analysis, commonly used to assess the odds of infection against potential risk factors, has been employed, and the spatial dependence in the response data has been modelled most successfully using Bayesian spatial modelling. One outstanding issue, which can greatly affect the predictions, remains the variable selection procedure, particularly when there are a large number of potential risk factors.
In regression analysis and predictive/prognostic statistics, model validity is an important aspect [7], both the internal validity, or accuracy, i.e. the model explains the observed data well, and external validity, or generalizability, i.e. the model predicts new data well. In this context we furthermore aim for parsimony (model contains a few strong predictors that are easily interpretable) and plausibility, both of the covariates (association with the disease are etiologically explainable) and of the predictions (believable in view of what is generally known). Taking account of the spatial correlation structure in the data is important for "geographic transportability", i.e. when predicting malaria prevalence to unobserved locations [8].
Selecting a few predictors for spatial modelling from among a large number of potential candidates is a major challenge and can easily become arbitrary. Ideally every possible combination of variables would be tested and compared in a Bayesian spatial framework. However, this would be extremely computingintensive and unfeasible, if not impossible, for most users. The most practical route is to reduce the list of potential explanatory variables using nonspatial selection methods, before moving to a spatial context.
Neither manual nor automated stepwise selection procedures are advised, because of frequent overfitting, and because of the resulting "phantom degrees of freedom" [9] pg 416: testing and rejecting many variables increases the probability of finding a significant predictor by chance, but since this sifting remains undeclared, standard errors in the final model are underestimated. Babyak [9], citing Harrell [10] and others, recommend shorter lists of candidate predictor variables, which are not strongly correlated, as well as bootstrapping, as a form of simulation. Austin and Tu [11], working on heart attack data, developed their model by running repeated stepwise selection procedures on bootstrap samples of their data, to identify the most consistent predictors.
The aim of this study was to develop a map of historical malaria risk for Botswana by analysing malaria prevalence data against a number of environmental variables from different data themes, using a systematic and repeatable staged process of variable elimination, including the stepwise bootstrap method described by Austin and Tu [11]. The resulting small subset of variables, each independently associated with the response, but possibly spurious because the condition of spatial independence was not satisfied, was tested in a Bayesian geostatistical model. We used the spatial model derived from the observed locations, to predict prevalence of malaria infection in children 1–14 years old at unobserved map locations across the whole country.
Methods
Study area
Botswana is semiarid to arid with few permanent water bodies. The country is flat, mostly between 900 and 1200 m altitude. The rainy season is from November to March. Vegetation ranges from desert scrubland in the SouthWest, where annual rainfall is <300 mm, through grassland, to wooded savannah in the North, which receives >500 mm rain annually. Mean annual temperatures are between 18 and 23°C. Botswana today has a total population of about 1.6 million; population density over two thirds of the country being <1 per square km [12]. The population according to the 1971 census was 630379 with an approximate 3.1% annual increase [13] which if extrapolated back in time translates to around 470000 in 1961/62. In 1975 80% of the population lived in the eastern part of the country,
Malaria risk is highest in the tropical North (figure 1). Indoor residual spraying was introduced in 1946 on a limited scale. Coverage was gradually improved culminating in a comprehensive vector control program in the 1980's [14], but even by 1953 indoor residual spraying for mosquito control was a "regular feature" in risk areas, apparently mainly in towns, along rivers and apparently excluding rural areas "remote from regular medical supervision", but with good results [15]. Larval control was also implemented when mosquito breeding was detected. Malaria prevalence decreased markedly after 1944, again between 1961/62 and 1974, and further thereafter [14]. By 1960 no prevalence above 70% was measured, suggesting mesoto hypoendemic conditions. Further South, transmission is hypoendemic and epidemic, and over large areas entirely absent. Incidence, like the climate, is strongly seasonal, peaking around March/April [16]. The gradient in malaria broadly follows the environmental gradients described before.
Malaria data
Archived malaria prevalence data were collated within the MARA project, as described by Omumbo et al [17]. In Botswana geographical coordinates could be obtained for 613 out of a total of 1063 agespecific prevalence surveys. Of these, 20 did not report sample sizes and were excluded. Here we used only the 1961/62 national survey (figure 1) to develop a historical malaria risk map. For the 1–14 year age group, 122 prevalence results were available, for 118 unique locations across the country, progressively from August 1961 to May 1962. Surveys in different regions were carried out during different months (figure 2). The total number examined was 17149; the mean sample size was 141 per survey (range 2–831). The design effect was calculated in Stata [18].
Environmental data
Fortynine variables representing different summaries and transformations of the eight environmental data themes (see table 1), were included in the study: elevation [19], surface water [20], land cover [21], longterm monthly mean rainfall, temperature [22], vapour pressure [23], and normalized difference vegetation index (NDVI) at 8 km [24] and 1 km [25] resolution.
Themes with monthly values (rainfall, temperature, NDVI and vapor pressure) were plotted against logittransformed malaria prevalence, logit(p). Based on observed temporal patterns in the scatter plots, months were aggregated for "summer" (December to March) and "winter" (April to October). Different annual summary indices were also calculated for each theme. Calculations of some of the variables are shown in the appendix.
Distance from water bodies was calculated by projecting maps of perennial and nonperennial water bodies onto a 200 × 200 m grid and calculating for each grid cell the euclidian distance to the nearest water body. Values were transformed by adding a value of 100 m to each pixel and deriving the natural logarithm.
For land cover, the thirteen United States Geological Survey land cover classes occurring in Botswana were regrouped into two categories, broadly corresponding to drier and moister land cover types. Most data points were found in "grassland" and "savannah" with only isolated surveys in the other land cover types. Prevalence was generally higher in "savannah" than in "grassland" areas. Other obviously drier and lower risk land cover types ("barren or sparsely vegetated", "shrub land", "urban or builtup") were therefore included with "grassland" in a "low risk" category, while other clearly moister classes ("herbaceous wetland", "water bodies", "evergreen broadleaf forest") were included with the higherrisk "savannah" category. Other minor land cover types were included in the category alongside which they mostly commonly occurred ("grassland/crop land mosaic" was mainly found scattered among "grassland"; "dryland crop land and pasture" and "mixed" among "savannah").
Values were extracted from the data grids for each geographical location where a malaria survey result was available.
Variable selection and model development
We carried out a staged approach during model formulation. A flow chart of the variable selection procedure is shown in figure 3.
Stage 1
The malaria prevalence database was split randomly into derivation (n = 81) and validation (n = 41) subsets. To identify the best univariate predictors, univariate logistic regression analysis against the derivation data was carried out on all 50 potential predictors. We allowed for clustering by survey location using the HubertWhite sandwich estimator in Stata [18].
Stage 2
To reduce confounding arising from correlated variables, and also to reduce the variables to data ratio, we ranked the variables significant in univariate analysis by the Akaike Information Criterion [26] (AIC), and excluded those that were strongly correlated (Spearman's r > 0.85) with a higherranking variable belonging to the same environmental theme. Scatter plots against logit(p) were prepared of the remaining variables (figure 4).
Stage 3
Following the approach of Austin and Tu [11], we drew 1000 bootstrap samples from the derivation data, and ran automated backward exclusion procedures on each sample. Since it was not possible in Stata to allow for clustering within the stepwise procedure, which resulted in the explanatory power of variables being overestimated, we used stringent entry and removal thresholds (p = 0.02 and 0.05 respectively). We recorded the coefficients and the number of times each candidate variable was selected in the 1000 models.
Stage 4
A nonspatial multiplevariable model was derived in a manual stepwise fashion, starting with the most frequently selected variable, and adding further variables in order of selection frequency, as long as all entered variables remained significant at the 5% probability level. If a previously entered variable became nonsignificant with the addition of another, we retained the one more frequently selected in Stage 3 in favour of the other.
Stage 5
Back in Stage 2 variables had been excluded based on their univariate predictive power. To identify the best representative(s) of a theme in a multiple variable context, correlated variables excluded in Stage 2 were allowed to compete against each other for entry into the model in further stepwisebootstrap procedures. The variables in the Stage 4 model constituted the basic candidate list. Working themebytheme, we reintroduced into the candidate list also those variables that had been excluded in Stage 2 on account of their high correlation with any variable of the same theme that had survived to Stage 4. Each time we ran a stepwisebootstrap procedure as described above, recording which of the competitors was most frequently selected. This variable then replaced the original variable in the model. Details, in the form of an example, are provided in an annotation to table 2. Using the modified model, prevalence was predicted for all 122 observations. The accuracy of the predictions for both derivation and validation data was assessed using the concordance correlation coefficient [27, 28].
Stage 6
To account for spatial correlation in the survey data, a generalized geostatistical spatial model using Markov Chain Monte Carlo (MCMC) simulation was fitted on all 122 observed prevalence data points [29–32]. The covariates of the Stage 5 model were included as potential explanatory variables. Spatial modeling was carried out using the package geoRglm in the statistical software system R [30]. Detailed methods are included in the appendix. For each model parameter the median and 2.5 and 97.5 percentiles were calculated from the MCMC simulations. Prevalence and its 95% CI was predicted and mapped for a grid of 2300 locations based on the covariates and the spatial structure in the data.
Results
The design effect in the data was 52 before adjusting for covariates. The clustered survey data thus only had the same power as 330 (17149/52) individuals randomly sampled over the entire country.
Of the 50 potential explanatory variables, 42 were significantly associated with malaria prevalence in univariate logistic regression in Stage 1 (table 1). Scatter plots of logit(p) against the 14 variables that were selected for further analysis in Stage 2, are shown in figure 4.
The selection frequency of the 14 candidate variables in the 1000 stepwisebootstrap models of Stage 3, are shown in table 2. Figure 5 shows the frequency distribution of coefficients for each variable. Some variables were unstable, having positive coefficients in some models and negative coefficients in others. Five variables were selected into the Stage 4 model, namely annual maximum rainfall, winter mean temperature, proportional SD temperature, elevation and land cover (marked in table 2).
The results of the additional three stepwisebootstrap procedures of Stage 5 are shown in table 2. In the rainfall theme, annual maximum was outperformed and replaced by summer total. For temperature theme, annual mean outperformed winter mean. With annual mean in the model, standard deviation became nonsignificant. Since standard deviation ranked lower in Stage 3 than winter mean, it was removed, reducing the number of variables in the Stage 5 model to four. Results of the Stage 5 model are shown in table 3.
Figure 6 shows the scatter plot of observed versus predicted logit(p), for the derivation and validation data of the nonspatial Stage 5 model. The concordance correlation coefficient (ρ_{C}) [27, 28] for the derivation data, weighted by sample size, was 0.851, n (individuals examined) = 11182 in 66 nonzero prevalence surveys, the 95% confidence interval (CI) = 0.846 to 0.856. The unweighted ρ_{C} = 0.834, n = 66, CI = 0.760 to 0.908. For the validation data weighted ρ_{C} = 0.835, n = 4467, CI = 0.826 to 0.843; unweighted ρ_{C} = 0.776, n = 30, CI = 0.635 to 0.917. The difference between observed and predicted logit(p) did not vary with prevalence.
After adjusting for spatial random effects, only three covariates remained significant. Land cover (median = 0.515; 95% CI = 1.099 and 0.059) was removed. The predictions (median and CI) from the spatial Stage 6 model are also shown in figure 6. It contained three covariates namely summer rainfall, annual mean temperature and elevation, each independently significantly associated with prevalence of infection after allowing for spatial correlation in the data (table 4).
Discussion
This study was concerned with finding the best predictors of malaria prevalence in terms of plausibility, parsimony and reliability. One important question was how to summarize the environmental data in a meaningful way. We determined to explore a range of alternative summaries of the monthly climate data, believing one appropriate summary indicator to be better for prediction than individual months [5], quarterly aggregates [3], or principal components [4], the last of which are difficult to interpret. However, as more and more variables are tested against a certain data set, the risk increases that some will explain the data merely by chance, but will fail to explain new data.
In an initial attempt to derive a wellfitting and plausible model through automated stepwise variable selection (results not shown), arbitrary factors such as entry and removal threshold settings, how many variables were included in the list of candidates, and which datasubset was used for model derivation, affected which variables got selected. The bestfitting models did not produce the most plausible risk maps, and visa versa. The majority of maps resulting from these models strongly contradicted expert opinion. A more systematic selection procedure was called for.
Identification of consistent predictors is compromised by correlation among predictors. A strong, reliable predictor may ultimately be selected less frequently than a weaker predictor, if several strongly correlated alternatives compete for entry into the model so that each has a low selection frequency [11]. For this reason it was important to include in the candidate list only littlecorrelated variables. This was ensured in Stage 2, where the candidate list was reduced from 42 to 14.
Reliable predictors would not only explain a particular data set, but would be associated consistently with the response. The bootstrapping of Stage 3 helped identify such predictors, because those that consistently explain different subsets of the data, are more likely to explain new data. In the stepwise bootstrap procedures, variables that explained the most observations would be selected most frequently while those that explained only some of the observations, would be selected only when these observations appeared in the bootstrap sample. The effect of individual observations on variable selection, especially of outliers, was thus reduced.
In the process of univariate ranking (Stage 1 and 2) we became guilty of "data peeking" [9]. Using our data to assemble a candidate list of predictors set up the analysis for success. Such undeclared testing and discarding of variables may lead to illegitimately high model fit. Another problem of Stage 2 was that variables were excluded on the grounds of low univariate correlation with the response, while their predictive power may be quite different once other variables are accounted for. Stage 5 was an attempt to redress both these problems at once, by giving each variable excluded in Stage 2, whose relative had survived up to Stage 4, a fair chance to outperform and supplant its competitors in a multiplevariable context, at the same time, through the bootstrap subsampling, to reduce the influence of the data set on this process.
A further benefit of the Stage 3 bootstrapstepwise procedures was the information provided by the frequency distributions of coefficients in the 1000 stepwise models (figure 5). A variable that has a widely varying coefficient, or one that is sometimes positive and sometimes negative, is clearly not reliable and should be considered with suspicion [33]. An example was summer vapour pressure, the strongest univariate predictor, but selected least frequently in multiplevariable regression (figure 5J ). Altitude on the other hand, a weak univariate predictor, became an important predictor in a multiplevariable context, with a stable positive coefficient (figure 5G ). In fact, the most frequently selected variables (table 2) had stable coefficients (figure 5), whereas the most unstable coefficients were found among the least frequently selected variables, confirming the relative importance of predictors.
The strong association found between malaria prevalence and selected environmental data (figure 7) is biologically plausible since high malaria infections have been shown to coincide with conditions that favour vector and parasite development in a given location [3]. However, over small distances environmental conditions vary only slightly due to the relatively simple flat Botswanan topography, while malaria prevalence showed substantial local variation, for example contemporal measures of 67% (n = 48, Maun) versus 24% (n = 557, Maun suburb), or 3% (n = 219) versus 17% (n = 116) in Matangwane. Such local variation is perhaps partly caused by the distribution of small breeding sites. Yet in studies where detailed breeding site information was available, much of the variation in incidence [34], prevalence and entomologic inoculation rate [35] nevertheless remained unexplained. Localized factors, such as individual, household and village characteristics, as well as the effect of sampling procedure and size, may further contribute to the unexplained variability in prevalence.
Summer rainfall and annual mean temperature, retained in the final multiplevariable model, were highly plausible predictors. The same variables – summer rain and mean temperature over the preceding year – were also found to explain interseasonal variation in malaria incidence in KwaZuluNatal [36]. Summer rainfall also explained much of the variation in interannual variation in malaria incidence in Botswana [16]. High rainfall during the hot summer months allows rapid breeding and population expansion of the mosquito vectors, while high mean temperatures maximize the maturation rate of the parasite in its exothermic arthropod host [37]. Warmer winters reduce the dieback of mosquitoes and parasites, thereby increasing the reservoir for the following season.
The strong positive association of elevation with malaria prevalence (an increase in logit(p) of 1 every 160 m, table 4) was surprising, as prevalence on its own, as it usually tends to be, was higher in lowlying areas (figure 4G). This positive association was difficult to explain, but may be connected with the malaria control that was ongoing at the time. It appears from early reports [15] that vector control operations were widespread and intensive along rivers and the main populated areas.
The nonspatial model of Stage 5 predicted the data fairly well but the predictions achieved by the spatial model of Stage 6 were more accurate (figure 6). The map corresponding to the Stage 5 model (figure 7A) had an implausible discontinuity, caused by the negative coefficient of landcover. Landcover was the most frequently selected variable in the bootstrap procedures, but was not significant in the spatial model. This binary variable may simply have approximated the spatial division between high and low prevalence areas, which was ultimately described more correctly through the geospatial approach of Stage 6 (figure 7B).
A good number of locations with observed zero prevalence had predicted prevalence of 5%, i.e. logit(p) of 3, and above (figure 6). In these cases sampling error may have played an important role, as large sample sizes are needed to measure very low prevalence rates confidently. Conversely, nonzero observations were more often lower than the predictions based on environmental factors. By 1961/2 malaria prevalence in the North of Botswana was already much below the level measured in 1944 [14], probably due to the limited use of indoor residual spraying which had been ongoing since the 1940's. This highlights the fact that not only environmental, but also anthropogenic factors, especially malaria control need to be considered. This furthermore highlights the need to monitor control coverage and effectiveness, as well as other potential cofactors, in order to understand the situation more accurately.
Evidence from elsewhere in Africa suggests that prevalence rates in the dry/low transmission season may differ substantially from those in the wet/high transmission season [38, 39]. In this study month of survey was a significant predictor of prevalence in a univariate setting only, but not while accounting for other variables. Prevalence by month (figure 4N) was confounded by where surveys were carried out when, and thus did not reflect the seasonality of malaria risk. The highest incidence months for example (March to May) would not be the lowest prevalence months, as figure 4N suggests. Rather, surveys were carried out during these months in the lowrisk South (figure 2). To measure intraannual variation in prevalence we would have required data from the same localities in different months.
The spatial risk map (figure 7B) presents a smoothed picture of malaria risk in Botswana prior to intensive malaria control, which was highly plausible based on expert opinion and the mean incidence at district level [40]. The wide CI (figure 7C) in predicted prevalence highlights the uncertainty remaining after accounting for all explained variation in the data. The confidence level needs to be taken into account when using the map for planning and evaluating control interventions, to avoid overinterpretation of the map.
Conclusion
A continuous map of malaria risk is more useful than pointprevalence rates for several reasons. First, the variability in individual observations may hide underlying patterns that have epidemiological importance. Further, it is not possible to deduce from a pointreferenced map what prevalence you may expect to see in areas that have not been sampled, whereas a model such as the one developed here gives a likely range of prevalence for the entire region. A continuous prevalence map can also be combined with underlying population data to estimate the number of people at risk of – or infected with – malaria. Finally, the spatial statistical methods employed here distinguish between the correlation among observations that can be ascribed to their spatial proximity (neighbouring villages affecting each other), and that which can be explained by environmental factors (thereby avoiding overestimating the explanatory power of the covariates).
Though malaria risk has been reduced substantially through intense malaria control, a malaria risk map nevertheless remains highly useful from the control perspective in knowing historical prevalence levels. We have furthermore demonstrated a systematic procedure for variable selection and model formulation in developing a geostatistical risk model from pointreferenced malaria prevalence data, which has relevance to a broad range of environmentally determined infectious diseases. The failure take account of spatial correlations during the entire variable selection procedure remained a major weakness. As computing power increases and statistical software packages are further developed, variable selection within a spatial framework may end up being within the means of the average researcher.
The staged process of variable elimination employed here proved to be practical, though not necessarily the optimal solution. Stepwise variable selection on multiple bootstrap samples drawn from the data allowed us to identify the most consistent and stable explanatory variables. Selection frequency provided an objective rationale for choosing one variable above another, and to choose between similar and strongly correlated indicators. Spatial analysis was the final stage in the variable elimination process, after which we remained with a parsimonious, highly plausible model, which produced a smooth, plausible map of malaria risk.
Appendix 1
Standard deviation (SD)
where y_{m} = monthly value and \widehat{y} = mean of all y_{m}.
Proportional SD (based on monthly proportions)
where py_{m} = y_{m}/y_{tot}; y_{tot} = ∑y_{m}, and 0.0833 is the mean of all py_{m} (= 1/12)
Effective temperature [41]
Concentration of rainfall
Monthly rainfall is expressed as a vector (r_{ m }, θ_{ m }), rainfall being the magnitude (r) of the vector and the month its angle (θ) expressed in units of arc:
where m is the month, so that January = 1 and December = 12.
The twelve monthly vectors are added to calculate the total vector (r_{ t }, θ_{ t }):
The concentration index C is calculated as:
Concentration is 100% if all the rain falls in one month and 0% if all months have equal amount of rain.
θ_{ t }is the mean peak month around which rainfall is concentrated.
Generalized spatial logistic regression analysis
Bayesian geostatistical model formulation has been described by a number of authors [29–32]. Following these authors, the model is specified as follows:
Y_{ji} represents the binary response corresponding to the infection status of child j at site i (the survey site) taking value 1 if the child tested positive and 0 otherwise. The Y_{ji} are conditionally independent Bernoulli variables with infection probability p_{i} at location i.
The p_{i} are defined via a generalised linear mixed model, to take account of spatial dependence:
where β represents the regression coefficients for a set of known covariates X at all locations ℓ_{i} of the study area;
S = (S(ℓ_{1}),...., S(ℓ_{n}))^{T} denotes the values of the (unobserved) Gaussian spatial process S(·) at sample locations ℓ_{i};
σ^{2} = Var{S(ℓ)}, and Φ is a parameter of the correlation function ρ(d_{ij}, Φ), in our case exp(d_{ij}/Φ), where d_{ij} is the distance between locations ℓ_{ i }and ℓ_{ j }.
For β flat priors were specified respectively (defaults in geoRglm) and for σ^{2} a ScaledInverse chisquare distribution(χ^{2}_{ScI}) with five degrees of freedom and a mean of 0.5. For Φ a discrete exponential prior with mean of 0.04 and 1000 discretisation points in the interval 0.0001 to 2 was specified.
Convergence was assessed by inspecting plots of traces of simulations for individual parameters. The first 50,000 iterations were discarded; thereafter simulations were run for 250,000 iterations. Every 50th sample was retained. For each model parameter the median and 2.5 and 97.5 percentiles were calculated from the 5,000 MCMC simulations.
Models were compared by calculating the deviance information criterion (DIC) for each model [42]. Spatial prediction using Bayesian kriging was carried out for a grid of 2300 locations which correspond to the entire surface of Botswana. For each prediction location a posterior sample of MCMC simulations was generated taking account of the estimates of regression coefficients and the spatial effects at each location, and of the uncertainty of each parameter. This process is described in detail elsewhere [29, 31, 32], and was carried out using geoR [30].
Abbreviations
 CI:

Confidence interval/credible interval
 logit(p):

Logittransformed malaria prevalence
 MCMC:

Markov Chain Monte Carlo
 NDVI :

Normalized difference vegetation index
 SD :

Standard deviation
References
Snow RW, Marsh K, Le Sueur D: The need for maps of transmission intensity to guide malaria control in Africa. Parasitol Today. 1996, 12: 455457. 10.1016/S01694758(96)30032X.
Kleinschmidt I, Bagayoko M, Clarke GP, Craig M, Le Sueur D: A spatial statistical approach to malaria mapping. Int J Epidemiol. 2000, 29: 355361. 10.1093/ije/29.2.355.
Kleinschmidt I, Omumbo J, Briet O, Van De GN, Sogoba N, Mensah NK, Windmeijer P, Moussa M, Teuscher T: An empirical malaria distribution map for West Africa. Trop Med Int Health. 2001, 6: 779786. 10.1046/j.13653156.2001.00790.x.
Omumbo JA, Hay SI, Snow RW, Tatem AJ, Rogers DJ: Modelling malaria risk in East Africa at highspatial resolution. Trop Med Int Health. 2005, 10: 557566. 10.1111/j.13653156.2005.01424.x.
Snow RW, Gouws E, Omumbo JA, Rapuoda B, Craig MH, Tanser FC, Le Sueur D, Ouma J: Models to predict the intensity of Plasmodium falciparum transmission: applications to the burden of disease in Kenya. Trans R Soc Trop Med Hyg. 1998, 92: 601606. 10.1016/S00359203(98)907817.
Gemperli A, Vounatsou P, Sogoba N, Smith T: Malaria mapping using transmission models: application to survey data from Mali. Am J Epidemiol. 2006, 163: 289297. 10.1093/aje/kwj026.
Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med. 1999, 130: 515524.
Diggle P, Moyeed R, Rowlingson B, Thomson M: Childhood malaria in the Gambia: a casestudy in modelbased statistics. Applied Statistics. 2002, 51: 493506.
Babyak MA: What you see may not be what you get: a brief, nontechnical introduction to overfitting in regressiontype models. Psychosom Med. 2004, 66: 411421. 10.1097/01.psy.0000127692.23278.a9.
Harrell FE: Regression modeling strategies: with applications to linear models, logistic regression and survival analysis. 2001, New York: Springer
Austin PC, Tu JV: Bootstrap methods for developing predictive models. Am Stat. 2004, 58: 131137.
Deichmann U: Population Density for Africa in 1990, 3. Internet. 1997, NCGIA, UCSB, Santa Barbara, [http://grid.cr.usgs.gov/datasets/mapservices.php]
Chayabejara S, Sobti SK, Payne D, Braga F: Malaria situation in Botswana. Report AFR/MAL/144. World Health Organization, Regional Office for Africa
Mabaso ML, Sharp B, Lengeler C: Historical review of malarial control in southern African with emphasis on the use of indoor residual housespraying. Trop Med Int Health. 2004, 9: 846856. 10.1111/j.13653156.2004.01263.x.
Freedman ML: Malaria Control. Report. The Botswana National Archives and Records Services, Gaborone
Thomson MC, Mason SJ, Phindela T, Connor SJ: Use of rainfall and sea surface temperature monitoring for malaria early warning in Botswana. Am J Trop Med Hyg. 2005, 73: 214221.
Omumbo J, Ouma J, Rapuoda B, Craig MH, Le Sueur D, Snow RW: Mapping malaria transmission intensity using geographical information systems (GIS): an example from Kenya. Ann Trop Med Parasitol. 1998, 92: 721. 10.1080/00034989860120.
Anon: Stata Statistical Software: Release 7.0. 2001, Stata Corporation, College Station, Texas
Anon: GTOPO30 global digital elevation model. Internet. 1998, Center for Earth Resources Observation and ScienceUnited States Geological Survey, Sioux Falls, South Dakota, [http://edc.usgs.gov/products/elevation/gtopo30/gtopo30.html]
Anon: Africa Data Sampler, 1. CDROM. 1995, World Resources Institute, Washington, DC
Anderson JR, Hardy EE, Roach JT, Witmer RE: A land use and land cover classification system for use with remote sensor data. U S Geological Survey Professional Paper. 1976, 964:
Hutchinson MF, Nix HA, McMahan JP, Ord KD: Africa – A topographic and climatic database, 1. CDROM, Canberra. 1995
Mitchell TD, Hulme M, New M: CRU TS 2.0 highresolution gridded climate data, 1. Internet. 2003, Climate Research Unit, University of East Anglia, Norwich., [http://www.cru.uea.ac.uk/cru/data/hrg.htm]
Anon: Pathfinder Advanced Very High Resolution Radiometer (AVHRR) data. Internet. 2001, National Oceanic and Atmospheric Administration (NOAA), Goddard Distributed Active Archive Center, [http://daac.gsfc.nasa.gov/data/dataset/]
Anon: Global Land 1KM AVHRR Project. Internet. 2007, Center for Earth Resources Observation and Science, United States Geologic Survey, Sioux Falls, [http://edcsns17.cr.usgs.gov/1KM/1kmhomepage.html]
Akaike H: Information theory and an extension of the maximum likelihood principle. Second international symposium on information theory. Edited by: Petrov BN, Csaki F. 1973, Budapest: Akademiai Kiado, 267281.
Lin LIK: A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989, 45: 255268. 10.2307/2532051.
Lin LIK: A note on the concordance correlation coefficient. Biometrics. 2000, 56: 324325. 10.1111/j.0006341X.2000.00324.x.
Diggle PJ, Tawn JA, Moyeed R: Modelbased geostatistics. J Roy Stat Soc C. 1998, 47: 299350. 10.1111/14679876.00113.
Christensen OF, Ribeiro PJJ: geoRglm – a package for generalised linear spatial models. R News. 2002, 2: 2628.
Gemperli A, Vounatsou P: Fitting generalized linear mixed models for pointreferenced spatial data. J Modern Appl Stat Meth. 2003, 2: 497511.
Gemperli A, Vounatsou P, Kleinschmidt I, Bagayoko M, Lengeler C, Smith T: Spatial patterns of infant mortality in Mali: the effect of malaria endemicity. Am J Epidemiol. 2004, 159: 6472. 10.1093/aje/kwh001.
Concato J, Feinstein AR, Holford TR: The risk of determining risk with multivariable models. Ann Intern Med. 1993, 118: 201210.
Van Der Hoek W, Konradsen F, Amerasinghe PH, Perera D, Piyaratne M, Amerasinghe FP: Towards a risk map of malaria for Sri Lanka: the importance of house location relative to vector breeding sites. Int J Epidemiol. 2003, 32: 280285. 10.1093/ije/dyg055.
Hightower AW, Ombok M, Otieno R, Odhiambo R, Oloo AJ, Lal AA, Nahlen BL, Hawley WA: A geographic information system applied to a malaria field study in western Kenya. Am J Trop Med Hyg. 1998, 58: 266272.
Craig MH, Kleinschmidt I, Nawn JB, Le Sueur D, Sharp BL: Exploring 30 years of malaria case data in KwaZuluNatal, South Africa, Part I: the impact of climatic factors. Trop Med Int Health. 2004, 9: 12471257. 10.1111/j.13653156.2004.01340.x.
Molineaux L: The epidemiology of human malaria as an explanation of its distribution, including some implications for its control. Malaria: Principles and Practice of Malariology. Edited by: Wernsdorfer WH, McGregor I. 1988, Edinburgh: Churchill Livingstone, 2: 913998.
Lindsay SW, Wilkins HA, Zieler HA, Daly RJ, Petrarca V, Byass P: Ability of Anopheles gambiae mosquitoes to transmit malaria during the dry and wet seasons in an area of irrigated rice cultivation in The Gambia. J Trop Med Hyg. 1991, 94: 313324.
Molineaux L, Gramiccia G: The Garki project: research on the epidemiology and control of malaria in the Sudan savanna of West Africa. 1980, Geneva: World Health Organization
Craig MH, Snow RW, Le Sueur D: A climatebased distribution model of malaria transmission in Africa. Parasitol Today. 1999, 15: 105111. 10.1016/S01694758(99)013964.
Stuckenberg BR: Effective temperature as an ecological factor in southern Africa. Zool Afr. 1969, 4: 145197.
Spiegelhalter DJ, Best NG, Carlin BP, Linde AVD: Bayesian measures of model complexity and fit. J Roy Stat Soc B. 2002, 64: 583639. 10.1111/14679868.00353.
Acknowledgements
We thank the Botswana Ministry of Health for contributing their data to the efforts of the MARA project. We acknowledge Tom Smith and Penelope Vounatsou of the Swiss Tropical Institute for revising the manuscript critically for important intellectual content and thank them for their valuable comments. We thank the South African Medical Research Council and the Rudolf Geigy Stiftung zu Gunsten des Schweizerischen Tropeninstituts for supporting this study, as well as the various funders of the larger MARA project (particularly MIM/TDR and RBM).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
All authors critically reviewed several versions of the manuscript. MC carried out the bulk of the analysis and drafted the manuscript. IK participated in its design, carried out the spatial analysis and helped draft the manuscript. MM collated the malaria data. MC, IK and MM read and approved the final manuscript. BS recently passed away and we hereby wish to express our deepest appreciation of his leadership, his personal and professional support over the years and his tireless efforts towards malaria control.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Craig, M.H., Sharp, B.L., Mabaso, M.L. et al. Developing a spatialstatistical model and map of historical malaria prevalence in Botswana using a staged variable selection procedure. Int J Health Geogr 6, 44 (2007). https://doi.org/10.1186/1476072X644
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1476072X644