We use a spatial ABM designed to explore cholera diffusion in Kumasi, Ghana in 2005 in combination with survey data . Two online surveys were run to gather data on people’s risk perception for cholera: MOOC survey (Geohealth online course) and Google survey (an online survey). While most of the questions were identical in the two surveys (Appendices 1 and 2), there was one difference. In the MOOC survey participants chose to use or not to use river water for drinking through judging about its quality by the visual appearance (pictures shown). The Google survey collected information on the influence of individual risk factors on the willingness to use the river water without visuals using only textual description of the water quality situation. The risk perception of participants is questioned based on one factor and a combination of factors.
Spatial agent-based model
To quantitatively assess the impacts of different workflows, we implemented different BNs within an existing spatial model that studies cholera diffusion in Ghana. The Cholera ABM (CABM) was originally developed by Augustijn et al. (2016) using NetLogo  for a 2005 epidemic in Kumasi, Ghana. CABM simulates two different cholera infection pathways, via the environment (lower infectiousness) and human-environment-human infection (hyper-infectious). When passing through the digestive system, cholera bacteria transition to a hyper-infectious state. When faecal materials from cholera patients are deposited at open dumpsites, runoff during heavy rains can carry the infection to nearby rivers, and as people use the river water for domestic use, this runoff can contribute to the diffusion of the disease. The original model does not contain any risk awareness of agents. Even when the number of disease cases increase, agents keep on using river water.
A total of four agent types are present in the CABM: households, individuals, media, and rain particles. The model consists of three sub-models: a hydrological model, an activity model (including learning), and a disease model. The study area equals to 19 km2 and consists of 21 communities (Fig. 1 left), eleven of them are completely included. There are no administrative boundaries for these communities. However, for this model the developers determined the boundaries using Thiessen polygons. The spatial environments of CABM consists of: (1) elevation surface data (DEM) to construct hydrological environment to determine the flow direction and flow accumulation of the rain drops; (2) the dumpsites with actual locations gathered using global positioning system (GPS); the (3) house layer with different income levels: high, medium, and low; (4) the river; (5) the centre and ID of communities (Fig. 1).
A two-stage ML algorithm was implemented in the model to simulate the intelligent processes of risk perception and agent decision making outlined by Abdulkareem et al. . Protection Motivation Theory is one of the dominant approaches in behavioural science and was used as the theoretical framework in CABM. This theory suggests that risky decisions consist of two stages: threat appraisal (risk perception) and coping appraisal. Additionally, two BNs were implemented to represent these two appraisals using R software (connect to Netlogo via R extension) . All agents except media in CABM – i.e. households, individuals, rain particles, − are associated with a particular location at each moment. Households agents that use BNs also have spatial intelligence, which refers to the fact that agents sense and have an understanding of their spatial environment and make intelligent decisions based on changes in these spatial environments. Here, household agents using BNs perceive the environment where they go to fetch the water, evaluate its visual pollution and combine the information with other factors to evaluate their risk perception using BNs.
High-income agents for our case-study always buy bottled water, therefore, they have been excluded and assumed they are protected from getting cholera. Low- and middle-income household agents, who have no access to tap water, use the BNs to decide whether to fetch water from the river. They perceive and evaluate risk using the first BN (BN1), and, when risk is perceived, they cope by deciding on what type of water to use by employing a second BN (BN2). The structure of the network and its priors are the same for all agents at initialization, but as agents learn, the conditional probabilities depend on the evidences for each agent and will change differently. Risk assessment is based on four criteria: visual pollution of water collection points, media attention, communication with neighbours, and individual memory. During the coping process, they can choose between buying water, boiling river water, or fetching water from another water collection point. Both BN1 and BN2 were designed on the basis of expert knowledge.
In the present study, we alter the ability of BN1 to guide individual risk perception by using new BNs to run our experiments. In particular, our original BN1 that used expert opinions to define both BN structure and parameters is re-implemented in this work to serve as a benchmark. The coping appraisal decision determining which water source agents use remained unchanged (BN2 in ).
To derive a BN from data, micro-level behavioural datasets describing the relation between risk perception and the model variables are needed, including visual pollution, media, neighbour, and memory.
This data was collected during two surveys: a survey among international participants of the Massive Open Online Course (MOOC) Geohealth and an online survey (Google survey). The MOOC Geohealth was organised by the Faculty of Geo-Information Science and Earth Observation (ITC) of the University of Twente, in the Netherlands during 2016 and 2017 with 194 and 235 participants from 92 countries (54% were from Africa, including Ghana) completing the survey. MOOC participants were split randomly into four equal subgroups, which were shown pictures of rivers with different levels of visual pollution (Fig. 2). All subgroups answered the same set of survey questions testing their willingness to use the river water for drinking and cooking purposes (Appendix 1). In every subsequent question, additional information on other factors such as memory, media attention, and communication with neighbours was provided.
Information on the influence of individual risk factors on water use was collected during a separate survey implemented using a Google form (Appendix 2). The importance of factors including ‘Visual pollution’, ‘Media’, and ‘Contact with neighbours’ was surveyed both individually and in combination with other factors. This survey was distributed to students enrolled in Master of Science courses in Geoinformatics and urban planning and management at the Faculty of ITC, University of Twente. In total, this led to 125 survey participants from 33 countries (35% of them were from Africa including Ghana).
We combine the data gathered from these two surveys into one dataset to ensure that BNs are constructed and trained with all possible combinations of factors states and risk perception responses
Integration of empirically – Driven BNs in the spatial ABM
All BNs consist of a number of nodes connected by links in the form of a directed acyclic graph (DAG) . When integrated into a spatial ABM to enhance agent intelligence, each node represents a variable in the agent decision-making process simulated in that model. In the present study, these variables include: the observation of visual pollution at water collection points (VP), the reporting of media on cholera cases (Media), communication with neighbours that may or may not have cholera in their household (Neighbours), and updating and retrieving memory representing a household’s previous use of the current water source (Memory). The BN supports agent decisions on assessing water infection level (i.e. Risk node) based on VP, Media, Neighbours, and Memory. The latter nodes (except VP) have a Boolean value (true or false) indicating the presence or absence of their relationship to risk. The VP node has three states: no, low, and high. These states indicate the level of visual pollution at water collection points. In the survey there were four different states, which were mapped as follows: clean water (no risk), brown water or a small amount of garbage around water collection points (low risk), and garbage on the river banks and in the river (high risk). In this paper, we explore four combinations for the specification of either BNs network structure or CPT. Namely, we run our spatial ABM with BNs designed (i): based on data only (both structure design and CPT is derived from data); (ii) based on data complemented with expert knowledge (structure is data-driven but CPT is expert-driven; (iii) on structure that is expert-driven but CPT is data-driven; (iv) based on expert knowledge only (structure and CPT is derived by the expert knowledge).
Any node can be updated upon new evidence, even when they are related to multiple variables. The evidence acquired about a state variable should propagate to update states in the rest of the network, and this process requires network training (learning). The training of BNs can be either by using data or by eliciting expert knowledge . BN training is performed via a flow of information through the network, and it can take place prior to using the network (i.e., before implementation within the ABM) with the availability of data or continue during the simulation runs (when it is an expert-driven network). In the first example, the training process of BNs ends with final probabilities (posterior probabilities) that the network will continue to produce every time it is consulted by the ABM. In the second case, the BN model needs improvement since it is not fully trained at the start of the simulation, though will be trained using data (agent decisions) generated during the simulation. This process is called sequential learning. Usually when no data are available to construct BNs, the adjustment of parameters (nodes) takes place when the network performs identification based on new evidences.
We compare four different BNs (DDBN, DEBN, EDBN, and EEBN). The first letter in the BN acronyms refers to the information source – data-driven (D) or expert-driven (E) – for their derived structure, while the second letter refers to the estimation of probabilistic parameters. When a BN is expert-driven, we either designed the structure and/or retrieved parameters based on census data of the case study area or literature dealing with risk perception of waterborne diseases such (e.g., [33,34,35]). An overview of the networks is provided below.
In DDBN, both the BN structure and its parameter values were driven by survey data. The scored-based algorithm “Tabu search” is used to construct the BN . This algorithm makes use of a goodness-of-fit score function for evaluating graphical structures with regard to a dataset. Tabu search is a metaheuristic algorithm using short-term memory to ensure that the search explores new areas without remaining in a local optimum. In this algorithm, the fitting function is used to score a network/DAG with respect to the training data, and a search method is used to determine the highest-scoring network structure. This algorithm continually improves scores until converging at optimal results. DDBN was trained prior to the simulation.
The structure of this BN follows the same approach as in the MOOC surveys. In the survey, we first showed participants a picture of water with a specific level of visual pollution, followed by questions related to the other factor(s). In this approach, VP is assumed to be the parent of the other factors. Then, we derive the probabilities of nodes states and CPT from the survey data and train it prior to simulation.
The structure of DEBN is identical to that of DDBN. However, an expert assigned probabilities to node states and the related CPTs to formulate logical scenarios. These values were driven from the literature and used in the BNs.
This is a fully expert-driven network adopted from Abdulkareem et al. (2018) . The probability values of these network variables were derived from the available literature and census data for Kumasi, Ghana. Here, EEBN settings reproduce the original setup and serve as a benchmark to compare the three alternative combinations between survey and expert data for BNs.
Goodness of fit of BNs and model output
BN validation was conducted using two steps: validation of the network structure using scored functions and validation of the learning parameters (CPT) (Fig. 3). We also compared the outcome of integrating the four BNs into the CABM with the survey data to validate the realism of agent risk perception.
There are two approaches that are commonly used to measure the goodness of fit of BN models . The first is to test if the conditional independence assertions involved by the structure of the BN model are satisfied by the training dataset. The second method is to evaluate the degree to which the resulted structure describes the data. To achieve this, we use scoring functions. Many scoring functions exist, and the most popular are AIC (Akaike information criterion), BIC (Bayesian information criterion), and Bayesian Dirichlet with likelihood equivalence (BDe) . The primary issue with scoring functions is the absence of an objective method to determine which function is optimal . AIC provides a relative measure of the information lost when a given BN model is used to represent reality, while BIC is an example of penalised likelihood and it selects the true model that fits the data. Moreover, BDe calculates the joint probability of a BN model for a given dataset. Overall, the optimal model from the set of BN models is the one with the higher absolute AIC, BIC, and BDe values .
To address the main research questions, we follow a number of steps (Fig. 4). The primary elements of this workflow are explained in the following sections. We conduct a total of four experiments, in which we run the CABM with all four BNs for 100 random seed runs, creating a new synthetic population every 5 runs.Footnote 1 We provide the mean values across 100 sets of runs for all output metrics. In the first set of experiments, we run the CABM with DDBN and EDBN, training them prior to the simulations. Then we run the CABM with DEBN and EEBN, training them during the simulations to adjust the initial values of CPT of both networks proposed by the expert knowledge. Since DDBN and EDBN are trained prior to implementation of the simulation, testing data for the goodness of fit of these two BNs arises from the ABM. Additionally, since DEBN and EEBN are trained while running the simulation, the survey data serves as the goodness of fit data. The sample size of all validation dataset is equal (i.e., the size of the empirical data) to balance the scores.
To compare model outcomes with the survey results, we calculated the average number of agents that perceived risk during the simulation. These percentages were computed for each risk factor and combinations of factors. In addition, mean epidemic curves and risk perception curves were obtained and compared.
BN models and resulting spatial patterns
The implementation of different BN models may impact the behaviour of agents based on their location in space. To evaluate this impact, we present a set of maps that show the spatial distribution of risk perception variables. Risk perception factors: VP alone, VP with media, VP with contact with neighbours, VP combined with media and contact with neighbours, and combination of media and contact with neighbours are displayed per community with the implementation with each BN model. Risk perception based on media or contact with neighbours only does not occur.