To demonstrate the parallel implementation four different cases were examined (see Table 1). It is not the intention of these demonstration cases to provide any substantive analysis of evacuation behaviour but to demonstrate the performance improvements possible using the parallel implementation of buildingEXODUS. The first two cases are theoretical in nature and have been designed to gain a greater understanding of the performance of the parallel implementation. The second two cases are based on realistic scenarios; a high-rise building and a large public gathering in a large urban space. Earlier results produced by a prototype implementation of the parallel buildingEXODUS  have been improved using the latest version of the parallel implementation described in this paper.
The cases demonstrated here have been tested on a 64bit-Windows cluster consisting of 10× Intel Core 2 Duo (dual core) 3.16 GHz based computers connected via a 1Gbit Ethernet switch.
The software is stochastic in nature and needs to be run a number of times, generating a distribution of predicted evacuation times, which would also result in a small distribution of runtimes. However, for the purposes of this demonstration work only, the software was configured to operate in a deterministic manner. Thus, in situations where some movement decisions/conflicts are sometimes determined by random selection, they have been resolved in a deterministic manner. This was done to eliminate variability between simulations. This would further ensure that the runtimes of the simulations were consistent. Each case study configuration was run three times and it was found that there was less than a second difference in runtime for any particular configuration. The predicted evacuation time for each case study was identical across the serial and parallel configurations.
The results have been split into single core and dual core results. It has been found in previous work that better speedups are obtained on two single processors connected via a network compared to two processors (cores) on a shared memory machine [38, 39]; typically dual cores operate at a computational speedup of 1.7–1.9 compared to a single core. This effect is due to memory bus contention on the shared memory computer with both processes trying to simultaneously access the single memory bus within the computer. However, in some cases this memory bus contention is so high that the use of two processors in a shared memory computer produces little, if any, speedup compared to a single processor.
In the remainder of this paper two types of speedup will be discussed; these are computational speedup and real-time speedup. Computational speedup compares the run time of the parallel implementation against the serial or single processor run time . Real-time speedup, or speed up over real time, compares the run time of the parallel or serial implementation of the software against the time required to complete the evacuation as predicted by the software. More detailed tables of results are available in Appendix B.
Case 1: Idealised Large Geometry
This test case is intended to represent an ideal case for the parallel implementation. It has been designed so that there is no sub-domain boundary interaction and that the problem is well load balanced throughout the entire simulation. This test was devised to explore the upper limits of speedup potential possible with parallel buildingEXODUS.
The geometry is 4000 m long and 25 m wide producing an area of 100,000 m2. There are twenty 5 m wide exit points located along one side of the geometry, each separated by 200 m with the first and last exits being 100 m from each end of the geometry. A population of 100,000 agents (1 person/m2) is uniformly distributed throughout the domain and move towards their nearest exit point.
In order to ensure the most optimal decomposition the domain is split into 20 equal sub-domains (each sub-domain has 5000 agents) and these are allocated to each processor by taking the modulus of the sub-domain number with the total number of processors used. Due to the layout of the geometry the population does not have to cross between processor boundaries to reach an exit point.
The following configurations were used to test the speedup potential: 1, 2, 3, 4, 5, 7 and 10 single cores; 1, 2, 3, 4, 5 and 10 dual cores. Configurations that have not been represented here would not give any speedup improvements on the configurations that have been simulated. For example with 5 (single core) computers each computer would have 4 sub-domains to process; if 6 (single core) computers were utilised then 4 computers would have 3 sub-domains but 2 computers would have 4 sub-domains and this would be limit the speedup to being the same as 5 (single core) computers. The analytical computational speedup in this instance is calculated by dividing the total number of sub-domains by the maximum number of sub-domains residing on any individual processor. The predicted evacuation time for this scenario was 14 min 26 s and was consistent across all the parallel simulations and the serial version of buildingEXODUS. The computational speedup for the various permutations are presented in Fig. 10.
An impressive computational speedup of 10.9 was achieved using 10 computers compared to a single computer. An examination of all the single core results shows that a super-linear speedup, a speedup greater than the number of processors, is achieved for all the configurations. It is difficult to say precisely why a super-linear speedup has resulted, but could be due to hardware effects [45, 46], such as the increase in overall processor cache size.
The maximum computational speedup of 13.9 achieved with 10 dual core computers is less than the analytical speedup of 20. The analytical speedup is based on the premise that doubling the processor/core count would double the speedup. This ignores the inter-processor communication cost and also the effect of cores sharing the memory bus when the computer is utilising dual cores. These speedups are good and are unlikely to be achieved for most practical scenarios. They do however demonstrate a peak performance that can be achieved from this type of application.
In a real scenario near peak performance will be achieved in the initial stages of the simulation as there is likely to be a good load balance due to the initial population distribution being reasonably uniform. However, as the simulation progresses it is expected that the parallel efficiency will decrease as the load becomes unbalanced due to the movement of people.
It is also worth noting that on a single processor the simulation was some 3.5 times slower than real time for this particular simulation i.e. on a single processor the run time was 3.5 times slower than the time required for the actual evacuation. However, using 10 dual core processors produced a real-time speedup of 4.
Case 2: Long Open Area
This scenario consists of 100,000 agents in an open rectangular geometry measuring 100 m by 1000 m producing an area of 100,000 m2 resulting in a crowd density of 1 person/m2. The crowd moves to the left to exit the geometry. The left side of the geometry is completely open, creating a 100 m wide exit. This case can be considered as a realisation of the analytical study in Sect. 2.3.
The geometry was sliced into N equally sized sub-domains in the same fashion as illustrated in Fig. 9. For this case the domain was split into 20, 50 and 100 sub-domains respectively. Each of these partitions was examined using 1–10 computers in both single and dual core processor modes and this resulted in 60 total permutations for this one case.
The predicted evacuation time for this geometry is 14 min 20 s. This predicted time was consistent across the various parallel implementations and with the serial version of buildingEXODUS.
This particular simulation was designed to represent a non-congested exit flow thus leading to a short overall predicted evacuation time with minimal computational requirements. The computational speedup for single and dual core configurations are depicted in Fig. 11. The analytical speedup is calculated using Eq. 2.
Theoretically 100 sub-partitions should give the best speedup performance for all the possible permutations due to the best load balance being maintained throughout the simulation. In practice there are other factors that influence the performance including the cost of communication, which increases with additional sub-domains, and hardware effects such as increased cache size and, in the case of multi-cores, a shared memory bus. The use of a shared memory bus and increasing communication cost both reduce the actual performance from the analytical prediction. Conversely the increased cache size made available by increasing the number of computers improves the performance beyond the analytical prediction. The actual speedup performance is therefore a combination of the load balance and these other factors. When two computers are used, either in single or dual core mode, the best partition is 20 sub-domains for this particular problem. This is due to the analytical load balance being comparable to the other partitions but the communication cost is far lower than the partitions with a higher number of sub-domains. As the number of processor/cores is increased there is a greater divergence between the analytical speedups and therefore load balances. When 10 dual core computers are used the disadvantage of increased communication for the 100 sub-domain partition compared to the 20 sub-domain partition is far outweighed by the improved load balance achieved.
The run time for a single core is 1.2 times faster than the predicted evacuation time. The real-time speedup for 10 dual core processors using the 100 sub-domain partition is 17.2. Additional processors should further improve the speedup.
Case 3: High-Rise Building
The high-rise building scenario consists of 50 floors with four emergency exit staircases and a floor area for each floor of 1800 m2. The staircases are not equally spaced within the building core leading to some of the staircases attracting more people than others. This test case was run with two population sizes of 8000 agents (160 people per floor) and 16,000 agents (320 people per floor) that were uniformly distributed throughout the building on each floor. Unlike the previous case, this scenario involves a great deal of congestion as agents attempt to gain access to the stairs and as they traverse down the stairs. The predicted evacuation time for this case was 50 min for 8000 agents and 1 h 30 min for 16,000 agents.
The partitioning strategy adopted for this high rise building was based on dividing the building into 25 sub-domains vertically i.e. two floors per sub-domain. These sub-domains are then divided horizontally into four quadrants of equal floor area with a staircase associated with each quadrant. In total there were 200 sub-domains used in the analysis.
The computational speedup obtained for both the single core and dual core processors are presented in Fig. 12. We note that the computational speedup produced for the 16,000 agent population is significantly higher than the 8000 population. Clearly, the performance of the parallel implementation improves with increasing problem size. This is a well-known phenomenon with most parallel processing problems where the speedup performance improves with increasing problem size . This is due to the relative decrease of communication time with increasing computation size; this decrease is partly attributable to the fact that network communications have a fixed start-up cost in addition to the cost of sending the actual data. For this particular problem it is also related to the fact that increasing the population size by a factor of two will approximately increase the communication cost by a factor of two. However, the computational cost has increased by a factor of 4.5 due to the increased congestion.
We also note from Fig. 12 that the performance for the 8000 population has tailed off as more processors are added to the cluster, with no improvements being derived from adding more than eight processors to the cluster. However, for the 16,000 population further improvements in performance could be derived by adding additional computers (single core) to the cluster beyond 10, but the return gained from adding additional computers is diminishing.
Using 8 single core computers, the evacuation of the 8000 population can be simulated in 2 min 48 s. The predicted evacuation time is 50 min 7 s. This represents a real-time speedup of 17.9. The 16,000 population requires 1 h 30 min to evacuate and using 10 processors the evacuation can be simulated in just over 7 min, representing a real-time speedup factor of 12.9.
From Fig. 12 it is noted that the dual core simulations can return poorer performance than the single core simulations. Furthermore, unlike the single core simulations, the speedup reaches a peak for the 8000 population with 7 processors and 8 processors for the 16,000 population. Adding additional computers beyond these critical values actually diminishes performance. As noted earlier, the performance of the parallel implementation improves with increasing problem size. As the dual core computers have a greater computational performance, they require a larger problem size to make better use of the number of computers available in the cluster.
Case 4: Large Urban Space
This case study is a rough approximation of the Trafalgar Square public area in central London, UK. The usable pedestrian area measures approximately 46,000 m2 and there are 14 exit routes from the domain (see Fig. 13).
Two scenarios were created by populating the domain with 60,000 agents (1.3 p/m2) and 120,000 agents (2.6 p/m2). Unlike the other test cases the agents were assigned exit points prior to the evacuation leading to more congestion than might be expected from a more usual scenario where individuals leave via their nearest exit point. The exit allocation is roughly proportional to the size of each exit and the individuals allocated to a particular exit are randomly located in the domain. The exit allocation used in this study is not intended to be an accurate representation of actual behaviour in large urban places but is simply used to test the parallel implementation when agents will be contra-flowing and not simply exiting by their ‘nearest’ exit.
The predicted overall evacuation time for 60,000 people is 10 min 27 s, and for 120,000 people is 21 min 49 s. It should be noted due to the lack of real data and hypothetical exiting conditions for the individuals that no firm conclusions should be drawn concerning the evacuation of Trafalgar Square from these simulations. This test case is purely designed to investigate the performance of the parallel implementation on a large urban space.
For this case the Metis  partitioning algorithm was utilised along with multiple sub-domains per processor. Metis is generally used for partitioning computational grids used in Computational Fluid Dynamics (CFD) and Finite Element Analysis (FEA) for parallel processing. Metis divides the domain into a number of sub-domains whilst minimizing the size of the sub-domain boundaries. For this case 10, 20, and 32 sub-domains were created. An example partition is illustrated in Fig. 13, where each coloured region represents each of 20 sub-domains within the partition and each coloured sub-domain was allocated to one of five computers.
A visualisation of the evacuation of 60,000 agents is provided in Fig. 14 at 3 time slices and the congestion at a number of junctions is clearly shown. The computational speedups for parallel buildingEXODUS on this test case are depicted in Fig. 15 for a population sizes of 60,000 and 120,000 agents. Results for dual cores are limited to 10 dual core processors due to time constraints and hardware availability.
While the performance derived from using 20 partitions with single cores appears erratic it does produce the overall best single core performance for 60,000 agents returning a speedup of 4.73 from 10 processors. Clearly, as in Case 2, the performance is dependent on the number of partitions used. Using the dual core processors and 20 partitions a speedup of 5.3 from 10 processors is achieved.
It should be noted that the single processor performance is essentially equivalent to real time i.e. it takes as long to compute the evacuation as it does to actually perform the evacuation. Using 10 single core processors with 20 partitions, the run time is some 4.8 times faster than real time and runs in just over 2 min.
From Fig. 15 we note for the single core simulations, as the number of processors increases, the speedup also increases. Using 10 dual core processors with 20 partitions, the run time is some 5.31 times faster than real time and runs in just under 2 min.
Using the best partition suggested by the study involving the population of 60,000 individuals the 20 sub-domain case was re-run using a population size of 120,000 agents.
As is expected, the larger problem size produces a greater computational speedup. Using single cores, the run time on a single processor is more than twice as long as the predicted evacuation time. However, using 10 single core processors produces a speedup of 5.94 resulting in 10 single core processors running 2.6 times faster than real time. Using 10 dual core processors the speedup is a factor of 6.89 on the single core processor time. This results in the 10 dual core processors running 3 times faster than real time.
In Fig. 15 there is an uncharacteristic drop off in performance for nine processors when 20 sub-domains are used. This is due to the nature of the distribution of sub-domains on the computers. It is likely that one or two computers are substantially busier than the rest of the computers due to a poor load balance at some stage of the simulation. Although this result is notable it can be seen that the multiple sub-domains per processor is at least partially successful in maintaining speedup on this non-idealised scenario where it is not possible a priori to determine where the population load balance would exist.