Determining Confidence Intervals, and Convergence, for Parameters in Stochastic Evacuation Models

An issue when using stochastic egress models is how many simulations are required to accurately represent the modelled scenario? Engineers are mostly interested in a representative Total Evacuation Time (TET). However, the convergence of the TET may not ensure that the full range of evacuation dynamics has been adequately represented. The average total egress curve (AC) has been suggested as an improved measure. Unfortunately, defining a confidence interval (CI) for the AC is problematic. CIs can robustly quantify the precision of many statistics and have been used to define convergence in egress modelling and other research fields. This paper presents a novel application of bootstrapping, functional analysis measures (FAMs), and a bisection algorithm, to derive three FAM-based CIs representing the precision of the AC. These CIs were tested using a theoretical model to demonstrate the consistency of the coverage probability, the actual percentage of CIs that contain the theoretical parameter, with the nominal 95% confidence level (NCL). For two of the FAM-based CIs, it was found that the coverage probability was between 94.2% and 95.6% for all tested sample sizes between 10 and 4000 simulations. The third FAM-based CI’s coverage probability was always greater than the NCL and was a conservative estimate, but this presented no problems in practice. A FAM-based CI may suggest if there is more or less variability in an earlier phase of the evacuation. A convergence scheme based on statistical precision, CI widths, is proposed and verified. The method can be extended to other statistics.

Degrees of freedom (n -1) T -1 (p, df) The inverse CDF of Student's t-distribution for p P(x) The probability that x occurs U(z) The CDF of the normal distribution up to z U -1 (p) The inverse CDF of the normal distribution for p bxc Integer floor of x dxe Integer ceiling of x For all agents k = 1 to m Subscripts l Lower CI limit u Upper CI limit

Introduction
A stochastic approach to evacuation simulation [1,2] is employed in many models [3][4][5][6][7][8] to reflect uncertainty in human behaviour [9]. If an (experimental or simulated) evacuation trial is repeated using the same population and the same initial conditions, it is possible that the evacuation will progress differently and may result in differences in total evacuation time (TET) or other statistics of interest. In addition to that uncertainty there is also uncertainty related to the initial conditions where the distribution of agents (e.g. number of agents and starting locations) and their attributes (e.g. walking speeds and response times) may also cause differences between trials. These uncertainties can be simulated by randomly sampling suitable distribution functions for those attributes. A simple random approach is adopted in Monte Carlo egress simulation although other possible sampling methods could be utilised [10].
Many simulations are needed to represent the range of potential behaviours which may be expected for a particular scenario. However, this is set against the potentially significant computational cost of running a stochastic egress model which an end user would wish to minimise. Thus, a major issue facing stochastic egress model users is determining when enough simulations have been performed to give an accurate representation of the modelled scenario and the statistics of interest. In evacuation modelling literature [11][12][13][14][15][16][17][18][19][20] that examines this issue there are broadly two approaches suggested. The first approach is based on running a fixed number of simulations. The second approach is based on some dynamic assessment of the behaviour of the statistic(s) of interest.
When the first approach is used, a wide range for the required number of simulations has been suggested. For high speed train evacuation analysis, Li et al. [11] suggested that 10 simulations would suffice provided certain conditions were met. For large passenger ship analysis, the International Maritime Organisation (IMO) recommends that a minimum of 500 simulations should be used to determine the 95th percentile TET [12]. They additionally stated that a lower number of simulations would be allowable provided a suitable convergence methodology was followed [13]. For aircraft evacuation analysis, Galea [14] noted that 1000 simulations could typically be utilised to determine the 95th percentile TET. Meacham et al. [15] utilised 2000 simulations to construct a Cumulative Density Function (CDF) of the TETs for building scenarios. There is no consensus on the number of simulations used and there is generally little justification for the number of simulations proposed. The disadvantage of this approach is that it is unknown whether the number of simulations is sufficient to represent the range of behaviours that may exist. Furthermore, the number of simulations performed could be excessive, for the intended purpose, which could limit the number of potential scenarios or alternative designs that could be explored. This approach does have the advantage of requiring little additional modeller expertise but is also limited as some models may be very sensitive to the chosen number of simulations [16].
In the second approach, additional egress simulations are performed until some indicators of convergence have been satisfied. With this approach, the number of simulations performed is unknown a priori. The eventual number of simulations performed is based on the behaviour of the statistics of interest and should ensure that sufficient, and not excessive, computation being performed. Compared to running an arbitrary number of simulations this requires some expertise in determining suitable convergence tolerances for the scenario being run. Two types of indicators have been proposed, the first type is based on confidence intervals (CIs) [21,22] and the second type is based on examining successive differences in the statistics of interest with additional simulation.
CI based techniques have been successively used in egress modelling [13,20], and various other fields of study [23][24][25][26][27][28], to define convergence. For large passenger ship analysis, Grandison et al. [13] calculate the CI of the 95th percentile TET, using a binomial distribution, and incrementally test, with additional simulations, the limits of the CI against a pass/fail criterion [12]. This technique was found to accurately determine whether a ship design had passed or failed, whilst minimising the computational effort required. Furthermore, they noted that this technique could be adapted to determine convergence based on a required precision and that the methodology could be extended to other suitable statistics of interest and types of egress analysis. A disadvantage of this method is that it can be difficult to determine CIs for some statistics. However, an advantage of this scheme is that the convergence indicators are directly linked to the precision of the statistics of interest.
A successive difference technique has been proposed by Ronchi et al. [17] and Ronchi and Nilsson [18]. Ronchi et al. [17] define convergence (i.e. sufficient repeat simulations have been performed) when their five statistics of interest (MT, SD and three functional analysis measures (FAMs) of the average total egress curve (AC)) were all changing by less than their specified tolerances per simulation over a set number of simulations. It was suggested that these tolerances could be based on the uncertainty of the available safe egress time from a fire modelling analysis. Additionally, Lovreglio et al. [19] proposed that those tolerances could be based on values determined from a set of repeated evacuation experiments when used to validate a stochastic egress model. An advantage of the successive difference approach is the simplicity in calculating the convergence indicators. However, Ronchi et al. noted that 'The first limitation of the method is that it uses the concepts of convergence in mean and the central limit theorem rather than a statistical estimation of the expected values…'. This means that their convergence indicators do not directly relate to standard errors or CIs of the statistics of interest.
From the previous examples, the parameter of most interest is a representative TET. However, an analysis based solely on the TET may miss important details of the evacuation [16][17][18][19][20]29], e.g., a set of egress trials could have similar representative TETs, but the individual evacuation dynamics could be very different. A more detailed analysis of the evacuation can be performed by additionally examining the exit time of certain percentages of the agents [18,20]. However, an even greater level of detail can be achieved by examining the entire agent egress curve (an ordered vector of the individual agent's exit time) via FAMs [16,17,19,29]. Furthermore, egress curves are a typical output from many evacuation models. The convergence behaviour of the AC was used to represent behavioural uncertainty, the uncertainty associated with the stochastic nature of human behaviour, by Ronchi et al. CIs have been demonstrated to accurately and efficiently determine convergence for statistics of interest in egress modelling and other fields of study. Ronchi et al.'s use of the AC was an important innovation in studying egress variability. It is therefore desirable to extend their original method by applying CIs to the AC, thereby giving a more standard statistical interpretation of behavioural uncertainty. However, it is not intuitively obvious how to determine the CIs for this curve and no solution has been previously published. The key contribution of this paper is the development of FAM-based CIs for the AC (Sect. 3) with an accurate overall confidence level. This was achieved using the novel application of bootstrapping (see Sect. 2.1), FAMs (see Sect. 2.2), and a bisection algorithm (see Sect. 3.1). This addresses the first limitation of Ronchi et al.'s method and allows the benefits of the CI approach to convergence to be combined with the AC within a common framework. It is also advantageous to use the FAMs, to form the CIs, as they are somewhat familiar to the fire and evacuation research communities. FAMs were introduced to the fire safety field in 1999 [30] and the first reported use for evacuation was in 2012 [29]. A further contribution of this paper is the development of a convergence scheme. The convergence scheme is based on the same variables used by Ronchi et al. [17], i.e. AC, MT, and SD, but utilises CIs. The CI for the MT is based on the standard t-distribution expression (see Sect. 2) and the CI for the SD is obtained using a standard application of bootstrapping (see Sect. 2.1.1). This convergence scheme will efficiently determine when all the statistics of interest have reached their required statistical precisions. Utilising convergence of the AC helps to ensure that variability, between simulations, which is not expressed by examining TETs alone, is also captured.
The CIs and the convergence scheme are tested and verified using a simple theoretical model. Apart from demonstrating the correctness of the methods, that work also feeds into a discussion on how to determine appropriate tolerances for the FAM-based CIs of AC.

Statistical and Mathematical Background
This section provides a brief background on the necessary terminology and methods used in this paper. Initially, some standard statistical terminology will be described; this initial section can be skipped over by readers with a grounding in statistics. This is followed by an overview of ''bootstrapping'' with some worked examples (Sect. 2.1). Finally, a summary of functional analysis, used to compare egress curves, is given (Sect. 2.2).
A sample statistic is the value of an estimator based on a sample from the population of results. A population parameter can be considered as the value of an estimator based on the entire population of results. Examples of estimators include the mean TET (MT) (Eq. 1), the 95th percentile TET, and the standard deviation (SD) of TET. Ideally, one would want to know the parameter but practically it is (generally) only possible to obtain the statistic of a sample of results and a range of potential values for the parameter. The sample statistic would display variability depending on the nature of the sample obtained although the sample statistic tends to the value of the parameter as the size of the sample increases (Law of Large Numbers [22]).
The ''expected value'' of a sample statistic is the mean value of a series of infinitely repeated independent instances of that sample statistic with the same sample size [22]. This will equate to the corresponding population parameter if the estimator is unbiased. This is the case for many estimators although the sample SD is slightly biased. Whilst the standard variance bias correction has been applied (Eq. 2), i.e. dividing by (n -1) rather than n, there is still some residual bias. It is reasonable to neglect this bias given the large scale of the variability associated with the SD compared to the small size of the bias and the complexity of calculating that bias. The small bias that does exist also diminishes as the sample size increases.
Although it is generally impossible to exactly know the population parameter it is possible to estimate a range of possible values for the population parameter based on the sample of results. This is a CI, which will differ between different samples but will frequently include the parameter of interest. Ideally, this frequency is determined by the confidence level (CL). However, the actual frequency, known as the coverage probability, may differ in practice. A set of CIs for 50 different samples is depicted in Fig. 1. This is illustrated for a 95% CL in Fig. 2 where the vast majority, 47 ($ 95%), of the sample CIs contain MT (¥) but 3 ($ 5%) of the sample CIs do not. The superscript brackets indicate the number of samples used to calculate the statistic, when the number is ¥ the statistic is equivalent to the population parameter. In general, only a single sample of results will be generated, although that single sample could contain hundreds of simulations, so only one CI will be generated per statistic of interest. The calculation of the CI for certain parameters is straightforward due to the availability of standard expressions. For the MT, the CI with a 100(1 -a)% CL can be calculated using Eq. 3 where T -1 (p, df) is the inverse t-distribution for percentile p with df degrees of freedom [22]. This expression assumes that the population SD of the TETs is unknown, which is generally the case in egress models/experiments, and is estimated with the sample SD. If the original data is normally distributed, then the expression is valid for n > 1. For non-normally distributed data, the expression is generally valid due to the central limit theorem for a larger n but is dependent on the nature of the original data distribution. Some sources advocate a minimum n of 30 although for skewed data a minimum n of 40 [31] has been suggested. It is further assumed that the data is independent and identically distributed.
In Eq. 3, the CI is quoted as a lower limit and an upper limit of the population parameter value, e.g. 150.0 with a 95% CI [135.0, 165.0]. The CI could also be quoted as the difference between the limits and the sample statistic, e.g. 150.0 with a 95% CI [-15.0, + 15.0]. Those differences could also be quoted as percentages, e.g. 150.0 with a 95% CI [-10%, + 10%]. If the CI is symmetric then it can be expressed in shortened form with a plus-minus limit, e.g. 150 ± 10% (95% CI). A CI is a trade-off between its width, which should ideally be as narrow as possible, and its CL, which should ideally be as high as possible. Ideally, one would want a 100% confidence, but this leads to an infinitely wide CI which is not very useful. For a statistic calculated from a fixed number of data points, the CI width will increase with increasing CL. Having a very high CL will result in a very wide CI whilst a ''low'' CL with a correspondingly narrow CI is also unhelpful as there is little confidence that the CI contains the parameter of interest. A 95% CL is typically chosen as it is very likely to contain the population parameter whilst maintaining a relatively narrow width. When calculating the CI MT (Eq. 3) with varying CLs (see Table 1), a 99% CL gains an extra 4% in confidence but requires a 33% increase in CI width compared to a 95% CL. A higher 99.9% CL gains an extra 4.9% in confidence but requires a 71% increase in CI width compared to a 95% CL. Using a lower 68% CL (equivalent to 2 9 standard error widths) reduces the CI width by 49% but there is a reduction in confidence of 27% compared to a 95% CL. From Eq. 3, the width (W MT ) of the CI MT with a fixed CL will also tend to narrow with increasing n (see Eq. 4).
In order to further discuss CIs, it is necessary to describe sampling distributions. A sampling distribution is the probability distribution of a statistic for a fixed sample size for all the possible samples that could be selected [22]. For example, a set of five TETs are obtained from repeated trials and the statistic of interest is the mean of those TETs. If this set of five trials is repeated, a further set of TETs can be obtained, and another mean of those values is calculated. Each time the process is repeated a different estimate for the mean value will be obtained, see Table 2. If this could be repeated a very large number of times, then it would be possible to produce a distribution of the sample mean values, i.e. the sampling distribution. Assuming the TETs are normally distributed with an unknown population SD then it has been shown theoretically that the sampling distribution for the example can be represented by a t-distribution [22]. The mean of the sampling distribution is equal to the expected value and the population mean of the original TET distribution. From the central limit theorem, the standard deviation of the sampling distribution, also known as the standard error, is the standard deviation of the original TET distribution divided by Ön.
From this sampling distribution, the interval that contains the innermost 100(1 -a)% of sample means can be described by Eq. 5 (see Fig. 2).
Equation 5 is not particularly useful as in general the value of the MT (n) is known and the MT (¥) is unknown. However, Eq. 5 can be algebraically recast [22] to make the MT (¥) the subject of the inequality leading to the probability that the MT (¥) lies within a CI with a CL of 100(1 -a)% (Eq. 6). This is a probabilistic statement of the CI given in Eq. 3.
For some statistics, such as the mean value (Eq. 3) and the 95th percentile value [13], a theoretical solution exists for determining the CI. However, for many statistics a theoretical solution may not exist or is difficult to derive. For those cases, it may be possible to use bootstrapping to approximate the sampling distribution and hence derive the CI.

Bootstrapping Concepts
Bootstrapping [32][33][34][35][36] is a widely used computational statistical technique. The premise of bootstrapping is that inference about a population from sample data can be modelled by resampling the sample data and performing inference about a sample from resampled data. Generally, the population distribution is unknown and only the sample of data is known. Bootstrapping uses this sample of data as an approximation to the true population distribution. The bootstrapping technique is applicable to many estimators although there are known difficulties associated with the modal value [37]. Care is also required when applying the technique to heavy-tailed distributions particularly those with infinite variance [38,39]; this is not anticipated to be a major issue in the egress modelling context. A simple worked example, to determine the CI of the mean, follows. In practice, bootstrapping is not generally used to determine the CI of a mean value due to the wide applicability of Eq. 3. The primary intent of this example is to demonstrate the methodology. A set of n values are sampled from some unknown distribution and leads to the set of values in the ''Original'' row of Table 3. In this example, n is set to five. This data could be expensive to obtain, e.g. these could be TETs from a set of egress trials. The mean of the original sample is 139.5. The original sample is now resampled to create a bootstrap sample, which is a very inexpensive process. Each value in the original sample has the same, 1/n, chance of being selected. Values are randomly copied from the original sample n times to  Table 3). This implies that a particular value from the original sample may be represented once, more than once, or not at all in a bootstrap sample. Furthermore, only values from the original sample can be represented in the bootstrap sample. For example, it can be seen for bootstrap 1 that 99.7 is represented twice and 188.5 is not represented at all and the other values are represented once each. When n values have been selected, the mean value of the bootstrap sample is calculated. This process is repeated B times, where B ) n, to create a bootstrap sampling distribution of the mean (cf. Table 2). In this example B is set to 19. This is a far lower number of bootstrap samples than would be used in practice but is used here for demonstration purposes. The CI, with a 90% CL (the maximum CL that can be represented with 19 bootstraps), is obtained by taking the percentile values that bound the innermost 90% of the sampling distribution which are the 5th and 95th percentile values. The bth bootstrap index, of the vector of ascending ordered bootstrap means, representing the (100p)th percentile is given by Eq. 7.
This equates to the 1st and 19th values of the ordered bootstrap sampling distribution of the mean (see Table 4). The bootstrap CI is not particularly close to the standard approach although is not unreasonable given the limited sample size and small number of bootstraps.
The accuracy of the method significantly improves as n and B increase, and as more sophisticated methods are used to select the required percentiles from the bootstrap sampling distribution. The above worked example was repeated using a sample size of 100 values and 1999 bootstraps. In this case the mean of the sample was 156.5. The bootstrap  Fig. 3. The bootstrap mean sampling distribution is similar in shape to a normal distribution and approximates the true sampling distribution. The simplest approach for selecting a centred CI with a 100(1 -a)% confidence level is to assume it is bounded by the 100(a/2)th and 100(1 -a/2)th percentile bootstrap samples. This is the approach used for the worked example above. However, this simple percentile approach has two notable problems. Firstly, it tends to give narrower CIs than expected when ''small'' sample sizes were used. Secondly, it does not account for skew and bias that may exist in the samples.
For the first problem, Hesterberg proposed a correction based on the normal distribution and the t-distribution [40]. This is based on correcting the percentile limits of the bootstrap sampling distribution of the mean. The ''corrected'' (100 p)th percentile based on the original (100p)th percentile of sample size n is given by Eq. 8, where U(z) is the CDF of the standard normal distribution func-  tion. This correction always tends to make the CI more conservative but could result in an over-correction or insufficient correction of the required percentile dependent on the actual, but unknown, form of the sampling distribution. In the event of an over-correction, the CI will be wider than it should be but represents a more conservative estimate. If the CI is insufficiently corrected, then at least some correction is applied compared to no correction thus bringing the coverage probability of the CI closer to the nominal confidence level. Table 5 illustrates the correction (Eq. 8) applied to the 5th percentile for various sample sizes. The effect of the correction is significant for small n but diminishes as n increases.
For the second problem associated with possible skew and bias in the samples, there are a number of advanced bootstrapping methods for determining appropriate limits for the CI, including the Bias Corrected and accelerated (BCa) method [32,33], ABC, studentized, and double bootstrap [35]. The BCa bootstrap was selected due to its wide range of applicability [41], its relatively straightforward implementation, and its comparatively low computational requirements.
The BCa corrected 100(a/2)th percentile bootstrap index b l and 100(1 -a/2)th percentile bootstrap index b u , incorporating the small sample size correction (Eq. 8), are given by Eqs. 9 and 10 respectively, where bxc is the integer floor and dxe is the integer ceiling of x. Equation 11 is the bias correction factor where U -1 is the inverse CDF of the normal distribution, h is the calculated statistic, and h Ã j is the jth bootstrap of the distribution of h * . Equation 13 is the acceleration factor derived from the original sample data.
h Ài is h calculated using the original n simulations performed barring the ith simulation.
2.1.1. Bootstrapping the CI for the SD If the TETs are normally distributed, the CI SD can be determined using a Chi squared distribution with n -1 degrees of freedom [22]. However, in general the TETs will not be normally distributed and those theoretically derived CIs could be inappropriate. The CI SD is therefore determined using bootstrapping. For n simulations there are n TETs that are used to compute SD (n) . A single bootstrap SD * can be derived by randomly resampling, with replacement, those n TETs n times and calculating the SD of that bootstrap sample. This process is repeated B times, where B ) n, and leads to a bootstrap sampling distribution of the SD * . The CI of this ordered bootstrap sampling distribution of SD *,o approximates the CI for the true sampling distribution of SD (n) . The BCa method is used to obtain the upper and lower limits of the CI.
The specific equations for calculating the BCa limits for the SD are obtained by substituting SD into Eqs. 9-14 leading to Eqs. 15-21.
The CI SD is [SD Ã;o b l;SD , SD Ã;o b u;SD ] where the bootstrap index b l,SD is given by Eq. 20 and b u,SD is given by Eq. 21.

Functional Analysis
The final area of mathematics used in this paper to define CIs for the AC is functional analysis, a branch of mathematics used to compare vectors or times series data. An egress curve is a vector of agent egress times. Unlike a scalar quantity such as the TET where the difference between two TETs is merely a difference in magnitude, it is less obvious how to express the difference between vectors. The difference between the curves can be expressed using a set of FAMs.
The FAMs compare one curve (a) against another curve (b), using different criteria, that returns a scaler value that meaningfully represents the similarity (or difference) between the two curves. The egress curve a = {a 1 , a k ,…., a m }, where a k is the time of the kth agent to exit and a m is the exit time of the last (mth) agent to exit; egress curve b is similarly defined. The three FAMs used here are the Euclidean Relative Distance (ERD), the Euclidean Projection Coefficient (EPC), and the Secant Cosine (SC). These FAMs have been previously described in fire [30] and evacuation contexts [16,17,19,29] so only concise details, suitable for egress curves, are given here.
The ERD (Eq. 22) represents the ''relative distance'' between two curves and will be zero when the curves are identical and greater than zero if the curves are different. ERD varies between zero and infinity with values nearer to zero implying a closer ''distance'' between the curves. In the appendix it is shown that the ERD of two egress curves can be compared to normalised absolute difference (NAD), i.e. |TET a -TET b | ‚ TET b , between two TETs of those egress curves. If the ERD is greater than the NAD, then it indicates there is more relative difference in the egress curves compared to the TETs; this suggests there may be more variability in an earlier phase of the evacuation. If the ERD is less than the NAD, then there is less relative difference in the curve compared to the TETs; this suggests that there may be more variability at the end of the evacuation.
The EPC (Eq. 23) is the value needed to project one curve onto another curve. This would be one when the curves can be exactly ''projected'' on each other, which will be the case when the curves are the same and varies about one when not exactly projected. EPC varies between zero and infinity, assuming a k ‡ 0 and b k ‡ 0 8k, with values nearer to one implying a closer ''projection'' similarity between the curves.
The SC (Eq. 24, where Dx k-s = x k -x k-s and s is the step-size ‡ 1) compares the ''shape'' of two curves. The step-size s is used to vary the smoothing of the curve, where a step size of one implies no smoothing and the smoothing effect increases as s increases. SC equates to one when the curves have the same ''shape'', which will be the case when the curves are the same. SC varies between zero and one when the egress curves have different ''shapes'' with values nearer to one implying a closer similarity in shape. Egress curves monotonically increase, thus Da k-s ‡ 0 and Db k-s ‡ 0 8k, implying that SC(a, b) ‡ 0. The Cauchy-Schwarz [42] inequality (Eq. 25) implies that SC(a, b) £ 1.
3. Development of FAM-Based CIs for the AC The average egress curve AC (n) = {ac 1 , ac k , …, ac m } is calculated by averaging all n generated egress curves c i (Eq. 26 shows the calculation for the kth component of AC (n) ) from the simulations performed. A bootstrap average curve AC * is derived by randomly resampling, with replacement, those n generated curves n times, and then averaging. This process is repeated B times and leads to a bootstrap sampling distribution of the average egress curve. Unfortunately, it is difficult to directly assign a CI to the sampling distribution of curves. However, it is possible to create proxies that represent the precision of the AC indirectly. This can be achieved by transforming the AC sampling distribution of curves into usable sampling distributions of point values, via the FAMs, that can be assigned CIs. Each bootstrapped average curve AC * is compared to the average egress curve AC (n) with the FAMs to create sampling distributions of those measures, i.e. ERD * (AC * , AC (n) ), EPC * (AC * , AC (n) ), and SC * (AC * , AC (n) ). This approximates the (generally) unknowable sampling distributions of all possible sample average egress curves AC (n) compared to the population (or expected) average egress curve AC (¥) using the FAMs, i.e., ERD(AC (n) , AC (¥) ), EPC(AC (n) , AC (¥) ), and SC(AC (n) , AC (¥) ). The CIs of the FAM-based sampling distributions represent the precision of the AC.
The ERD sampling distribution can be represented by a semi-infinite distribution between zero to infinity as the ERD values are greater than or equal to zero. It is anticipated that the distribution would be concentrated near zero with the frequency tending to zero as the bootstrapped ERDs tend to infinity. When AC (¥) is unknown the best estimate of ERD * (AC (¥) , AC (n) ) is zero as the best estimate of AC (¥) is AC (n) . The 0th percentile of the ERD *,o sampling distribution is zero. The CI is obtained by using the 0th and 100(1-a)th percentiles of the ordered bootstrapped values as the lower and upper limits for a confidence level of 100(1a)%. The small sample size correction (Eq. 8) is applied to the 100(1-a)th percentile; this is achieved by modelling the bootstrapped ERD measures using a folded normal distribution from zero to infinity with the (unfolded and folded) mode at zero. The corrected 100(1-a)th percentile bootstrap index b ERD of the ordered bootstrapped ERD *,o values is given by Eq. 27 where Eqs. 28 and 29 map the percentile from the folded distribution to the full distribution and back again. The The SC sampling distribution can be represented by a distribution between zero and one where it is expected that the values will be concentrated near one and the fre-quency will tend to zero as the SC values tend to zero. The best estimate of SC(AC (¥) , AC (n) ) is one and the 100th percentile of the SC sampling distribution is one. The CI, with a confidence level of 100(1-a)%, is given by the limits of the corrected (100a)th and 100th percentile of the ordered bootstrapped SC *,o measures. The corrected 100ath percentile bootstrap index b SC is modelled, in a similar fashion to b ERD , using Eq. 30 where Eqs. 31 and 32 map the percentile from the folded distribution to the full distribution and back again.
The EPC sampling distribution will vary between zero and infinity where it is expected that the values will be concentrated about one and the frequency tends to zero as EPC tends to either zero or infinity. This can be represented by a twotailed distribution and the CI of the ordered bootstrapped EPC measures is calculated using the BCa method (Sect. 2.4). The best estimate of EPC(AC (¥) , AC (n) ) is one. The specific equations for calculating the BCa limits for the EPC are obtained by substituting EPC into Eqs. 9-14 leading to Eqs. 33-37.
where AC n ð Þ Ài is an average egress curve calculated using the original egress curves barring the egress curve from the ith simulation.
Determining Confidence Intervals, and Convergence, for Parameters The CI EPC is given by [EPC Ã;o b l;EPC , EPC Ã;o b u;EPC ] where b l,EPC and b u,EPC are given by Eqs. 36 and 37 respectively.

Overall Confidence Level for the Average Egress Curve
The precision of the AC is represented by three FAM-based CIs. Each of these CIs has an individual CL of 100(1-a)%. These three CIs can be considered to form a triple-CI that has a more complex coverage probability than a single CI as it is possible that the AC (¥) may lie within zero, one, two, or three of the CIs (see Fig. 4).
The overall confidence level is equivalent to P(ERD \ EPC \ SC), i.e. all FAM-based CIs are simultaneously satisfied, and is likely to be less than the individual CL. Although it is easy to control the individual CLs, what is sometimes required is the ability to set the overall CL. This is a well-known problem with multiple CIs and could be addressed by applying the Bonferroni correction to the individual CIs [43]. If the overall CL is 100(1 -a overall )% then the confidence level required for each of the individual CIs is given be Eq. 38 where N CI is the number of CIs (three for the average egress curve).

2154
Fire Technology 2020 However, this leads to a conservative estimate of the individual CLs, as the correction assumes the worst-case multi-CI probability coverage, e.g. the right-hand Venn diagram of Fig. 4, and therefore the CIs are potentially much wider than necessary. When three CIs are used, an overall 95% CL corresponds to each CI having an individual 98.3% CL. Fortunately, as the ERD * (AC * , AC (n) ), EPC * (AC * , AC (n) ), and SC * (AC * , AC (n) ) sampling distributions are available, for the bootstrapped egress curves, it is possible to determine a more accurate overall confidence level. This is achieved by adjusting the individual confidence level of the CIs, using the bisection algorithm below, until the number of AC * bootstrap curves that are simultaneously within all three FAM-based CIs matches the required overall CL.
1. The required number (R AC ) (Eq. 39) of AC * curves that needs to be contained simultaneously by all three CIs, for an overall CL of 100(1 -a overall )%, is calculated by taking the maximum of the number of AC * that need to be held individually by CI ERD (Eq. 40), CI EPC (Eq. 41), and CI SC (Eq. 42) at the overall CL.
2. The lower individual confidence level (LICL) and upper individual confidence level (UICL) for bisection algorithm are set to Eqs. 43 and 44 (the Bonferroni correction) respectively. The actual individual CL, that satisfies the overall CL, lies between these two limits.
3. The bisection is now performed by setting the middle individual confidence level (MICL) Eq. 45. Set N AC to zero If the difference between UICL and LICL is less than the bisection tolerance, 0.1% is suggested, then the overall CL convergence has been achieved (continue to step 8). Otherwise, the overall CL convergence has not been achieved (go to step 3).

CI Convergence Scheme
Now that CIs have been developed for the statistics it is possible to determine convergence w.r.t. the estimate of the population parameters. The CI MT is calculated using Eq. 3, the CI SD is calculated using bootstrapping (Sect. 2.1.1), and the FAM-based CIs of the AC are calculated using the methodology described in Sect. 3. The width of a CI narrows (converges) with increasing sample size, e.g. Eq. 4. This enables a convergent approach based on incremental testing where the number of trials can be progressively increased until the width of the CI is sufficiently reduced that it is less than the specified precision tolerance. The convergence of the MT, the SD, and the FAMs are determined by comparing the normalised width (W) of the CIs against a specified tolerance (Tol) (Eqs. 46-50) for a 100(1 -a)% CL.
The convergence algorithm is depicted as a flowchart in Fig. 5. The user needs to specify the above tolerances (Tol MT , Tol SD , Tol ERD , Tol EPC , and Tol SC ) as well as the minimum number of simulations (MIN_SIM), the maximum number of simulations (MAX_SIM), the number of bootstrap-resamples (B), the CL, and the number of simulations performed between testing (K). Convergence is determined when all the CI widths are less than their specified tolerances. Convergence may not be achieved within the MAX_SIM simulations, in that case the MAX_SIM simulations will be performed with the final CIs reported. If the user has no specific convergence requirements, then a set of values for the convergence attributes are suggested in Table 6. The tolerances are discussed in Sect. 6.

Case Study
A theoretical case study model, similar to Ronchi et al. [17], was used to create multiple fictitious egress curves of 120 agents. The data is generated by pseudorandomly [44] sampling a log-normal distribution with a mean of 12 s and standard deviation of 13.4 s (Ö180 s). A set of 120 values are generated {l 1 , …, l 120 } with the egress time (t) of the kth agent given by Eq. 51. Figure 6 depicts ten randomly generated egress curves using the case study model.
Using the case study model has two major advantages over using simulated or real data. Firstly, the generally unknowable population parameters are defined for the case study model so the methods can be tested and verified. The population MT (¥) is 1440 s (120 9 12 s), as the mean of the sum is the sum of the means. The population SD (¥) is 147.0 s (Ö120Ö180 s), as the variance of the sum is the sum of the variances and the standard deviation is the square root of the variance. The exit time of the kth agent of AC (¥) is given by Eq. 52. Secondly, there is very little computational cost in generating this data allowing a vast number of curves to be generated in a short timeframe. Virtually all the computational cost of the case study is incurred from the bootstrapping process rather than curve generation. In practice, it is anticipated that the computational cost of egress simulations (curve generation) will dwarf the computational costs of the bootstrapping process. In this test case study, the step size (s) used for the SC (Eq. 24) is one. This gives the most conservative value for the difference in shape between the curves. The testing of the CIs is performed in three parts. The first part examines the behaviour of the CIs, within a single set of simulations, relative to the sample estimates of the statistics and the population parameters (Sect. 5.1). The second part examines the average behaviour of the CIs over multiple sets of simulations including the measured coverage probability and average CI width for a parameter at a specified sample size (Sect. 5.2). The third part examines the behaviour of the convergence scheme based on the CIs (Sect. 5.3).  Figure 6. Ten randomly generated fictitious egress curves for 120 agents from the case study model.

CI and Error Behaviour within a Single Set of Simulations
The behaviour of the statistics (MT, SD, ERD, EPC, SC) and their CIs were examined over a set of two hundred simulations. This is an example set of simulations and the behaviour will not be the same every time but highlights the expected behaviour. The statistics are compared to their equivalent population parameters and normalised leading to the true errors (e). The CIs are also normalised, in a similar fashion to e, leading to normalised CI limits (v). The normalised error (e MT ) and CI limits (v l;MT , v u;MT ) for the MT are given by Eqs. 53, 54, and 55.
The normalised error (e SD ) and CI limits (v l;SD ; v u;SD ) for the SD are given by Eqs. 56, 57, and 58.
The normalised error (e ERD ) and CI limit (v ERD ) for the ERD are given by Eqs. 59 and 60.
2160 Fire Technology 2020 The normalised error (e EPC ) and CI limits (v l;EPC ; v u;EPC ) for the EPC are given by Eqs. 61, 62, and 63.
The normalised error (e SC ) and CI limit (v SC ) for the SC are given by Eqs. 64 and 65.
In Figs. 7, 8, 9, 10 and 11 the normalised errors and CIs are plotted for MT (Fig. 7), SD (Fig. 8), ERD (Fig. 9), EPC (Fig. 10), and SC (Fig. 11). In Figs. 7, 8, 9, 10 and 11 it is apparent that for this set of simulations the CIs bounds the true error. This is expected as the CI is an estimate interval of the population parameter, meaning that the population parameter is unlikely to be outside of the CI. In Fig. 7, the SD has a significant error, $ 0.1 ($ 10%), after 150 simulations, although this error is bounded by the CI. There is no obvious relationship between e SD and the other error measures.  There is an unexpectedly close similarity between the behaviour of e MT (Fig. 7) and the e EPC (Fig. 10), both errors are near zero at 50 and 100 simulations, have a similar local peak value, approximately 0.02, at 67 simulations, and have similar values of approximately -0.01 at 200 simulations. There is also a similarity  between e ERD (Fig. 9) and e MT j j, both errors are near zero at 50 and 100 simulations, have a local peak value at 67 simulations, and have similar values of 0.01 at 200 simulations. Similarly, the CI limits for EPC and ERD have a similar trend and magnitude as the corresponding CI limits of MT. A brief algebraic analysis of e EPC and e ERD is given in the appendix. That analysis demonstrates that e EPC and e ERD are approximately equivalent to e MT under certain conditions. This is useful as the ERD and EPC tolerances, in the convergence scheme, can be related to the MT tolerance (see Sect. 6). It is also shown, in the appendix, that if e ERD is greater than e MT j j then there is more relative difference exhibited between the egress curves than the difference between the TETs of those egress curves.
The convergence behaviour of SC (Fig. 11) is somewhat different, and shows no obvious correlation, to the other statistics as the error and CI change comparatively smoothly with an increasing number of simulations. The error reduces more rapidly than the other statistics with the normalised CI less than 0.01 after 100 simulations and less than 0.005 after 200 simulations. The convergence behaviour unexpectantly appears to be 1/n while the convergence of the other variables appears to be 1/Ön. Although the general behaviour is different to the other CIs the normalised CI SC still bounds e SC .

Coverage Probability and Average Width of CIs
Ten thousand sets of simulations of size n were generated to estimate the actual coverage probability of the CIs and the average normalised widths (Eqs. 46-50) of the CIs. This was repeated using eight sample sizes (10,20,30,40, 100, 400, 1000, and 4000). The coverage probability is the percentage of CIs that ''contain'' the population parameter, and ideally this should match the confidence level of the CI. For the MT and SD, the coverage probability is determined by counting the number of times the population value lies within the CI. For the average egress curve, the FAMs are used to compare the sample average egress curve to the population average egress curve, e.g., ERD(AC (¥) , AC (n) ), EPC(AC (¥) , AC (n) ), and SC(AC (¥) , AC (n) ), and those values were checked to see if they lie within the CI or not.
In Table 7, the coverage probability of the CIs is generally good, the CI MT and CI ERD coverage probability nearly perfectly matching the confidence level within the error band. The CI EPC coverage probability is within the error band for sample sizes of 30 or more; even at a sample size of 10 the discrepancy is not too great. However, the CI SC coverage probability is unexpectedly higher than the nominal confidence level until a sample size of 4000 is reached. This implies that the CI SC may not be as narrow as they could be, but at least the CI is a conservative estimate. The CI SD has the worst coverage probability behaviour of all the statistics but is still reasonable (> 90%) with sample sizes over 20. Even for a small sample size of 10 the coverage probability is 85.2% and is perhaps better than no error estimate. The CI SD coverage probability tends to the nominal CL with increasing sample size, but even for small sample sizes the method is usable. The coverage probability could perhaps be improved for the CI SD using a double bootstrap [35], but this will result in a significant additional computational cost.
In Table 8, the widths of the MT, SD, ERD, and EPC CIs are approximately inversely proportional to the square root of the number of simulations performed. This is a typical relationship between the CI width and the number of simulations performed and is similarly observed for CI widths of other parameters [13]. The SC's width has a different convergence behaviour and is approximately inversely proportional to the number of simulations performed.
The normalised width of the SD is notable as it is much wider than the other statistics. For example, after 1000 simulations the W SD is still about 10% of the SD, approximately ± 5% error. For 40 simulations, the W SD is 0.49 which is equivalent to a [-19%, + 30%] error at a 95% CL. For 10 simulations, the W SD is 0.84 is equivalent to a [-24%, + 60%] error. If an end user requires a precise estimate of the SD, then a high number of simulations will generally be required.
When the multi-CI correction scheme (Sect. 3.1) is used it can be seen in Table 9 that the average individual CL (97.6%) is significantly lower than the conservative Bonferroni correction (98.3%), but the overall coverage probability is still above 95%. It should be noted that the SC coverage probability is much greater than its CL, thus significantly contributing to the overall coverage probability being over 95% below 4000 simulations. However, with the 4000-simulation sized sample the overall coverage probability is still above 95% when the SC coverage probability is in line with the expectation of the computed CL.
In Table 10 the corrected widths for an overall nominal 95% CL are given. These widths are slightly wider ($ 14%) than the widths in Table 8. For 40 simulations, the CI ERD is [0, 0.042], CI EPC is [0.958, 1.042], and CI SC is [0.975, 1] with an overall confidence level of 95%. These could be reported as errors with the ERD error of 4.2%, the EPC error of ± 4.2% and the SC error of 2.5%.

Convergence Testing
A sample of simulations is generated until convergence was achieved for the tested statistic at the stated tolerance. A minimum of 40 simulations [41] were performed before testing for convergence. This was repeated 10,000 times for each measure to estimate the actual coverage probability of the CI. For this case study each convergence criteria were tested in isolation, i.e., convergence is based solely on the tested statistic of interest. This is a reasonable way to test the convergence as ultimately convergence will be dependent on a single variable as the other variables must be individually converged before the final convergence is determined.
The results in Table 11 illustrate that the methodology can generate CIs that have an actual coverage close to the nominal 95% confidence level. The theory of CIs is based on a variable CI limits with a set sample size. This is somewhat different to converging to a set CI width with a variable sample size. Intuitively, it would be expected that the coverage probability would be close to the nominal CL and Ross [25] noted that the coverage probability would match the nominal CL for sufficiently large n.
As would be anticipated from Sects. 5.1 and 5.2, it can also be seen that as the tolerance decreases the number of samples required for convergence increases. This increase is approximately quadratic for all the measures apart from the SC. For the SC, as the tolerance decreases there is a linear increase in the number of samples required for convergence.

Discussion
The advantage of the convergence scheme developed here compared to Ronchi et al.'s scheme is the convergence is directly based on the required precision of the statistics of interest. The most intuitive tolerance to set is Tol MT . It has been previously suggested that the tolerance for Ronchi et al.'s original method could be based on uncertainty from either a fire modelling study determining available safe egress time or uncertainty based on experimental egress trials [19]. Those techniques could also be applied to the CI based method provided those uncertainties are calculated using CIs. In lieu of that type of information, a typical choice for Tol MT could be 0.02, the equivalent of ± 1% error, but the choice depends on the user's requirements. However, the choice of tolerances for the other statistics SD, ERD, EPC and, SC are less apparent. Prescribing appropriate tolerances for the ERD, EPC and SC FAM-based CIs is problematic, but the case study suggests a potential way forward. It was found that the W EPC , with a 95% confidence level, and W MT have a similar magnitude, W EPC is approximately 10% wider than W MT , for the same sample size (see Table 10) and the behaviour of the corrected CI limits of MT and EPC (see Figs. 7 and 10) are also similar. It was also shown that under certain conditions the true error measured by EPC and MT are approximately equivalent (see appendix). The ERD also has a behaviour and magnitude related to MT under certain circumstances. W ERD is roughly half the magnitude of W MT for the same sample size (see Table 10) and the upper confidence limit of the ERD and MT CIs are similar in magnitude and trend (see Figs. 7 and 10). It was also shown (see appendix) that the true error measured by ERD and MT are also equivalent under certain circumstances. In lieu of any other requirements, it is reasonable to base the tolerances of the ERD and EPC CIs on the specified tolerance of MT (Eqs. 66 and 67). Although there is no obvious relationship between MT and SC convergence behaviour, it is also suggested that Tol SC is also based on the specified tolerance of MT (Eq. 68) unless the user has a specific requirement.
Setting the Tol SD in relation to Tol MT is not recommended. From Table 8, even with 4000 simulations the average W SD is 0.049. By assuming that the TETs follow a normal distribution it can be calculated [22] that it would take $ 10,000 simulations to achieve convergence for a Tol SD of 0.02. Unless the user specifically requires a certain precision for the SD then it is suggested setting Tol SD between 0.4 and 0.2.
There are circumstances where a user would wish to set the MT tolerance based on time, e.g. ± 10 s, Tol MT,seconds = 20 s. The tolerance of the FAM-based CIs cannot be directly expressed in seconds, but this is not a particular problem as the Tol MT,seconds is easily converted to a decimal (Eq. 69). This needs to be calculated at runtime as MT is generally unknown a priori.
In the previous case study, W EPC /2 and W ERD are slightly larger than W MT /2. As W ERD is greater than W MT /2 then there is a greater relative variability expressed in the AC compared to the MT (see appendix). This is due to the nature of the theoretical case study model. The coefficient of variation (CV) of the exit time of the kth agent, for the case study model, decreases with increasing k (see Eq. 70). Thus, the earlier exiting agents exhibit the most relative variability. However, the actual variability of the kth agent, i.e. SD k , increases with increasing k. The ERD gives greater weight to larger differences due to the squaring term (Eq. 22) leading to a small increase in W ERD compared to W MT /2. When performing practical egress simulations, there is potentially larger variation in the exit time of agents other than the last agent to exit which would be reflected in the FAM-based CIs of the AC.
The CI based convergence scheme has a higher computational cost than Ronchi et al.'s original method. This includes generating the bootstrap samples, ordering the values, and calculation of the BCa coefficients. However, it is anticipated that the overall cost will be small compared to the computational cost of the evacuation simulations. If bootstrapping takes a significant amount of time, relative to the evacuation simulations, then the algorithm can be optimised by increasing the number of simulations (K) performed between convergence checks. The calculations have been ordered in the convergence scheme so that the most computationally expensive bootstrapping, the FAM-based CIs, are not calculated until the MT and SD have been deemed to have converged. The CI method also requires additional memory compared to their original method, although this is not expected to be a serious limitation. The largest memory requirement is the need to store all n egress curves. For example, a 'large' problem consisting of a scenario with 100,000 agents that is repeated 1000 times would require less than 500 MB of RAM. This is a modest memory requirement for a modern computer.
Although the use of CIs addresses Ronchi et al.'s first noted limitation, the other limitations they discussed still apply to the work performed here. For instance, it is possible that egress curves can appear identical even when the dynamics of the evacuation are different. Furthermore, differing egress curves compared to the AC can return the same ERD, EPC or SC measures. The convergent behaviours of the FAM-based CIs of the AC are therefore imperfect proxies for ensuring all the variability in an egress simulation is adequately represented. However, these measures are superior to using the convergence of the MT alone when trying to determine whether enough simulations have been performed to represent the true variability of the evacuation scenario.

Concluding Comments
Ronchi et al. [17] pioneered the use of the convergence behaviour of the AC to better represent the convergence of the evacuation scenario in general. In this paper, their original approach was modified to use CIs. The CI approach has been shown to give reliable convergence with reference to the estimate interval of the population parameters when applied to the mean TET, standard deviation of TETs and the AC. The FAM-based CIs for the AC have been calculated using a novel application of bootstrapping, FAMs, and bisection algorithm. Although the methods described in this paper are more complex than Ronchi et al.'s original convergence indicators, the resultant CIs have a more standard statistical interpretation than their original convergence indicators. The choice of convergence tolerances is therefore more straightforward as it is just the required statistical precision of the statistics of interest. The case study and algebraic study (see appendix) of the ERD and EPC true errors showed there can be equivalency between those errors and the error in MT. These studies provided guidance for setting the tolerances of ERD and EPC.
The convergence scheme is easily adapted to include other statistics of interest. There are many more parameters that may be of interest, e.g. measures of congestion, that may not have a known CI. For example, Smedburg [45] suggested several statistics, such as egress exit flow rates, which were represented by time series data; FAM-based CIs and convergence for those statistics could be determined using the methods described in this paper. The calculation of the CIs and convergence should be implemented in software so that end users can have easy access to the technology.
The CIs can also be easily applied to experimental or simulated trial data without the convergence scheme suggested here where the number of trials performed is limited by other criteria.
When Eq. 75 is satisfied then e EPC will be approximately equal to e MT (Eq. 76).
There are two specific examples, of when Eq. 75 is satisfied, provided here. The first example is when Eq. 77 is satisfied. The second example is when both Eqs. 78 and 79 are satisfied.
The e ERD (Eq. 80) can be expressed as Eq. 81. equal (0.2). The second set of curves satisfy Eqs. 78 and 79 and, Eqs. 86 and 87 and e ERD and e EPC are approximately equal to e MT (0.19 % 0.2). By extending the analysis of the ERD relationship it can also be seen, from Eq. 81, that if Eq. 88 is true then e ERD will be greater than e MT j j, indicating there is more relative difference in the rest of the egress curve compared to the TET. One specific circumstance of this occurs when Eq. 89 is true. Similarly, if Eq. 90 is true then e ERD will be less than e MT j j, showing there is less relative difference in the rest of the egress curve compared to the TET. One specific circumstance of this occurs when Eq. 91 is true.  Table 12 MT, EPC and ERD Error for Two Sets of Contrived Curves (Fig. 11)