Uncertainties of Sub-Scaled Supply and Demand in Agent-Based Mobility Simulations with Queuing Traffic Model

Agent-based models for dynamic traffic assignment simulate the behaviour of individual, or group of, agents, and then the simulation outcomes are observed on the scale of the system. As large-scale simulations require substantial computational power and have long run times, most often a sample of the full population and downscaled road capacities are used as simulation inputs, and then the simulation outcomes are scaled up. Using a massively parallelized mobility model on a large-scale test case of the whole of Switzerland, which includes 3.5 million private vehicles and 1.7 million users of public transit, we have systematically quantified, from 6 105 simulations of a weekday, the impacts of scaled input data on simulation outputs. We show, from simulations with population samples ranging from 1% to 100% of the full population and corresponding scaling of the traffic network, that the simulated traffic dynamics are driven primarily by the flow capacity, rather than the spatial properties, of the traffic network. Using a new measure of traffic similarity, that is based on the chi-squared test statistic, it is shown that the dynamics of the vehicular traffic and the occupancy of the public transit are adversely impacted when population samples less than 30% of the full population are used. Moreover, we present evidence that the adverse impact of population sampling is determined mostly by the patterns of the agents’ behaviour rather than by the traffic model.


Introduction
The complexity of modern transportation systems continues to increase in order to efficiently serve the demand from the increasing population in urban areas and to ensure compliance with policies that support the energy transition (Litman 2013;Speranza 2018). Emerging technologies such as electric vehicles (EVs) and automated vehicles (AVs) are among the main drivers of the ongoing transformations in the transportation sector (Burns 2013;Hars 2015). EVs and AVs require new infrastructure (such as, chargers for EVs and intelligent traffic management systems for AVs) to be built and maintained (Huang et al. 2015). The increased penetration of EVs and AVs is also paralleled by changes in the behaviour of people, for example due to the fact that EVs require additional time for charging, or that EVs have limited range compared to internal combustion engine vehicles (Dijk et al. 2013), or that cheaper AVs result in a different perception of the value of time (Childress et al. 2015). As further reductions in mobility-related CO2 emissions are targeted by many countries across the globe (Biresselioglu et al. 2018), it is anticipated that the transition to EVs will intensify in the near future. Furthermore, traditional and free-floating car-sharing services are expected to reduce the cost of car ownership (Prettenthaler and Steininger 1999) and to have a positive impact on sustainable travel (Cervero et al. 2007) by changing patterns of behaviour. Simultaneously, there is a trend towards a closer integration of different transportation modes into multi-modal networks with the idea that mobility-as-a-service (Jittrapirom et al. 2017) may potentially improve the overall inter-modal interaction of service providers, and especially make public transit more attractive to existing car owners.
The above mentioned changes in mobility require substantial investments in the upgrade of infrastructure in the transportation, electricity and telecommunication sectors (Schroeder and Traber 2012). Moreover, new policies are being formulated and implemented in order to facilitate the upcoming changes and to accelerate the adoption of new technologies (Bakker and Trip 2013). However, uncertainties in such complex systems, which involve both people behaviour and policy regulation, increase investment risks (De Palma et al. 2012;Pagani et al. 2019) and slow down the required improvements to infrastructure. Thus, there is need for a holistic planning tool, especially for large-scale metropolitan areas, whereby for plausible "what-if" scenarios the anticipated changes in demand and supply of mobility can be simulated, and in advance, the required adaptations in infrastructure and needed financing be determined.
Agent-based modelling (ABM) is one approach to model the behaviour and interaction of people in complex urban environments (Zhang and Levinson 2004;Eppstein et al. 2011;Fagnant and Kockelman 2014;Zhuge and Shao 2018). In ABM, agents are considered as individuals with their own logic of behaviour and rules of interaction; thus, ABM simulations are more realistic, and provide a framework to understand causal effects in the complex system as a whole. ABM is not limited to humans, but cover a broad range of disciplines from particles in physics to living species in biology (Abar et al. 2017). Typically, in mobility ABM, each agent represents a human being or a vehicle within a simulated environment that is comprised of roads, public transit system, and other mobility services as well as the policies and regulations that influence the decision-making of agents. During simulation, agents can react to events and adapt their behaviour. For example, if a road is blocked because of an accident, then agents can re-route to reach their intended destinations.
On the one hand, ABM yields transport simulations with a higher level of detail and complexity in which behaviour and decision-making are driven by the same factors as in the real world (for example, congestion, road accidents, public transport delays, social networks, locations of places of activity, etc.). Thus, very realistic scenarios can be formulated and the impacts of potential changes in policies, infrastructure and transport services can be accurately assessed, at the spatial and temporal resolution of each simulated agent. On the other hand, large-scale and detailed agentbased models -that integrate complex behaviour and rules of interaction -impose a substantial computational burden on the simulation process. ABM applied to large metropolitan areas may encounter bottlenecks in computational performance with correspondingly long simulation run times; these challenges reduce the attractiveness of such models for transportation planners and policy makers. Moreover, the cost of required hardware is also a burden when high performance is required.
One approach to get feasible run times is to improve the computational performance of the simulation itself using either multi-threading and distributed computing (Cameron and Duncan 1996;Rickert and Nagel 2001;Nökel and Schmidt 2002;Cetin et al. 2003;Klefstad et al. 2005), efficient algorithms (Goldberg and Harrelson 2005;Sanders and Schultes 2005) or dedicated acceleration hardware (Strippgen and Nagel 2009). Despite the improvements made with such an approach in recent years, it seems that the complexity of simulated transportation systems is increasing at a rate faster than that at which performance improvements come into place.
Therefore, another approach adopted by researchers is, first to downscale the demand and supply of mobility in the simulated scenario: a fraction of agents is randomly sampled from the full population, while the capacity of the road network (that is, the rate at which vehicles pass through a transportation link) and other infrastructure are also downscaled. And the simulation outcomes are then scaled up, and these results considered to be a good approximation of the scenario with the full population and the actual network. For example, if a 10% sub-sample of the whole population is used, then the capacity of the roads is downscaled also approximately to 10% (some variation is possible due to calibration). Then, the simulation outcomes are scaled up by a factor of 10. Basically, each agent in the downscaled scenario represents an aggregate group, not each individual, of the real world.
It is evident that such an aggregated behaviour in a simulated scenario can lead to errors. For example, a scaled road network introduces rounding errors (such as, when the physical length of a street is too short when large downscaling factors are used), or there may be an over-and under-supply of travel demand on links and for public transit due to the skew of the sampled population on certain routes and at certain locations. Depending on the context of the scenario, distortions in the simulation outcomes may lead to a biased interpretation of results. In light of the ongoing shift to e-mobility one can also consider the impact on electricity distribution grids: in a simulation with a sub-sampled population, the demand for charging would be concentrated at certain aggregated points and may lead to the overloading of power lines, while in reality the demand may actually be distributed more uniformly in space and time without causing line overloading issues. Another open question is how one should properly scale the charging infrastructure of EVs in such simulation with a sub-sampled population.
It should be noted that in general ABM introduces uncertainties into simulations (Ševčíková et al. 2007;Beykaei and Miller 2017) because input data, such as synthetic population and travel demand, are typically derived from surveys and aggregated statistics, and furthermore, the modelled patterns of behaviour are also mostly based on surveys. In contrast to the almost unavoidable uncertainties that result from input data, the uncertainties that result from the use of downscaled input data are introduced artificially because of the lack of required computational power, and these uncertainties can be avoided either by employing a more efficient simulation approach or by running the simulations on more powerful and expensive hardware. Until now, there has been relatively little effort made to understand and quantify the uncertainties that are introduced by sub-sampling the demand and supply in mobility simulations. Moreover, there is a strong evidence (see Section 2 below) that, in comparison to a simulation with the full-scale input data, sub-sampling distorts the spatio-temporal characteristics of the simulated travel demand and of the traffic externalities.
In this paper, the impacts of using sampled populations and scaled input in agentbased scenarios, with a mesoscopic queueing model for traffic propagation, are quantified. The contributions of the paper are the following: -A systematic and quantitative assessment of errors that arise in agent-based transport simulations with downscaled input data is presented; specifically, the sensitivities of predicted traffic flow dynamics and public transit occupancy rates are quantified. While the public transit system does not share network links with cars and itself was not scaled in this study, the impacts of scaled travel demand on the occupancies of vehicles in a public transit system are captured. -A GPU-accelerated mesoscopic queuing traffic model is described, and its sensitivity to variations in traffic dynamics under a range of scale factors is evaluated. -A large-scale scenario for Switzerland that includes 3.5 million cars and 1.7 million users of public transit is used in the assessment. -A novel similarity measure, based on the chi-squared statistic, is proposed to assess the quality of simulated spatio-temporal traffic dynamics.
The rest of the paper is organized as follows. In Section 2 an overview of previous related work is given. In Section 3, a large-scale scenario for Switzerland and a mobility model are presented, and the approach that is used to assess the impacts of sub-scaled inputs on the simulated traffic flows and public transit utilization is described. The results of the study are then presented in Section 4; the conclusions and a discussion are presented in Section 5.

Related Work
There are only a few multi-agent mobility simulators capable of running truly largescale and multi-modal scenarios with millions of agents and millions of links and nodes in the network: TRANSIMS (Smith et al. 1995), MATSim (Horni et al. 2016), GEMSim (Saprykin et al. 2019a). MATSim, as one of the most widely used, provides the main evidence for issues with sub-sampled populations.
It is common among researchers to use small population samples (1% to 10%) to run large-scale scenarios with scaled input data. Hülsmann et al. (2014) used 1% of population to run a Munich scenario while studying traffic-related air pollution. The speed of computations was the major reason for downscaling, therefore simulation output results were scaled up back to 100% of population. Zhang et al. (2013) used 1% of population to run a large-scale scenario for the city of Shanghai. The memory constraints of the available hardware were stated as the main reason for downscaling. Bekhor et al. (2011) integrated activity-based and agent-based models for the Tel Aviv metropolitan area using 10% of population to avoid long running times. Kickhofer et al. (2016) used a population sample of 0.65% to run a large-scale scenario for the city of Santiago de Chile, where the authors conclude that the congestion patterns of the simulated scenario do not match the real patterns well and recommend expanding the population sample in the range of 10% to 100%.
More scenarios and case studies with population samples are available in the MAT-Sim book (Horni et al. 2016). Furthermore, MATSim developers present in the above mentioned book the use of a 1% to 10% sample of the population to obtain reliable results with the acceptable run times. Therefore, the most popular sample sizes for large-scale scenarios run with MATSim are within this range.
In recent years, ABM simulations with the use of public transit and other emerging modes of transportation have increased. As a consequence, the required computational power has also increased and researchers have continued to use small population samples so that the run times of their simulations are kept within a reasonable timeframe. However, there is evidence that small population samples are the source of some discrepancies in traffic simulations.
Ben-Dor et al. (2017) faced issues using network links shared by private cars and public transit vehicles in MATSim simulations of the Tel Aviv metropolitan area with a 10% population sample. In the simulations buses tended to get stuck in long waiting queues because the link flow capacity was over-used by cars, resulting in disruptions of public transit services. The study showed that when a coarse population sample is used, then the traffic flows on links are not scaled by the same ratio. Furthermore, while for private cars it is possible to calibrate the predicted road traffic flows in order to reduce the adverse effects of scaling, on the other hand public transit has to run according to schedule and thus, public transit is more sensitive to changes in network capacity. Bischoff and Maciejewski (2016) in a study of autonomous taxis in the Berlin area showed that use of a 10% population sample leads to 11% of the demand for autonomous taxis compared to a simulation with the full population; that is a 10% relative error. On the other hand, changes in fleet occupancy statistics were considered acceptable, and the deviations in the durations of both pickup trips and trips with customers were no more than 3% of the length. Simoni et al. (2015) compared the accumulation-production relationship for the links in a MATSim simulation of central Zurich using different sized population samples (10%, 20% and 50%). The simulation results were seen to vary with size of the population sample. It was also noted that flows on links decreased faster with the increased flow density of the larger population samples. However, only qualitative graphical comparison were presented, without any quantitative results. Erath et al. (2012) had artefacts of overcrowded buses in a MATSim simulation of the public transit in Singapore using a 10% population sample. In order to improve the simulation, a 25% population sample was used. No further details on the issues with public transit were presented. Bösch et al. (2016) in the Switzerland baseline scenario for MATSim put an attention on the usage of population sub-samples by providing an example from car-sharing simulations: using a smaller population sample means that a reduction in shared vehicles is required to prevent over-supply, and at the same time reduction of the number of shared vehicles leads to reduced availability in the area or under-supply. Thus, it is not recommended to use population samples with less than 5%-10% for scenarios with shared cars. Also issues with the scaling of public transit were mentioned: because of the inability to scale a public transit schedule itself (that is, a fleet size, the frequency of operation), it is suggested to scale down the size (and, as consequence, link flow capacity consumed by a vehicle) of public transit vehicles proportionally to the size of the population sample. But as it was shown by Ben-Dor et al. (2017), this approach does not solve the issues with public transit completely when using population sub-samples. Kwak et al. (2012) conducted probably the first attempt to systematically study the errors that arise in traffic simulations due to the use of samples of the full population. In the work, a macroscale static traffic assignment model was used for different periods of the day. This traffic model was not an agent-based model, but rather used origin-destination (OD) matrices with zones. The OD matrices were scaled up to match the flows of the full population. While not an ABM simulation, nevertheless, the work showed that the use of samples of the full population affects the predicted traffic flows even at the macro-scale.
Recently, Llorca and Moeckel (2019) studied the effects of scaled down populations in agent-based traffic simulations of the Munich metropolitan area. The study focused mostly on the impacts on average travel time and the distribution of travel times. The results showed that the average travel time depends on the size of the population sample and is minimum with a sample that is 10%-20% of the full population. Travel time distributions for 5% and 100% of population samples were observed to be very similar. Another interesting finding was that the scale factor for spatial length of the links (streets) did not have a strong influence on the average travel time of agents. The authors concluded that a scale factor of 5% seems reasonable for simulations where only analysis of highly aggregated results are required. However, the authors emphasized that for traffic flow analysis of a single corridor a full population sample is likely required.

Methodology
To evaluate the impact of scaled input data on simulation outputs, our bottom-up in-house agent-based simulation framework, EnerPol (Eser et al. 2016;Marini et al. 2018), is used. EnerPol provides integrated scenario-based assessments of energy and transportation infrastructure, land use, and urban development. EnerPol's GPU-Enhanced Mobility Simulator, GEMSim (Saprykin et al. 2019a), is used in this study. GEMSim utilizes a massively parallel mobility model and allows one to assess, in a reasonable amount of time, "what-if" scenarios from city to country scales.

Travel Demand
The travel demand part of a scenario consists of a set of individual agents derived from the EnerPol's agent-based population model. The population model generates a synthetic population for the whole of Switzerland (approximately 8.3 million agents) using data from the Interactive Statistical Atlas of Switzerland (Federal Statistical Office 2019). The agents are then linked to geo-referenced households based on the Swiss Federal Register of Buildings and Dwellings (Federal Statistical Office 2018b), and the Swiss Business and Enterprise Register (Federal Statistical Office 2018a) is used to identify the locations of jobs of employed agents. On this basis, each agent is assigned an individual daily plan that is based on the Swiss Transportation Survey 2010 (Federal Statistical Office / Federal Office for Spatial Development 2012). The calibration was performed in the population model to match census data, and the validation with real world traffic counts is presented in prior work (Saprykin et al. 2019a).
An individual daily plan has a set of activities that an agent must perform, and the legs connecting the activities. Each activity has a spatial location and a type (that is, work, shopping, leisure, etc.), and each agent performs an activity by spending a certain amount of time at the corresponding location. A leg describes the path that an agent should take to move from the location of one activity to the location of another using either a car, public transit, or walking. The Switzerland test case that is used in this paper consists of approximately 3.5 million agents with private cars and approximately 1.7 million agents using public transit (inhabitants below the age of 6, people staying home, and car passengers are excluded from the scenario). The walk mode is a sub-mode of public transit, and is employed when it is faster to walk than to wait for a public transit vehicle.
Travel demand is executed repeatedly in a loop, where agents can learn from the previous experience and adapt their plans between the iterations. In this scenario, only re-routing strategy is allowed, and agents neither switch their mode of transport between simulated iterations nor shift departure times.

Travel Supply
The supply part of the scenario consists of a road network and a public transit schedule. The network is represented as a directed graph, where each edge represents a road segment that spans from an upstream to a downstream node (vertex), and each node may have multiple upstream or downstream links, for example at an intersection. The Switzerland test case has a network with 513 770 nodes and 1 127 775 road links, and is shown in Fig. 1a. The road network was constructed using data from OpenStreetMap (2018).

Traffic Model
GEMSim utilizes a spatial queue model, based on Gawron (1998) and Cetin et al. (2003), and adapted for massively parallel processing on GPUs with additional improvements and optimizations. In this section, a brief description of the traffic queueing model is provided, while a more detailed description is available in Saprykin et al. (2019a). The principle of this queue model is presented in Fig. 2. A link in the network is represented by two buffers: one buffer corresponds to the physical length of a link (spatial buffer), and another buffer corresponds to the designed traffic flow (capacity buffer). The spatial buffer limits the number of vehicles that can simultaneously queue on a road segment, while the capacity buffer limits the number of vehicles that can leave the link in a time period (that is, the number of vehicles per hour). (1) (2) The size of a spatial buffer N l is calculated as where k l is a spatial scaling coefficient, L link is the physical length of the link, L veh is the length of the space occupied by a vehicle (gross space), N lanes is the number of lanes on the link, and [ ] means integer. The size of a capacity buffer N f is calculated as where k f is a capacity scaling coefficient, q is the maximum allowed traffic flow through the link in terms of the number of vehicles per time period, t cycle is the duration of a single simulation time step, and t period is the duration of the time period used to define allowed traffic flow. Both coefficients k l and k f are typically in the range (0; 1] and are used to scale the network for the size of the population sample. When full scale input data are used, the coefficients are equal or close to 1, while for downscaled input data the coefficients are typically equal or close to the fraction of the population sample that is used for sampling. The reason why these coefficients do not necessarily scale in direct proportion to the size of sample is that the simulation process is highly nonlinear and stochastic. The dependence of these coefficients on the scale that is used is discussed later in the paper.
A vehicle moves between links in a two-stage procedure. First, the vehicle moves from the spatial buffer to the capacity buffer if the following criteria are met: -A vehicle stayed the minimum required time on the link defined as t link = L link /v link , where v link is the allowed free speed. -A vehicle is at the front of the queue in the spatial buffer.
-Enough flow capacity has been accumulated: during the simulation step t cycle a link accumulates q · t cycle /t period for the flow capacity to keep the physical constraints for the designed traffic flow.
Second, when a vehicle is in the capacity buffer, it moves into the spatial buffer of the downstream link if there is enough free space. When a downstream link has multiple upstream links connected through the same node, the order in which upstream links are processed is defined to be uniformly at random and proportional to the flow capacity of each link: where q i and q k are the flow capacities of the i-th and k-th upstream links, respectively, and p i is the probability of an upstream link being selected. This selection process introduces additional stochastic perturbation into the simulation, and multiple runs of the same scenario can produce slightly different results. The order of upstream links is updated in each simulation time step. The same queueing model is used to simulate the movement of public transit vehicles. For each public transit vehicle a driver agent is assigned, and he/she follows a route from the schedule, while passenger agents embark and disembark at transit stops.

Similarity Measure
In order to evaluate the impacts of using scaled input data, a similarity measure is required. This similarity measure must not only be able to compare the outputs from two simulations in relation to each other, but must also provide a quantitative measure of how close the predictions from a simulation with scaled input data are to a scenario with the full-scale population. As we also want to compare simulation outputs both spatially and temporally, the notion of a spatio-temporal point is utilized for the generalization of a similarity measure. A spatio-temporal point represents a measure taken at a certain location over a certain period of time. For traffic flow, the measure may be a set of counts from inductive loops throughout a day, and for public transit the measure may be vehicle occupancy at each stop along a route.
As the problem of uncertainty quantification of sub-sampled populations has received little attention in prior works, there are no specific measures suggested in the literature. However, one can see that a comparison of the outputs of two simulations is very close to the procedures that are used in the calibration and validation of traffic models. While these procedures qualitatively compare outputs from simulations using a common set of measures of goodness-of-fit, most of these measures have a notable limitation: that is, one has to specify a threshold that indicates whether the goodness-of-fit is considered to be acceptable or not.
Below we briefly describe the most commonly used measures of goodness-of-fit provided, and discuss their limitations and applicability to the present study. Some of the measures are subsequently evaluated if the measure qualitatively captures the same trends in the goodness-of-fit as identified in the new proposed measure which we later describe.
The mean absolute error (MAE), mean absolute normalized error (MANE), root mean squared error (RMSE) and root mean squared normalized error (RMSNE) are used by many researchers for calibration purposes (Hourdakis et al. 2003;Balakrishna et al. 2007;Hollander and Liu 2008). MAE and MANE are insensitive to large errors, and the non-normalized RMSE may give a biased assessment as different roads have different traffic volumes throughout a day; notwithstanding, RMSNE is a good candidate for further evaluation: where x i is the prediction at the i-th spatio-temporal point, y i is an empirical or observed value at the same point, and N is the total number of evaluated spatiotemporal points. While the RMSNE can show system-wide relative differences in the outputs of scenarios, the RMSNE does not show the nature of the differences. In that regard, Theil's inequality coefficient (Theil 1958) is a preferred measure and has been used in numerous studies (Hourdakis et al. 2003;Brockfeld et al. 2004): An inequality coefficient of 0 indicates a perfect fit, while an inequality coefficient of 1 indicates the worst possible fit. Theil's inequality coefficient can be decomposed into three parts: where x and y are the averages of the predicted and the observed quantities X and Y , σ x and σ y are standard deviations of the quantities, and r is a Pearson's correlation coefficient of the quantities. The three parts, the bias proportion U m , the variance proportion U s , and the covariance proportion U c indicate the sources of the differences, respectively the systematic bias, the distribution mismatch between the predicted and the observed quantities, and the non-systematic bias. A value of 0 for U m and U s and value of 1 for U c indicate a perfect match between the predicted and the observed quantities. Additionally, the sum of the three parts of Theil's inequality coefficient is unity: While the RMSNE and Theil's coefficient are generally applicable to any type of simulation model, the GEH measure of goodness-of-fit has been adopted by many highway agencies around the world (in USA (Dowling et al. 2004), UK (Great Britain Highways Agency 1996), Australia (Roads and Maritime Services 2013), New Zealand (NZ Transport Agency 2014), etc.) for the validation of traffic models: GEH (x i , y i ) values less than 5 are considered to be indicative of a good fit; values in the range of 5 to 10 indicate that the model's outcomes are still acceptable but some inconsistency in the model's predictions is present; and, values greater than 10 require that the inconsistencies be explained in order that the model's outcomes be accepted. Typically, at least 85% of spatio-temporal points must yield GEH (x i , y i ) values less than 5 in order that a model is considered to be properly validated. Although GEH (x i , y i ) is not a real statistic, it is very similar to the chi-squared two-sample test statistic introduced by Pearson, which is for a pair of measurements x i and y i defined as: where χ 2 (x i , y i ) approaches a χ 2 distribution with one degree of freedom. The tested null hypothesis H 0 is that x i and y i come from the same distribution, and H 0 is accepted if the value of the χ 2 (x i , y i ) test statistic is less than the critical value of χ 2 1,α distribution for a chosen significance level α (typically 0.05) and one degree of freedom, otherwise the null hypothesis H 0 is rejected. As predicted outputs that are inherently variable must be compared, it seems reasonable to use the GEH measure of goodness-of-fit to assess similarity with a selected significance level. However, even though GEH has been adopted by many highway agencies for the validation of predictions compared to field data, as discussed below, we do not consider GEH to be a good similarity measure for two predicted outputs.
From (10) and (11) one can obtain: where χ 2 GEH (x i , y i ) approaches a χ 2 distribution with one degree of freedom. It can be shown that a given critical value GEH th used to test GEH (x i , y i ), with its corresponding value of χ 2 GEH,th , acts as a non-linear scaling factor for the critical value of χ 2 1,α that is used to test the chi-squared statistic for the same pair of quantities x i and y i : In other words, the use of the GEH formula is similar to using the chi-squared test statistic, but with the critical value of χ 2 1,α corresponding to the different (and shifted) significance level α, and K GEH establishes the relation between these two different critical values of the χ 2 1 distribution. Essentially, the value of K GEH > 1 decreases (and shifts) the initial significance level α (that is rate of type I errors) in the chi-squared test, therefore increasing the rate of type II errors significantly. Moreover, the same initial (and unshifted) value of α is used to make the decision on the goodness-of-fit for the model that is being validated: the number of spatiotemporal points that needs to satisfy the scaled critical value of χ 2 1,α is not increased due to the decreased α. It is for this reason that we consider that GEH is not a real statistic.
For example, the recommended value of GEH = 5 with an initial critical value of χ 2 1,α = 2.07 for a significance level α = 0.15 (that is, the requirement that at least 85% of the spatio-temporal points pass the GEH test) gives approximately K GEH = 6, thus shifting α to 0.00041 while ensuring that 85% of points pass the GEH test. Similarly, a threshold value of GEH = 4 with at least 95% of the points passing the GEH test (that is, initial α = 0.05), gives K GEH = 2 and a shifted α = 0.00468.
Another interesting interpretation of the GEH formula can be obtained from the following: where E xy,i and σ 2 xy,i are average and variance of x i and y i . The variance σ 2 xy,i is multiplied by a factor of 2 because both x i and y i are random variables. Combining (13) and (14) gives: where σ 2 GEH,th and σ 2 1,α are variances for the corresponding critical values of χ 2 GEH,th and χ 2 1,α conditional on the equality of corresponding means E GEH,th and E 1,α . Equation (15) shows that GEH with K GEH > 1 yields model predictions with a higher variance than allowed with the chi-squared statistic, and over-dispersion is exactly defined by K GEH .
The lack of clarity in how the GEH th values that are suggested as critical for traffic model validation are determined casts doubt on the general applicability of the GEH formula as a good similarity measure for the outcomes from simulations. Neither of these GEH th values can be considered to be a reliable statistical measure. Instead, the chi-squared statistic test is considered to be more suitable for assessing similarity, and this statistic is perhaps the most widely used statistic with well-known properties. Moreover, the chi-squared test allows one to choose a significance level α, making this measure of goodness-of-fit more widely applicable. Nevertheless, it worthwhile to see if the GEH measure is able to capture the trends in similarity, as well.
Considering the above, we propose the following similarity measure for a given significance level α: where S α has a minimum of 0 when none of spatio-temporal points pass the chi-squared test, and a maximum of 1 when at least (1 − α) * 100% of spatio-temporal points pass the test. Thus, S α represents a straightforward statistical and quantitative approach to measure the similarity of outputs from mobility simulations. This similarity measure is generic enough and can be applied to both traffic flows and occupancy of public transit vehicles. Thus below we further evaluate the following four measures: RMSNE, Theil's inequality coefficient, GEH and the proposed S α .

Scaling Scenario Input and Output
To investigate the impact of scaling one's input data, the following set of population samples is used: 1%, 2%, 5%, and 10% to 90% with 10% intervals. Uniformly random sampling of the full population was applied, and the simulation outputs with the full population (that is, 100% population sample) are used as the reference (that is, quantity Y ). Hence, all simulated population samples are compared with respect to the full population sample simulated as a reference case. A 30 hour period, starting from midnight, is simulated. As the dependence of the network parameters k l and k f on the size of population sample is unknown, a calibration (impedance matching) was performed for each of the samples to match the simulation outputs of the reference case as closely as possible. Two sets of spatio-temporal points were chosen for calibration of car traffic: counts on all car-related network links during the morning (07:00-08:00) and the evening (17:00-18:00) peak hours. During simulation, the traffic counts are aggregated over 15-minute time intervals, and therefore for each peak hour each network link has four distinct spatio-temporal points which are used in the assessment of similarity. The use of multiple spatio-temporal points on each link allows the temporal dynamics of traffic flows to be compared rather than limiting the comparison to only time-averaged states. In addition to the peak hours, two other sets of spatio-temporal points were used to evaluate similarity measures with calibrated coefficients: noon hour (12:00-13:00) when the traffic is relaxed and daily (00:00-24:00) aggregated values. Network links where no cars pass in the reference case are excluded from the analysis. As public transit is not scaled (neither the network links nor the schedule), it need not be calibrated. The occupancy of each public transit vehicle after its departure from each stop are used as the set of spatio-temporal points to assess the similarity of simulated public transit. The traffic and occupancy counts from simulations with population samples are scaled up by the inverse of the sampling fraction size to match the counts from the reference case.
The determination of the optimal values k * l and k * f for the calibrated parameters k l and k f is the subject of the following optimization problem: where a significance level α of 0.05 was chosen, and S m α and S e α are the similarity measures for the morning and evening peak hours, respectively. To reduce the number of required simulations for the calibration, the search space for values k * l and k * f was adjusted depending on the sampling fraction size as follows: -{k l , k f } ∈ [0.1; 1] with a step of 0.1 for population samples from 10% to 90%; -{k l , k f } ∈ [0.005; 0.015] with a step of 0.001 for the 1% population sample; -{k l , k f } ∈ [0.01; 0.1] with a step of 0.01 for the population samples of 2% and 5%.
Additionally, the small population samples (less than 10%) were examined with a broader range of k l and k f , but the variation in the optimization objective was in the range of 1%. Similarly, larger population samples did not show significantly different behaviour for intermediate scaling ratios. In total, 1 221 simulations were performed for the whole set of population samples. After calibration the k * l and k * f that were determined for each population sample were used to calculate similarity measures of car traffic for the population sample. This procedure was repeated for five times with different sets of the random population samples and the results of the similarity measures were averaged to smooth out fluctuations. Thus, a total of 6 105 simulation runs were performed.
To determine the number of iterations to run for each of the population samples, the average score of the agents between iterations was examined for different sample sizes. Figure 3a shows that larger population samples require more iterations to converge the average score, while samples smaller than 10% require only a few iterations. The same behaviour, when the number of iterations required for a simulation to converge depends on the population sample size, was previously reported by Llorca and Moeckel (2019). Moreover, use of larger population samples may lead to oscillations of the score between iterations. Figure 3b shows a zoomed-in part of the previous plot, after 80 iterations, when agents stopped innovating their daily plans and only choose one of the previously memorized daily plans with the best scores. After 100 iterations, the difference of the average score between 1% and 100% samples is less than 1%. However, after 20 iterations, there is only marginal improvement for the average score even for large sub-samples. Hence, sub-scaled scenarios were run for 20 iterations with re-routing strategy followed by 5 iterations with memorized plans only.

Results
The mean, standard deviation (shaded area) and 95% confidence intervals using tdistribution (vertical bars) for the optimal values of the scaling coefficients k * l and k * f are shown in Fig. 4. The standard deviation and confidence intervals are obtained based on five runs for each of the combinations of k l and k f during the calibration process as described above. It is interesting to note that the capacity coefficient k * f matches exactly the scale of the input data for almost all population samples; thus, it is evident that the flow capacity is the main driver of the traffic dynamics in the traffic model. For almost all population samples, the standard deviation of k * f is either zero or very close to zero, therefore the flow capacity coefficient is rather stable and is not affected by traffic fluctuations between simulation runs. The spatial coefficient k * l is close to 1 for population samples in the range of 20%-50%, and decreases for larger sized population samples. However, for very small population samples, less than 10%, k * l decreases substantially, and does not match the overall input scale factor. This can be explained as follows: the more the network is scaled down, the more the performance of spatial buffers degrades due to rounding errors when smaller values of k l are applied; on the other hand, larger k l better matches the impedance of the reference case. The same behaviour is reported by Llorca and Moeckel (2019) where spatial buffers are scaled with factors larger than the corresponding size of the population sample. The standard deviation of k * l is larger than the standard deviation of k * f , therefore indicating that spatial buffers are more sensitive to stochastic fluctuations in the simulated traffic.
To show the impact of rounding errors on the simulated traffic flow, let's consider the following example with a network link of N l = 8 (about 60 meters long) and N f = 1 (flow capacity of 900 cars/hour) which is very typical for the city of Zurich. For the sake of simplicity, we assume that the flow of this link is fully matched with downstream links such that there is always free space available in downstream spatial buffers. Further, we consider a stable flow of 900 cars/hour from upstream links, or 1 car in each 4 seconds of simulated time. In a non-scaled scenario, this link should never be overflown, and no spill-over should occur under the assumed conditions. In the first scenario, a 40% sample is used, and, according to (1)-(2), link buffers are scaled to N l,0.4 = 3 and N f,0.4 = 1, whereas the link accumulates 0.4 cars of flow capacity in each 4 seconds. In the second scenario, a 70% sample is used, and link buffers are scaled to N l,0.7 = 5 and N f,0.7 = 1, whereas the link accumulates 0.7 cars of flow capacity in each 4 seconds. As soon as the link has accumulated flow capacity >= 1.0, it can release 1 car into the capacity buffer and downstream consequentially. Table 1 shows the state of the spatial buffer of the link during a simulation when the upstream flow is kept constant (1 car in each 4 seconds). Since a flow of a single car cannot be downscaled, there exists a probability that for a certain time period a flow . . . of 900 cars/hour is kept. In Table 1, a simulation step equals to 4 seconds to keep the calculations simple. Table 1 shows that the spatial buffer of the link is overflown in only 6 steps (24 simulated seconds) when using a 40% population sample with the network scaled down accordingly. However, when using a 70% population sample, it takes 17 steps (68 simulated seconds) to overflow the link. In total, after 17 steps, the link with a smaller scaling factor of 0.4 spills over 8 cars causing congestion in upstream links. Hence, rounding errors in downscaled network links can severely affect congestion patterns and traffic dynamics during the simulation.
Additional insight into the nature of the impacts of k l and k f is given in Fig. 5, where the optimization objective for one of the sets of samples (not averaged across five sets) is shown. The characteristics of the optimization objective, given in (18), in the k l -k f plane are sensitive to the size of the population sample that is used. For very small population samples of 1% the objective is both far in similarity compared to the reference case and the surface of the objective is flat. With this flat surface, there is a high likelihood that many different combinations of k l and k f may be optimal. Thus, with the very small population sample, the scaled flow capacity of the network does not drive the traffic dynamics, and the model generates mostly noise rather than proper traffic dynamics. When the size of the population sample is increased to 10%, the model is more similar to the reference case in terms of the traffic dynamics, and the flow capacity of the buffers generate more realistic traffic flows. With a 50% population sample the model is almost fully driven by the flow capacity and generates traffic that is quite similar to the reference case. It is also clear that the k l coefficient does not significantly affect the simulation outputs. With a 90% population sample, the model is sensitive to both the flow capacity and the spatial length of the links; in this regard k l effectively fine-tunes the traffic dynamics.
As the size of the population sample increases, the objective (traffic similarity) improves, and the surface of the objective in the k l -k f plane becomes more concave. The concave surface indicates that the objective is more sensitive to changes of the scaling coefficients, therefore the optimal values of k l and k f are closer to the scaling factor when the surface is concave. Nevertheless, as the simulations are highly non-linear, the optimum coefficient k * l for spatial buffers is smaller than the corresponding scale factor for the flow capacity. Optimum coefficients for spatial and capacity buffers, evaluated with different sized population samples, are presented in Table 2 (Appendix A).
The measures of goodness-of-fit for the different population samples are presented in Figs. 6-7. In the legend, GEH 5 and GEH 10 indicate that the links satisfy, respectively, the conditions 5 ≤ GEH (x i , y i ) < 10, and GEH (x i , y i ) ≥ 10, and H W indicates that the similarity measure is only applied to a subset of network links consisting from motorway and expressway links with a minimum speed of 80 km/h. The GEH measure is given as a percentage of the total number of links that are analyzed. As expected, the GEH measure satisfies the validity condition GEH < 5 for at least 85% of the links even with the 5% population sample during hourly evaluated periods (morning, noon and evening), while the traffic similarity S α based on the chi-squared statistic yields statistically significant results only with a population sample of at least 30% in the morning, 60% in the noon and 40% in the evening. The lower similarity in the noon and evening peak hours is only observed for motorways and expressways, while for the whole set of links a statistically significant result is already achieved with a population sample of at least 30%. Motorways and expressways are also affected by discrepancies in traffic dynamics when small population samples are used. However, one may not intuitively expect this result considering that links with smaller volumes are affected more by traffic fluctuations, but simulations show that both road types are affected by the size of the population samples. Nevertheless, lower traffic volumes on highways in the noon could be the reason for higher fluctuations and lower similarity. The same trend is also observed for daily aggregated counts in Fig. 7, where neither GEH nor S α reaches a statistically significant result, as the error is accumulated throughout a day for each of the links. It is also noteworthy that as the size of the population sample is increased from 1% to 30%, the GEH measure decreases sharply from more than 30% to less than 1% (except daily evaluated period where errors are accumulated), and then decreases only slightly up to a population sample of 90%. Therefore, GEH captures qualitatively a critical point with a population sample of 30%, while S α captures this point quantitatively. The shape of curves for the GEH measure with the GEH 10 condition decreases more rapidly than the curves for the GEH measure with the GEH 5 condition as the more heavily penalized links fit better to the reference distribution as the size of the population sample is increased. All other measures of goodness-of-fit qualitatively capture the impacts of the size of the population sample on traffic dynamics. However, Theil's inequality coefficient highlights a critical point when the size of the population sample is in the range of 10%-20%, and further increases in the size of the population sample lead to only minor improvements. One can also note that for the noon hour the errors are higher, as was previously mentioned. More detailed data of the three component parts of Theil's inequality coefficient are presented in Tables 3-6 (Appendix B) for cars and in Table 7 (Appendix B) for public transit. These data show that there is a larger systematic bias and a larger mismatch in the distribution for small population samples. Such errors can be expected when small traffic volumes are scaled up by large multiples. For large sized population samples, non-systematic errors contribute more than the systematic bias and mismatch of distribution, as the overall match of traffic dynamics and transit occupancy is better.
The RMSNE also shows that there is a tendency for the error to reduce with larger population samples, but it is very difficult to interpret these results. For example, the RMSNE is below 2 for cars when a 5% population sample is used, as the simulation results are averaged over a large number of links when the RMSNE is evaluated.
Although the public transit infrastructure is not scaled using k l and k f coefficients and public transit vehicles do not share the roads with other cars, in general, all similarity measures indicate that public transit has higher uncertainties compared to the peak hours, however the similarity stays almost unaffected in a daily evaluated period. Public transit vehicles always run according to the schedule, and therefore are unaffected by the queueing model (that is, no congestion forms, and vehicles can always move forward). Thus, we can infer that discrepancies in the predicted traffic dynamics that result from the use of small population samples do not depend mainly on the traffic model that is used, but rather are a consequence of the sub-scale population itself which distorts the demand. Figure 8 and Table 8  vehicle-kilometres travelled (V KT ) during hourly and daily evaluated periods; the performance is normalized relative to the reference case. The simulations with a 5% population sample give errors in the range of 5%-6%, and the errors are 2%-3% for simulations with 10%-20% population samples. It is worthwhile to note that V KT varies much less than V H T and has an error no more than 6%, even with a 1% population sample, while V H T has errors in the range of 7%-30% for small sized population samples. A possible explanation is that the distribution of trip distances performed by the agents is more narrow and concentrated around the mean value, therefore it is easier to approximate the distribution with a few agents. On the other hand, the distribution of trip durations is broader, and the use of a small population sample gives a larger error. However, it is also evident that discrepancies in the measures of goodness-of-fit depend on the patterns of behaviour of the simulated population, and these patterns can be different in other scenarios.
The runtime performance of simulations with different population sub-samples was evaluated on a cluster node with two Intel Xeon E5-2620 v4 CPUs clocked at 2.60 GHz, 256 GB of RAM and four NVIDIA P100 (16 GB RAM onboard) GPUs. In total, 20 CPU threads have been used to run a scenario. Runtime of one simulated daily iteration for each of the population sub-samples is shown in Fig. 9. The runtime performance scales non-linearly, that is, when the size of a sub-sample increases from 20% to 40%, the runtime increases from 220 to 350 seconds only. The reason for this behaviour is the structure of a GPU-accelerated simulation loop. As it was shown in prior work (Saprykin et al. 2019b), there are two main performance bottlenecks: (i) data transfers between GPU and the host side, and (ii) re-routing of the agents between iterations. Both bottlenecks almost linearly depend on the size of a simulated population, but the traffic propagation part which is accelerated with a GPU has a non-linear runtime dependency (Saprykin et al. 2019a). Thus, the overall runtime of a simulated iteration has a non-linear dependency on the size of a population sample.

Conclusions and Discussion
The GPU-enhanced spatial queuing model is seen to be more sensitive to the traffic flow coefficient k f rather than the spatial coefficient k l . Thus, in the calibration of a simulation that uses a sampled population the optimal flow and spatial coefficients can be more efficiently determined: specifically, the search space for the optimum k f can be reduced by using a narrower search range for k f . As the optimum flow coefficient is within a known narrow band, only a few steps are required in the k f dimension. On the other hand, as the spatial coefficient does not have a strong impact on traffic dynamics, coarser step-sizes can be used in the k l dimension. All commonly used measures of goodness-of-fit, as well as our proposed measure, capture the same trends in regards to the impacts of using sampled populations: namely, smaller population samples yield traffic dynamics that are more dissimilar than the traffic dynamics in a simulation that uses the full population. Our proposed measure of goodness-of-fit S α shows that there is a critical size of approximately 30% of the full population, below which the simulated traffic dynamics are markedly different. Therefore, as a trade-off between the quality of the simulated traffic dynamics and the required computational power, a 30% population sample will give predictions that are insignificantly different from a simulation that uses the full population. The dynamics of the occupancy of public transit are also accurately captured in simulations with a 30% population sample. However, if high accuracy in the predicted traffic dynamics and in the occupancy of public transit are not desired, and the main interest is only aggregated parameters such as V H T and/or V KT , with errors in the range of 2%-6% being acceptable, then smaller population samples, in the range of 5%-10% of the full population, can be used. Nevertheless, as emerging technologies continue to increase the complexity of modern transportation systems, the impact of the size of the population sample needs to be always considered. One example is the transition to electric mobility, where simulations with a 30% population sample will yield accurate predictions of the traffic dynamics, but may not necessarily accurately capture the spatio-temporal characteristics of the electricity demand for the charging of EVs. Only simulations with the full population will unambiguously yield accurate predictions related to both the transportation and electricity infrastructures.
In the simulations with sampled populations, whereas for private traffic both the supply and demand are scaled, in the case of public transit only the demand is scaled. The measures of goodness-of-fit show the same trends for both private traffic and public transit in regards to the impacts of using sampled populations. A public transit system can be seen here as a case of an ideally scaled road network which does not impact travel times of the agents. It is thus evident that it is the scaling of demand which is the main reason for distortions in the simulated traffic dynamics. While a scaled traffic queueing model may also impact traffic dynamics, the impacts can be reduced during the calibration process. We expect that scenarios with other types of agent-based traffic models will also be adversely impacted by the scaling of input data.
In the study, five samples from the full population for each of the scaling factors were used to smooth out fluctuations caused by the probabilistic traffic model (that is, the order in which upstream links are processed is randomized in each simulation step). However, each sample may itself introduce uncertainties in the simulation results. Hence, in future research, it would be interesting to quantify the impact of a single population sample on such uncertainties. To do that, one can simulate a scenario for each population sample multiple times with different random seeds and then average the simulation results. While this would require even more computing power, one would be able to separate per-sample uncertainties from the other uncertainties due to the use of scaling factors.
The results presented in the paper are obtained using the large-scale scenario of Switzerland. It would be interesting to see similar studies for other areas to better understand the impacts of different network topologies, for example how a grid network structure affects the results. Second, as the public transit vehicles were not mixed with the rest of the traffic and were not delayed in this study, one can do further analysis of scalability issues in the scenarios where public transit vehicles do share roads with other cars and trucks. Finally, other re-planning strategies than rerouting may also impact the scaling process differently. For example, departure time mutation or switch of a transport mode would be interesting to study in the context of the scaling issues.