Theoretical background of validation of AB models
In this section, we follow a general-to-specific way to discuss the validation of AB models in the light of existing literature. First, we introduce the types of validation techniques (stages) for AB models in general terms. We utilize a procedure to validate AB models, which was introduced by Klügl (2008), and discuss the validation stages that are ordered in that procedure. Then, we discuss one of these stages (the last one) called empirical validation in detail, since we introduce a novel method for that stage in this paper.
One of the major valuable aspects of using AB models is to explain and understand a real-world phenomenon that is costly and sometimes difficult to analyze in real world (e.g., field experiments, real laboratory experiments, etc.) (Xiang et al. 2005). As Farmer and Foley (2009, p. 686) state, “AB models allow for the creation of a kind of artificial (virtual) universe in which many players act in complex and realistic ways”. Thus, such models enable to analyze—in silico—the future status of the original system under novel conditions. Assessing how well the artificial universe (i.e., AB models) represents a proportion of the original system (i.e., a part of the real world that is aimed to be modeled) is an asset for models that potentially makes the modeling results more credible (Klügl 2008). This assessment is called validation in the literature (Windrum et al. 2007; Bianchi et al. 2007). If the model is validated, the answers derived from the model can be utilized to answer questions directed to the original system (Klügl 2008).
Klügl (2008) introduced a framework (see Fig. 1) that places different validation stages in an order to validate AB models. Some stages in the framework are also discussed in Balci (1994) separately (i.e., without being a part of a framework). The framework starts with face validity. In that stage, the modelers are supposed to contact to domain experts to assess whether the model behaves reasonably. The experts provide subjective judgments on the accuracy of the model. Sensitivity analysis comes next, where the impact of different parameters on the model output is assessed. It is assumed that the relationship between a parameter and the output occurring in the model should occur similarly in the original system as well. Once such impacts are analyzed, then the appropriate values are assigned in calibration for the parameters. Calibration aims for finding the “optimal” parameter set, which resembles the output of the model to the output from the original system. In general, AB model parameters are calibrated to aggregated (macro) patterns (Guerini and Moneta 2017). The plausibility check comes after calibration, where human experts assess the plausibility of the model outcome (e.g., dynamics and trends of the different output values of model runs). It is technically the same as the previously discussed face validity, as Klügl (2008) states. Finally, statistical tests are applied to compare model-generated data and real data as named empirical validation.
Empirical validation is the last stage of the procedure in Fig. 1 and aims to compare the data coming from the rwDGP and the mDGP statistically. Assume that we have real data generated by the rwDGP, which contains different data points in a time-series. The data points can be at the micro-level as the expression in (1) denotes (Pyka and Fagiolo 2007; Windrum et al. 2007), where I represents the population of individuals whose heterogeneous behaviors are observed and contained in the vector of z in a finite time-series of n. For instance, for a mobility mode-choice model, z would be individual level mobility mode choice behavior:
$$\begin{aligned} (z)_i= & {} \{ z_{i,t}, t= t_0,\ldots , t_n\}\quad i\ \epsilon \ I, n\ \epsilon \ {\mathbb {N}} \end{aligned}$$
(1)
$$\begin{aligned} (Z)= & {} \{ Z_{t}, t= t_0,\ldots , t_n\}\quad n\ \epsilon \ {\mathbb {N}}. \end{aligned}$$
(2)
The data points that the rwDGP generates at the micro-level can be aggregated to obtain macro-data points, as denoted in (2) (Pyka and Fagiolo 2007; Windrum et al. 2007), where the vector of Z contains macro-data points of a population (i.e., I) over a time series. For instance, a household’s consumption behavior is represented by a micro-level data point, while the aggregation of all households in a population I is represented by a macro-level data point, which can then be used as a component of the GDP. Modelers aim to approximate values for the vector of z or Z for which finding the optimal micro (\( \theta \), e.g., agent preferences) and macro (\( \Theta \), e.g., the environment) parameters is needed for calibration. Once the optimal parameters are set, which is the one step before the empirical validation in Fig. 1, then the output of the model can be compared empirically to real data from the original system (Fagiolo et al. 2007; Guerini and Moneta 2017). As Klügl (2008, p. 6) states, “calibration and validation must use different data sets for ensuring that the model is not merely tuned to reproduce given data, but may also be valid for inputs that it was not given to before”. However, having two data sets from the same original system is not often possible. In such cases, the available data can be used on all available levels, as Klügl (2008) asserts. For instance, a model can use micro-data as input, be calibrated at the macro-level, and be validated at the meso-level. Therefore, the same data set can be exploited at different levels without over-fitting.
Related works
In this section, we first discuss recently introduced validation methods. Then, we explain why our method is related to the discussed methods and how it could expand them for the sake of readers.
Lamperti (2018b) has offered an information theoretic criterion called General Subtracted L-divergence (GSL-div) as a validation method for AB models. The method measures the similarity between model-generated and real-world time-series. It assesses the extend of models’ capability to mimic patterns (e.g., distribution of time-series such as changes of values from one point in time to another) occurring in real-world time-series. It is related to our method, because our method aims to compare the similarity among patterns occurring in real data and model-generated data as well. However, GSL-div focuses only on aggregated time-series data as Fagiolo et al. (2019) indicate, while our method focuses rather on meso-level behavioral patterns that are constructed by micro attributes. We discuss the advantages of the meso-level approach later. The authors state that the GSL-div can overcome certain shortcomings of the method of simulated moments (MSM), e.g., it does not need to resort to any likelihood function and provides a better representation about the behavior of complex time-series. Their method could be applied technically to any AB model that produces time-series data. Detailed explanation of the method, illustrative examples, and case studies can be found in Lamperti (2018a, 2018b).
Barde (2020, 2016) has introduced another information theoretic criterion as a validation method for AB models. The method is called Markovian information criterion (MIC). It follows the minimum description length (MDL) principle, which hinges on the efficiency of data compression to measure the accuracy of models’ output (Grünwald and Grunwald 2007). It first uses model-generated data to create a Markov transition matrix for the model, and then uses the real data to produce a log score for the model on the data. The method uses the Kullback–Leibler (KL) divergence to measure the distance between real and model-generated data; thus, the accuracy of the mDGP is assessed. As the author states, the method does not include estimation; instead, it is applied to already calibrated models to assess their output. It is related to our method from that aspect. However, similar to GSL-div, the application level of our method is different than MIC as we explain in detail in the following section.
Grazzini and Richiardi (2015) discuss estimation methods for dynamic stochastic general equilibrium modeling (DSGE) models and analyze whether such models can also be applied to AB models. The authors mention the simulated minimum distance (SMD) methods, such as the method of simulated moments (MSM), as a natural approaches to the estimation of AB models. Such methods aim for estimating model parameters by minimizing the distance between the aggregates between model output and real data. Our approach differs from these methods, because it focuses on the last step of the procedure of Klügl (2008) (see Fig. 1). In other words, it is applied to already calibrated models, similarly to the method of Barde (2016). Thus, the estimation methods in the class of SMD can be only complementary to our method. As we discussed in the future works section, in a future paper, we plan to couple an SMD method with our method to apply together on an AB model.
Differently from the previously discussed methods, Guerini and Moneta (2017) offer a method that aims to compare causal relationships in model-generated data and real-world data to validate AB models. The method hinges on estimating Structural Vector Autoregressive (SVAR) models through real and artificial time-series and comparing them to get a validation score. Our method does not rely on time-series and we compare relationships at meso-level, while the method of Guerini and Moneta (2017) focuses only on aggregate time-series.
To conclude, as Fagiolo et al. (2019, p. 14) state in their critical review, “all these recently developed validation methods focus only on aggregate time-series, while most of AB models have been able to replicate both micro and macro stylized facts”. Some of the discussed methods could be applied in principle at the micro-level, but there is no “proof-of-concept” yet. Besides, applications of such methods at the micro-level could lead to over-fitting if a model gets micro-data as input and its parameters are estimated to fit individual behavior one-to-one (e.g., fitting behavior of artificial agent to its real counterpart). Considering the increasing availability of micro-data, the number of AB models using micro-data as input increases (Macal and North 2014; Hamill and Gilbert 2016). Therefore, in this paper, we offer a meso-level validation method for the models drawing on micro-data. The method involves an unsupervised machine-learning algorithm along the lines suggested by Fagiolo et al. (2019) and Barde (2016). They represent contributions regarding machine-learning involvement on the side of estimation (van der Hoog 2019). However, such involvement is still lacking on the side of validation. Our method could expand the existing validation methods towards the direction of machine-learning and encourage future contributions. The further text is structured as follows: we discuss the overall concept of our method in the next section in detail. We also discuss for what kind of AB models the method could be applied and provide some example models from recent research in “An overview of AB models that might be validated with our method”.
The overall concept of the meso-level validation method
This section introduces a meso-level empirical validation method for AB models drawing on micro-data first as a broad methodological choice, and then, we describe it in detail. In broad terms, we sharply distinguish the different phases and goals of the relationship between real (empirical) data and the model in the following way: the meso-level is exclusively used for validation, whereas the micro-level is used for input micro-data into the agents in terms of parameters (not of outcomes of their decision-making process, because this could lead to over-fitting) and the macro-level for calibration. By this distinction, we radically eliminate any source of overlap between what is given to the model as input, what is used for calibrating its overall results and macro–micro-parameters, and what is used for validation. More specifically, our method consists of sequential steps for which we created an overall concept as in Fig. 2. We explain each step one after another, according to their sequence in the concept. The main goal of the concept is to compare model-generated data and real data at the meso-level to understand how well the mDGP can produce the behavioral patterns that occur in real data. It produces a quantitative score in a spectrum according to which we can assess the validity.
The overall concept gets two data sets as input. The first data set contains information regarding the ex-ante characteristics of artificial individuals (i.e., agents) and their ex-post behavior, generated by the mDGP. The second data set involves information regarding the ex-ante characteristics of real individuals and their ex-post behavior, generated by the rwDGP. Both data sets contain information at the individual level, since the mDGP of AB models produce data at the individual level (i.e., micro-level). Individuals are clustered according to their characteristics and behavior in the data sets, and these clusters are compared at meso-level quantitatively. An essential point for the comparison according to the method is that the real data should be the one that is used to initialize the model. In this case, individuals in real data are mapped to artificial agents one-to-one; thus, the number of real and artificial individuals becomes equal, which is a prerequisite to apply the validation method. The data sets can differ in what model-wise is an ex-post behavior, because an artificial agent might behave differently than a real individual with the same characteristics. The variables constituting the ex-ante characteristics should ideally be the ones influencing the ex-post behavior. By having this, the clusters involve a combination of the variables in individuals’ characteristics and consequent behavior. Hence, by comparing clusters, we can study the behavioral patterns (e.g., the relationship between the characteristics and the behavior) in model-generated data and real data.
Instead of clustering artificial and real data sets separately, we merge them as indicated in Fig. 2, and cluster them together to analyze the balance in the clusters (i.e., how many real and how many artificial individuals are in each cluster). Individuals in the merged data are placed in a multidimensional latent space based on their attributes (i.e., ex-ante characteristics and ex-post behavior). The latent space is represented by a symmetric distance matrix.Footnote 2 Several metrics exist to create that matrix, such as Euclidean, Manhattan, Gower, etc. (Bektas and Schumann 2019a). In the overall concept, we utilize the Gower distance metric, since it can handle different column typesFootnote 3 (e.g., categorical, numerical, ordinal, etc.) to place instances in the latent space (Gower 1971). For instance, the merged data might contain some attributes of households that can be categorical such as income level, or numerical such as age. Gower distance can determine the positions of the individuals in the latent space based on these columns without any transformation, while other metrics such as Euclidean accepts only numerical ones (Bektas and Schumann 2019a):
$$\begin{aligned} \text {Sil} = \frac{b_i - a_i}{\max \{a_i,b_i\}}. \end{aligned}$$
(3)
As for the clustering algorithm, we utilized the k-medoids clustering algorithm, since it is compatible with the latent space created by the Gower distance metric (Bektas and Schumann 2019a). However, k-medoids is an unsupervised algorithm; thus, we need to find ex-ante the optimal number of clusters. There are the goodness-of-fit metrics in the literature [e.g., Average Silhouette Width (ASW), Calinski and Harabasz Index (CH) and Pearson version of Hubert’s \(\Gamma \) (PH) (Campello and Hruschka 2006)], which can provide quantitative measurement scores regarding the quality of clustering with the different number of clusters. The ASW is one of the most widely used approaches that measures how well an instance is matched with its own cluster (Maulik and Bandyopadhyay 2002; Bektas and Schumann 2019a). As a goodness-of-fit measure, it reflects how well intra-cluster homogeneity and inter-cluster dissimilarity are maximized (Rousseeuw 1987). The idea for pre-specifying the optimal number of clusters is to try different k-values in an interval and appoint one of them, which has the highest ASW value, as the optimal number of clusters. For each k number, the ASW value of the clusters is calculated according to Eq. (3), which depicts the Silhouette value of instance i. The feature \(a_i\) represents average dissimilarity of i to all other objects in the cluster a (the smaller the value, the better the assignment). Another feature \(b_i\) reflects the minimum dissimilarity of the instance i to all objects in any other cluster (the closest cluster to i except its own cluster). Equation (3) returns values between \(-1\) and 1. Values close to 1 indicate that instance i is assigned to the proper cluster. Average Sil values of all instances (ASW) give an idea about the quality of the clustering (Rousseeuw 1987).
After the instances are placed in a latent space, and the optimal number of clusters are found, the k-medoids algorithm (see Algorithm 1) partitions the instances into k (the optimal number) clusters. To understand how well clusters from real and artificial data overlap, we compare the quantity of artificial and real individuals in the clusters according to the indicator (4). In the formulation of the indicator (4), R represents the number of real instances, A represents the number of artificial instances, and N is the optimal number of clusters. The indicator finds the dissimilarity in the balance of artificial and real instances for each cluster. Finally, it returns a normalized score in a spectrum between zero and one. The indicator uses the L1 norm (i.e., least absolute deviation) similarly to the Manhattan distance, since it gives equal importance to all clusters that might have different dissimilarities (i.e., balance differences).Footnote 4 Besides, the L1 form is more preferable for high-dimensional data applications (Aggarwal et al. 2001):
$$\begin{aligned} \frac{\sum _{k=1}^{N}=\frac{\mid R_k - A_k \mid }{R_k + A_k}}{N}. \end{aligned}$$
(4)
If an artificial agent behaves observationally equivalent to the real individual with whom it has the same characteristic, they are placed in the same position in the latent space; thus, they are supposed to be in the same cluster. If all artificial agents behave observationally equivalent with the real individuals with whom they have the same characteristics, it is expected that the clusters would have 50% artificial fifty percent real instances (as in the simple experiment in Online Appendix A). In this case, the indicator’s outcome (4) becomes zero, which indicates a perfect match. In other words, a zero score demonstrates that the behavior patterns in real data are perfectly overlapping with the ones from the artificial data. Conversely, if an artificial agent produces different ex-post behavior than his real counterpart, they are placed in different positions in the latent space. Thus, they are supposed to be in different clusters. That leads to unbalanced clusters and, consequently, a weak validation score according to the indicator in (4).
The overall concept is completed with the determination of the place of the score in the distribution of all possible scores it could theoretically take, which allows us to interpret it. To determine a meaningful threshold, we obtain all possible scores it can have and their frequency in the exhaustive list of all possible cases, which is the state space. The state space contains all possible alternative ways in which a total can be distributed,Footnote 5 with Page (2012) demonstrating a Java algorithm to obtain them in a broad variety of restrictions. In the case at hand, we study the scores that the indicator (4) generates in all possible subdivisions of the total number of artificial agents and of the total number of real individuals in the clusters. Accordingly, we obtain the distribution of possible scores, which allows us to judge the specific score—that a model achieves in the previous steps—concerning all other possible scores.
Overall, this procedure builds on the idea that a validated model should produce “indistinguishable” results from real data. Going beyond the inter-personal qualitative procedure proposed in Piana (2013), we deliver a method having a quantitative indicator of “goodness-of-validation,” taking values from zero to one. The method can be used for the AB models having micro-data as input and produce results accordingly. We discuss such models and provide examples in “Discussion”. The method provides these models two advantages: avoiding over-fitting of the micro-level validation and having more detailed validation than macro-level, as recommended in Fagiolo et al. (2019). In the next section, we apply the method to a specific model in the personal mobility domain, implementing a certain simulation platform.