Abstract
This study proposes a new efficient parameter tuning method for multiagent simulation (MAS) using deep reinforcement learning. MAS is currently a useful tool for social sciences, but is hard to realize realistic simulations due to its computational burden for parameter tuning. This study proposes efficient parameter tuning to address this issue using deep reinforcement learning methods. To improve compatibility with the tuning task, our proposed method employs actorcriticbased deep reinforcement learning, such as deep deterministic policy gradient (DDPG) and soft actorcritic (SAC). In addition to the customized version of DDPG and SAC for our task, we also propose three additional components to stabilize the learning: an action converter (DDPG only), a redundant full neural network actor, and a seed fixer. For experimental verification, we employ a parameter tuning task in an artificial financial market simulation, comparing our proposed model, its ablations, and the Bayesian estimationbased baseline. The results demonstrate that our model outperforms the baseline in terms of tuning performance, indicating that the additional components of the proposed method are essential. Moreover, the critic of our model works effectively as a surrogate model, that is, as an approximate function of the simulation, which allows the actor to tune the parameters appropriately. We have also found that the SACbased method exhibits the best and fastest convergence, which we assume is achieved by the high exploration capability of SAC.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multiagent simulation (MAS), which primarily simulates social phenomena by compiling agent behaviors, is widely used in social sciences. For instance, MAS has been utilized in analyses of the COVID19 pandemic [1], financial markets [2, 3], demographic movements [4], and evacuation or massive pedestrian control planning [5, 6]. Edmonds et al. [7] argued that MAS can be used to validate wider possibilities in social sciences. Generally, dominant equations do not exist in social phenomena, and interactions between people are unknown but important; therefore, MAS enables us to understand and analyze metaphenomena generated by complex micro–macro interactions among agents (people).
Parameter tuning is an essential procedure in MAS. Generally, parameter tuning is used for two reasons. The first is to adjust the parameters of the simulation model such that the simulation results reflect realworld phenomena; this tuning is essential because agents’ and environmental parameters cannot be directly observed in the real world. The second reason is to determine the parameter values for optimizing social phenomena or resolving social problems; for instance, because massive pedestrian simulations aim to create a more efficient flow of people, various solutions have been proposed to determine a better flow.
However, parameter tuning is a challenging task because of the large dimensions of the parameters and the computational cost of the simulations. As MAS typically employs many agents in its simulations, the computational cost for each simulation is comparatively high. Moreover, the parameters that should be tuned are also highdimensional or large, as has been indicated by [8, 9].
A fundamental solution to address this massive computational burden has not yet been proposed, to the best of our knowledge. For instance, to reduce the computational burden, Yamashita et al. [9] alternated part of the MAS with neural networks (NNs) to obtain the best solution. Also, Bayesian optimization, such as the treestructured Parzen estimator (TPE) [10], has frequently been used in the parameter tuning of NNs. However, the proposed solutions failed to fully resolve the dimensional problem because the dimensions of the parameter exploration were not compressed.
Because MAS frequently exhibits chaotic behavior owing to complex agent interactions, the exploration of optimized parameters is difficult. MAS aims to reproduce chaotic phenomena that result from complex interactions. Therefore, it is often difficult to determine optimal parameters from global estimates.
To address these issues, this study attempts to utilize deep reinforcement learning. Deep reinforcement learning has recently been developed to handle highdimensional problems. For instance, Baker et al. [11] showed that it can learn complex hideandseek strategies using tools that make the game more complex and highdimensional. Therefore, we believe that the utilization of deep reinforcement learning for MAS parameter tuning is promising.
Thus, this study proposes a parametertuning method that uses reinforcement learning for MAS. As the first step in the introduction, we focus only on lowdimensional parameter tuning. This is because lowdimensional tuning is possible even when using traditional parameter tuning, which enables us to compare and estimate the capability of deep reinforcement learning. In this study, we demonstrate the capability of deep reinforcement learning for MAS parameter tuning.
Consequently, we show that the proposed model is promising. The contributions of this study are as follows:

1.
A reinforcementlearningbased parameter tuning model for MAS parameter tuning using some additional components to stabilize learning is proposed.

2.
Our model outperforms the baseline model. Furthermore, the proposed additional components are successful and necessary for our proposed model.

3.
In our proposed model, actorcritic type reinforcement learning is employed. The results demonstrate that the critic works as a surrogate model of the simulations, leading to the actor being able to learn a better parameter.

4.
We propose SAC and DDPGbased models and confirm that the SACbased method exhibits the best and fastest convergence owing to its high exploration capability.
2 Related work
Edmonds et al. [7] argued that MAS is useful in social sciences, which have complex interactions. MAS aims to imitate the real world by creating an imaginary world using agents on computers. Simulations are beneficial because they enable the exploration of hypothetical situations or the prediction of phenomena under certain conditions, such as new regulations. As mentioned earlier, MAS has been utilized for the analysis of several domains, such as COVID19 [1], financial markets [2, 3], demographic movement [4], and evacuation or massive pedestrian control planning [5, 6]. The importance of agentbased simulations has been debated, particularly in the context of financial markets [12, 13]. For instance, Lux et al. [14] showed that interactions between agents in financial market simulations are necessary to replicate stylized facts therein. Cui et al. [15] also showed that the trader model used in artificial financial market simulations required intelligence to replicate stylized facts in financial markets. Furthermore, Mizuta [16] demonstrated that the MAS of a financial market can contribute to the implementation of rules and regulations in actual financial markets. This study used artificial financial markets as an example of the application of the proposed method.
Several practical studies have used artificial financial market simulations. Torii et al. [17] revealed how the flow of a price shock was transferred to other stocks using an artificial financial market based on the trader model proposed in [18]. The model proposed in this study is also used as the parametertuning target. Mizuta et al. [2] tested the effect of tick size on trading shares, that is, the price unit for orders, which led to a discussion on tick size devaluation in the Tokyo Stock Exchange Market. Hirano et al. [3] assessed the effect of the regulation of the capital adequacy ratio (CAR), such as the Basel regulatory framework, and observed the risk of market price shocks and depressions due to CAR regulation. Some studies also focused on flush crashes using artificial financial market simulations [19, 20]; as an example of one such platform, Torii et al. [21] proposed “Plham” [22]. In this study, we use the “PlhamJ” platform [23], the updated version of “Plham.” Meanwhile, other artificial financial market simulators also exist, such as UMART [24], the Santa Fe artificial stock market [25], and the agentbased interactive discrete event simulation [26].
However, as mentioned in the introduction, MAS, including financial market simulations, has a high computational burden; thus, several workarounds have been proposed. One solution is parallelization of simulations using simulation management software. Murase et al. [27] proposed organizing assistants for comprehensive and interactive simulations (OACIS). Murase et al. [28] subsequently proposed CARAVAN, a framework for comprehensive simulations of massive parallel machines, to optimize MAS parameters by parameter sampling. These frameworks are mainly aimed at automating the parametertuning process and do not address the high computational burden. Another method to reduce the computational burden is introducing NNs. Yamashita et al. [9] alternated part of the MAS with NNs to obtain the best solution and reduce the computational burden. The NN operates as an approximate function of MAS; this approach is known as a surrogate model. The potential of the surrogate MAS model was investigated by Angione et al. [29]. In this study, part of our model can be understood as a surrogate model of MAS.
This study employed reinforcement learning to reduce the computational burden of MAS parameter tuning. In the context of reinforcement learning, numerous research and development initiatives have been undertaken. Qlearning is a wellknown offpolicy reinforcement learning method based on the Qtable [30]. Learning theory, that is, the temporal difference, was proposed in [31]. Tesauro proposed a method for applying temporaldifference learning to backgammon [32]. SARSA, state–action–reward–state–action, is another example of a simple reinforcement learning method proposed in [33]. There are numerous models of reinforcement learning. The most significant discovery in reinforcement learning is its combination with NNs. After deep reinforcement learning was invented, Mnih et al. [34] demonstrated that deep Qlearning (Deep QNetwork, DQN) can outperform humans using the Atari learning environment [35]. These neuralnetworkbased models are known as deep reinforcement learning and were also utilized in this study. These improvements are possible owing to the invention of NNs and deep learning such as convolutional NNs for image classification [36]. Consequently, several reinforcement learning methods using deep learning have been developed. As an extension of DQN, double DQN (DDQN) was proposed in [37], which uses two networks and has better performance than DQN. These two networks were also used in the proposed method. Moreover, the dueling DDQN was proposed using a new state value function to improve learning performance [38]. As an extension of Qlearning, the asynchronous advantage actorcritic [39] method uses deep learning asynchronously. This is based on the actorcritic method proposed in [40], which is used in the deep reinforcement learning of our proposed model. Subsequently, because the asynchronous method was not important, the advantage actorcritic (A2C) method was proposed [41]. Hessel proposed RainBow [42] as a method that combined the aforementioned methods. ApeX DQN [43], which used prioritized experience replay, and R2D2 [44], which used long shortterm memory [45], were proposed as outperforming methods. Silver et al. proposed a reinforcement learning algorithm through selfplaying, which achieved excellent performance in some games, such as chess, shogi, and Go [46]; this algorithm is based on their previous study, which is well known as “Alpha Go Zero” [47]. In this study, we used the deep deterministic policy gradient (DDPG) [48] and the soft actorcritic (SAC) [49, 50], an actorcritic reinforcement learning for continuous actions.
3 Proposed methods
In the proposed method, we employed actorcriticbased deep reinforcement learning methods with continuous action spaces, such as DDPG [48] and SAC [49, 50]. We selected these for the following reasons. First, Figure 1 shows the typical parameter tuning and our proposed schemes. At the tuning trial in the usual parameter tuning scheme, the parameter tuner provides a parameter set for the parametertuning task. Subsequently, the simulator runs using the parameter set and returns the results, and the analyzer (objective function) returns the final objective value. Finally, the parameter tuner receives feedback and attempts to modify the parameter set to maximize the objective value. To consider this scheme down to the reinforcement learning scheme, the actorcritic framework exhibits a good fit, as illustrated in Figure 1. Second, the typical parametertuning tasks and parameter sets frequently include continuous values; therefore, support of continuous values is required. The illustration in Figure 1 represents the basic outline of our methods; however, the details are not exactly the same as illustrated and explained hereafter.
The use of either DDPG or SAC with minor customization to adapt DDPG to our task failed. Thus, we proposed three special components to enable their application to MAS parameter tuning. The three components are as follows:

1.
Action converter (AC)

2.
Redundant full neural network actor (FNNA)

3.
Seed fixer (SF)
An AC is employed to enforce action bounds; in SAC, action bounds are realized by the squashing function for Gaussian sampling. Therefore, the AC is only used for DDPG. The patterns of applicable components are listed in Table 1 and all the models are listed in Table 2.
We first explain the base models except for the baseline, that is, DDPG and SAC with small customization, for MAS parameter tuning and thereafter describe the three components.
3.1 Customized DDPG
A customized DDPG functions similar to the original DDPG; however, the input/output is different because of the task settings.
First, the actor in our model (A in Figure 2) does not accept any states. Typically, the actor accepts the current state to calculate the best policy based on the current state. However, the parametertuning task does not include any representation of the current state; moreover, the output of our actor is a parameter set, because the parameter set is similar to an action. Thus, our actor does not accept any state, and it only generates a policy for the parameter sets as follows: \(P_{i} = A(),\) where \(P_{i}\) denotes the ith trial parameter set for the simulator. If the parameter is Ndimensional, \(P_i = \left( p_{1,i}, p_{2,i}, \cdots , p_{N,i}\right) \), and A denotes the actor.
Second, the critic also differs from the original DDPG. A typical critic of DDPG accepts the state and action, but our critic only accepts a parameter set (action) similar to our actor. Moreover, although the usual prediction target of the critic is the Qvalue (the sum of the discounted future returns), the prediction target of the proposed critic is only an objective value of parameter tuning, as shown in Figure 2; this is because parameter tuning is a oneshot trial and not a Markov process. In typical reinforcement learning tasks, every action is supposed to have semipermanent effects, and the value of each action must be evaluated based on the future expected returns. However, the selection of one parameter does not affect the future results of parametertuning tasks; thus, consideration of the discounted future expected returns is not required. The loss function for our critic is defined as follows:
where \(o_i\) denotes the objective value of the ith trial from the simulation results, C denotes our critic, \(P_i\) denotes the ith trial parameter set given by the actor, and \(C(P_i)\) denotes the critic output for the given parameter set \(P_i\).
Therefore, actor loss is defined as,
because our actor learns to maximize the objective value estimated by our critic. Thus, by minimizing \(L_{C}\) and \(L_{A}\) alternately, our critic pursues the correct evaluation for a given parameter set and our actor pursues a better parameter set to obtain higher objective values. These procedures are similar to those used for the original DDPG.
Similar to the original DDPG, we employed the Ornstein–Uhlenbeck process [51] as an exploration noise, replay buffer [52], and softtarget update.
Moreover, in the case of a simulation failure, a penalty for the objective value was applied because the objective value could not be calculated. However, if the parameter is set inappropriately, the gradient cannot be calculated because all simulations will fail, and only penalties are applied, which have no gradient. Thus, if the parameter is set to this value through learning, the new parameter is selected again at random.
3.2 Customized SAC
Based on the original SAC [49, 50], we developed a customized SAC. Similar to the customized DDPG, the most important customizations are the input and output. However, unlike DDPG, SAC has a slightly more complex architecture than DDPG.
SAC is similar to DDPG in terms of supporting continuous action spaces. However, unlike DDPG, SAC employs a stochastic policy and an entropy term of the policy in its objective function, which enables high exploration capability. Figure 3 shows the outline of the implementation of the customized SAC.
First, in our actor, the action is generated using the reparameterization trick as follows:
where A() denotes the SAC actor NN that generates \(\mathbb {R}^{2\times N}\). \(\mathcal {N}(\mu , \sigma )\) is a Gaussian distribution, whose mean and variance are \(\mu \in \mathbb {R}^{N}\) and \(\sigma ^2 \in \mathbb {R}^{N}\), respectively . \(P_i = \left( p_{1,i}, p_{2,i}, \cdots , p_{N,i}\right) \) are the Ndimensional simulation parameters for the i th iteration.
In our SAC critic, unlike the customized DDPG, the model comprises two networks: \(Q(P_i)\) and V(). V() does not accept any inputs and the target value is theoretically defined as follows:
where \(\textrm{Pr}(P_i)\) denotes the probability that \(P_i\) is sampled from \(\mathcal {N}(\mu , \sigma )\) in (4), and \(\alpha \) is a hyperparameter which has been set to 0.2 in our experiments. In the usual reinforcement learning method, \(Q(P_i)\) and V() denote the action value and state value function, respectively. However, in our study, no such state existed. Thus, if we interpret them, \(Q(P_i)\) and V() are considered the parameter value and exploration value function, respectively, because the second term in V() denotes policy entropy. Moreover, although \(P_i\) appears on the right side of (5), V() does not depend on \(P_i\) because it evaluates the current state in the original theory of reinforcement learning, and no state is available in this task.
The loss functions of the critic are defined as follows:
In our experiments, we employed target networks. Therefore, \(Q(P_i)\) in (7) and (8) is the fixed network and no gradient is passed.
However, the loss function of the actor is obtained as follows:
Here, backpropagation is realized using the aforementioned reparameterization trick for some samples.
In the original study on SAC [49, 50], squashing of the action space (enforcing action bounds in the original context) was also employed. In our experiments, the parameters had positive value restrictions. Therefore, we formulated a squashed Gaussian policy. In our model, we employed the function \(f(x) = \ln (1+\exp (x))\). Thus, the Gaussian policy is converted to,
where \(P_i = \ln (1+\exp (\textbf{u}))\), \(\textbf{u} \sim \mathcal {N}(\mu , \sigma )\) and \(\textrm{Pr}_\mathcal {N}(\textbf{u})\) denotes the sampling probability. This definition differs from that in (4) because of the use of action squashing.
In addition, in a customized SAC, the replay buffer [52], soft target update, and penalty for invalid simulations are employed.
3.3 Action converter (AC)
We propose an AC to introduce parameter restrictions into a customized DDPG. The AC converts the output of the actor into a restricted parameter space, like action squashing in the SAC algorithm. In our study, we mainly used this for a nonnegative constraint parameter using the function \(f(x) = \ln (1+\exp (x))\) as in SAC.
This is similar to the aforementioned enforcing action bounds in SAC [49, 50]; however, only the aim (enforcing action bounds) is common to our AC; notably, SAC is not a deterministic reinforcement learning. Although SAC squashes its action probability, DDPG does not have an action probability but only employs exploring noise. Thus, DDPG requires action space squashing instead of probability squashing to bind the action space.
Thus, we assumed that a more relaxed squashing than that of SAC should be applied as an AC. Excessive changes should be avoided, particularly in areas that do not require constraints, to avoid disturbing DDPG exploration. Thus, for the nonnegative constraint, we employed \(f(x) = \ln (1+\exp (x))\). This is because \(\frac{\partial f(x)}{\partial x} = \frac{e^x}{1+e^x} = 1  \frac{1}{1+e^x}\), which is almost one when x is sufficiently large.
However, it is also possible to consider other functions. In addition, we did not discuss and examine the AC for other parameter constraints.
A practical application of AC is to the constraint parameters of the actor output. For instance, suppose that the parameter space is twodimensional, and all parameters have nonnegative constraints. In this case, \(P_i = \left( p_{1,i}, p_{2,i}\right) \), and we define \(p^\star _{1,i} = \ln (1+\exp (p_{1,i}))\) and \(p^\star _{2,i} = \ln (1+\exp (p_{2,i}))\). Subsequently, the final parameter set \(P^{\star }_i = \left( p^{\star }_{1,i}, p^{\star }_{2,i}\right) \) is obtained and used instead of \(P_i\).
Note that AC is applied only to the customized DDPG; it is not necessary for SAC because it has enforcing action bounds.
3.4 Redundant full neural network actor (FNNA)
For a simple implementation of our models, that is, the customized DDPG and SAC, the actor is only required to return the gradientenabled parameters like the left panel of Figure 4. Thus, the minimum implementation is for the actor to have only N parameters when the dimension of the parameter set is N because the actors in our models do not accept an input.
However, we propose an actor with a redundant NN called a redundant full NN actor (FNNA). As the right panel of Figure 4 shows, the FNNA architecture contains a multilayer perceptron (MLP), and the actor accepts a dummy vector all of whose components are one.
Although this architecture seems redundant and meaningless, it is similar to the lottery ticket hypothesis [53]. We assume that FNNA worked better than the minimum implementation of our models because NN performs similar to collective intelligence or learning and are stabilized according to the lottery ticket hypothesis.
Moreover, when we employed SAC and FNNA, the FNN was also applied to the Vnet (V()).
3.5 Seed fixer (SF)
MAS frequently exhibits various behaviors depending on random seeds. Random variables are used in many MAS routines to realize the probabilistic decisions of agents. However, because of the complex interactions between agents, slight state differences cause chaotically significant differences. Thus, differences in random variables could cause significant differences, as illustrated in the left panel of Figure 5.
The effects of random seeds are significant in learning. A critic was introduced to learn the relationship between the simulation parameter sets and simulation results. The actor was trained using the gradients obtained from the critic. The smoothness of the critic gradient, that is, the smooth response of the critic outputs to the input parameter, is necessary. Thus, the difference in the objective values caused by small changes in the parameter set must be greater than the effect of the random seeds.
Although a larger number of simulations generally eases the effect of random seeds, it is insufficient to erase their effects for critic gradients. Moreover, it is difficult to increase the number of simulations when the computational cost is large.
The SF provides one solution and fixes the seeds used in each set of simulations, as illustrated in the right panel of Figure 5. The objective values obtained through the simulations illustrated in Figure 1 are calculated by the mean of M trials for stabilization. Thus, the SF fixes M random seeds in the simulations; this implies that the set of random seeds for M simulations is always the same.
Introducing an SF eliminates the effect of random seeds and smoothens the gradient of the critic. Although simulations can exhibit chaotic behavior, they are improved.
However, this component has a tradeoff, in that the learning results are biased. As the random seeds are fixed to only M patterns, and M is generally excessively small to cover all possible states, the results are biased by these M patterns. Thus, there is a tradeoff between the gradient smoothness of the critic and the possibility of bias. Although employing a larger M could be a solution, it could also increase the computational burden. Fortunately, in our experiments, \(M=100\) did not appear to cause a significant bias.
4 Experiments
4.1 Task setting
We employed an artificial financial market simulation as a MAS for the experiments. Recently, particularly after the 2007–2008 financial crisis, financial market simulations have gained focus. For instance, Bookstaber [54] argued that MAS for financial markets was important for avoiding future crashes caused by the complex interaction of financial market factors. Mizuta [16] argued that MAS could contribute to the discussion of regulations in financial markets.
An artificial financial market simulation seems an ideal task for testing our method. It is because statistical indicators are used in actual financial markets and can be used for tuning targets. It enables us to set a reasonable tuning target.
Our experiments employ the stylized financial market simulation proposed in [17], which is widely used as an artificial financial market simulation. This simulation is available via PlhamJ, a platform for largescale, highfrequency artificial financial markets (Java version) [23].
Moreover, our proposed method can be applied to other simulations as long as Key Performance Indicators (KPIs) for the output are defined. Therefore, although this study only tests our method on an artificial financial market simulation, it is also applicable to other MAS tasks and simulation models.
4.1.1 Simulation
Only 100 stylized trader agents based on [17] and one continuous doubleauction market exist in the simulation. At time t, the stylized trader agent i determines its trading actions using the following criteria: fundamentals, chartists (trends), and noise. Initially, the agents calculate the following three factors:

Fundamental factor:
$$\begin{aligned} F_t^i = \frac{1}{\tau ^{*i}} \ln {\left\{ \frac{p_t^*}{p_t}\right\} }. \end{aligned}$$(11)where \(\tau ^{*i}\) denotes the mean reversiontime constant of agent i, \(p_t^*\) denotes the fundamental price at time t, and \(p_t\) denotes the price at time t.

Chartist factor:
$$\begin{aligned} C_t^i = \frac{1}{\tau ^i}\sum _{j=1}^{\tau ^i} r_{(tj)} = \frac{1}{\tau ^i}\sum _{j=1}^{\tau ^i}\ln {\frac{p_{(tj)}}{p_{(tj1)}}}, \end{aligned}$$(12)where \(\tau ^i\) denotes the time window size of agent i and \(r_t\) denotes the logarithmic return at time t.

Noise factor:
$$\begin{aligned} N_t^i \sim \mathcal {N} (0, \sigma ). \end{aligned}$$(13)denotes that \(N_t^i\) obeys a normal distribution with zero mean and variance \(\sigma ^2\).
Subsequently, the agents calculate the weighted averages of these three factors as follows:
where \(w_F^i, w_C^i\), and \(w_N^i\) denote the weights of agent i for each factor.
Next, the expected price of agent i is calculated using the following equation:
Subsequently, using a fixed margin of \(k^i \in [0,0.1]\), we determine the actual order prices using the following rules:

If \(\widehat{p_t^i} > p_t\), agent i places a bid (buy order) at the following price:
$$\begin{aligned} \min {\left\{ \widehat{p_t^i} (1k^i), p_{t}^{\textrm{ask}}\right\} }. \end{aligned}$$(16) 
If \(\widehat{p_t^i} < p_t\), agent i places an ask (sell order) at the following price:
$$\begin{aligned} \max {\left\{ \widehat{p_t^i} (1+k^i), p_{t}^{\textrm{bid}}\right\} }. \end{aligned}$$(17)
Here, \(p_{t}^{\textrm{bid}}\) and \(p_{t}^{\textrm{ask}}\) denote the best bid and ask prices, respectively.
The parameters employed for this type of trader are as follows: \(p_t^*=300, w_N^i \sim Ex(1.0), \sigma =0.001, \tau ^* \in [50,100]\), and \(\tau \in [100,200]\), which were mainly determined based on [17]. Here, \(Ex(\lambda )\) indicates an exponential distribution with an expected value of \(\lambda \). The values of \(w_F, w_C\) in \(w_F^i \sim Ex(w_F) \) and \( w_C^i \sim Ex(w_C)\) were the tuning targets in this experiment.
4.1.2 Objective value
The objective values are the skewness and kurtosis of the logreturns of the market prices. The values are calculated as follows:
where T denotes the total number of simulation steps, which was 10,000 in this study.
According to the results in [55], we set the targets of skewness and kurtosis to 0.0 and 6.0, respectively; these values are approximates as these values differ across financial markets. However, this was not important because we focused only on the tuning efficiency of the proposed method in this study.
Subsequently, the total mean square error (MSE) of both the skewness and kurtosis of the targets was calculated. Thus,
If the simulations fail, the MSE is replaced by the penalty, which was set to 1, 000 for this study. Simulation failures could occur because the price increased to infinity owing to inappropriate parameters.
Then, the final objective values were calculated as the negative MSE because the objective values are maximized.
In the tuning phase, all simulations were performed 100 times and the mean of the objective values was used as the final objective value to stabilize these objective values. Moreover, in the final evaluation phase after tuning, we performed 1,000 simulations to evaluate the performance of the tuners.
4.2 Models
As mentioned in the task settings, our parametertuning task aims to obtain a better \(w_F\) and \(w_C\) to provide a better objective value. As per the baseline model representing the typical parametertuning scheme shown in Figure 1, we employed a TPE (Optuna). In the following section, we explain this as well as the detailed settings of our proposed models.
4.2.1 Baseline model: treestructured parzen estimator (Optuna)
We employed a TPE [10] as a baseline model for MAS parameter tuning. This estimator is frequently used for parameter tuning in machine learning and is known as Optuna [56]. TPE is a Bayesian optimization method that pursues the best parameter set for a higher objective value as a black box optimization problem.
4.2.2 Our customized DDPG
All schemes and learning theories are explained in Section 3. Here, we describe the settings in detail, including the model parameters in the customized DDPG used in our experiments.

Actor (FNNA): fourlayered MLP with 100dimension hidden layers returning twodimension output. Each layer, except for the last layer, uses layer normalization and ReLU activations. This implies that a 100dimensional dummy vector filled with ones is accepted as the input.

Actor (not FNNA): always returns only two gradientenabled parameters.

Critic: fourlayered MLP with 100dimension hidden layers accepting twodimensional input (parameter set) and retuning onedimensional output (estimated objective value). Each layer, except for the last layer, uses layer normalization and ReLU activations.

Soft target update ratio of DDPG: 0.1

Actor learning rate: \(10^{4}\)

Critic learning rate: \(10^{3}\)

Batch size: 100

Buffer size of experience replay: 1,000
4.2.3 Our customized SAC
All schemes and learning theories are explained in Section 3. Here, we explain the settings in detail, including the model parameters in the customized SAC used in our experiments.

Actor (FNNA): threelayered common MLP with 100dimension hidden layers and the final linear layers returning twodimension outputs for \(\mu \) and \(\sigma \), respectively. All linear layers use layer normalization and each layer in the threelayered common MLP uses ReLU activations. This implies that a 100dimensional dummy vector filled with ones is accepted as the input.

Actor (not FNNA): always returns two twodimensional gradientenabled parameters for \(\mu \) and \(\sigma \), respectively.

Qnet (\(V(P_i)\)): fourlayered MLP with 100dimension hidden layers accepting twodimensional input (parameter set) and retuning onedimensional output (estimated objective value). Each layer, except for the last layer, uses layer normalization and ReLU activations.

Vnet (V() with FNNA): fourlayered MLP with 100dimension hidden layers accepting a 100dimensional dummy vector filled with ones and retuning onedimensional output (V()). Each layer, except for the last layer, uses layer normalization and ReLU activations.

Vnet (V() without FNNA): always returns onedimensional gradientenabled parameters for V().

Soft target update ratio of SAC: 0.1

Actor learning rate: \(10^{4}\)

Critic learning rate: \(10^{3}\)

Batch size: 100

Buffer size of experience replay: 1,000
4.3 Evaluation
For fair evaluation, the number of simulations available for training was limited to 100,000 at all. This implies that each epoch of training performed 100 simulations; thus, 1,000 epochs were performed.
After the training phase, the final evaluation test was performed with 1,000 simulations using the bestperforming model during training (based on objective values for training simulations).
5 Results
Table 3 summarizes the results of each of the ten experiments. This table lists the loss (negative objective value), kurtosis, and skewness of the tuning results of all models, as well as the mean (and standard deviation) and median values.
The loss is defined by (21), and a smaller loss indicates better parameter tuning. Moreover, as mentioned earlier, the loss is the negative value of the objective values because the actorcritic framework generally maximizes the objective value. In our settings, MSE was employed for this loss; thus, the loss was always nonnegative.
According to the results in Table 3, the SAC + FNNA + SF setup exhibits the best performance in terms of loss. However, SAC + FNNA + SF shows not only the best mean of the loss but also the best deviation and median of the loss compared with those of the baseline and other models.
The SACbased model achieved the best results, but the DDPGbased model outperformed the baseline. However, SACbased models appear to be more stable than DDPGbased models.
Comparing all the results of the DDPGbased models with those of other DDPGbased models missing the specified components, it is thus confirmed that all three components (AC, FNNA, and SF) are essential for our proposed DDPGbased model, even if one of the components is missing, performance degrades significantly. In particular, AC is necessary to tune because learning does not work completely if the AC is missing. When the AC is missing, the nonnegative constraints of the parameters in our task are not always satisfied, thus causing an invalid simulation. An invalid simulation always returns a fixed penalty, which leads to the missing parametertuning gradient. Thus, at least in our task, the AC is a necessary feature for DDPGbased models to run the tuner practically.
Comparing the effects of each component of the DDPGbased models, AC > FNNA > SF was observed in the sequence of larger effects. The effect of the SF seems comparatively small, but if it is missing, our DDPGbased models do not exhibit better performance than the baseline.
In contrast, for SACbased models, only the SF is a necessary component. FNNA performs better only when the SACbased model employs the SF. In contrast to DDPGbased models, the AC is not applicable. However, the dynamics of the effects of the additional components are also different from those of DDPGbased models.
Figures 6, 7, 8, 9, 10, 11, 12, 13 and 14 show the losses of all models for each epoch. The evaluation losses are plotted in these figures. The loss values plotted in these figures were calculated using an additional 1,000 simulations only for evaluations. Figure 6 shows the results for the baseline (TPE: Optuna). In this figure, a type of overfitting of the training samples can be observed. In contrast, in Figure 7, the DDPGbased model (DDPG + AC + FNNA + SF) exhibits a slower convergence. Unlike these two models, according to Figure 11, the SACbased model (SAC + FNNA + SF) exhibits faster and more stable convergence in its loss. In addition, from these figures, we can assume that the SACbased model outperforms other models. Moreover, as shown in Figures 11 – 14, we can observe that the SF leads to faster convergence.
6 Discussion
Our proposed model exhibited better tuning performance than the baseline model. In our evaluation, the SACbased models exhibited the best performance, particularly with the SF. As mentioned earlier, our task setting was less dimensional; therefore, the baseline method has several advantages. Although our proposed method exhibited its advantages in higherdimensional parameter tuning under our assumption, the results demonstrated that our proposed model works well and seems promising.
Moreover, our proposed components of DDPG and one component (SF) of SAC were found to be essential. If one of these three components for DDPG were missing, the DDPGbased models would have performed worse than the baseline. Moreover, if the SF was missing for SAC, the SACbased models would also have performed worse than the baseline. Faster convergence caused by the SF was also clearly observed as shown in Figures 11 and 13, and Figures 12 and 14. According to these results, the SF is the most important factor for parameter tuning in MAS. In our proposition, we assumed that differences in the random variables could cause significant differences because of the chaotic behavior of MAS as a complex system. Our results demonstrated that this assumption was valid.
According to the results, SACbased models outperformed DDPGbased models. The first possible reason is that SAC is not deterministic. Because of the deterministic policy of DDPG, DDPGbased models only achieved parameter tuning by continuous exploration. The action space, that is, the parameter space in this task, is extremely broad; thus, it is difficult to find a sweet spot through continuous exploration. However, SAC enables stochastic exploration, which causes a difference in results. The second possible reason is the SAC policy entropy. Because of the policy entropy term (the second term in (5)), SAC has a rich and wide exploration capability. Thus, compared to DDPG, SAC has a higher exploration capability, which led to better results in our task.
The fact that our actorcriticbased methods could tune the parameters of the simulation implies that the critic works as an approximate function of the simulations and analysis in our model. As explained and illustrated in Figure 1, our actor did not receive feedback directly from the simulation output. During the actor learning procedure, the actor received feedback on the evaluation score of the critic for the output parameter of the actor. This implies that the critic works sufficiently as a simulation approximation, and that the simulations are unnecessary for actor learning. This further implies that the critic successfully works as an endtoend surrogate model. Owing to the approximation by the critic, the actor successfully obtained a gradient for parameter tuning.
In terms of the evaluation, our experiments should also be updated for a more accurate evaluation. Our experimental task simply tuned kurtosis and skewness; however, this is far from practical. Therefore, the task settings should be updated for more practical tuning.
Considering the situation in which our models exhibited potential, we should test them for the task of highdimensional parameter tunings. In the context of deep reinforcement learning, our proposed method can be assumed to outperform highdimensional tasks because deep reinforcement learning has outperformed highdimensional tasks in previous studies. By contrast, in the context of surrogate models, when the parameter dimension is large, the approximation of simulations by the critic may become more complex and require more data for learning approximation. Moreover, for the more dimensional task, evaluation criteria (objective function) should also be enriched more. For tuning a small number of parameters, a lowdimensional KPI is enough. However, if we consider tuning more dimensional tasks, lowdimensional KPI, such as our experiment only using Kurtosis and Skewness, is not enough to set a tuning task without multiple local optima. Therefore, for testing more dimensional parameter tuning tasks, we also try to build better evaluation criteria, and the building is not easy. Thus, it remains unclear whether our reinforcementlearningbased models work well for highdimensional tasks and need to be addressed in future studies.
Finally, applicability to the other tasks is also a possible future work. The only requirement of our proposed method is the KPI of outputs. Although building a KPI for tuning is difficult as we discussed above, if the KPI exists, our method seems applicable for any simulation parameter tuning tasks. However, if KPI is inappropriate, the tuning will fail. Therefore, also on simulations of other fields, both construction of KPIs and the experiments of our method are necessary.
7 Conclusion
This study proposed a method for tuning the simulation parameters for MAS using a customized reinforcement learning method. In our proposed method, actorcritictype reinforcement learning methods such as DDPG and SAC were modified for MAS parameter tuning. Moreover, we proposed three additional components: AC, FNNA, and SF. For the experiments, we employed an artificial financial market simulation for the tuning task. The objective function in tuning is the negative MSE between the target and simulations such that the skewness and kurtosis are close to realistic values. In our experiments, we compared our proposed method with a baseline known as TPE (Optuna), which is based on Bayesian estimation. The results demonstrated that the proposed method outperformed the baseline method. In particular, our SACbased models outperformed other models, including the baseline. These results indicate that the proposed method is promising. Moreover, it was also indicated that AC, FNNA, and SF for DDPGbased models and SF for SACbased models were essential components. Interestingly, the results demonstrated that the critic of our proposed model worked well as a surrogate model for the simulations. Subsequently, owing to the critic, the actor could be assumed to learn better parameters. Based on these results, we conclude that the proposed model is promising. In future work, we plan to address the learning stability or the evaluation of other tasks, such as highdimensional parametertuning tasks, in which the method based on reinforcement learning can fully demonstrate its advantages.
Availability of data and materials
No data is used in this study.
References
Kurahashi, S.: Estimating Effectiveness of Preventing Measures for 2019 Novel Coronavirus Diseases (COVID19). Proceeding of 2020 9th Int. Congress Adv. Appl. Inf. 487–492 (2020). https://doi.org/10.1109/IIAIAAI50415.2020.00103
Mizuta, T., Kosugi, S., Kusumoto, T., Matsumoto, W., Izumi, K., Yagi, I., Yoshimura, S.: Effects of Price Regulations and Dark Pools on Financial Market Stability: An Investigation by Multiagent Simulations. Intell. Syst. Account. Finance Manag. 23(1–2), 97–120 (2016). https://doi.org/10.1002/isaf.1374
Hirano, M., Izumi, K., Shimada, T., Matsushima, H., Sakaji, H.: Impact Analysis of Financial Regulation on MultiAsset Markets Using Artificial Market Simulations. J. Risk Financial Manag. 13(4), 75 (2020). https://doi.org/10.3390/jrfm13040075
Sajjad, M., Singh, K., Paik, E., Ahn, C.W.: A datadriven approach for agentbased modeling: Simulating the dynamics of family formation. J. Art. Soc. Soc. Simul. 19(1), 9 (2016). https://doi.org/10.18564/jasss.2988
Nonaka, Y., Onishi, M., Yamashita, T., Okada, T., Shimada, A., Taniguchi, R.I.: Walking velocity model for accurate and massive pedestrian simulator. IEEJ Trans. Electron. Inf. Syst. 133(9), 1779–1786 (2013). https://doi.org/10.1541/ieejeiss.133.1779
Shigenaka, S., Onishi, M., Yamashita, T., Noda, I.: Estimation of LargeScale Pedestrian Movement Using Data Assimilation. IEICE Trans. Inf. Syst. D. J. 101(9), 1286–1294 (2018). https://doi.org/10.14923/transinfj.2017SAP0014
Moss, S., Edmonds, B.: Towards Good Social Science. J. Art. Soc. Social Simul. 8(4), 13 (2005). http://jasss.soc.surrey.ac.uk/8/4/13.html
Matsushima, H., Uchitane, T., Tsuji, J., Yamashita, T., Ito, N., Noda, I.: Applying Design of Experiment based Significant Parameter Search and Reducing Number of Experiment to Analysis of Evacuation Simulation. Trans. Japanese Society Art. Intell. 31(6), 1–9 (2016). https://doi.org/10.1527/TJSAI.AGE
Yamashita, Y., Shigenaka, S., Oba, D., Onishi, M.: Estimation of Largescale Multi Agent Simulation Results Using Neural Networks [in Japanese]. In: 39th Japanese Special Interest Group on Society andArtificial Intelligence (SIGSAI), p. 05 (2020). https://doi.org/10.11517/JSAISIGTWO.2020.SAI039_05
Ozaki, Y., Tanigaki, Y., Watanabe, S., Onishi, M.: Multiobjective treestructured parzen estimator for computationally expensive optimization problems. In: Proceedings of 2020 Genetic and Evolutionary Computation Conference, pp. 533–541 (2020). https://doi.org/10.1145/3377930.3389817
Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., Mordatch, I.: Emergent Tool Use From MultiAgent Autocurricula. In: Proceedings of the International Conference on Learning Representations (2020). https://doi.org/10.48550/arxiv.1909.07528
Farmer, J.D., Foley, D.: The economy needs agentbased modelling. Nature 460(7256), 685–686 (2009). https://doi.org/10.1038/460685a
Battiston, S., Farmer, J.D., Flache, A., Garlaschelli, D., Haldane, A.G., Heesterbeek, H., Hommes, C., Jaeger, C., May, R., Scheffer, M.: Complexity theory and financial regulation: Economic policy needs interdisciplinary network analysis and behavioral modeling. Science 351(6275), 818–819 (2016). https://doi.org/10.1126/science.aad0299
Lux, T., Marchesi, M.: Scaling and criticality in a stochastic multiagent model of a financial market. Nature 397(6719), 498–500 (1999). https://doi.org/10.1038/17290
Cui, W., Brabazon, A.: An agentbased modeling approach to study price impact. In: Proceedings of 2012 IEEE Conference on Computational Intelligence for Financial Engineering and Economics, pp. 241–248 (2012). https://doi.org/10.1109/CIFEr.2012.6327798
Mizuta, T.: An agentbased model for designing a financial market that works well. arXiv (2019). https://doi.org/10.48550/arXiv.1906.06000
Torii, T., Izumi, K., Yamada, K.: Shock transfer by arbitrage trading: analysis using multiasset artificial market. Evol. Inst. Econ. Rev. 12(2), 395–412 (2015). https://doi.org/10.1007/s408440150024z
Chiarella, C., Iori, G.: A simulation analysis of the microstructure of double auction markets. Quantitative Finance 2(5), 346–353 (2002). https://doi.org/10.1088/14697688/2/5/303
Leal, S.J., Napoletano, M.: Market stability vs. market resilience: Regulatory policies experiments in an agentbased model with low and highfrequency trading. J. Econ. Behav. Organ. 157, 15–41 (2019). https://doi.org/10.1016/j.jebo.2017.04.013
Paddrik, M., Hayes, R., Todd, A., Yang, S., Beling, P., Scherer, W.: An agent based model of the EMini S &P 500 applied to flash crash analysis. In: Proceedings of 2012 IEEE Conference on Computational Intelligence for Financial Engineering and Economics, pp. 257–264 (2012). https://doi.org/10.1109/CIFEr.2012.6327800
Torii, T., Kamada, T., Izumi, K., Yamada, K.: Platform Design for Largescale Artificial Market Simulation and Preliminary Evaluation on the K Computer. Art. Life Robotics 22(3), 301–307 (2017). https://doi.org/10.1007/s100150170368z
Torii, T., Izumi, K., Kamada, T., Yonenoh, H., Fujishima, D., Matsuura, I., Hirano, M., Takahashi, T.: Plham: Platform for Largescale and Highfrequency Artificial Market (2016). https://github.com/plham/plham
Torii, T., Izumi, K., Kamada, T., Yonenoh, H., Fujishima, D., Matsuura, I., Hirano, M., Takahashi, T., Finnerty, P.: PlhamJ (2019). https://github.com/plham/plhamJ
Sato, H., Koyama, Y., Kurumatani, K., Shiozawa, Y., Deguchi, H.: Umart: a test bed for interdisciplinary research into agentbased artificial markets. In: Evolutionary Controversies in Economics, pp. 179–190 (2001). https://doi.org/10.1007/9784431679035_13
Arthur, W.B., Holland, J.H., LeBaron, B., Palmer, R., Tayler, P.: Asset pricing under endogenous expectations in an artificial stock market. The Economy as an Evolving Complex System II, 15–44 (1997). https://doi.org/10.1201/97804294966392
Byrd, D., Hybinette, M., Hybinette Balch, T., Morgan, J.: ABIDES: Towards HighFidelity MultiAgent Market Simulation. In: Proceedings of the 2020 Conference on Principles of Advanced Discrete Simulation, pp. 11–22 (2020). https://doi.org/10.1145/3384441.3395986
Murase, Y., Uchitane, T., Ito, N.: A Tool for Parameterspace Explorations. Phys. Proced. 57(C), 73–76 (2014). https://doi.org/10.1016/J.PHPRO.2014.08.134
Murase, Y., Matsushima, H., Noda, I., Kamada, T.: CARAVAN: A Framework for Comprehensive Simulations on Massive Parallel Machines. Massively MultiAgent Systems II, 130–143 (2019). https://doi.org/10.1007/9783030209377_9
Angione, C., Silverman, E., Yaneske, E.: Using machine learning as a surrogate model for agentbased simulations. PLOS ONE 17(2), 0263150 (2022). https://doi.org/10.1371/JOURNAL.PONE.0263150
Watkins, C.J.C.H., Dayan, P.: Qlearning. Mach. Learn. 8(3–4), 279–292 (1992). https://doi.org/10.1007/bf00992698
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988). https://doi.org/10.1007/BF00115009
Tesauro, G.: Temporal Difference Learning and TDGammon. Commun. ACM 38(3), 58–68 (1995). https://doi.org/10.1145/203330.203343
Rummery, G.A., Niranjan, M.: Online Qlearning Using Connectionist Systems. University of Cambridge, Department of Engineering Cambridge, England (1994)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236
Bellemare, M.G., Veness, J., Bowling, M.: The Arcade Learning Environment: An Evaluation Platform for General Agents. J. Art. Intell. Res. 47, 253–279 (2013). https://doi.org/10.1613/jair.3912
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Adv. Neural Inf. Process. Syst. 2, 1097–1105 (2012). https://doi.org/10.1145/3065386
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double QLearning. In: Proceedings of 30th AAAI Conference on Artificial Intelligence, pp. 2094–2100 (2016). https://doi.org/10.1609/aaai.v30i1.10295
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Frcitas, N.: Dueling Network Architectures for Deep Reinforcement Learning. In: Proceedings of 33rd International Conference on Machine Learning, pp. 2939–2947 (2016)
Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., Legg, S.: Noisy Netw. Explor. arXiv (2017). https://doi.org/10.48550/arXiv.1706.10295
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, USA (2018)
OpenAI: OpenAI Baselines: ACKTR & A2C (2017). https://openai.com/blog/baselinesacktra2c/ Accessed 20191106
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., Silver, D.: Rainbow: Combining improvements in deep reinforcement learning. In: Proceedings of 32nd AAAI Conference on Artificial Intelligence, pp. 3215–3222 (2018). https://doi.org/10.1609/aaai.v32i1.11796
Horgan, D., Quan, J., Budden, D., BarthMaron, G., Hessel, M., van Hasselt, H., Silver, D.: Distributed Prioritized Experience Replay. arXiv (2018). https://doi.org/10.48550/arXiv.1803.00933
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., Dabney, W.: Recurrent Experience Replay in Distributed Reinforcement Learning. In: Proceedings of International Conference on Learning Representations, pp. 1–15 (2019)
Hochreiter, S., Schmidhuber, J.: Long ShortTerm Memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., Hassabis, D.: A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Sci. 362(6419), 1140–1144 (2018). https://doi.org/10.1126/science.aar6404
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017). https://doi.org/10.1038/nature24270
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. In: Proceedings of 4th International Conference on Learning Representations (2015). https://doi.org/10.48550/arxiv.1509.02971
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., Levine, S.: Soft ActorCritic Algorithms and Applications. arXiv (2018). https://doi.org/10.48550/arxiv.1812.05905
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proc. 35th Int. Conf. Mach. Learn. 2976–2989 (2018). https://doi.org/10.48550/arxiv.1801.01290
Uhlenbeck, G.E., Ornstein, L.S.: On the theory of the brownian motion. Physi. Rev. 36(5), 823 (1930). https://doi.org/10.1103/PhysRev.36.823
Wawrzyński, P., Tanwani, A.K.: Autonomous reinforcement learning with experience replay. Neural Netw. 41, 156–167 (2013). https://doi.org/10.1016/j.neunet.2012.11.007
Frankle, J., Carbin, M.: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. Proceedings of 7th International Conference on Learning Representations (2018). https://doi.org/10.48550/arxiv.1803.03635
Bookstaber, R.M.: The End of Theory: Financial Crises, the Failure of Economics, and the Sweep of Human Interaction. Princeton University Press, USA (2017)
Corsi, F.: Measuring and modelling realized volatility: from tickbytick to long memory. PhD thesis, Universitá della Svizzera italiana (2005)
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A nextgeneration hyperparameter optimization framework. In: Proceedings of the 25th International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019). https://doi.org/10.1145/3292500.3330701
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number JP 21J20074 (GrantinAid for JSPS Fellows).
Funding
Open access funding provided by The University of Tokyo. This work was supported by JSPS KAKENHI Grant Number JP 21J20074 (GrantinAid for JSPS Fellows).
Author information
Authors and Affiliations
Contributions
M.H. made the conception or design and the softwear, conducted experiments and analysis, and wrote the main manuscript text. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethical Approval
Not applicable
Competing interests
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Fairnessdriven User Behavioral Modelling and Analysis for Online Recommendation
Guest Editors: Jianxin Li, Guandong Xu, Xiang Ren, and Qing Li.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hirano, M., Izumi, K. Neuralnetworkbased parameter tuning for multiagent simulation using deep reinforcement learning. World Wide Web 26, 3535–3559 (2023). https://doi.org/10.1007/s11280023011975
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280023011975