Reinforcement learning based parameters adaption method for particle swarm optimization

Particle swarm optimization (PSO) is a well-known optimization algorithm that shows good performance in solving different optimization problems. However, PSO usually suffers from slow convergence. In this article, a reinforcement learning-based online parameters adaption method(RLAM) is developed to enhance PSO in convergence by designing a network to control the coefficients of PSO. Moreover, based on RLAM, a new RLPSO is designed. In order to investigate the performance of RLAM and RLPSO, experiments on 28 CEC 2013 benchmark functions are carried out when comparing with other online adaption method and PSO variants. The reported computational results show that the proposed RLAM is efficient and effictive and that the the proposed RLPSO is more superior compared with several state-of-the-art PSO variants.


INTRODUCTION
In recent years, the research community has witnessed an explosion of literature in the area of swarm and evolutionary computation [1].Hundreds of novel optimization algorithms have been reported along the years and applied successfully in many applications, e.g., system reliability optimization [2], DNA sequence compression [3], system of boundary value problem [4], solving mathematical equations [5,6], object-level video advertising [7], wireless networks [8].
Particle swarm optimization (PSO), which was first proposed in 1995 by Kennedy and Eberhart [9] [9], is one of the most famous swarm-based optimization algorithms.
Due to its simplicity and high performance, a multitude of enhancements have been presented on PSO during the last few decades, which can be simply categorized into three types: parameter selection, topology, and hybridization with other algorithms [10].
When solving different optimization problems, appropriate parameters need to be configured for PSO and its variants.The performance of PSO heavily depends on the parameters settings, which shows the importance of parameters adaption.However, it is very hard for humans to conduct a laborious preprocessing phase for the detection of promising parameter values and to design the strategy to monitor the running state of PSO and then adjust the running parameters [11].So, this papar focuses on parameters adaption.
The current adaptation algorithms are mainly divided into two categories, offline adaptation and online adaptation.Offline adaptation is relatively simple to implement, but because all its designs are determined before the optimization algorithm runs, it loses some adaptability to the problem during running.In contrast, online adaptation will monitor the operation of the optimization algorithm to control the operation of the algorithm.Common online adaptation methods include control based on history, control based on test experiments, control based on fuzzy logic and control based on reinforcement learning and so on.In the existing online adaptations, almost all control rules need to be learned during algorithm running.Although fuzzy logic has pre-configured rules, it also needs to be manually configured item by item, which makes the current adaptation algorithm inefficient.In addition, these adaptation algorithms are usually designed for a certain optimization algorithm and cannot be applied to various optimization algorithms.But now in the image domain [12] and NLP domain [13], the pre-training model is very mature, which greatly improves the performance of subsequent tasks.This inspired us to use reinforcement learning to improve particle swarm performance through pre-training.
In this paper, we propose a Reinforcement-Learning-based parameter adaption method(RLAM) by embedding Deep deterministic policy gradient(DDPG) [14] into the process of PSO.In the proposed method, there are two neural network: actor network and action-value network.Actor network is trained to help the particles in PSO choose their best parameters according to their states.Actor-value network is trained to evaluate the performance of the action network and provide gradients for the training of the actor network.The inputs of actor network includes three parts: the percentage of iterations, no-imporving iterations and the diversity of the swarm.All particle will be divided into several groups and each group has their own action generated by the actor network.Typically the action controls w, c1 and c2 in PSO, but it has the ability to control any parameter if needed.Reward function is needed to train the two networks.The design of it is very simple and targeting at encouraging the PSO to get better solution in every iteration.The whole process of training and utilizing the two networks will be described in Sect 3.2.2.
To evaluate the performance of the method proposed in this paper, three sets of experiments are designed in this paper: 1. Six particle swarm algorithm variants were selected, and their performances combined with RLAM were compared with their original performances.2. Combine RLAM with the original PSO and compare it with other online adaption methods.3. A new RLPSO will be designed based on this method, and its performance will be compared with five particle swarm variants and several advanced algorithms proposed in recent years.These experiments verify the effectiveness of RLAM.

The main contribution of this paper
1.The reinforcement learning algorithm DDPG is introduced into self adaption, which greatly improves the parameter adaptation ability.
2. Introduce the concept of pre-training into self adaption, so that the particle swarm algorithm can not only adapt parameters through the current situation, but also adapt parameters through past experience, which improves the intelligence level of the algorithm.
3. Based on the above improvements, a self-adaptive particle swarm optimization algorithm based on reinforcement learning(RLPSO) is proposed, which greatly improves the performance of particle swarm optimization.

The structure of the paper
The rest of this paper is organized as follows.related work about self-adaption is introduced in Sect. 2. the definition of PSO and DDPG is in Sect.3. the implementation of proposed RLAM and RLPSO algorithms is described is in Sect. 4. Experimental studies are presented in Sect. 5. A conclusion summarizing the contributions of this paper is in Sect.6.

RELATED WROK
The original particle swarm is designed to imitate the behavior of birds and fish, but in fact the behavior of birds and fish is far more complicated than that of particle swarm algorithm, and many adaptations will be made when the environment changes.Moreover, the performance of the particle swarm variant algorithm largely depends on its parameter configuration.The characteristics of natural organisms and the dependence of particle swarms on parameters have prompted more and more researchers to study how to make particle swarms more intelligent, that is, more adaptive, to meet various optimization problems.The essence of this method can be understood as an optimizer for an optimizer.
Each adaptive algorithm consists of three parts, adaption target, adaption source and adaption method.The adaptive algorithm receives information from the adaptation source and finally guides the adaptation target to change through the adaptation method.Its classification is shown in Figure 1.
This article will focus on the adaptive methods part of the adaptive algorithm.
Adaptive method are distinguished into the two categories, online tuning and offline tuning, depending on whether they tune the parameters prior or during the algorithm's execution.

offline adaption
Offline adaption refers to those adaptive methods whose parameters are determined before the actual optimization problem is solved, and can be divided into three categories: fine tuning, based on the optimization progress of the optimization problem, the parameter linear change and nonlinear change scheme.

fine tuning
There are many parameters that need to be configured in the particle swarm algorithm, and these parameters are related to the quality of the final optimization result.Therefore, the parameters can be tested in groups, the parameter groups can be tested on all test problems, and the parameter group with the best effect can be selected.Design of Experiments [15][2], F-Race [16][3], and ParamILS [17][10] are part of the state-of-the-art in this field.

Linear strategies
Initially, all parameters in the particle swarm algorithm were set to a fixed value [9], but researchers found that this method was not efficient, and it was difficult to balance the relationship between exploration and discovery.Therefore, linear strategy is proposed.Linear strategy means that some algorithms pre-specify a linear expression during the running process, and determine some current running parameters according to some running states.
In the paper [18], the researchers propose a linearly reduced w parameter, which improves the fine tune capability of PSO in the later stage.In this paper, the value of w will decrease linearly from an initial value wmax to wmin.The specific formula for this change is as follows: Then in the paper [19], the researchers proposed a linear change operation for three parameters, in addition to the linear decrease of w, c1 will increase linearly, and c2 will decrease linearly.These parameters are preset with their maximum and minimum values.This method makes the PSO algorithm more inclined to search near the optimal value found by itself in the early stage, which improves the particle's exploration ability, and improves its fine tune ability in the later stage, which can quickly converge to the global optimal value .The formula for this change of c1 and c2 is shown in Eq.1, and the change of w is consistent with the above formula: In the paper [20], the researchers constructed a linearly growing w, which can also achieve better performance in some of the problems given in the paper.The equation is as follows: Considering that both the linear growth and decline of w have advantages in some problems, in the paper [21], the researchers constructed a linearly increasing w first and then linearly decreasing.The specific changes are as follows: In addition to the above-mentioned linear adjustment according to the running progress of the algorithm, there are also papers that linearly adjust the parameters according to some other parameters.
For example, in the paper [22], the researchers adjust the w value according to the average distance between particles.The specific adjustment method is as follows: where D(t) denotes the average distance amongst particles and ω max and ω min are predefined.

Nonlinear strategies
Inspired by many linear strategies, many researchers have turned their attention to nonlinear strategies.These methods make parameter changes more flexible and closer to the needs of the problem.
In the paper [23], the author proposes to use the E index to update w, which improves the convergence speed of the algorithm on some problems.
In the paper [24], the author uses a quadratic function to update the parameters.Test results show that this method outperforms the linear transformation algorithm in most continuous optimization problems.
In the paper [25], the author combines the sigmoid function into the linear transformation, so that the algorithm can quickly converge in the search process.
In the papers [26] and [27], the author applies the chaotic model (Logistic map) to the parameter transformation, which makes the algorithm have stronger search ability.
Similarly, in the paper [28], the author sets the parameters to random numbers, and also obtains good results.

online adaption
online tuners are techniques for the dynamic adaptation of the algorithm's parameters during its execution.They are typically based on performance feedback from the algorithm on the considered problem instance.

History based
Some papers divide the whole running process into many small processes during the running process.Use different parameters or strategies in each small process, and judge whether the parameters or strategies are good enough according to the performance of the algorithm at the end of the small stage.The strategies or parameters that are good enough will be chosen more in subsequent runs.In the paper [29], the author builds a parameter memory.In each run, all running particles are assigned different parameter groups, and the parameters used by some particles with better performance after the small-stage run is completed will be saved in the parameter memory.The parameters chosen by subsequent particles will tend to be close to the average of the parameter memory.In the paper [30], the author designs 5 particle swarm operation strategies based on some excellent particle swarm variant algorithms in the past, and keeps a record of success rate for each strategy.In the initial state, all success rates are set to 50%, and then in each small process, a strategy will be randomly selected for execution based on the weight of the past success rate.How many particles have been promoted, based on which the memory of the success rate is updated.In the paper [31], the author improved the original EPSO, updated the designed strategy, divided the particle swarm into multiple subgroups, and evaluated the strategy separately, which further improved the performance of the algorithm.

Small Test period based
This method divides the entire optimization process into multiple sub-processes, and divides each sub-process into two parts, one part is used to test the performance of parameters or strategies, and the number of evaluations is generally less than 10%, and the other part is the normal optimization process.In the paper [32], the author divides the parameters into many parameter groups according to the grid.In each performance test process, the population is divided into multiple subgroups, and the parameter groups around the grid of the previous process are tested.After running several iterations, the parameter group with the best effect is selected as the parameter group to be used.In the paper citetatsis2019dynamic, the author divides the test sub-process into two processes.In the first process, the adjacent points of the current selection point will be evaluated, and the probability of success will be calculated, Then in the second process, take the direction with the highest probability of success as the exploration direction, explore multiple steps in this direction, and finally select the best parameter group in the second step as the parameter configuration for the next operation.This method further improves performance.

Fuzzy rules
In the paper [33], the author presented a new method for dynamic parameter adaptation in PSO, where the authors proposed an improvement to the convergence and diversity of the swarm paper input(adaption source) output(adaption target) reward single particle algorithm [35] topological structure next topological structure diversity, gbest N Q [36] distance with the best/rank c1, c2, w gbest Y Q [37] pervious strategy strategy gbest Y Q [38] particle postion velocity evaluate fitness Y Q [39] particle index/postion strategy gbest Y Q [40] postion, pbest c1, c2 gbest Y PG in PSO using interval type-2 fuzzy logic.The experimental results were compared with the original PSO, achieving that the proposed approach improves the performance of PSO.In the paper [34], the author presented a work to improve the convergence and diversity of the swarm in PSO using type-1 fuzzy logic applied to classification problems.

Reinforcement learning based
At present, most optimization algorithms based on reinforcement learning are based on Qlearning for adaptive control parameters and strategies.In the paper [35], The author combines a variety of topological structures, using the diversity of the particle swarm and the topological structure of the previous step as the state, and selects the topological structure with the largest Q value from the Qtable as the topology to be used in the next optimization step.In the paper [36], the author uses the Qlearning algorithm.Enter the state with how far from the optimal value and the ranking of the particle evaluation value among all particles.Use different parameter groups as adaptation targets.The reward is determined based on whether the entire optimization problem grows, the current optimization progress, and the currently selected action.In the paper [37], the author uses the Qlearning algorithm.The strategy selected in the previous step is the state input.Output actions control different strategies such as jump or finetune.Finally, the reward is determined according to whether the whole optimization problem grows.In the paper [38], the author uses the Qlearning algorithm, take the particle position as the state input.The output action controls the prediction speed of different particle strategies.Finally, the reward is determined according to the increase or decrease of the particle evaluation value.In the paper [39], the author uses the Qlearning algorithm, using the serial number of each particle, the particle position as the state input, and the output action to control two different strategies, and finally determine the reward according to whether the entire optimization problem grows.In the paper [40], the author adopts the algorithm of policy gradient, which takes the particle position pbest as the state input, and the output action controls C1 and C2, and finally determines the reward according to the growth rate of the entire optimization problem.
A summary of these papers is placed in the table 1, which describes the method used in the paper, state input, action output, reward input and whether to control each particle separately.

particle swarm optimization
Particle swarm optimization is the representation of swarm behaviors in some ecological system, such as birds flying and bees foraging [A reinforcement learning-based communication topology in particle swarm optimization:29] [41].In classic PSO, the movement of a particle in PSO is influenced by its own previous best postion and the best postion of the best particle in the swarm.To describe the state of particles, velocity V and postion X is defined as follows: Here, D represents the dimension of the search space.N represents the number of particles.As the search progresses, the two movement vectors are updated as follows: Here, w is the inertia weight.c 1 is the cognitive acceleration coefficient.c 2 is the social acceleration coefficient.r 1 and r 2 are uniformly distributed random numbers within [0,1], and v i (t + 1) denotes the velocity of the i th particle in the t th generation.pBest i is the personal best position for the i th particle, and gBest is the best position in this swarms.

Reinforcement learning and deep deterministic policy gradient
this subsection will introduce DDPG and RL briefly.For more detail you can see [14].

Reinforcement learning
Reinforcement learning(RL) is a kind of machine learning.Its purpose is to guide the agent to perform optimal actions in the environment to maximize the cumulative reward.The agent will interact with the environment in discrete timesteps.At each timestep t, the agent receives an observation s t ∈ S, takes an action a t ∈ A and receives a scalar reward r t (s t , a t ).Here S is the state space, A is the action space.The agent's behavior is controlled by a policy π : S → A , which maps each observation to an action.

Deep deterministic policy gradient
In DDPG, there are 4 neural network designed to get the best policy π: actor network µ(s t |θ µ ), target actor network µ (s t |θ µ ), action-value network Q(s t , a t |θ Q ) and target action-value network Q (s t , a t |θ Q ).µ and µ are used to choose action according to state, Q and Q are used to evaluate the action choosed by the actor network.θ µ , θ µ , θ Q and θ Q are the neural network weights of above neural networks.At the beginning, θ µ is a copy of θ µ and θ Q is a copy of θ Q .During training, the weights of these target networks are updated as follows: Here, τ 1.To train the actor-value network, we need to minimize the loss function: Then, actor-value network is used to train actor network with policy gradient: The data flow of training process is shown in Fig. 2.

Comprehensive learning particle swarm optimizer (CLPSO)
For more detail, you can see [42].
The velocity updating equation used in CLPSO is as follows: where, ] defines which particle's pbest the i th particle should follow.The d th dimension of the i the particle follow the f i (d) th particle's pbest in d th dimension.
To determine the f i (d), every particle have their own learning P c i .The P c i value for each particle is generated by the following equation: Here, ps is the population size, a = 0.05, b = 0.45.When a particle updates its velocity for one dimension, there will be a random value in [0,1] generated and compared with P c i .If the random value is larger than P c i , the particle of this dimension will follow its own pbest.Otherwise, it will follow another particle's pbest for that dimension.CLPSO will employ a tournament selection to choose a target particle.What's more, to avoid wasting function evaluations in the wrong direction, CLPSO defines a certain number of evaluations as refreshing gap m.During the period of a paticle following a target particle, the number of times the particle ceases improving is recorded as f lag clpso If f lag clpso is bigger than m,the particle will get his new target particle by employing a tournament selection again.

PROPOSED ALGORITHM
In this section, an efficient parameters adaption method is proposed.A variant of particle swarm algorithm based on the above algorithm will be introduced later.

Parameters self-adaption based on DDPG
In this section, we will first introduce the input and output of the action network applied to the particle swarm algorithm.Later, We will then describe how the reward function scores based on changes in state.The third step will introduce how to train the particle swarm algorithm.How to run it with the trained network will be described later.Finally, the proposed new particle swarm algorithm is introduced.

State of PSO (input of actor network)
The running state designed in this paper is divided into three parts: the current iteration progress, the current particle diversity, and the current duration of the particle no longer growing.In order for the neural network to work optimally, all operating states will eventually be mapped to the interval -1 to 1, and finally input to the neural network.
The iteration input is considered as a percentage of iterations during the execution of the PSO, the values for this input are in the range from 0 to 1.In the start of the algorithm, the iteration is considered as 0% and is increased until it reaches 100% at the end of the execution.The values are obtained from Equation 7.
Here, F e num is the number of function evaluations that have been performed, and F e max is the maximum number of function evaluation executions set in the algorithm run.
The diversity input is defined by Equation 8, in which the degree of the dispersion of the swarm is calculated.This means that for less diversity, the particles are closer together, and for high diversity, the particles are more separated.The diversity equation can be taken as the average of the Euclidean distances amongst each particle and the best particle: Here, x ij (t) is the j th postion of i th particle in the iteration t. x j (t) is the average j th of all the particles in the iteration t.
The stagnant growth duration input is used to indicate whether the current particle swarm is running efficiently.It is defined as follows: Here, F e num last−imporve represents the number of evaluations at the last global optimal update.
In order to map all parameters to 0-1 and make the information more salient, all the above information will be encoded based on the sin-encode method in the literature [43].Assuming that an input is originally x, x can represent Iteration, diversity or Iteration no−improvement , the output is as follows: Where state i is the i th parameter newly generated from x, in this paper i takes 0, 1, 2, 3, 4.An x will eventually generate 5 new parameters.The reason for this operation is that some of the above parameters are very small, and their changes are even smaller, which will cause the action network to fail to capture their changes and fail to perform effective action output.

Action of PSO (output of actor network)
The actions designed in this paper are used to control the operating parameters in the PSO, such as w, c1, c2.The parameters to be controlled can be set as required.The action output by the action network does not control the above parameters of each particle separately, but divides the particles into 5 subgroups, and generates 5 sets of parameters to control different groups.
For the traditional PSO algorithm.The obtained action vector a t is 20-dimensional, divided into 5 groups, each of which is aimed at a sub-swarm.For a sub-swarm, the action vector is 4dimensional: a[0] to a [3].The w, c1 and c2 required for each round of the optimization algorithm will be generated according to a[0] to a [3].The generating formula is as follows: Here, scale is a parameter helping us normalize c1 and c2 and it is optional.
For some PSO variants.Since the parameters have been studied in some algorithms, the original parameter set is already quite good.In order to take advantage of the performance of the original parameter set, the new w, c1, c2 will be configured according to the following formulas.
Where w origin , c1 origin , c2 origin represent the original parameters of the algorithm.In all subsequent experiments, each algorithm will choose one of the action parameter configuration methods for experimentation.
If there are more parameters in the algorithm that need to be configured, you can increase the number of output parameters and configure them like c1 or w.

Reward function
Reward function is used to caculate reward after an action is excuting.Its target is to encourage the PSO get better Gbest.Therefore, the reward function is designed as follows: Here, Gbest(t) is the best solution in the t th iteration.

Training
This section will introduce the training process for the traditional PSO algorithm.Other particle swarm variants are basically the same as this process.
In the training process, each iteration of the particle swarm algorithm is equivalent to the agent performing an action in the environment, that is, an epoch of training.The particle swarm completes the optimization of an objective function, which is equivalent to the interaction between the agent and the environment, that is, an episode of training.
The training process is as follows: First, the action network and the action value network are randomly initialized, and the target action network and the target action value network are copied respectively.A replay buffer R is then initialized to save the running state, actions and rewards.The third step initializes the environment, including the initialization of the PSO and the initialization of the objective function to be optimized.The fourth step is to obtain the running state parameters from the particle swarm.The fifth step is to input the running state s t into the action network, get the action a t , and add a certain random noise.The random noise is normally distributed, with 0 as the mean and 0.5 as the variance.The formula is as follows: The sixth step converts the resulting actions into the required w, c1 and c2 using the formula 11.The seventh step is to perform an iteration of particle swarm optimization according to the above parameters to obtain a new reward r t+1 and a new state s t+1 .The eighth step saves the state, actions and rewards to cache R. The ninth step is to randomly select a batch of experiences from cache R. Update the weights of the action-value network by minimizing the loss function (formula 4).Update the weights of the action-action network by (formula 3).The tenth step is to update the weights of the target action value network and the target network according to the formula [? ].If the particle swarm has not finished iterating yet, then t is increased by 1 and returns to step 4. If the training is not completed then go back to step 3.The pseudocode is proposed in Algorithm 1.
If the objective function used in training directly adopts the function to be optimized later, the effect will be better.If you train with a set of test functions, it can work on all objective functions, but not as good as the first method.
After training, a trained action network model will be obtained for subsequent runs.

Running
In the running process, compared to the training process, many steps will be removed, and the overall process is very simple.
The operation process of the traditional particle swarm algorithm will be introduced below.The implementation process of other particle swarm variant algorithms is basically the same as this process.
The flow chart of the PSO algorithm combined with RLAM is shown in Figure 3. 1: Randomly initialize θ Q and θ µ in action network µ(s|θ µ ) and action value network Q(s, a|θ Q ).
2: Initialize the target network Q and µ , and its weight value is copied from Q and µ 3: Initialize the playback buffer R 4: for episode = 1 : EpisodeM ax do 5: Initialization environment (PSO and evaluate function) for t=1, Tmax do get observation s t from environment 8: Choose actions based on s t , network µ and explore noise{Eq.14}

9:
Perform the action a t in the environment and observe the reward r t and the new state s t+1 10: Save (s t , a t , r t , s t+1 ) to the cache R 11: Update Action Value Network by minimizing the loss function{Eq.3}

12:
Update the action network through the sampled action policy gradient:{Eq.4}

13:
Update the weights of the target network function{Eq.2}

14:
end for 15: end for n addition to the original process of PSO, the new content is that before the particle velocity is updated, the running state of the particle swarm will be calculated and a new parameter group will be generated to guide the particle update.

Network structure
This chapter will introduce the network structure of action network and action value network in detail.
Since the action evaluation network is only used in the pre-training process, and the action network needs to be used in the actual operation, in order to prevent excessive computation during the optimization process, a smaller action network and a larger action evaluation network will be designed.
The schematic diagrams of the two networks are shown as 4,5.
Actor value network adopts the design of a 6-layer fully connected network, and the activation function between the networks is leakyrelu.The action network adopts the design of a 4-layer fully connected network, and the activation function between the networks is also leakyrelu.In the last layer, in order to map the action to the required range of -1 to 1, a layer of tanh activation function is added.

RLPSO
In order to better reflect the parameter adjustment ability of RLAM, this paper designs a new RLPSO algorithm based on RLAM.The speed update equation of this algorithm is as follows: Here, f i(d) is introduced in the previous section.pbest is the particle's own best experience, and gbest is the best experience in this swarm.According to the current running state, w, c1, c2, c3, and c4 are coefficients generated by the actor network, which has been introduced in Sect 4.1.2.r1, r2 and r3 are all uniformly distributed random numbers between 0 and 1.
To prevent particles from being trapped in the local optimum, there is a mutation stage after the velocity updating.During this stage, first, a random number r4 between 0-1 will be generated, and then r4 will be compared with c4 * 0.01 * f lag clpso .If r4 is less than it, the mutation will be performed, and the particle position will be reinitialized in solution space.
At the end of one period, particles will move according to their velocity, and then particles' fitness and history best experience will be updated.
The pseudocode is proposed in Algorithm 2.

7:
Calculate the new speed {Eq.15}Calculate the evaluation value of all particles 14: Update the parameters in particle operation 15: end for 16: end while

EXPERIMENTS
To verify the performance of the proposed algorithm, 3 sets of experiments will be conducted.In the first experiment, RLAM was fused with various PSO variants, and the obtained results were compared with the original PSO variants.The second experiment will compare the performance of RLAM with other online adaption methods based on the classical particle swarm algorithm.Finally, the RLPSO designed based on RLAM is compared with other current state-of-the-art algorithms.
In the following experiments, in order to test the algorithm performance more comprehensively, we selected the CEC2013 test set [44] for testing.The test set includes 28 test functions that can simulate a wide variety of optimization problems.The test functions of CEC2013 are shown in the table 2: The domain of all functions is -100 to 100.Due to the relatively large overall testing volume, in all runs, the end condition for all algorithms is to complete 10,000 evaluations.The proposed algorithm has been impolemented using Python 3.9 under 64-bit Ubuntu 16.04.7 LTS operating system.Experiments are conducted on a server with Intel Xeon Silver 4116, 2.1 GHz CPU and 128 GB of RAM.

Improvement of PSO variants after combining with RLAM
In this experiment, we compare the performance of various PSO variants incorporating RLAM with the original version.The combination of PSO and RLAM is as described above.For each problem, each algorithm will be run 50 times, the test dimension is 30 dimensions, and the final result will be shown in the table 3.
In the table 3, the result is calculated as follows: Where improve is the percentage of improvement, which is the data in the table.Gbest origin is the optimal value obtained by the original version of the PSO variant.Gbest train is the optimal value obtained by the RLMA version of the PSO variant.Each row represents the result on a different evaluation function.benchmark best represents the optimal value of the current test function.
In order to show the improvement more intuitively, Figure 6 shows the heatmap of the improvement of various PSO algorithms after adding RLMA.
From table 3 and figure 6 we can see that almost all algorithms have significant improvement after incorporating RLAM on all test functions.The average improvement ranged from 13% to 29.2%, and the overall average improvement was 19.2%.
In Unimodal's test function (F01-F05), most of the PSOs have a relatively large improvement after combining with RLAM.In the Multimodal test function (F06-F20), although the overall improvement is not as good as that of Unimodal, most of them are also significantly improved, and a few are significantly improved.In the Composition test function (F21-F28), the overall improvement is smaller than the first two categories, but it still has a certain effect.From these conclusions, it can be seen that RLAM can generally achieve a very good improvement effect on   simple problems, and the improvement rate has decreased on complex problems, but it still has a significant improvement effect.
In addition, for a more comprehensive statistical analysis, this paper carries out nonparametric statistical tests.Table 4 presents the results of the pairwise comparison including the number of increases, and decreases and p-value of Wilcoxon test [35,45].we have normalized the results on every function to be in [0, 1] according to the best and worst results obtained by all the algorithms [46].As can be seen from the table, out of all 6 tested PSO methods, two algorithms improve in all test functions after combining RLAM.The worst among the other algorithms decrease in 4 of the 28 test functions.Overall, among the 168 combinations of test algorithms and test functions, a total of 11 combinations have decreased, and the rest have improved, and the proportion of effective results is as high as 93.5According to the results of Wilcoxon test, RLAM is significantly effective with a level of significance α = 0.05.

campare to other online adaption methods
To test the pros and cons of RLAM with other online adaption methods, RLAM is compared with 4 other online adaption methods in this section.The other four algorithms are: adaptation algorithm based on type 1 fuzzy logic [47], adaptation algorithm based on type 2 fuzzy logic [33], adaptation algorithm based on success rate history [48] and adaptation algorithm based on Qlearning [11].
In the result, PSO with RLAM is labeled with PSOtrain, PSO with type 1 fuzzy logic is labeled with TF1PSO, PSO with type 2 fuzzy logic is labeled with TF2PSO, PSO with adaptation algorithm based on success rate history is labeled with SuccessHistoryPSO, PSO with adaptation algorithm based on Qlearning is labeled with QLPSO.
The best results obtained by PSO with RLAM and PSOs with other online tuning methods are listed in Table 5, and the best results are also shown in bold.In this table some data are the same but only one of the cases marked in bold is due to insufficient numerical resolution, and the finer results will show their differences.In this table, the fuzzy logic based adaptation performed the worst, where the type 1 fuzzy logic based adaptation won the first place in F7, while the type  And the adaptation based on successful history won the first place in F16, F23, while the adaptation based on reinforcement learning performed the best in the test.Among them, the adaptation based on Q-learning won the first place in F4, F5, F18, F20, F25, F26, and RLAM won the first place in a total of 19 other functions.Fig. 11 describes the ranking of these algorithms for a multiple comparison.From the figure 7 we can see that RLAM has the lightest overall color and the highest average ranking on the entire function test set.The adaptation based on Q-learning also works well, but it achieved the last place in the 6 test functions of F14, F15, F16, F22, F23 and F27.The effect of other algorithms is not much different.Among them, the adaptation algorithm based on type 1 fuzzy logic ranks better overall, with almost no worst results, indicating that its robustness is better.
Tables 6 are pairwise comparisons between PSOs with online tuning methods.It is clear that PSO with RLAM is significantly better than FT1PSO, FT2PSO, success-history-PSO, QLPSO with a level of significance α = 0.05.
Overall, RLAM has an excellent performance on all test functions when comparing with other online tuning methods.

Comparison of RLPSO with other state-of-the-art PSO variants
Since most of the current state-of-the-art particle swarm algorithms combine many other optimization methods, adjusting parameters alone cannot make a good comparison.In order to test the strength of RLAM applied in particle swarm optimization, here we will compare the performance of RLPSO based on RLAM design and some other improved particle swarm optimization algorithms.The algorithms compared include some of the widely used particle swarm algorithms:CLPSO [42],FDR-PSO [49],LIPS [50], PSO [9] and SHPSO [51].And several state-of-theart variants of PSO including EPSO [30], AWPSO [52] and PPPSO [53].All algorithms are executed 50 times for each problem and the results are the average.
The best result obtained by RLPSO and other PSO variants are listed in Table 7, and the best results are also shown in bold.In the results we can see that RLPSO took the first place in 18 of the 28 test functions, far exceeding several other PSO algorithms.SHPSO took first place on two functions.AWPSO took first place on 5 functions.PPPSO took first place on 2 functions.EPSO took first place on a function.
Figure 8 shows the specific ranking of each algorithm on different test problems.It can be seen that RLPSO ranks very high on almost all problems, reflecting the excellent performance of RLPSO.
To show the stability of the proposed algorithm, convergence curves are depicted in Figs. 9. Tables 8 are pairwise comparisons between RLPSO.It is clear that RLPSO is significantly better than other PSOs with a level of significance α = 0.05.
In general, RLPSO has outstanding performance among many particle swarm algorithms, which shows that the RLAM method can obtain an excellent particle swarm variant algorithm after a certain design, which also highlights the power of RLAM.

CONCLUTION
In this paper, a reinforcement learning-based parameter adaption method RLAM and an RLAM-based RLPSO are proposed.In RLAM, Through each generation, the particle running selecting the optimal coefficients returned by actor network.Also, the actor network will be trained before the running.It can be trained using the target function or just using some test function.A combination of iteration, no-improving-iteration and diversity is seen as the calculation of the state.Reward is calculated according to the change of best result after an update.In RLPSO, in addition to RLAM, CLPSO and mutation mechanisms are also added to the algorithm.Furthermore, this paper carries out comprehensive experiments to compare the new algorithm with other online adaption method, investigate the effects of RLAM with different PSO variants and compare RLPSO with other state-of-the-art PSO variants.
The proposed method is incorporated into multiple PSO variants and tested on the CEC2013 test set, and it can be seen that almost all PSO variants have improved final optimization accuracy.This result shows that RLAM is beneficial and harmless in almost all problems and all optimization algorithms.And it solves the problem that manual parameter adjustment is too cumbersome and burdensome.The algorithm proposed in this paper has also been compared with other online adaption methods, including adaptive particle swarm optimization based on reinforcement learning, adaptive particle swarm optimization based on fuzzy logic, and adaptive particle swarm optimization based on success rate history.In the final results we see that the adaptive algorithm proposed in this paper significantly outperforms other adaptive algorithms.Since most of the current state-of-the-art particle swarm algorithms combine many other methods, adjusting parameters alone cannot make a good comparison.Therefore, this paper designs RLPSO based on RLAM, which combines CLPSO and some mutation and grouping operations.The final algorithm is compared with a variety of top particle swarm algorithms on the CEC2013 test set, and it can be seen from the results that the proposed algorithm is at the leading level.
In the future, based on RLAM, the selection of states, the control of actions and the applicability of other optimization algorithms will be further studied to further improve the optimization performance.

Figure 2 :
Figure 2: The training process of DDPG

Figure 3 :
Figure 3: The running process of PSO with RLAM

Figure 7 :
Figure 7: Multiple comparison between RLMA and other online tuning

Table 1 :
reinforcement learning based self-adaption methods.

Table 2 :
All test functions and their optimal values in CEC2013.

Table 3 :
Improvement of PSO variants after combining with RLAM.

Table 4 :
Pairwise comparison between PSO and PSO with RLAM

Table 5 :
Convergence accuracy comparison between RLAM and other online adaption method.

Table 6 :
Pairwise comparison between PSO with RLAM and other online tuning method

Table 7 :
Comparison between RLPSO and other PSO variants

Table 8 :
Pairwise comparison between RLPSO and other sota PSO