Design and application of adaptive PID controller based on asynchronous advantage actor–critic learning method

To address the problems of the slow convergence and inefficiency in the existing adaptive PID controllers, we propose a new adaptive PID controller using the asynchronous advantage actor–critic (A3C) algorithm. Firstly, the controller can train the multiple agents of the actor–critic structures in parallel exploiting the multi-thread asynchronous learning characteristics of the A3C structure. Secondly, in order to achieve the best control effect, each agent uses a multilayer neural network to approach the strategy function and value function to search the best parameter-tuning strategy in continuous action space. The simulation results indicate that our proposed controller can achieve the fast convergence and strong adaptability compared with conventional controllers.


Introduction
The PID controller is a control loop feedback mechanism which is widely used in industrial control system [1]. Based on the investigation of conventional PID controller, the adaptive PID controller adopts online parameter adjustment method according to the state of the system, therefore it has better system adaptability. The fuzzy PID controller [2] adopts the ideology of matrix estimations [3,4]. In order to satisfy the requirement of the self-tuning PID parameters, the method adjusts the parameters by querying fuzzy matrix table. The limitation of this method is that it needs much more prior knowledge. Moreover, this method has a large number of parameters that is needed to be optimized [5].
The adaptive PID controller [6,7] approximates nonlinear structure by neural networks, which can achieve effective control without identifying the complex nonlinear controlled object. But, it is difficult to obtain the teacher signals in the supervised learning process. The evolutionary adaptive PID controller [8] has difficulty in achieving real-time control due to the fact that it requires less prior knowledge [9]. The adaptive PID controller, which is based on reinforcement learning [10], solves the problem by obtaining the teacher's signal in unsupervised learning process. And the optimization of the control parameters is simple. The actor-critic (AC) adaptive PID [11,12] is the most widely used reinforcement learning controller. However, the convergence speed of the controller is affected by the correlation of the learning data in the AC algorithm [13].
Google's DeepMind team proposed the asynchronous advantage actor-critic (A3C) learning algorithm [14,15]. This algorithm adopts multi-strategies to train multiple agents in parallel, each agent will experience different learning state. So the correlation of the learning sample is broken while improving the computational efficiency [16]. This algorithm has been applied in many fields [17,18].
The proposed method aims to improve the convergence and adaptive ability of the PID controller. To achieve this purpose, we use the A3C algorithm that enhance the learning rate to train agent in the parallel threads. And two BP neural networks are used to approach policy function and value function separately. The experiments show that the proposed algorithm outperforms the conventional PID controlling algorithms. The rest of paper is arranged as fellows. Starting from a brief description of PID controller in Sect. 2 and 3, we introduce our new approach in Sect. 4 and show experimental results in Sect. 5. We conclude the paper in Sect. 6.

Related work
The conventional PID controlling algorithms can be roughly classified into three categories: the fuzzy PID controller, neural network PID controller and reinforcement learning PID controller.

Fuzzy PID controller
Tang [19] proposed a method that combined the fuzzy math with the PID control. However, this method still had some limitations such as that it required a lot of manual experience to establish the rule table. Besides, the rule table was often only adapted to a specific application scenario. To address these issues, Sun [20] developed a fuzzy PID controller based on improved genetic algorithm, which used multiple fuzzy control rules to adjust parameters by genetic algorithm. The controller abandoned the plenty of manual work and set up an exclusive rule under the environment. Spired by the idea of the work, Zhu [21] added the normalized velocity parameter reflecting the response of the system based on the adjusting factor of fuzzy rules. The method aimed to change the mapping between input and output variables with the fuzzy subsets so that it made the controller be able to divide the error and the rate of error into multiple control stages.

Adaptive controller based on neural network
Liao [22] proposed a method utilizing the neural network to reinforce the performance of PID controller for the nonlinear system. Although the initial parameters of neural network could be determined by artificial test, it could not ensure the reliability of the manual result. Based on this, Li [23] adopted the genetic algorithm to obtain the optimal initial parameters of the network. However, the genetic algorithm was easily to fall into local optimum. In order to solve the problem, Patel [24] appended the immigration mechanism, 10% of the elite population and the inferior population were selected as the variant population, to the neural network adaptive PID controller (MN-PID). In addition, Nie [25] presented an adaptive chaos particles swarm optimization for tuning parameters of PID controller (CSP-PID) to avoid the local minima.

Reinforcement learning adaptive controller
Aziz Khater [26] proposed a PID controller that combining the ASN reinforcement-learning network with fuzzy math. Despite this method did not need too much accurate training samples compared the neural network PID, its structure was too complex to guarantee the real-time performance. In view of this point, Adel [10] designed an adaptive PID controller based on AC algorithm. This controller had simple structure with one RBF network. However, its speed of convergence was slow owing to the relevance in learning sample of AC algorithm.

Basic structure of PID controller
Incremental PID is an algorithm of PID control by increment of control volume. The typical control system structure is shown in Fig. 1. Besides, its formula is as follows: y 0 t ð Þ, y t ð Þ, e t ð Þ, De t ð Þ, D 2 e t ð Þ represents the current actual signal value, the output value of the current system, the system output error, the first-order difference of error and the second-order difference of error respectively. In the form 1, incremental PID is cancelled the integral summation, so it saves the time of calculation. Besides, it influence the system lightly when the system is broken. In the synthesis the factor, the incremental PID is optimum choice for the practical application.  structures, which are executed and learned in parallel by creating multiple agent in same environmental instances. The central network is responsible for updating and storing AC network parameters. One agent has its own AC structure. Different agent will transfer learning data to central network to update their parameters of AC network. Further the Actor network is responsible for policy learning, while critic network is responsible for estimating value function.

Structure of A3C-PID controller
The design of A3C adaptive PID controller is to combine the asynchronous learning structure of A3C with the incremental PID controller. Its structure is shown in Fig. 2. The whole process is as follow: Step 1: For each thread, the initial error e m t ð Þ enters the state converter to calculate De m t ð ÞD 2 e m t ð Þ and output the Step 2: The Actor (m) maps the state vector S m t ð Þ to three parameters, K p K i and K d , of PID controller.
Step 3: The updated controller acts on the environment to receive the reward r m t ð Þ. After n times, Critic (m) receives S m t þ n ð Þwhich is the state vector of the system. Then it produces the value function estimation VðS tþn ; W 0 v Þ and n-step TD error d TD , which are the important basis for updating parameters. The formula of the reward function is shown as Formula (2) In the next step, the Actor (m) and the Critic (m) send their own parameters W 0 am , W 0 vm and the generated d TD into the Global Net to update W a and W v with the policy gradient and the descend gradient. Accordingly, the Global Net passes their W a and W v to Actor (m) and Critic (m), making them continue to learn new parameters.

A3C learning with neural networks
Multilayer feed-forward neural network [27,28], also known as BP neural network, is a back-propagation algorithm for multilayer feed-forward networks. It has strong ability for nonlinear mapping and is suitable for solving problems with complex internal mechanism. Therefore, the method uses two BP neural networks respectively to realize the learning of policy function and value function. The network structure is as follows.
As shown in Fig. 3, the Actor network has three layers: The first level is the input layer. The input vector S ¼ e m t ð Þ; De m t ð Þ; D 2 e m t ð Þ Â Ã T represents the state vector. The second layer is the hidden layer. The input of the hidden layer as follows: Actor (1) Actor(m) Critic (1) Critic(m) r m (t)

Kp Ki Kd
where k represents the number of neurons in the hidden layer, w ik is the weights connected the input layer and the hidden layer, b k is the bias of the k neuron. The output of the hidden layer as follows: The third layer is the output layer. The input of the output layer as follows: where o represents the number of neurons in the output layer, w ho is the weights connected the hidden layer and the output layer, b o is the bias of the k neuron. The output of the output layer as follows: Actor network does not output the value of K p K i and K d directly, but output the mean and variance of the three parameters. Finally, the actual value of K p , K i and K d is estimated by the Gauss distribution. The Critic network structure is similar to the Actor network structure. As shown in Fig. 4, the Critic network also uses BP neural networks with three layers structure. The first two layers are the same as the layers in the Actor network. The output layer of the Critic network has only one node to output the value function VðS t ; W 0 v Þ of the state. In the A3C structure, Actor and Critic networks use nstep TD error method [29,30] to learn action probability function and value function. In the learning method of this algorithm, the calculation of the n-step TD error d TD is realized by the difference between the state estimation value VðS t ; W 0 v Þ of the initial state and the estimation value after n-step, as followed: The 0\c\1, represents the discount factor, is used to determine the ratio of the delayed returns and the immediate returns. W 0 v is the weight of the Critic network. The TD error d TD reflects the quality of the selected actions in the Actor network. The performance of the system learning is: After calculating the TD error, each AC network in the A3C structure does not update its network weight directly, but updates the AC network parameters of the central network (Global-Net) with its own gradient. The update formula is as follows: where W a is the weight of Actor network stored by the central network, and W 0 a represents the weights of Actor network in AC structure, W v is the weight of Critic network in the central network, and W 0 v represents the Critic network weights for each AC structure. a a is the learning rate of Actor, and a c is the learning rate of Critic.

The network initialization of A3C-PID controller
The initial parameters of the network directly affect the stability of the closed loop control system. However, the

Input layer
Hidden layer Output layer v Fig. 4 Critic network structure of actor-critic PID controller of the neural network is difficult to obtain the teacher's signal. Therefore it is necessary to determine the network parameters by experience or manual trial. The unsupervised learning characteristics of reinforcement learning enable the controller to obtain the optimal initial parameters of the network through iterative learning. However, the AC-PID controller has a slow convergence speed due to the correlation between the learning samples obtained by the AC algorithm. A3C-PID Controller learns network parameters in multi-threading asynchronously, which can break the relevance of samples and improve the convergence rate. The learning process of A3C-PID network parameter is similar to that described in the 3.1 section, but the difference is that A3C-PID sets the m for the number of computer CPU cores in iterative learning, then the value of m is set to one when online controlling.

Working process of A3C-PID controller
Based on the architecture of asynchronous learning and the learning mode with taking n-step TD error as the performance, the working process of A3C-PID controller is as follows: (a) Setting the sampling period ts, the number of threads of the A3C algorithm m, update the period n, and initialize the network parameters of each AC structure through iteration learning on K times; (b) Calculating errors of system and constructing state vectors as inputs to Actor(m) and Critic(m); (c) Critic(m) outputs VðS t ; W 0 v Þ; (d) Actor(m) outputs the value of K p , K i and K d . Then the system observes the error e m ðt þ 1Þ when next sampling time and calculate the reward r m ðtÞ according the Eq. (2); (e) Determining whether to update the parameters of Actor(m) and Critic(m). The Critic outputs the state value VðS tþn ; W 0 v Þ then the system updates the parameters of Global Net which is W a and W c according to Eqs. (9) and (10), if it has meet update time n. Otherwise, returning step d); (f) Global Net transmits the new parameters W 0 am and W 0 cm to each Actor(m) and Critic(m); (g) Determining whether the end condition is satisfied, if that exiting the controlling, otherwise updating S m ðtÞ and returning step (c).

Simulation experiment of nonlinear signal
In order to verify the effectiveness and superiority of this algorithm, the nonlinear objects are simulated and analyzed based on PID, CSP-PID, MN-PID, AC-PID and A3C-PID respectively. The discrete model of the object is as follows: yðk þ 1Þ ¼ f ðyðkÞ; yðk À 1Þ; yðk À 2Þ; uðkÞ; uðk À 1ÞÞ ð11Þ The inputs rin is that:  Table 1.
The simulation results show that the A3C-PID controller reaches the minimum for the root mean square error (RMSE) and the mean absolute error (MAE) value. Compared with the other three controllers, the control accuracy of A3C-PID is higher. It not only proves that our design of a new PID controller is reasonable but the controller has the better control performance for the nonlinear system.

Simulation experiment of inverted pendulum
The control of single inverted pendulum is a classic problem in the control study. The control process is to apply the force F to the bottom of the car to make the car stay in the setting position and make the angle between the rod and the vertical line in a deviation range.   Figure 10 shows a single inverted pendulum. As shown in Fig. 10, the quality of the car is M, the quality of the pendulum is m, the position of the car is x, the angle of the pendulum is, the equation of the single inverted pendulum is obtained as Eqs. (13) and (14).
where I¼ 1 12 mL 2 ; l ¼ 1 2 L, F represents the force acting on the car, and take continuous value on [-10, 10]. Sampling period is 20 ms. Single inverted pendulum has 4 control indexes: pendulum angle, swing speed, position of trolley and speed of car. There initial conditions are as follow: The final state of expectation is: In the simulation, the parameters of the inverted pendulum are as follows: The l c presents the friction coefficient of the car relative to the guide rail indicates. The l p presents the friction coefficient of the rod to the car. The parameters of the A3C-PID controller are set to as follow: m = 4; a a ¼ 0:002; a c ¼ 0:01; e ¼ 0:001; c ¼ 0:9; n ¼ 50 The results of the simulation are shown in Figs. 11 and 12. Figure 11 is the response of the four controlling indicators of the inverted pendulum in 10 s. From Fig. 11, it can be seen that under the A3C-PID controlling, the inverted pendulum can quickly reach the stable state of 4 control indicators. Figure 12 is the output of A3C-PID, AC-PID and traditional PID control. It can be seen that A3C-PID controller has better system tracking performance than traditional PID and AC-PID.

Closed loop control structure of stepping motor
The stepper motor is a low speed permanent magnet synchronous motor. It is not used as the input of the pulse sequence. But used in the digital control system by changing the excitation state to realize the angle actuating element. The stepper motor usually adds a photoelectric encoder, a rotating transformer or other measuring feedback elements to achieve high precision positioning control in the closed loop control. The block diagram of the closedloop servo control system is shown in Fig. 13. From the Fig. 13, the inner loop includes the current loop and the speed loop. The current loop is used to track the current of the two phase hybrid stepping motor, so that the dual phase hybrid stepping motor can output the torque smoothly under the micro step. The speed loop control enables the load electricity to track the setting speed and achieve the effect of speed control. The outer loop is the position loop, which loads the output to track a given position. The position loop controller usually adopts PID control. Therefore, we added the A3C-PID to the position loop to test the validity of the controller.

Modeling and simulation of two-phase hybrid stepping motor
In this paper, a two-phase hybrid stepping motor is used to control in the simulation experiment. Firstly, we need to establish a mathematical model. However, the two-phase hybrid stepping motor is a highly nonlinear mechanical and electrical device, so it is difficult to describe it accurately. Therefore, the mathematical model of a two phase hybrid stepping motor is studied in this paper. It is simplified and assumed to be as follows: The magnetic chain in the phase winding of the permanent magnet varies with the rotor position according to the sinusoidal law. The magnetic hysteresis and the eddy current effect are not considered, only the mean and fundamental components of the air gap magnetic conductance are considered. The mutual inductance between the two phase windings is ignored. On the basis of the above limit, the mathematical model of the two phase hybrid stepping motor can be described by the Eqs. 17-21.
T e ¼ Àk e i a sinðN r hÞ þ k e i b cosðN r hÞ ð 19Þ In above formulas, u a and u b are two-phase voltage and current respectively of A and B. R is winding resistance. L is winding inductance. k e is torque coefficient. h and x are rotation angle and angular velocity of motor respectively. N r is the number of rotor teeth. T e is electromagnetic torque of hybrid stepping motor. T L is Load torque. J and B are the load moment of inertia and the viscous friction coefficient respectively. It can be seen from the mathematical model of a two phase hybrid stepping motor that the two phase hybrid stepping motor is still a highly nonlinear and coupled system under a series of simplified conditions. The simulation model of two phase hybrid stepping motor servo control system is built by using Simulink in Matlab. The simulation is shown in Fig. 14 Table 2.
Dynamic performance of the A3C, BP, and AC adaptive PID controller are shown on Fig. 15. In the time of early simulation (20 cycles), the BP-PID controller has a faster response speed and a shorter rise time (12 ms), but it has a higher overshoot of 2.1705%. On the contrary, both the AC-PID and the A3C-PID controller have smaller overshoot as 0.1571% and 0.1021%. But the adjustment time of AC-PID is long (48 ms), and the rise time is 21 ms. In contrast, A3C-PID controller has better stability and rapidity.  Fig. 13 The closed-loop servo control system of hybrid stepping motor Fig. 14 The simulation of servo system Figure 16 shows the process of adaptive transformation of A3C-PID controller parameters. As be seen from Fig. 16, the A3C-PID controller is able to adjust the PID parameters based on errors in different periods. At the beginning of the simulation, the tracking error of system is large. In order to ensure a fast response speed of the system, K P is continuously increasing while K d is reducing. Then the system is in order to prevent from having a high overshoot, which limits the increasing of K i . With the error decreasing, K P begins to decrease and the value of K i is gradually increased to eliminate the cumulative error, but at the same time, a small amount of overshoot is caused. Since the K d value at this stage has a large influence on the system, it tends to be stable. When the final tracking error comes to zero, K P , K i and K d reach a steady state.
Simulation results show that the A3C-PID controller has good adaptive capabilities.
The AC-PID and A3C-PID reward value curves are shown in Fig. 17. The goal of reinforcement learning is to learn the best strategy to maximize reward value U. The calculation formula is as seen in Eq. (22) We can conclude from the analysis of Fig. 17 that A3C-PID controller has a higher U value after 3000 iterations than AC-PID controller. In addition, the U value of A3C-PID has become stable after about 1800 iterations, while AC-PID converges only after the 2500 iterations. So, A3C-PID has a faster convergence rate than the AC-PID.

Conclusions
Machine Learning and Intelligent Algorithms have been well applied in many industrial fields [31][32][33][34][35][36]. The purpose of this paper has been to present our efforts to improve the convergence and adaptability of the adaptive PID   controller. In this paper, a new PID controller is proposed with A3C algorithm. The controller uses the BP neural network to approach the policy function and the value function. BP neural network have the strong ability in nonlinear mapping, which can enhance the adaptive ability of the controller. The learning speed of A3C PID controller is accelerated with the parallel training of CPU multithreading. The method of asynchronous multi-thread training reduces the correlation of the training data, and makes the controller more stable and adaptable. Our experiments of nonlinear signal and inverted pendulum demonstrate that A3C-PID controller has higher control accuracy than others PID controller. The experiments about the position control of two phase hybrid stepping motor show that A3C-PID controller has a good performance on overshoot, rise time, steady state error and adjustment time. According these work, the effectiveness and application significance of the new method can be confirmed. Our aim is to make the controller apply to the multi-axis motion control and the actual industrial production.
Acknowledgements This work was supported by National Science and Technology Major Project of China (Grant Number 2017ZX05009-001).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.