Reinforcement learning-based particle swarm optimization for sewage treatment control

To solve the problem of high-energy consumption in activated sludge wastewater treatment, a reinforcement learning-based particle swarm optimization (RLPSO) was proposed to optimize the control setting in the sewage process. This algorithm tries to take advantage of the valid history information to guide the behavior of particles through a reinforcement learning strategy. First, an elite network is constructed by selecting elite particles and recording their successful search behavior. Then the network is trained and evaluated to effectively predict the particle velocity. In the periodic wastewater treatment process, the RLPSO runs repeatedly according to the optimized cycle. Finally, RLPSO was tested based on Benchmark Simulation Model 1 (BSM1) of sewage treatment, and the simulation results showed that it could effectively reduce the energy consumption on the premise of ensuring qualified water quality. Furthermore, the performance of RLPSO was analyzed using the benchmarks with higher dimension, which verifies the effectiveness of the algorithm and provides the possibility for RLPSO to be applied to a wider range of problems.


Introduction
The activated sludge method is a biological sewage treatment method commonly used in the wastewater treatment processes (WWTP) [1,2]. Through biochemical reaction, the pollutants in the sewage are adsorbed, decomposed and oxidized, so the pollutants are degraded and separated from the sewage to achieve the purification of the sewage [3][4][5][6]. To ensure that the effluent water quality reaches the standard, it is necessary to fill the aeration tank with appropriate oxygen through the blower to maintain the concentration of dissolved oxygen (S O ) in the aerobic area, and use the reflux pump to maintain the concentration of nitrate nitrogen (S NO ) in the anoxic zone [7]. However, the operation of blower and reflux pump requires a large amount of energy loss, which inevitably increases the operation cost. At the same time, from the perspective of biochemical reaction mechanism, suitable S O and S NO are helpful to ensure the successful progress of nitrification and denitrifying reactions [8,9]. Therefore, it is necessary to dynamically optimize S O and S NO and construct the control strategy aiming at reducing the energy consumption (EC) in the sewage treatment process on the premise of ensuring qualified effluent quality (EQ).
With the characteristics of nonlinearity, time variation and strong coupling, the control issues in the WWTP have been extensively investigated. The main challenge of WWTP is to construct an optimal control strategy with the aim of reducing EC while ensuring qualified EQ. For example, Vrečko presented a PI-based control strategy including feedforward control and a step-feed procedure, which was applied to WWTP [10]. Furthermore, Vrečko et al. presented a model predictive controller (MPC) for ammonia nitrogen, which gives better results in terms of ammonia removal and aeration energy consumption than PI controller [11]. Mulas proposed a dynamic matrix-based predictive control algorithm, which is able to decrease the energy consumption costs and, at the same time, reduce the ammonia peaks and nitrate concentration [12]. Han et al. proposed an efficient self-organizing sliding-mode controller (SOSMC) to suppress the disturbances and uncertainties of WWTP [13]. However, in the above algorithm, the concentration setting values of the key variables in sewage process are fixed or changed according to the preset trajectories, without considering the real-time influence of sewage quality and flow rate.
Sewage treatment is a complex dynamic reaction process. To reduce EC under the statue of meeting EQ standards, more and more intelligent algorithms are presented to dynamically optimize the setting values of key variables in WWTP. For examples, Hakanen et al. designed a multiobjective interactive wastewater treatment software based on differential evolution (DE), using variables such as the So setpoint in last aerobic zone and the methanol dose as decision variables [14]. Han et al. proposed a Hopfield neural network method (HNN) based on Lagrange multiplier for the optimal control of pre-denitrification WWTP [15]. Yang used an artificial immune network-based combinatorial optimization algorithm (Copt-ai Net) to determine the optimal set values of So and S No [16]. In [17], an adaptive multi-objective evolutionary algorithm based on decomposition (AMOEA/D) is developed with the usage of EC and EQ as objectives to be optimized.
However, sewage treatment is a cyclical process, that is, optimization calculations should be performed in intervals, which can result in high fitness evaluations (FEs) cost for optimization. In the above intelligent control algorithm, sewage treatment information is not fully utilized. The subsequent optimization does not extract useful information from the previous optimization process, and the previous optimization does not play a guiding role for the subsequent optimization.
In the cycle optimization process, information storage and reuse can improve computing efficiency and sewage treatment effect. Inspired by reinforcement learning mentioned in [18,19], and considering the simple operation and fast convergence of particle swarm optimization algorithm (PSO) [20][21][22][23], we propose a wastewater treatment control method based on reinforcement learning particle swarm optimization (RLPSO). This method introduces a reinforcement learning strategy in the particle update. First, select the elite particles, record their concentration setting values and adjustment trends, and construct an elite particle set. Then an elite network was trained and used as the strategy function to predict the particle velocity. Finally, a simplified evaluation method is utilized to calculate the state value function which is used to update the elite network model.
The remainder of the paper is organized as follows. The next section introduces the international Benchmark Simulation Model 1 (BSM1) of WWTP and optimization objective function. The subsequent section describes RLPSO in detail. Then the experiment results and analysis are shown. The final section provides the conclusion and outlook.

Wastewater treatment processes optimization
In WWTP, the main reaction is carried out by biological reactor and secondary sedimentation tank. The biological reactor consists of five units. The first two units are anaerobic zones, which mainly complete denitrification reaction, while the last three are aerobic zones, which mainly complete nitrification reaction. To evaluate and compare different optimal control strategies, the Benchmark Simulation Model 1 (BSM1) [24][25][26] was developed by the IWA (International Water Association) and COST (European Cooperation in the Field of Science and Technology), shown in Fig. 1. In BSM1, there are two control loops, S O and S NO . The first control loop tunes the dissolved oxygen concentration in the fifth unit S O by changing the oxygen transfer coefficient K La5 . The second control loop tunes the nitrate nitrogen level in the second unit S NO by changing the internal recirculation flow rate Q a . The two control loops adopt proportional integral controller (PI). However, due to the influence of weather or users, sewage quality keeps changing. If S O or S NO is set at a constant value, it is difficult to maintain the optimal balance between EQ and EC. Therefore, it is necessary to dynamically optimize the set values of S O and  Fig. 1 The architecture of the BSM1 S NO and construct the optimized control strategy aiming at reducing EC on the premise of ensuring qualified EQ.
Aeration energy (AE) and pumping energy (PE) consumption accounts for more than 70% of total energy consumption, so the EC of the optimization problem is defined as the sum of AE and PE: According to the BSM1 mechanism model, AE and PE are defined as follows, respectively [27]: where K Lai is the oxygen conversion coefficient and V i is the volume of the ith biological reactor, respectively. S o,sat is the saturation concentration for oxygen. T is the evaluation cycle. Q a , Q r , and Q w denote the internal recycle flow rate, return sludge recycle flow rate and waste sludge flow rate, respectively. Z a , Z r , and Z w are the corresponding components concentrations.
EQ represents the fine to be paid for the discharge of water pollutants to the receiving water body. According to the definition of BSM1, the equation of EQ is [27] where SS, COD, S NO , S NKj and BOD 5 are suspended solid concentration, chemical oxygen demand, nitrate concentration, Kjeldahl concentration, and biochemical oxygen demand, respectively. EQ value will impact the operation cost of WWTP if the effluent discharge fee is executed strictly.
In addition to EQ, the five effluent parameters should meet the following standards specified in BSM1 [28]: where N tot = S NO + S NKj . S NH denotes influent ammonium.
In summary, the constrained objective optimization function of the WWTP is where c is the weight coefficient and the set values of S O and S NO are the decision variables. Since sewage treatment is a min f = c ⋅ EC + EQ , dynamic and periodic optimization process, we proposed a RLPSO control strategy to minimize the objective optimization function (6) by dynamically adjusting the set values of S O and S NO , to improve the sewage treatment efficacy and reduce the operating cost.

Reinforcement learning-based particle swarm optimization
Particle swarm optimization PSO originated from the study of the behavior of preying on birds, its basic idea is that whole swarm of birds will tend to follow the bird which found the best path to food [29]. To search an optimum, PSO defines a swarm of particles to represent the potential solutions to an optimization problem. Each particle begins with an initial position randomly and flies through the D-dimensional solution space. The flying behavior of each particle can be described by its velocity and position as the following.
is the position vector of the ith particle; P i = (p i1 , p i2 ,…, p id ,…, p iD ) is the best position found by the ith particle; P g = (p g1 , p g2 ,…, p gd ,…, p gD ) is the global best position found by the whole swarm. c 1 , c 2 are two learning factors, usually c 1 = c 2 = 2 [29]; r 1 , r 2 are random numbers between (0, 1) [30]; is the inertia weight to control the velocity, which may decrease linearly starting at 0.9 and ending at 0.4 or ∈ (0, 1) [30].
WWTP is a process of periodic optimization. This is because the WWTP is a complex system with large lag, which is difficult to operate in real time. Therefore, it is necessary to set the cycle time and carry out the optimization calculation in each cycle. However, PSO has the characteristics of random initialization to improve the diversity, and only consider the individual optimum and global optimum during the optimization, ignoring the inherent properties of the system. If PSO is directly applied to WWTP, information from previous cycles does not provide any guidance for subsequent optimization processes, which will lead to low efficiency. To improve the treatment effect, it is necessary to record the influence of the set values of S O and S NO on the sewage parameters, and reuse the information to the optimization process, which will provide reference data for the next optimization calculation. So we consider adding a prediction item to Eq. (7), as shown in the following: where v id is the dth dimensional predicted velocity of particle i by the strategy function μ, and r is the prediction coefficient. According to Eq. (9), the velocity direction of particles is determined by four parts: inertial velocity, individual historical optimum, global optimum, and prediction item. On the one hand, it draws the advantages of PSO, which is both self-cognition and group sociality, on the other hand, the prediction item infuses PSO with historical information, which is more suitable for repeated cycle optimization problems. To determine the prediction item v id , we introduce reinforcement learning (RL) [31] strategy to PSO.

Reinforcement learning strategy
Reinforcement learning interacts with the environment through a trial-and-error mechanism and learns optimal strategies by maximizing cumulative rewards. Reinforcement learning agent mainly includes four basic elements: environment, state (s), action (a) and reward (R) [31]. During operation, the agent determines an action a according to the current state s through the strategy function μ, executes the action, and enters the next state. At the same time, the system returns the value R to reward or punish the action. The process runs repeatedly to maximize the expected benefits of the agent.
In the similar way, reinforcement learning-based PSO (RLPSO) includes four basic elements shown in Fig. 2. The agent is a particle in population and the environment is the WWTP in the paper. The state s is the position X of each particle in the population; the action a is the velocity V prediction strategy, which is determined by the strategy function μ. The reward value R is related to the fitness value f of the optimization problem. Therefore, to obtain the particle velocity prediction v id , we need to establish the strategy function μ according to reward value R.
In the RLPSO, the particle agent predicts the speed according to the strategy function μ: In this paper, the strategy function μ is described as an elite network model. By learning the information of elite particles, the elite network model was trained. The process is mainly divided into three steps: elite particle set construction, strategy function training and elite network model evaluation. The details are described as follows.

Elite particle set construction
Elite network model is trained with elite particle information to guide the search of offspring population. The first step in the kth iteration is to select elite particles based on the reward value R(k). In the iteration process, the reward value R(k) is determined according to the fitness variation value, as shown in the following equation: where f (k) is the fitness value of the kth iteration, k = 0, ..., K − 1 . K is the maximum number of iterations of each run. Only the particles with reward value R(k) = 1 are selected as the elite particles, and then the position X i (k) before the update of the elite particles and the speed V i (k) after the update are saved to construct the elite particle set Ω e .

Strategy function training
The elite particle set Ω e is used to save the position x of the elite particle before the update and the speed v after the update. RLPSO uses a limited capacity elite particle set to store elite particles. Suppose the number of elite particle set Ω e is N e . Ω � e is the newly generated elite particle set, and its number is N ′ e . If N e + N � e exceeds the finite capacity value N em , all the elite particles Ω e + Ω � e are sorted according to the fitness value, only the first N e items are stored into Ω e again, and the original data is overwritten. The elite particle set Ω e is used as a data set. In the data set, the particle position is input and the speed is output, and then a neural network model is trained to obtain the elite network model Φ. The trained elite network model Φ is used as the strategy function μ to guide the particle operation. With the elite network model Φ, the particle velocity can be predicted according to the particle position X i:

Elite network model evaluation
With the continuous update and change of the elite particle set Ω e , the elite network model will be evaluated after the training. During the evaluation process, the new model and the original model are used to guide the particle optimization process. To better reflect the influence of the strategy function μ on the particle velocity update, RLPSO velocity update equation is simplified as When the termination conditions k ≥ K is satisfied, the optimal fitness value obtained by the guidance of the new elite network model is set to f * 1 , and the optimal fitness value obtained by the original network model is set to f * 2 . If f * 1 > f * 2 , it means that the prediction effect of the new network model is better. Set the new network reward value R(K) = 1 after the iteration, otherwise R(K) = −1.
Considering the randomness of particles, the above evaluation process is repeated M times to estimate the state value function where ⌢ V (X) represents the average reward that can be obtained after the particle X moves through the strategy function μ. If ⌢ V (X) > 0 , the new model is considered better than the original model, and the new network is used to replace the original network. If network. By comparing the two models, we determine the prediction model required by the subsequent algorithm.

Algorithm procedure
The algorithm procedure is described below. RLPSO 1. Initialize particle position X i and velocity V i , i = 1, 2, ..., N 2. Let Run = 1. Update the particle position and velocity according to Eqs. (7) and (8). In the iterative process ( k < K ), select the particle with reward value R(k) = 1 as the elite particle, establish the elite particle set Ω e and train the elite network model Φ 3. Randomly generate N particles. Let r ≠ 0 , use the elite network model Φ to predict the particle velocity v id , and update the particle position and velocity according to Eqs. (8) and (9)

Simulation experiment of RLPSO based on BSM1
The proposed RLPSO is simulated on BSM1 platform and compared with PI controller, CPSO [32], SLPSO [33], PSO [34], APSO [35], DE [36], HNN [15], Copt-ai Net [16] and AMOEA/D [17]. The simulation conditions are based on the sunny and good weather in the BSM1. The parameters of the HNN, Copt-ai Net and AMOEA/D algorithms are determined by the original papers. Besides, the other algorithms parameters are set as follows.
The selection time of simulation data is 14 days. The sampling interval is 15 min, and the optimization period is 2 h. So a total of 168 runs are conducted for each algorithm. During the optimization, one difficulty in employing RLPSO in BSM1 is the huge time consumption for fitness evaluations (FEs), the algorithm is required not only to satisfy the optimization accuracy, but also to accelerate the convergence speed. Therefore, for RLPSO, the population size N is set to 10, r = 0.3 , = 0.4 , D = 2, and K max = 40. c 1 and c 2 are 2. We can figure out that FEs is 67,200. Early experiments show that RLPSO with the setting parameters is nearly convergent in the 40th iteration, and meets the requirements of EQ and EC. For the convenience of comparison, the inertia weights of the PSO-based algorithms are all selected as 0.7. In DE algorithm, the mutation rate is 0.5 and the crossover probability is 0.9. The population size and iterations number of these algorithms are the same as RLPSO. Table 1 shows the comparison of EQ and EC under several strategies. As can be seen from Table 1, compared with PI strategy, all of these intelligent algorithms can reduce EC by optimizing the set values of S O and S NO . Among them, the EC obtained by PSO algorithm is lower than RLPSO, which is 3652.40 kWh/d. However, S NH concentration via PSO is 4.19 mg/L, which exceeds the limit of 4 mg/L. Similarly, S NH concentration obtained by DE and APSO also exceeds the standard. Besides, EC obtained by CPSO, SLPSO, HNN, and Copt-ai Net is obviously higher than that of RLPSO, which proves that RLPSO is superior to these algorithms.
We can also see from Table 1 that the EC obtained by AMOEA/D is slightly lower than RLPSO, but its EQ is higher than RLPSO. The performance of the two algorithms is comparable. But it should be noted that, in the AMOEA/D strategy, the population size N is 100, and K max = 300. We can calculate that in each optimization cycle, the FEs of RLPSO is just 1/75 of AMOEA/D. RLPSO can obtain EC similar to AMOEA/D with significantly lower FEs, which prove that RLPSO is more suitable for sewage treatment process.

RLPSO simulation experiment based on benchmark functions
To further study the performance of RLPSO, the RLPSO algorithm is analyzed on the high dimensional general benchmarks. Six different types of benchmark functions (Rastrigin, Griewank, Ellipsoid, Rosenbrock, Sphere, and Ackley functions) are used to study the performance of RLPSO compared with CPSO, SLPSO, PSO, APSO and DE. In the algorithm, the population size N = 10, and the dimension D is set to 10 and 20 dimensions, respectively. Each algorithm runs 50 times. The maximum number of iterations per run is 200. Other parameters of different algorithms are the same as the BSM1 experiment. In the experiment process, f * j represents the optimal fitness value obtained in the jth run, j = 1,2,…50. Figures 3, 4, 5, 6, 7 and 8 show the boxplots comparison of f * obtained by various algorithms. As can be seen from   significantly. However, it can be seen from Figs. 9a-12a that f* of RLPSO tends to converge. This is because RLPSO relies on elite neural network to transmit information between different runs, and the previous optimization can play a guiding role in the subsequent optimization. It should be noted that, as can be seen from Figs. 13a-14a and 9b-14b, RLPSO still has fluctuations. This is because elite network with fixed structure was selected in the training process, which resulted in a decline in RLPSO performance for more complex or high-dimensional benchmarks. Nevertheless, the fluctuation range of RLPSO is significantly weaker than that of CPSO or DE. Tables 2 and 3 list the best, worst, mean and standard deviation value of f * for RLPSO, CPSO, SLPSO, PSO, APSO and DE. It can also be seen from the tables that the best value of RLPSO is weaker than that of DE in Rastrigin function, but its mean or standard deviation is better. For the other benchmarks, all performance statistics of RLPSO are optimal, which proves the accuracy, robustness and effectiveness of the RLPSO algorithm.

Conclusions
In this paper, we proposed an RLPSO algorithm to solve the WWTP problem. On the one hand, this method is based on the theory of reinforcement learning. Through continuous interactive attempts of environment and action, the method adjusts the strategy according to the feedback information, and finally judges the optimal concentration setting value under various conditions. On the other hand, this method is based on swarm intelligence algorithm PSO. In the application of WWTP, it is helpful to improve the diversity distribution of solutions to find the global optimal concentration setting value. Besides, the method has an elite network with memory function. In order to improve the treatment effect, it is necessary to record the influence of the set values of S O and S NO on the sewage parameters, and reuse the information, which provides reference data for the next optimization calculation.
In summary, the RLPSO algorithm proposed in this paper can not only meet the effluent standard, but also reduce the operating cost and provide a feasible solution for the actual sewage treatment plant. In the further, we will continue to study the sewage treatment system, carry out data mining [37,38], and seek for a better optimal control method. In addition, we will use RLPSO to solve more practical problems, such as robot control [39,40] and sEMG-based human-machine interaction [41].  permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.