Introduction

The activated sludge method is a biological sewage treatment method commonly used in the wastewater treatment processes (WWTP) [1, 2]. Through biochemical reaction, the pollutants in the sewage are adsorbed, decomposed and oxidized, so the pollutants are degraded and separated from the sewage to achieve the purification of the sewage [3,4,5,6]. To ensure that the effluent water quality reaches the standard, it is necessary to fill the aeration tank with appropriate oxygen through the blower to maintain the concentration of dissolved oxygen (SO) in the aerobic area, and use the reflux pump to maintain the concentration of nitrate nitrogen (SNO) in the anoxic zone [7]. However, the operation of blower and reflux pump requires a large amount of energy loss, which inevitably increases the operation cost. At the same time, from the perspective of biochemical reaction mechanism, suitable SO and SNO are helpful to ensure the successful progress of nitrification and denitrifying reactions [8, 9]. Therefore, it is necessary to dynamically optimize SO and SNO and construct the control strategy aiming at reducing the energy consumption (EC) in the sewage treatment process on the premise of ensuring qualified effluent quality (EQ).

With the characteristics of nonlinearity, time variation and strong coupling, the control issues in the WWTP have been extensively investigated. The main challenge of WWTP is to construct an optimal control strategy with the aim of reducing EC while ensuring qualified EQ. For example, Vrečko presented a PI-based control strategy including feedforward control and a step-feed procedure, which was applied to WWTP [10]. Furthermore, Vrečko et al. presented a model predictive controller (MPC) for ammonia nitrogen, which gives better results in terms of ammonia removal and aeration energy consumption than PI controller [11]. Mulas proposed a dynamic matrix-based predictive control algorithm, which is able to decrease the energy consumption costs and, at the same time, reduce the ammonia peaks and nitrate concentration [12]. Han et al. proposed an efficient self-organizing sliding-mode controller (SOSMC) to suppress the disturbances and uncertainties of WWTP [13]. However, in the above algorithm, the concentration setting values of the key variables in sewage process are fixed or changed according to the preset trajectories, without considering the real-time influence of sewage quality and flow rate.

Sewage treatment is a complex dynamic reaction process. To reduce EC under the statue of meeting EQ standards, more and more intelligent algorithms are presented to dynamically optimize the setting values of key variables in WWTP. For examples, Hakanen et al. designed a multi-objective interactive wastewater treatment software based on differential evolution (DE), using variables such as the So setpoint in last aerobic zone and the methanol dose as decision variables [14]. Han et al. proposed a Hopfield neural network method (HNN) based on Lagrange multiplier for the optimal control of pre-denitrification WWTP [15]. Yang used an artificial immune network-based combinatorial optimization algorithm (Copt-ai Net) to determine the optimal set values of So and SNo [16]. In [17], an adaptive multi-objective evolutionary algorithm based on decomposition (AMOEA/D) is developed with the usage of EC and EQ as objectives to be optimized.

However, sewage treatment is a cyclical process, that is, optimization calculations should be performed in intervals, which can result in high fitness evaluations (FEs) cost for optimization. In the above intelligent control algorithm, sewage treatment information is not fully utilized. The subsequent optimization does not extract useful information from the previous optimization process, and the previous optimization does not play a guiding role for the subsequent optimization.

In the cycle optimization process, information storage and reuse can improve computing efficiency and sewage treatment effect. Inspired by reinforcement learning mentioned in [18, 19], and considering the simple operation and fast convergence of particle swarm optimization algorithm (PSO) [20,21,22,23], we propose a wastewater treatment control method based on reinforcement learning particle swarm optimization (RLPSO). This method introduces a reinforcement learning strategy in the particle update. First, select the elite particles, record their concentration setting values and adjustment trends, and construct an elite particle set. Then an elite network was trained and used as the strategy function to predict the particle velocity. Finally, a simplified evaluation method is utilized to calculate the state value function which is used to update the elite network model.

The remainder of the paper is organized as follows. The next section introduces the international Benchmark Simulation Model 1 (BSM1) of WWTP and optimization objective function. The subsequent section describes RLPSO in detail. Then the experiment results and analysis are shown. The final section provides the conclusion and outlook.

Wastewater treatment processes optimization

In WWTP, the main reaction is carried out by biological reactor and secondary sedimentation tank. The biological reactor consists of five units. The first two units are anaerobic zones, which mainly complete denitrification reaction, while the last three are aerobic zones, which mainly complete nitrification reaction. To evaluate and compare different optimal control strategies, the Benchmark Simulation Model 1 (BSM1) [24,25,26] was developed by the IWA (International Water Association) and COST (European Cooperation in the Field of Science and Technology), shown in Fig. 1. In BSM1, there are two control loops, SO and SNO. The first control loop tunes the dissolved oxygen concentration in the fifth unit SO by changing the oxygen transfer coefficient KLa5. The second control loop tunes the nitrate nitrogen level in the second unit SNO by changing the internal recirculation flow rate Qa. The two control loops adopt proportional integral controller (PI). However, due to the influence of weather or users, sewage quality keeps changing. If SO or SNO is set at a constant value, it is difficult to maintain the optimal balance between EQ and EC. Therefore, it is necessary to dynamically optimize the set values of SO and SNO and construct the optimized control strategy aiming at reducing EC on the premise of ensuring qualified EQ.

Fig. 1
figure 1

The architecture of the BSM1

Aeration energy (AE) and pumping energy (PE) consumption accounts for more than 70% of total energy consumption, so the EC of the optimization problem is defined as the sum of AE and PE:

$$ EC = AE + PE. $$
(1)

According to the BSM1 mechanism model, AE and PE are defined as follows, respectively [27]:

$$ AE = \frac{{S_{{O.{\text{sat}}}} }}{{T \times {1}{\text{.8}} \times {1000}}}\int_{t}^{t + T} {\sum\limits_{i = 1}^{5} {V_{i} \cdot K_{{{\text{Lai}}}} (t){\text{d}}t} } , $$
(2)
$$ PE = \frac{{1}}{T}\int_{t}^{t + T} {({0}{\text{.004}}Q_{a} (t)} + {0}{\text{.05}}Q_{w} (t) + {0}{\text{.008}}Q_{r} (t)){\text{d}}t, $$
(3)

where KLai is the oxygen conversion coefficient and Vi is the volume of the ith biological reactor, respectively. So,sat is the saturation concentration for oxygen. T is the evaluation cycle. Qa, Qr, and Qw denote the internal recycle flow rate, return sludge recycle flow rate and waste sludge flow rate, respectively. Za, Zr, and Zw are the corresponding components concentrations.

EQ represents the fine to be paid for the discharge of water pollutants to the receiving water body. According to the definition of BSM1, the equation of EQ is [27]

$$ EQ = \frac{{1}}{T \times 1000}\int_{t}^{t + T} {(2SS(t) + COD(t) + 30S_{NO} (t) + 10S_{Nkj} (t) + 2BOD_{{5}} (t))} {\text{d}}t, $$
(4)

where SS, COD, SNO, SNKj and BOD5 are suspended solid concentration, chemical oxygen demand, nitrate concentration, Kjeldahl concentration, and biochemical oxygen demand, respectively. EQ value will impact the operation cost of WWTP if the effluent discharge fee is executed strictly.

In addition to EQ, the five effluent parameters should meet the following standards specified in BSM1 [28]:

$$ \begin{gathered} N_{{{\text{tot}}}} \le {\text{18mg/L}},COD \le {\text{100mg/L}}, \hfill \\ S_{NH} \le {\text{4mg/L}},SS \le {\text{30mg/L}}, \hfill \\ BOD_{{5}} \le {\text{10mg/L}}, \hfill \\ \end{gathered} $$
(5)

where Ntot = SNO + SNKj. SNH denotes influent ammonium.

In summary, the constrained objective optimization function of the WWTP is

$$ \begin{array}{*{20}c} {{\text{min}}} & {{\text{f}} = c \cdot EC + EQ} \\ \end{array} , $$
(6)

where c is the weight coefficient and the set values of SO and SNO are the decision variables. Since sewage treatment is a dynamic and periodic optimization process, we proposed a RLPSO control strategy to minimize the objective optimization function (6) by dynamically adjusting the set values of SO and SNO, to improve the sewage treatment efficacy and reduce the operating cost.

Reinforcement learning-based particle swarm optimization

Particle swarm optimization

PSO originated from the study of the behavior of preying on birds, its basic idea is that whole swarm of birds will tend to follow the bird which found the best path to food [29]. To search an optimum, PSO defines a swarm of particles to represent the potential solutions to an optimization problem. Each particle begins with an initial position randomly and flies through the D-dimensional solution space. The flying behavior of each particle can be described by its velocity and position as the following.

$$ v_{id} (k + 1) = \omega v_{id} (k) + c_{1} r_{1} (p_{id} (k) - x_{id} (k)) + c_{2} r_{2} (p_{gd} (k) - x_{id} (k)), $$
(7)
$$ x_{id} (k + 1) = x_{id} (k) + v_{id} (k + 1), $$
(8)

where Vi = (vi1, vi2,…, vid,…,viD) is the velocity vector of the ith particle; Xi = (xi1, xi2,…, xid,…, xiD) is the position vector of the ith particle; Pi = (pi1, pi2,…, pid,…, piD) is the best position found by the ith particle; Pg = (pg1, pg2,…, pgd,…, pgD) is the global best position found by the whole swarm. c1, c2 are two learning factors, usually c1 = c2 = 2 [29]; r1, r2 are random numbers between (0, 1) [30]; \(\omega\) is the inertia weight to control the velocity, which may decrease linearly starting at 0.9 and ending at 0.4 or \(\omega \in (0,1)\) [30].

WWTP is a process of periodic optimization. This is because the WWTP is a complex system with large lag, which is difficult to operate in real time. Therefore, it is necessary to set the cycle time and carry out the optimization calculation in each cycle. However, PSO has the characteristics of random initialization to improve the diversity, and only consider the individual optimum and global optimum during the optimization, ignoring the inherent properties of the system. If PSO is directly applied to WWTP, information from previous cycles does not provide any guidance for subsequent optimization processes, which will lead to low efficiency. To improve the treatment effect, it is necessary to record the influence of the set values of SO and SNO on the sewage parameters, and reuse the information to the optimization process, which will provide reference data for the next optimization calculation. So we consider adding a prediction item to Eq. (7), as shown in the following:

$$ v_{id} (k + 1)\, =\, \omega v_{id} (k) + c_{1} r_{1} (p_{id} (k) - x_{id} (k)) + c_{2} r_{2} (p_{gd} (k) - x_{id} (k)) + r_{\mu } v_{id\mu } (k + 1), $$
(9)

where \(v_{id\mu }\) is the dth dimensional predicted velocity of particle i by the strategy function μ, and \(r_{\mu }\) is the prediction coefficient. According to Eq. (9), the velocity direction of particles is determined by four parts: inertial velocity, individual historical optimum, global optimum, and prediction item. On the one hand, it draws the advantages of PSO, which is both self-cognition and group sociality, on the other hand, the prediction item infuses PSO with historical information, which is more suitable for repeated cycle optimization problems. To determine the prediction item \(v_{id\mu }\), we introduce reinforcement learning (RL) [31] strategy to PSO.

Reinforcement learning strategy

Reinforcement learning interacts with the environment through a trial-and-error mechanism and learns optimal strategies by maximizing cumulative rewards. Reinforcement learning agent mainly includes four basic elements: environment, state (s), action (a) and reward (R) [31]. During operation, the agent determines an action a according to the current state s through the strategy function μ, executes the action, and enters the next state. At the same time, the system returns the value R to reward or punish the action. The process runs repeatedly to maximize the expected benefits of the agent.

In the similar way, reinforcement learning-based PSO (RLPSO) includes four basic elements shown in Fig. 2. The agent is a particle in population and the environment is the WWTP in the paper. The state s is the position X of each particle in the population; the action a is the velocity V prediction strategy, which is determined by the strategy function μ. The reward value R is related to the fitness value f of the optimization problem. Therefore, to obtain the particle velocity prediction \(v_{id\mu }\), we need to establish the strategy function μ according to reward value R.

Fig. 2
figure 2

Schematic diagram of interaction between particle agent and environment

In the RLPSO, the particle agent predicts the speed according to the strategy function μ:

$$ v_{id\mu } (k + 1) = \mu (X_{i} (k)). $$
(10)

In this paper, the strategy function μ is described as an elite network model. By learning the information of elite particles, the elite network model was trained. The process is mainly divided into three steps: elite particle set construction, strategy function training and elite network model evaluation. The details are described as follows.

Elite particle set construction

Elite network model is trained with elite particle information to guide the search of offspring population. The first step in the kth iteration is to select elite particles based on the reward value R(k). In the iteration process, the reward value R(k) is determined according to the fitness variation value, as shown in the following equation:

$$ \left\{ \begin{gathered} \begin{array}{*{20}c} {\begin{array}{*{20}c} {if} & {f(k + 1) - f(k) > 0,} & {R(k) = 1;} \\ \end{array} } & {} \\ \end{array} \hfill \\ \begin{array}{*{20}c} {\begin{array}{*{20}c} {if} & {f(k + 1) - f(k) \le 0,} & {R(k) = - 1;} \\ \end{array} } & {} \\ \end{array} \hfill \\ \end{gathered} \right. $$
(11)

where \(f(k)\) is the fitness value of the kth iteration, \(k = 0,...,K - 1\). K is the maximum number of iterations of each run. Only the particles with reward value \(R(k) = 1\) are selected as the elite particles, and then the position Xi(k) before the update of the elite particles and the speed Vi(k) after the update are saved to construct the elite particle set \(\Omega_{e}^{{}}\).

Strategy function training

The elite particle set \(\Omega_{e}^{{}}\) is used to save the position x of the elite particle before the update and the speed v after the update. RLPSO uses a limited capacity elite particle set to store elite particles. Suppose the number of elite particle set \(\Omega_{e}^{{}}\) is \(N_{e}^{{}}\). \(\Omega_{e}^{^{\prime}}\) is the newly generated elite particle set, and its number is \(N_{e}^{^{\prime}}\). If \(N_{e}^{{}} + N_{e}^{^{\prime}}\) exceeds the finite capacity value \(N_{em}^{{}}\), all the elite particles \(\Omega_{e}^{{}} + \Omega_{e}^{^{\prime}}\) are sorted according to the fitness value, only the first \(N_{e}^{{}}\) items are stored into \(\Omega_{e}^{{}}\) again, and the original data is overwritten. The elite particle set \(\Omega_{e}^{{}}\) is used as a data set. In the data set, the particle position is input and the speed is output, and then a neural network model is trained to obtain the elite network model Φ. The trained elite network model Φ is used as the strategy function μ to guide the particle operation. With the elite network model Φ, the particle velocity can be predicted according to the particle position Xi:

$$ v_{id\mu } (k + 1) = \Phi (X_{i} (k)). $$
(12)

Elite network model evaluation

With the continuous update and change of the elite particle set \(\Omega_{e}^{{}}\), the elite network model will be evaluated after the training. During the evaluation process, the new model and the original model are used to guide the particle optimization process. To better reflect the influence of the strategy function μ on the particle velocity update, RLPSO velocity update equation is simplified as

$$ v_{id} (k + 1) = \omega v_{id} (k){ + }r_{\mu } v_{id\mu } (k + 1). $$
(13)

When the termination conditions \(k \ge K\) is satisfied, the optimal fitness value obtained by the guidance of the new elite network model is set to \(f_{1}^{*}\), and the optimal fitness value obtained by the original network model is set to \(f_{2}^{*}\). If \(f_{1}^{*} > f_{2}^{*}\), it means that the prediction effect of the new network model is better. Set the new network reward value \(R(K) = 1\) after the iteration, otherwise \(R(K) = - 1\).

Considering the randomness of particles, the above evaluation process is repeated M times to estimate the state value function \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X)\):

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X) = \sum\limits_{m = 1}^{M} {R_{{}}^{m} (K)} , $$
(14)

where \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X)\) represents the average reward that can be obtained after the particle X moves through the strategy function μ. If \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X) > 0\), the new model is considered better than the original model, and the new network is used to replace the original network. If \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X) \le 0\), keep the original network. By comparing the two models, we determine the prediction model required by the subsequent algorithm.

Algorithm procedure

The algorithm procedure is described below.

RLPSO

1. Initialize particle position Xi and velocity Vi, \(i = 1,2,...,N\)

2. Let Run = 1. Update the particle position and velocity according to Eqs. (7) and (8). In the iterative process (\(k < K\)), select the particle with reward value \(R(k) = 1\) as the elite particle, establish the elite particle set \(\Omega_{e}^{{}}\) and train the elite network model Φ

3. Randomly generate N particles. Let \(r_{\mu } \ne 0\), use the elite network model Φ to predict the particle velocity \(v_{id\mu }\), and update the particle position and velocity according to Eqs. (8) and (9). At the same time, continue to select particles with reward value \(R(k) = 1\) as elite particles. If the number of elite particles exceeds the limited capacity, establish a new elite particle set \(\Omega_{e}^{^{\prime}}\) and train a new elite network model Φ′

4. Evaluation of the elite network model. According to Eqs. (8) and (13), the original model Φ and the new model Φ′, respectively, instruct the particle swarm to run M times to calculate the estimated value \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X)\). If \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{\mu } (X) > 0\), the new model Φ′ replaces the original model Φ′

5. Run = Run + 1. If terminal condition is not satisfied, return 3

Experiments

Simulation experiment of RLPSO based on BSM1

The proposed RLPSO is simulated on BSM1 platform and compared with PI controller, CPSO [32], SLPSO [33], PSO [34], APSO [35], DE [36], HNN [15], Copt-ai Net [16] and AMOEA/D [17]. The simulation conditions are based on the sunny and good weather in the BSM1. The parameters of the HNN, Copt-ai Net and AMOEA/D algorithms are determined by the original papers. Besides, the other algorithms parameters are set as follows.

The selection time of simulation data is 14 days. The sampling interval is 15 min, and the optimization period is 2 h. So a total of 168 runs are conducted for each algorithm. In the PI strategy, the set values SO = 2 and SNO = 1. The ranges of SO and SNO are 0.5–2 \({\text{mg/L}}\) and 0.8–2 \({\text{mg/L}},\) respectively, in RLPSO, CPSO, SLPSO, PSO, APSO and DE. In the objective optimization function of Eq. (6), c = 0.1.

During the optimization, one difficulty in employing RLPSO in BSM1 is the huge time consumption for fitness evaluations (FEs), the algorithm is required not only to satisfy the optimization accuracy, but also to accelerate the convergence speed. Therefore, for RLPSO, the population size N is set to 10, \(r_{\mu } = 0.3\), \(\omega = 0.4\), D = 2, and Kmax = 40. c1 and c2 are 2. We can figure out that FEs is 67,200. Early experiments show that RLPSO with the setting parameters is nearly convergent in the 40th iteration, and meets the requirements of EQ and EC. For the convenience of comparison, the inertia weights of the PSO-based algorithms are all selected as 0.7. In DE algorithm, the mutation rate is 0.5 and the crossover probability is 0.9. The population size and iterations number of these algorithms are the same as RLPSO.

Table 1 shows the comparison of EQ and EC under several strategies. As can be seen from Table 1, compared with PI strategy, all of these intelligent algorithms can reduce EC by optimizing the set values of SO and SNO. Among them, the EC obtained by PSO algorithm is lower than RLPSO, which is 3652.40 kWh/d. However, SNH concentration via PSO is 4.19 mg/L, which exceeds the limit of 4 mg/L. Similarly, SNH concentration obtained by DE and APSO also exceeds the standard. Besides, EC obtained by CPSO, SLPSO, HNN, and Copt-ai Net is obviously higher than that of RLPSO, which proves that RLPSO is superior to these algorithms.

Table 1 Comparison of EQ and EC of different control strategies in fine weather

We can also see from Table 1 that the EC obtained by AMOEA/D is slightly lower than RLPSO, but its EQ is higher than RLPSO. The performance of the two algorithms is comparable. But it should be noted that, in the AMOEA/D strategy, the population size N is 100, and Kmax = 300. We can calculate that in each optimization cycle, the FEs of RLPSO is just 1/75 of AMOEA/D. RLPSO can obtain EC similar to AMOEA/D with significantly lower FEs, which prove that RLPSO is more suitable for sewage treatment process.

RLPSO simulation experiment based on benchmark functions

To further study the performance of RLPSO, the RLPSO algorithm is analyzed on the high dimensional general benchmarks. Six different types of benchmark functions (Rastrigin, Griewank, Ellipsoid, Rosenbrock, Sphere, and Ackley functions) are used to study the performance of RLPSO compared with CPSO, SLPSO, PSO, APSO and DE. In the algorithm, the population size N = 10, and the dimension D is set to 10 and 20 dimensions, respectively. Each algorithm runs 50 times. The maximum number of iterations per run is 200. Other parameters of different algorithms are the same as the BSM1 experiment. In the experiment process, \(f_{j}^{*}\) represents the optimal fitness value obtained in the jth run, j = 1,2,…50.

Figures 3, 4, 5, 6, 7 and 8 show the boxplots comparison of f* obtained by various algorithms. As can be seen from the figures, median and interquartile range of RLPSO are obviously better than SLPSO, PSO and APSO, which proves that the performance of RLPSO is superior to these algorithms. Besides, RLPSO has almost no outliers, which also proves the stability of the RLPSO.

Fig. 3
figure 3

Boxplot comparison of various algorithms on the Sphere benchmark

Fig. 4
figure 4

Boxplot comparison of various algorithms on the Ellipsoid benchmark

Fig. 5
figure 5

Boxplot comparison of various algorithms on the Rosenbrock benchmark

Fig. 6
figure 6

Boxplot comparison of various algorithms on the Ackley benchmark

Fig. 7
figure 7

Boxplot comparison of various algorithms on the Griewank benchmark

Fig. 8
figure 8

Boxplot comparison of various algorithms on the Rastrigin benchmark

To further observe the performance of RLPSO compared with CPSO and DE, Figs. 9, 10, 11, 12, 13 and 14 show the f* trend of these three algorithms during 50 runs. For CPSO and DE, there is no data connection between different runs, and each run is randomly initialized, so f* trend fluctuates significantly. However, it can be seen from Figs. 9a–12a that f* of RLPSO tends to converge. This is because RLPSO relies on elite neural network to transmit information between different runs, and the previous optimization can play a guiding role in the subsequent optimization. It should be noted that, as can be seen from Figs. 13a–14a and 9b–14b, RLPSO still has fluctuations. This is because elite network with fixed structure was selected in the training process, which resulted in a decline in RLPSO performance for more complex or high-dimensional benchmarks. Nevertheless, the fluctuation range of RLPSO is significantly weaker than that of CPSO or DE.

Fig. 9
figure 9

Curve plot comparison of various algorithms on the Sphere benchmark

Fig. 10
figure 10

Curve plot comparison of various algorithms on the Ellipsoid benchmark

Fig. 11
figure 11

Curve plot comparison of various algorithms on the Rosenbrock benchmark

Fig. 12
figure 12

Curve plot comparison of various algorithms on the Ackley benchmark

Fig. 13
figure 13

Curve plot comparison of various algorithms on the Griewank benchmark

Fig. 14
figure 14

Curve plots comparison of various algorithms on the Rastrigin benchmark

Tables 2 and 3 list the best, worst, mean and standard deviation value of f * for RLPSO, CPSO, SLPSO, PSO, APSO and DE. It can also be seen from the tables that the best value of RLPSO is weaker than that of DE in Rastrigin function, but its mean or standard deviation is better. For the other benchmarks, all performance statistics of RLPSO are optimal, which proves the accuracy, robustness and effectiveness of the RLPSO algorithm.

Table 2 Optimal fitness values comparison of various algorithms on the ten-dimensional benchmarks
Table 3 Optimal fitness values comparison of various algorithms on the twenty-dimensional benchmarks

Conclusions

In this paper, we proposed an RLPSO algorithm to solve the WWTP problem. On the one hand, this method is based on the theory of reinforcement learning. Through continuous interactive attempts of environment and action, the method adjusts the strategy according to the feedback information, and finally judges the optimal concentration setting value under various conditions. On the other hand, this method is based on swarm intelligence algorithm PSO. In the application of WWTP, it is helpful to improve the diversity distribution of solutions to find the global optimal concentration setting value. Besides, the method has an elite network with memory function. In order to improve the treatment effect, it is necessary to record the influence of the set values of SO and SNO on the sewage parameters, and reuse the information, which provides reference data for the next optimization calculation.

In summary, the RLPSO algorithm proposed in this paper can not only meet the effluent standard, but also reduce the operating cost and provide a feasible solution for the actual sewage treatment plant. In the further, we will continue to study the sewage treatment system, carry out data mining [37, 38], and seek for a better optimal control method. In addition, we will use RLPSO to solve more practical problems, such as robot control [39, 40] and sEMG-based human–machine interaction [41].