Skip to main content

Advertisement

Log in

Assessment of reinforcement learning algorithms for nuclear power plant fuel optimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The nuclear fuel loading pattern optimization problem belongs to the class of large-scale combinatorial optimization. It is also characterized by multiple objectives and constraints, which makes it impossible to solve explicitly. Stochastic optimization methodologies including Genetic Algorithms and Simulated Annealing are used by different nuclear utilities and vendors but hand-designed solutions continue to be the prevalent method in the industry. To improve the state-of-the-art, Deep Reinforcement Learning (RL), in particular, Proximal Policy Optimization is leveraged. This work presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engineering design optimization. This paper is also to our knowledge the first to propose a study of the behavior of several hyper-parameters that influence the RL algorithm. The algorithm is highly dependent on multiple factors such as the shape of the objective function derived for the core design that behaves as a fudge factor that affects the stability of the learning. But also an exploration/exploitation trade-off that manifests through different parameters such as the number of loading patterns seen by the agents per episode, the number of samples collected before a policy update , and an entropy factor that increases the randomness of the policy during training. We found that RL must be applied similarly to a Gaussian Process in which the acquisition function is replaced by a parametrized policy. Then, once an initial set of hyper-parameters is found, reducing and until no more learning is observed will result in the highest sample efficiency robustly and stably. This resulted in an economic benefit of 535,000 - 642,000 $/year/plant. Future work must extend this research to multi-objective settings and comparing them to state-of-the-art implementation of stochastic optimization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability statement

The sponsor of this study did not give written consent for their data to be shared publicly, so due to the sensitive nature of the research supporting data is not available.

Abbreviations

PPO:

Proximal Policy Optimization

RL:

Reinforcement Learning

:

Value Function Coefficient

:

Entropy Coefficient

:

Number of steps per agent before parameter update

:

Learning Rate

or \(\epsilon \) :

Clipping range

:

Number of epoch to run the gradient ascent

NF:

Numframes: Number of pattern generated per episode

LP:

Loading Pattern

\(N_{FA}\) :

Number of fuel assemblies in the core

FAs:

Fuel Assemblies

BP:

Burnable Poison

CRD:

Control Rod

NPP:

Nuclear Power Plant

CO:

Combinatorial Optimization

SO:

Stochastic Optimization

NT:

Nemenyi Test

FT:

Friedman Test

SE:

Sample Efficiency

HT:

Hypothesis Testing

MDP:

Markov Decision Process

FOM:

Figure of Merit

References

  1. NEI. “NUCLEAR COSTS IN CONTEXT." Nuclear Energy Institute, NEI (2020)

  2. Kropaczek DJ (2011) “COPERNICUS: A multi-cycle optimization code for nuclear fuel based on parallel simulated annealing with mixing of states. Progress Nuclear Energy 53(6):554–561

    Article  CAS  Google Scholar 

  3. Park TK, Joo HG, Kim CH (2009) Multiobjective Loading Pattern Optimization by Simulated Annealing Employing Discontinuous Penalty Function and Screening Technique. Nuclear Sci Eng 162:134–147

    Article  ADS  CAS  Google Scholar 

  4. Parks GT (1996) Multiobjective pressurized water reactor reload core design by nondominated genetic algorithm search. Nuclear Sci Eng 124(1):178–187

    Article  ADS  CAS  Google Scholar 

  5. de Moura Meneses AA, Araujo LM, Nast FN, Vasconcelos da Silva P, Schirru R (2019) Optimization of Nuclear Reactors Loading Patterns with Computational Intelligence Methods. In: Platt G, Yang XS, Silva Neto A (eds) Computational intelligence, optimization and inverse problems with applications in engineering. Springer, Cham

    Google Scholar 

  6. de Moura Meneses A, Schirru R (2015) A cross-entropy method applied to the In-core fuel management optimization of a Pressurized Water Reactor. Progress Nuclear Energy 83:326–335

    Article  Google Scholar 

  7. de Lima AMM, Schirru FC, da Silva R, Medeiros JACC (2008) A nuclear reactor core fuel reload optimization using artificial ant colony connective networks. Ann Nuclear Energy 35:1606–1612

    Article  Google Scholar 

  8. Wu SC, Chan TH, Hsieh MS, Lin C (2016) Quantum evolutionary algorithm and tabu search in pressurized water reactor loading pattern design. Ann Nuclear Energy 94:773–782

    Article  CAS  Google Scholar 

  9. Lin S, Chen YH (2014) The max-min ant system and tabu search for pressurized water reactor loading pattern design. Ann Nuclear Energy 71:388–398

    Article  CAS  Google Scholar 

  10. Erdoğan A, Geçkinli M (2003) A PWR reload optimisation (Xcore) using artificial neural network and genetic algorithm. Ann Nuclear Energy 30:35–53

    Article  Google Scholar 

  11. Li Z, Huang J, Wang J, Ding M (2022) Comparative study of meta-heuristic algorithms for reactor fuel reloading optimization based on the developed BP-ANN calculation method. Ann Nuclear Energy 165:108685

    Article  CAS  Google Scholar 

  12. Ortiz JJ, Requena I (2004) “Using a multi-state recurrent neural networks to optimize loading patterns in BWRs. Ann Nuclear Energy 31:789–803

    Article  CAS  Google Scholar 

  13. Yamamoto A (2003) Application of Neural Network for Loading Pattern Screening of In-Core Optimization Calculations. Nuclear Technol 144(1):63–75

    Article  ADS  CAS  Google Scholar 

  14. Gozalvez JM, Yilmaz S, Alim F, Ivanov K, Levine SH (2006) Sensitivity study on determining an efficient set of fuel assembly parameters in training data for designing of neural networks in hybrid genetic algorithms. Ann Nuclear Energy 33:457–465

    Article  CAS  Google Scholar 

  15. Bello I, Pham H, Le QV, Norouzi M, Bengio S (2016) “Neural combinatorial optimization with reinforcement learning." arXiv:1611.09940

  16. Khalil E, Dai H, Zhang Y, Dilkina B, Song L (2017) “Learning combinatorial optimization algorithms over graphs." In NIPS’17: Proceedings of the 31st international conference on neural information processing systems, pp 6348–6358

  17. Li K, Zhang T, Wang R (2021) Deep Reinforcement Learning for Multi-Objective Optimization. IEEE Trans Cybernet 51(6):3103–3114

    Article  Google Scholar 

  18. Nissan E, Siegelmann H, Galperin A, Kimhi S (1997) Upgrading Automation for Nuclear Fuel In-Core Management: from the Symbolic Generation of Configurations, to the Neural Adaptations of Heuristics. Eng Comput 13:1–19

    Article  Google Scholar 

  19. Radaideh MI, Wolverton I, Joseph J, Tusar JJ, Otgonbaatar U, Roy N, Forget B, Shirvan K (2021) “Physics-informed reinforcement learning optimization of nuclear assembly design. Nuclear Eng Des 372:110966

    Article  CAS  Google Scholar 

  20. Rempe KR, Smith KS, Henry AF (1989) “SIMULATE-3 pin power reconstruction: methodology and benchmarking. Nuclear Sci Eng 103(4):334–342

    Article  ADS  CAS  Google Scholar 

  21. Seurin P, Shirvan K (2022) “PWR Loading Pattern Optimization with Reinforcement Learning." International conference on physics of reactors (PHYSOR 2022), pp 1166–1175

  22. Seurin P, Shirvan K (2023) “Pareto Envelope Augmented with Reinforcement Learning: Multi-objective reinforcement learning-based approach for Pressurized Water Reactor optimization." In The international conference on mathematics and computational methods applied to nuclear science and engineering (M &C 2023). Niagara Falls, Ontario, Canada, August 13-17

  23. Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming. Athena scientific Belmont, MA volume, p 1

    Google Scholar 

  24. Bengio Y, Lodi A, Prouvost A (2018) “Machine Learning for Combinatorial Optimization: a Methodological Tour d’Horizon.". Eur J Oper Res 290(2):405–421

    Article  MathSciNet  Google Scholar 

  25. Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski W, Dabney G, Horgan D, Piot B, Azar M, Silver D (2018) “Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI’18/IAAI’18/EAAI’18: Proceedings of the Thirty-Second AAAI conference on artificial intelligence and thirtieth innovative applications of artificial intelligence conference and eighth aaai symposium on educational advances in artificial intelligence 393:pp 3215–3222

  26. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) “Proximal policy optimization algorithms." arXiv:1707.06347

  27. Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2017) “Trust Region Policy Optimization." arXiv:1502.05477v5

  28. Botvinick M, Ritter S, Wang JX, Kurth-Nelson Z, Blundell C, Hassabis D (2019) “Reinforcement Learning: Fast and Slow. Trends Cognit Sci 23(5):408–422

    Article  Google Scholar 

  29. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionest reinforcement learning. Mach Learn 8(3):229–256

    Article  Google Scholar 

  30. Wu Y, Mansimov E, Liao S, Grosse R, Ba J (2017) “Scalable trust-region method for deep reinforcement learning using the Kronecker-factored approximation." NIPS’17: Proceedings of the 31st International conference on neural information processing systems, pp 5285–5294

  31. Kakade S, Langford J (2002) Approximately Optimal Approximate Reinforcement Learning. ICML 2:267–274

    Google Scholar 

  32. Kakade S (2001) “A natural policy gradient. NIPS’01: Proceedings of the 14th International conference on neural information processing systems: natural and synthetic, pp 1531–1538

  33. Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) “Stable Baselines." https://github.com/hill-a/stable-baselines

  34. Mazyavkina N, Sviridov S, Ivanov S, Burnaev E (2021) Reinforcement Learning for Combinatorial Optimization: A survey. Comput Oper Res 134:105400

    Article  MathSciNet  Google Scholar 

  35. Bertsimas D, Tsitsiklis JN (2008) Introduction to Linear Optimization. Athena Scientific, Dynamic Ideas

    Google Scholar 

  36. Dai H, Dai B, Song L (2016) “Discriminative embeddings of latent variable models for structured data." ICML’16: Proceedings of the 33rd International conference on international conference on machine learning 48:2701–2711

  37. Vinyals O, Fortunato M, Jaitly N (2015) “Pointer networks." NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems 2:2692–2700

  38. Nazari M, Oroojlooy A, Snyder LV, Takac M (2018) “Deep Reinforcement Learning for Solving the Vehicle Routing Problem." NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 9861–9871

  39. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) “Asynchronous methods for deep reinforcement learning. ICML’16: Proceedings of the 33rd International Conference on International Conference on Machine Learning 48:1928–1937

  40. Xing R, Tu S, Xu L (2020) “Solve Traveling Salesman Problem by Monte Carlo Tree Search and Deep Neural Network." arXiv:2005.06879v1

  41. Emami P, Ranka S (2018) “Learning Permutations with Sinkhorn Policy Gradient." arXiv:1805.07010v1

  42. Kool W, Van Hoof H, Welling M (2018) “Attention Solves Your TSP, Approximately." arXiv:1803.08475v2

  43. Solozabal R, Ceberio J, Takáč M “Constrained Combinatorial Optimization with Reinforcement Learning." arXiv:2006.11984

  44. Delarue A, Anderson R, Tjandraatmadja C (2020) “Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing." NIPS’20: Proceedings of the 34th International conference on neural information processing systems 52:609–620

  45. Radaideh MI, Forget B, Shirvan K (2021) Large-scale design optimisation of boiling water reactor bundles with neuroevolution. Ann Nuclear Energy 160:108355

    Article  CAS  Google Scholar 

  46. Kerkar N, Paulin P (2007) Exploitation des coeurs REP. EDP SCIENCES, 17, avenue du Hoggar, Parc d’activités de Courtaboeuf, BP 112, 91944 Les Ulis Cedex A, France

  47. del Campo CM, François JL, Avendano L, Gonzalez M (2004) Development of a BWR loading pattern design system based on modified genetic algorithms and knowledge. Ann Nuclear Energy 31:1901–1911

    Article  Google Scholar 

  48. Castillo A, Alonso G, Morales LB, del Campo CM, François JL, del Valle E (2004) BWR fuel reloads design using a Tabu search technique. Ann Nuclear Energy 31:151–161

    Article  CAS  Google Scholar 

  49. Radaideh MI, Shirvan K (2021) Rule-based reinforcement learning methodology to inform evolutionary algorithms for constrained optimization of engineering applications. Knowl-Based Syst 217:106836

    Article  Google Scholar 

  50. Nijimbere D, Zhao S, Gu X, Esangbedo MO (2021) TABU SEARCH GUIDED BY REINFORCEMENT LEARNING FOR THE MAX-MEAN DISPERSION PROBLEM. J Indust Manag Optimizat 17:3223–3246

    Article  MathSciNet  Google Scholar 

  51. Saccheri JGB, Todreas NE, Driscoll MJ (2004) “A tight lattice, Epithermal Core Design for the Integral PWR." In Proceedings of ICAPP ’04, p 4359. Pittsburgh, PA, USA

  52. “0523 - 0504P - Westinghouse Advanced Technology - 03.4 - Analysis of Technical Specifications Unit 4." nrc.gov/docs/ML1121/ML11216A087.pdf

  53. Liu Y, Halev A, Liu X (2021) “Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey." International Joint Conferences on Artificial Intelligence Organization, Survey Track, pp 4508–4515. https://doi.org/10.24963/ijcai.2021/614

  54. Li Z, Wang J, Ding M (2022) A review on optimization methods for nuclear reactor fuel reloading analysis. Nuclear Eng Des 397:111950

    Article  CAS  Google Scholar 

  55. Kropaczek DJ, Turinsky PJ (1991) In-core nuclear fuel management optimization for pressurized water reactors utilizing simulated annealing. Nuclear Technol 95(1):9–32

    Article  ADS  CAS  Google Scholar 

  56. François JL, Ortiz-Sevrin JJ, Martin-del Campo C, Castillo A, Esquivel-Estrada J (2012) Comparison of metaheuristic optimization techniques for BWR fuel reloads pattern design. Ann Nuclear Energy 51:189–195

    Article  Google Scholar 

  57. Ivanov BD, Kropaczek DJ (2021) ASSESSMENT OF PARALLEL SIMULATED ANNEALING PERFORMANCE WITH THE NEXUS/ANC9 CORE DESIGN CODE SYSTEM. EPJ Web of Conferences 247:02019. https://doi.org/10.1051/epjconf/202124702019

    Article  CAS  Google Scholar 

  58. de Moura Meneses AA, Machado MD, Schirru R “Particle Swarm Optimization applied to the nuclear reload problem of a Pressurized Water Reactor. Progress Nuclear Energy 51:319–326

  59. Derrac J, Garcia S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1:3–18

    Article  Google Scholar 

  60. Schlünz E, Bokov P, van Vuuren J (2016) A comparative study on multiobjective metaheuristics for solving constrained in-core fuel management optimisation problems. Comput Oper Res 75:174–190

    Article  MathSciNet  Google Scholar 

  61. Casella G, Berger RL (2002) Statistical Inference Second Edition. Pacific Grove

  62. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat Methods 17:261–272

  63. Terpilowski M (2019) scikit-posthocs: Pairwise multiple comparison tests in Python. J Open Sour Softw 4(36):1169

    Article  Google Scholar 

  64. Yilmaz BG, Yilmaz ÖF (2022) Lot streaming in hybrid flowshop scheduling problem by considering equal and consistent sublots under machine capability and limited waiting time constraint. Comput Indust Eng 173:108745

    Article  Google Scholar 

  65. Yilmaz ÖF, Yazici B (2022) “Tactical level strategies for multi-objective disassembly line balancing problem with multi-manned stations: an optimization model and solution approaches. Ann Oper Res 319:1793–1843. https://doi.org/10.1007/s10479-020-03902-3

    Article  Google Scholar 

  66. Awad NH, Ali MZ, Suganthan PN, Liang JJ, Qu BY (2017) “Problem Definitions and Evaluation Criteria for the CEC 2017 Special Session and Competition on Single Objective Real-Parameter Numerical Optimization. Technical report, Nanyang Technological University, Singapore

  67. Radaideh MI , Seurin P, Du K, Seyler D, Gu X, Wang H, Shirvan K (2023) “NEORL: NeuroEvolution Optimization with Reinforcement Learning—Applications to carbon-free energy systems. Nuclear Eng Des 112423

  68. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Gea Ostrovski (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  ADS  CAS  PubMed  Google Scholar 

  69. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) “Mastering the game of Go without human knowledge." Nature 550

  70. Konak A, Coit DW, Smith AE (2006) “Multi-Objective optimizaton using genetic algorithms: A tutorial. Reliability Eng Syst Safety 91:992–1007

    Article  Google Scholar 

  71. Alim F, Kostadin I, Levine S (2008) New genetic algorithms (GA) to optimize PWR reactors: Part I: Loading pattern and burnable poison placement optimization techniques for PWRs. Ann Nuclear Energy 35(1):93–112

  72. Verhagen F, Van der Schaar M, De Kruijf W, Van de Wetering T, Jones R (1997) ROSA, a utility tool for loading pattern optimization. Proc of the ANS Topical Meeting-Advances in Nuclear Fuel Management II 1:8–31

    Google Scholar 

  73. Frazier P (2018) “A Tutorial on Bayesian Optimization. arXiv:1807.02811

  74. Van Hasselt H, Guez A, Silver D (2016) “Deep reinforcement learning with double q-learning." In AAAI’16: Proceedings of the Thirtieth AAAI conference on artificial intelligence 2094–2100

  75. Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2018) “High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438v6

Download references

Acknowledgements

This work is sponsored by Constellation (formally known as Exelon) Corporation under the award (40008739). Feedback from Neil Huizenga and Antony Damico of Constellation were helpful in setting up the problem constraints.

Author information

Authors and Affiliations

Authors

Contributions

Methodology, software, formal analysis and investigation, visualization, writing - original draft preparation: Paul Seurin; conceptualization, methodology, funding acquisition, supervision, writing - review and editing: Koroush Shirvan. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Koroush Shirvan.

Ethics declarations

Conflict of Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: PPO hyper-parameters studies

In this appendix we provide the results of the major hyper-parameters alluded in Section 2.3. The shape of the reward via , the length of an episode via , the , and the entropy coefficient .

1.1 A.1 study

Defining the reward for the agent behaving in accordance with the designer’s goals is called reward hacking. Defining the reward to optimize the performance of the agents is called reward shaping. This study explores the influence of setting weights on design constraints. Weights of 1,000, 5,000, 15,000, 25,000, 40,000, 50,000, 75,000, and 100,000 are considered. Moreover, all constraints will have the assignment of the same weight, since good quality results could not be found when we manually experimented with varying the individual weights. To compare the obtained pattern on the same baseline, the penalization of the obtained pattern is normalized as follows [70]:

$$\begin{aligned} D(x,\mathcal {C}) = \sqrt{\sum _{i \in \mathcal {C}}(\frac{c_i(x) - c_i}{c_{i_{max}} - c_{i_{min}}})^2} \end{aligned}$$
(12)

where \(D(x,\mathcal {C})\) is the distance of the core x to the space of feasible core \(\mathcal {C}\) in the objective space, and \(c_i, i \in \{1,...,5\}\) are the constraints defined in Section 4.2.2 (\(c_6,c_7\) are always satisfied), which ranges are given in Table 2.

Table 3 Statistics for each value over 10 experiments

Figure 5 showcases the evolution of the mean score (left) and max objective (right) per generation. A generation is defined such that the experiments span 100 generations, therefore one contains between 400 to 500 SIMULATE3 evaluations. As expected, the mean score is constantly increasing meaning that the agents are learning. The intricate behavior starting around the 60th generation is happening when a feasible pattern has been found and the agents are only optimizing the LCOE. This occurs at about the same time for all . It is interesting to note that this happens for both mean score and max objective signifying that the agents behave well almost at the same time as they are hitting a feasible pattern and hence received a boost in reward.

Table 4 Mean Objective SE and the \(\sigma \)SE for the different cases

Additionally, the seemingly less smooth behavior compared to other hyper-parameter studies (e.g., see Figs. 7 and 9) is due to the normalizing of the objective with (12). In all cases, the max objective keeps increasing late during the search, as confirmed by the high IR provided in Table 3 and the standard deviation are in the same range. Therefore, no significant differences can be observed in terms of absolute performance. On top of that, the p-value for the FT with the max score is 0.351 and 0.876 for the mean. Hence these tests failed. However, we decided to stop the experiments as we did not observe significant differences in the performance with the different in contrast to the hyper-parameters studied in the next sections. We defined however more metrics at the beginning of this section and Tables 3 and 4 could provide more information and better-informed pathways to decide which to pursue the study with.

Table 5 Best patterns found for each case throughout all experiments
Fig. 6
figure 6

Lower right quadrant of the best pattern found obtained for equal to 5,000. On the left, from top to bottom: Assembly group number, fuel average enrichment, number of IFBA rods, and number of WABA rods. Blue is for twice-burned FAs, green for once-burned, and red for fresh (colors pattern also utilized in ROSA [72]). In violet, location where WABA cannot be placed. On the right, from top to bottom: Beginning of cycle \(k_{\infty }\) and assembly average burnup, peak pin Burnup at EOC, and \(F_{\Delta h}\) when the maximum over the cycle is reached. R means reflector, where the water is

In light of Table 3, the best average behavior for the max objective is obtained with 75,000, which also achieves a very competitive best pattern overall. In RL literature, it is sometimes advised to normalize the reward, otherwise, the magnitude of the gradients will be too different. But here one small change in the core with 75,000 may result in the blowout of the cumulative reward, which could be counter-intuitive. The fact that and equal 2 and 4, respectively, and the use of a clipping term and mitigate this effect. On top of that, the best pattern was found with 5,000, while the case 1,000 achieves the best mean score per episode with 25,000 being very close for this metric. But the behavior among the is not linear, as a of 25,000 achieves better results than 40,000, 50,000, and 100,000 in total, while a better mean score than 75,000 and almost as good max objective. Additionally, case 25,000 achieves the second best performance in terms of mean score, with a mean objective SE at - 5.7360, the highest one. Even though the differences do not look significant, the case 25,000 has the lowest \(\sigma \)SE at the last generation given in Table 4, signifying greater stability. Hence if the input variables were to change without the possibility to run 10 experiments due to time constraints (e.g., in case of emergency core redesign) we could imagine that utilizing 25,000 could help find a feasible pattern with a higher probability.

The breakdown of the FOMs of the best patterns found in each case is given in Table 5. The constraints \(c_6\) and \(c_7\) turned out to be usually satisfied and were omitted from the table, which will be the case in the remainder of the study. It should be noted that for the best pattern, some constraints are close to being tight (e.g., \(F_{\Delta h}\), \(C_b\), and Pin Peak Bu) in contrast to the one found with 25,000 for instance. This may show that to optimize the LCOE further, penalties on some constraints may need to be taken.

Fig. 7
figure 7

Error bar plots averaged over 10 experiments for the Max (left) and Mean (right) Objective per episode for different values. The Mean was obtained by averaging the mean score obtained per episode within a window of X episodes, where X depends on the total number of samples generated (usually around 400 samples for equal to 1 to obtain 100 generations). The length of the bars is two \(\sigma \)

The 2D projection of the best pattern found (i.e., with equal to 5,000) is given in Fig. 6. This case presents a high IR and a low \(\sigma \)SE throughout the 10 experiments. It means that the agents are constructing high-quality patterns at each episode, which is a testament of the agents’s understanding of the objective space. There, the agents employed several classical designer’s tactics even though no prior knowledge of the physics involved was given before the search started: the inner part of the core is mostly arranged like a checkerboard between fresh and once-burned FAs, a higher BP concentration is given to the FAs closer to the center, and FAs with a higher \(k_{\infty }\) are located near the periphery, which help to reduce the peaking factor and boron concentration [9]. This is worth noting as many papers instead impose that these expert heuristics are satisfied in the decision process as in [8, 9, 71]. However, usually designer put more WABA in the diagonal and the ring of fire because the power is shifted from the center of the core (where low enrichment and high IFBA are placed) to there. But here the agent decided to also utilize high WABA in-board but not in the ring of fire. This may be explained by the fact that WABA costs money, hence reducing its amount is pivotal, and removing it from the ring of fire may be the path of lowest resistance. Other Expert tactics for twice-burned include choosing once-burned with the highest \(k_{\infty }\) to obtain enough carry-over energy for the next cycle and FAs with low average burnup to limit the risk of high peak pin burnup at the periphery. Here the average burnup of twice-burned at EOC is not high but not particularly low neither for the ones on both sides of the diagonal (48.5 GWd/tHm at BOC). These FAs will then peak at 61.0. The peak pin burnup is also close to 62 GWd/tHm (61.3 GWd/tHm for the fourth FA starting from the left or the top). This may be explained because we are not only studying safety parameters. There is no penalty as long as the peak pin burnup is within 62 GWd/tHm, but the LCOE is decreased by extracting more energy out of these FAs.

Table 6 Statistics for each value over 10 experiments

To conclude this section, we can say that the do not participate in the exploration/exploitation trade-off, which makes sense as they are affecting the magnitude of the reward only. It remains a very important parameter nonetheless. It behaves as a fudge factor which could result in an improvement on the economic side and the stability of the agents. Comparing the LCOE of the patterns found with 5,000 and 25,000, a gain of about 180,000 and 220,000 $/year/plant for a 1,000 and 1200 MWe NPP, respectively, could be achieved by tuning the appropriately. However, we demonstrated with a multi-measure analysis that looking at the maximum objective is not sufficient, specifically because it does not necessarily yield to a better mean objective SE, which does not offer high certainty to find a feasible pattern rapidly with a different seed. The latter becomes instrumental if different input variables were given and a new problem must be solved rapidly which is achieved with more certainty with 25,000 as . All the observations made in this section motivated us to pursue with 25,000 as it presented the best trade-off between our multiple performance metrics.

Table 7 Mean Objective SE and \(\sigma \)SE for the different cases
Table 8 p-value for the NT between the instances for the maximum score

1.2 A.2 Numframes study

A prior study [67] showed that should not vary beyond 100. Here, the values are 1, 2, 7, 10, 25, 75, and 100. Figure 7 showcases the evolution of the mean score and max objective over 100 generations. For all there is continuous learning occurring however it is clear in this figure that very low values (i.e., 1 and 2) outperform higher ones, which is confirmed by Table 6 for the Avg (Me). Regarding the right-hand side of Fig. 7, we could imagine that the performance of the case equal to 1 saturates after the 50th generation. Nevertheless, this is simply the effect of scaling. The IR is high above 87% meaning that the agents are still improving their learning of the system and the high \(\sigma \)SE of Table 7 confirms this hypothesis. More generally for all cases, the IR is high for both the mean and the max objective, meaning that the agent’s decisions are indeed improving, except for the max objective of case 100, which may hint that this solution has been found randomly. The very low mean objective SE for case 100 given in Table 7 confirms this hypothesis.

10 experiments for each case were ran. The p values for the FT tests are all 0.00. Cases 1,2, and 7 were not statistically significantly different as provided in Tables 8 and 9. However, the p-value of case 1 versus case 7 is lower than that of case 2 versus case 7 for both max and mean score, which may hint that case 1 is superior. The differences are sharper than that of the ’s study. This should mean that 1 outperforms the other cases, which aligns with our previous observation and we decided to pursue with this value for the following studies.

Overall, it is clear that the case equal to 1 outperforms the others and Tables 6 and 7 show that there is a very sharp learning improvement when is decreased at a value lower than 7, while the difference is less obvious above 7. In addition, we can observe in Table 6 that a lower value of usually yields a lower variance between the results at each generation and always a better average behavior promising better stability. In the first order, therefore, is a manifestation of the exploration/exploitation trade-off. Higher yields higher exploitation and lower mean objective but sometimes a better max objective (e.g., the max objective with equal to 75 exceeds that of 2 to 25), which is of interest to the core designer. For low and high values of , the \(\sigma \)SE decreases as seen in Table 7. This column shows that something else is happening. A high does not necessarily involve slower convergence but in light of the column mean objective SE it does imply a lower mean objective. Therefore high should be avoided even if the optimization was run longer due to the risk of being stuck in a local optimum.

Table 9 p-value for the NT between the instances for the mean score per episode
Fig. 8
figure 8

Lower right quadrant of the best pattern found with equal to 1. On the left, from top to bottom: Assembly group number, fuel average enrichment, number of IFBA rods, and number of WABA rods. Blue is for twice-burned FAs, green for once-burned, and red for fresh (colors pattern also utilized in ROSA [72]). In violet, location where WABA cannot be placed. On the right, from top to bottom: Beginning of cycle \(k_{\infty }\) and assembly average burnup, peak pin Burnup at EOC, and \(F_{\Delta h}\) when the maximum over the cycle is reached. R means reflector, where the water is

It could be surprising to observe such a low mean objective SE even for the case equal to 2 compared to 1. Having equal to 1 with the restarted state being the best ever found can be thought of as a gradient ascent problem: At each episode, based on the understanding of the curvature of the objective space, the agents are trying to find the steepest ascent direction similar to SO-based methods where the steps are taken however randomly. Indeed in SO, the agents are following a rule of thumb (i.e. heuristic) in the decision process. It is a guided random walk. In RL, we have a policy that generates solutions, while a critic learns and evaluates the quality of these solutions. The heuristic is learned. In this sense, it is similar to a Gaussian Process [73] in which the acquisition function is a policy we learn, and the policy converges close to the optimum solution found. To the author of this work, there is no obvious mathematical intuition behind greater . We initially thought that asking the agent to solve multiple LPs at the same time would help as one state of a pattern is the restarting state for the action to choose the next pattern. These information could have helped the agent draw better conclusions from one state to the next. Nevertheless, it is clear that we are asking the agents to solve multiple loading designs at the same time, therefore more samples must be drawn to fully comprehend how an episode is constructed, which might be more difficult and explains why it is seemingly easier with the lowest . This study may also explain why value-based methods (e.g., Deep Q-Learning (DQN) [68] or Double Q-Learning [74]) were not performing well. One explanation might be that they required one or multiple-step look-ahead to optimize the Q-function [25]. However, as it was described in Section 4.1, each action is a core created. Since a full core is created for each action, each one fundamentally should not influence the generation of the next and the Q-update might be ineffective.

Table 10 Breakdown of the FOMs for the best patterns found for each case throughout all experiments
Fig. 9
figure 9

Error bar plots averaged over 10 experiments for the Max (left) and Mean (right) Objective per episode for different values. The Mean was obtained by averaging the mean score obtained per episode within a window of X episodes, where X depends on the total number of samples generated (usually around 400 samples for equal to 1 to obtain 100 generations). The length of the bars is two \(\sigma \)

Figure 8 showcases the 2D projection of the best pattern found in this study (corresponding to the top row of Table 10). There, in contrast to Fig. 6, rings of fire are observed from the periphery to the center, which is also a typical designer strategy. Lastly, by comparing the solution of equal to 1 and 2 in Table 10, the constraints are tighter for this design, which helped again to reduce the LCOE.

Because the case equal to 2 presented a higher variance with a good Avg (Me), hence learning potential, we also tried to restart the experiment with the trained policy from this case for another one day of optimization. We also increases to 250 and stopped at 6 experiments because the \(\sigma \) were very low. It turned out that 75 was the best case on average but no parametric tests succeeded as the differences were not significant, and the results are not presented for conciseness. The better performance of 75 (max and average max objectives of -4.558 and -4.573 respectively with \(\sigma \sim 0.0120\) averaged over 6 experiments) corresponds to a boost in exploration. In consequence, depending on the context of optimization, there is a trade-off between exploration/exploitation. Preparing the next core cycle, which can take weeks, starting with multiple runs with higher might lead to achieving a better solution than lower , but we should be wary of the faster convergence if is increased.

To conclude this section, we can say that plays a role in the exploration/exploitation trade-off. A higher value will yield a lower mean objective. Nevertheless, a lower value may lead to faster convergence, hence reaching and being trapped in an unacceptable local optimum (see case 75). This study also highlighted the importance of fine-tuning. Indeed, a very small change in may drastically change the optimum value found. In terms of the LCOE between 1 and 2 for a 1,000 or 1,200 MWe NPP, we get a gain of 79 - 95 k $/yr. On top of that, case 7 did not even find a feasible design within one day of optimization, which is fundamental for certain application as we alluded in Section 5.1.

1.3 A.3 study

The parameter also plays a role in the exploration/exploitation trade-off as presented in Section 2.3. The chosen values for the study are 1, 2, 4, 6, 8, 10, 25, 35, and 75, which were run over 10 experiments. Figure 9 showcases the evolution of the max and mean objective over 100 generations. It should be observed that there is smooth learning for all values of .

Similar to the study, large values of yields poor results with an average lower than - 5.00. The best patterns found are not even feasible for cases 35 and 75 as highlighted by Table 11. The mean objective columns of Table 11 are omitted as they are similar to that of the max objective with equal to 1. They all have a high IR except for case 75, but its low mean objective SE and \(\sigma \)SE hint that the solution was found randomly. Moreover, the variance in the results and the \(\sigma \)SE also increased with for large values. However, this behavior is not linear anymore below 4, where case 1 and 2 yield worse performances than 4, 6, 8, and 10 with a lower mean objective SE and a higher \(\sigma \)SE as shown in Table 12. This may be a manifestation of the exploration/exploitation trade-off as too few samples are recorded with equal to 1 or 2, which yields sub-optimum behavior. Column \(\sigma \)SE helps confirm the significance of regarding the trade-off. The higher the value the higher the \(\sigma \) except below 4. Moreover, the cases 1 and 2 present a high \(\sigma \) (MO), showing that they are also unstable.

Multiply by , define the number of samples utilized for the estimation of the gradient of the loss see (3) to update \(\theta \), and hence its quality. We hypothesize that the previous behavior is explained because low values (e.g., 1) result in a very poor local approximation of the gradient (i.e., high bias). On the other hand, larger values (e.g., 4, 6, 8...) may decrease the bias of the estimator but increase the variance due to the presence of more terms. In the beginning, the loss is constructed with many samples of poor quality, impeding the speed of learning. Once better samples are obtained, learning happens, and good solutions can be obtained.

Table 11 Statistics for each values over 10 experiments

The mean objective SE and \(\sigma \)SE increase with above 4, which confirms that this hyper-parameter directly control the level of exploration and exploitation in the search. With an average max objective of -4.599, the best average behavior is observed with 4, while 6 is very close but obtained a better max objective. Looking at the mean objective SE in Table 12 we can see however that case 4 is producing better patterns on average for the last generation with a lower \(\sigma \)SE, indicating higher stability and probability to generate a good pattern.

The NTs on the max objective did not succeed between 2, 4, 6, 8, and 10 as shown in Table 13, which is expected considering the few differences observed in Table 11. The case 6 succeeded against every value above 10 while the cases 4, 8,, and 10 succeeded against 2 out of the three values, therefore they might be considered the best cases. However, the considerations summarised above show that case 4 is superior to other metrics. For these reasons, we decided to keep equal to 4 for the remainder of the study.

Table 12 Mean Objective SE for the different cases with the \(\sigma \)SE

Additionally, the characteristics of the best patterns found are given in Table 14. The dissimilarities are not as high as for the study of Appendix A.2, but fine-tuning at a very fine level could have significant consequences. With equal to 6 allows an economy of 0.033 $/MWh compared to the best solution found with 4, which amounts to about 290,000 - 350,000 \(\$/yr\) in benefit for a 1,000 or 1,200 MWe NPP, respectively. In this case, the boron’s constraint is tight but the \(F_q\)’s one has more leeway, and agents trying to optimize or a designer taking over the core found could try to take penalties on the latter to improve the LCOE.

Lastly, the 2D projection of the best pattern found is given in Fig. 10. The fuel placement is similar to that of Fig. 8 but the LCOE is lower. The average enrichment of this core is 4.35% with an average burnup per assembly at 35.742 GWd/tHm, against 4.4% and 35.414 GWd/tHm for the previous core. Even though the \(F_{\Delta h}\) was lower for the previous core the \(F_q\) was higher. Combined with a higher cycle length and lower pin peak Bu, it hints that the power profile must be spread out more evenly hence increasing the average burnup per FAs and the energy extracted at a lower enrichment cost reducing the LCOE.

Table 13 p-value for the NT between the instances
Table 14 Best patterns found for each case throughout all experiments
Fig. 10
figure 10

Lower right quadrant for the best case equal to 6. On the left, from top to bottom: Assembly group number, fuel average enrichment, number of IFBA rods, and number of WABA rods. Blue is for twice-burned FAs, green for once-burned, and red for fresh (colors pattern also utilized in ROSA [72]). In violet, location where WABA cannot be placed. On the right, from top to bottom: Beginning of cycle \(k_{\infty }\) and assembly average burnup, peak pin Burnup at EOC, and \(F_{\Delta h}\) when the maximum over the cycle is reached. R means reflector, where the water is

Fig. 11
figure 11

Error bar plots averaged over 10 experiments for Max (left) and Mean (right) Objective per episode for different values. The Mean was obtained by averaging the mean score obtained per episode within a window of X episodes, where X depends on the total number of samples generated (usually around 400 samples for equal to 1 to obtain 100 generations). The length of the bars is two \(\sigma \)

To conclude this section, we saw that is a hyper-parameter that controls the exploration/exploitation by playing with the bias/variance trade-offs in the estimation of the gradient of the PPO loss. A very low value may result in instability of the agents and being stuck in local optima. A higher value may result in slow learning and a poorer quality of the best pattern found. Due to the scope of the study (optimization over one day), favoring lower value is the path forward. For a longer run, it might be preferable to utilize larger values up to 10.

1.4 A.4 study

Lastly, in this section we study the influence of the entropy coefficient . The entropy term is given by the entropy (a notion originating from information theory) of the distribution satisfied by the policy. This corresponds to the terms \(H(\theta )\) and \(c_2\) in (3). The values 0, 0.00001, 0.0001, 0.001, and 0.01 (the default value in stable-baselines) are compared. Greater values than 0.01 were yielding very poor results and were omitted.

Table 15 Statistics for each values over 10 experiments

The analysis of Fig. 11 demonstrates the role of the entropy in the exploration/exploitation trade-off. Higher entropy coefficients yield faster convergence and sometimes to lower optima, which is confirmed by the average max objective and mean objective SE columns of Tables 15 and 16, respectively. The case without entropy has a higher \(\sigma \) (MO) between the experiments even though it is supposed to have the highest exploitation. This could be explained by the fact that this entropy coefficient is too low, reaching a different sub-optimal solution anytime depending on the seed that steers initial random actions in different directions. The low mean objective SE and \(\sigma \)SE confirm this hypothesis. It converged sometimes to great, sometimes to poor local optima demonstrating signs of instability.

Table 16 Mean Objective SE for the different cases with the \(\sigma \)SE

With a low mean objective SE but a high \(\sigma \)SE, value 0.01 is not converging and is still learning but has found the second best pattern overall behind 0.001. The standard deviations (\(\sigma \) (MO)) for the cases 0.001 and 0.01 throughout the experiments (Table 15) are low compared to the others meaning that they are more stable. Moreover, their \(\sigma \)SE (Table 16) is high, meaning that they are both learning. The sweet spot is therefore located between 0.001 and 0.01. However, the case 0.001 is on average reproducing patterns of high quality (highest Avg (MO) above - 5.00 in Table 15) and found an excellent pattern overall with the highest Best (MO).

The FT failed with a value of 0.267 and therefore no post-hoc tests were computed. More experiments is needed to reach statistical significance but due to the reasons stated above, we could safely conclude that lowering the entropy coefficient to 0.001 is the way forward.

Table 17 Best patterns found for each case throughout all experiments
Fig. 12
figure 12

Lower right quadrant of the best pattern found obtained in the study. On the left, from top to bottom: Assembly group number, fuel average enrichment, number of IFBA rods, and number of WABA rods. Blue is for twice-burned FAs, green for once-burned, and red for fresh (colors pattern also utilized in ROSA [72]). In violet, location where WABA cannot be placed. On the right, from top to bottom: Beginning of cycle \(k_{\infty }\) and assembly average burnup, peak pin Burnup at EOC, and \(F_{\Delta h}\) when the maximum over the cycle is reached. R means reflector, where the water is

Looking at Table 17, we see that generally again, better LCOE will yield tighter constraints, except for the cycle length which could be higher. The 2D projection on the lower right quadrant of the best pattern overall found (with equal to 0.001) is given in Fig. 12. Similar to the best pattern showcased in Figs. 8 and 10 we can see typical rings of fire.

To conclude this section, we saw that is a hyper-parameter that controls the exploration/exploitation trade-off. Lower values yield faster convergence but are prone to local optima, because the final results may differ depending on the seeding as different directions of the search are early on preferentially locked on. On the other hand, a high value will result in slower convergence. A sweet spot between the speed of convergence, the quality of the pattern found, and the stability of the learning of the agent (i.e., low standard deviation throughout the experiments) was found with a value of 0.001. The latter helps find pattern rapidly but also provide the possibility to find an even better pattern with longer optimization thanks to the reasonably high \(\sigma \)SE at the last generation.

Fig. 13
figure 13

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the study

Table 18 Figure of Merits for the study

Appendix B: Alternate hyper-parameters studies

In this appendix we provide the results of the alternate hyper-parameters studies alluded in Section 2.3. The entropy coefficient is set at 0.01. The , , and were set at 25000, 1, and 4, respectively.

1.1 B.1 Learning rate study

The learning rate is connected to the size of the steps taken in the gradient ascent algorithm that minimizes the loss of (3). Figure 13 showcases the evolution of the mean reward per episode of the different values of studied. The behavior of the extreme cases are easily understood. The case 0.25 (green curve on the figure) reach a relatively high mean score early on probably because the learning step is high. However, there is no learning anymore because the steps remains high and the agents cannot manage to improve the pessimistic lower bound. The cases 0.000025 and 0.0000025 (purple and brown on the figure, respectively) are really unstable and are not learning. This may be explained because the learning step is very low. After one generation of samples, barely any learning is happening and the policy remains rather random. Then, new samples are generated randomly, and the distribution of data is probably very different from the previous generation, on which again barely any learning occurs. This then continues which is why we can see such oscillation with very low mean score. The case 0.00025 is then the case with the smoothest learning. As highlighted in Table 18 the closest in performance would be case 0.0025 but its mean objective SE is really low in comparison with the default 0.00025. Therefore we concluded that no study needed to be performed.

Fig. 14
figure 14

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the \(\epsilon \) study

Table 19 Figure of Merits for the \(\epsilon \) study

1.2 B.2 Clipping study

The clipping parameter or \(\epsilon \) is the parameter that ensure that the probability ratio \(r_t(\theta ) = \frac{\pi _{\theta }}{\pi _{\theta _{old}}}\) does not lies outside \([1-\epsilon ,1+\epsilon ]\) in (2). Figure 14 showcases the evolution of the mean reward per episode of the different values of \(\epsilon \) studied. In all cases there seem to be a smooth learning. Except for case 0.08, the higher the \(\epsilon \), the faster the learning early on, although the best mean objective is reached for 0.02. From 0.0005 to 0.02, there seems to be a linear improvement in the final mean score. The lower values enables larger policy updates, which seem to penalize the learning by reaching out local optima, in other word by lacking exploration (recall that the learning is based on the sampled generated by the policy at the previous step). Table 19 confirms this hypothesis. The case 0.08 seems to prevent learning of the policy by having almost a flat mean score from generation 2 to 45. Then it seems to linearly improve and ends up outperforming all cases from 0.0005 to 0.01. This could be a manifestation of the exploration/exploitation trade-off. 0.08 exhibits a high exploration compared to the other cases and after generation 45 it seems that it starts exploiting more. However, its performance remains very poor compared to the case 0.02 to 0.05 and even find the worst pattern overall. The case 0.01 found a relatively good pattern but with regard to its mean objective SE and \(\sigma \)SE, it is likely to have found it randomly.

The only cases that compete with the default of 0.02 are 0.03 and 0.05. However, the differences are not significant and 0.02 still exhibits a better behavior. Therefore, we concluded that we did not need to perform a study on this hyper-parameter.

1.3 B.3 study

is such that \(||\nabla _{\theta }L^{final}|| \le \) , where \(L^{final}\) is the loss of (3). This is a parameter that control the upper bound of the gradient of the loss in the gradient ascent algorithm. This is a classical technique to avoid catastrophic updates of the weights of a deep learning model, in the context of stochastic gradient descent for instance. Indeed, the estimation of the loss is always local not global, therefore, the steps taken in the training must be controlled. Which is what limiting the gradient does. The clipping \(\epsilon \) was initially derived to cover such possibility in PPO [26] as discussed in Appendix B.2. If too many samples separated the different experiments, the mean objective SE and \(\sigma \)SE could differ too much and we would not be able to fairly compare each one of them. Therefore, we limited the number of samples ran during a day to a similar value of 43,000 to have a more fair comparison.

Fig. 15
figure 15

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the study

Table 20 Figure of Merits for the study

Figure 15 showcases the evolution of the mean reward per episode of the different values of studied. In all cases there seem to be a smooth learning and the differences are less drastic as for the \(\epsilon \) study. Most regular cases reached a feasible pattern as displayed in Table 20. There are no linear behavior of this parameter and the best cases in terms of SE metrics were 0.50 (the base case), 1.00, and 1.50. However, for 1.00 and 1.50 the \(\sigma \)SE is rather low and because of our choice of low entropy coefficient (see Appendix A.4), the final combination if we were to study this hyper-parameter may lead to early convergence to worse local optima. Case 1.20 could also be competitive, as its SE are close to that of case 0.5. Therefore, we decided to also study three cases where the entropy coefficient was divided by 10 (last three rows of Table 20). As expected, the \(\sigma \)SE is low as well as the Mean objective SE for cases 1.0 and 1.50. For case 1.20, the agent converged to a local optima with a very low \(\sigma \)SE. These values are to be compared with Table 16, where the best case presented a higher mean objective SE and higher \(\sigma \)SE. Therefore, we decided to not delve into more studies for .

1.4 B.4 study

The parameter is the parameter that control the number of steps in the gradient ascent algorithm to maximize the surrogate loss of (3). Figure 16 showcases the evolution of the mean reward per episode of the different values of studied. There is a smooth learning for all cases except for case 1 (purple on the left hand-side of Fig. 16). This might be due to the fact that one step only prevents any learning and the weights of the neural architecture representing the policy are not converging to anything. The cases 5 and 10 largely outperform the others. This is confirmed by Table 21.

Fig. 16
figure 16

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the study

Table 21 Figure of Merits for the study

On the other hand, early on, lower leads to faster learning. This may be counter-intuitive, as higher means more exploitation of the information given by the samples generated by the policy. However, one could imagine that the original samples generated by the un-trained policy are random or at least representing low quality patterns. Therefore, the cases with higher are exploiting random or low-quality actions resulting in poor learning. From case 10 to 30 the \(\sigma \)SE increases, indicating that more epochs leads to more exploration. This is the same for the Best (MO) and almost the same for the mean objective SE with the exception of case 20. However, the figure on the left of Fig. 16 seems to indicate a random local drop in performance at the last generation for case 20, which may be due to the randomness of the policy, because it is clear from Fig. 16 that it has not converged yet. Table 22 seems to hint about why higher leads to higher \(\sigma \)SE. The increases with . This might be explained by the fact that the policy is more different with more steps in the gradient ascent, hence increasing the likelihood of changing the policy more. Therefore, the steps taken are less controlled yielding instability in learning. As a consequence, the entropy remains high (see Table 22) and the policy remains more random. The sweet spot is then between 5 and 10.

Only the case 5 competed with the default case 10, but the differences were not significant enough and we decided to not delve into any more study.

Table 22 Loss related Figure of Merits for

1.5 B.5 study

(line 16 or Algorithm 1) is the number of mini batches used in training during a step of the algorithm. Essentially, the size of the batches for training is equal to . As showcased in Fig. 17, all cases exacerbate a smooth learning. We stopped at 64 because there are no observed policy learning for the case 128 and greater values would not be possible since \(\times \) is equal to the latter.

Overall, a higher value of this parameter lead to slower learning. However, there seem to be a sweet spot for the case equal to 4. There, the size of the batches is equal to the number of cores. Both SE measures are better even though the case 2 found a high quality pattern with only a difference in magnitude about 0.01 in the LCOE. It could be interesting to see what happens when the ratio is kept constant and we left that for future work. Above 4, the behavior seems to remain linear. Only 16 found a worse best pattern than 32, but regarding the high \(\sigma \) SE of Table 23, it must have been found randomly.

Fig. 17
figure 17

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the study

Table 23 Figure of Merits for the View full size image study
Fig. 18
figure 18

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the study

Table 24 Figure of Merits for the study

1.6 B.6 \(\lambda \) and \(\gamma \) studies

In PPO, the advantage function is estimated via Generalized Advantage Estimation (GAE) [26]. The latter represents an exponential average of the advantage function with parameter \(\lambda \). This estimate suffers from the bias-variance trade-off and the \(\lambda \) parameter helps control it. Greater value of \(\lambda \) will decrease the bias of the estimator by allow greater contribution to later states (up to max( , ) for each agent) at the cost of increasing the variance due to the presence of more terms. Because the length of the episode is 1, neither \(\lambda \) or \(\gamma \) should influence the algorithm. The only differences observed by running different values must be due to the randomness of the policies resulting in slightly different SE measures. Indeed, the advantage estimation becomes [75]:

$$\begin{aligned} A_{\pi _{\theta }}(s_0,a_0;\theta ) = r_0 + \gamma \mathbb {E}(V(s_1)) - V(s_0) = r_0 - V(s_0)\end{aligned}$$
(13)

\(\gamma \) is the discount factor in the cumulative reward, which is a sum over one term. Therefore, no study are presented here.

1.7 B.7 study

is the weight assigned to the value function’s term in the PPO loss of (3). The algorithm was very sensitive to this term. It turned out that the number of samples for this particular case in the configuration studied had a big impact on the figure of merits found. We used 43,000 as the maximum number of samples. Figure 18 showcases the evolution of the mean score per episode and max objective for the study.

Fig. 19
figure 19

Evolution of the Mean Score per episode (left) and Max Objective (right) overall found by the agents for the depth and breadth of the architecture studies

In all cases, there is a smooth learning. Case 2 finds a feasible pattern slower than the others, as if it spent more time exploring. It is interesting to see that, except for case 1.5, the lower the , the higher the mean objective SE and the lower the \(\sigma \)SE as displayed in Table 24. Reducing the impact of the value function in the loss seems to increase the exploitation. We first hypothesized that reducing the impact of the value function would increase the speed of the convergence of the policy loss. Therefore, we tried to find an explaination looking at the loss FOMs such as in Table 22 via Table 25. Nevertheless, no clear behaviors were drawn. The impact of cannot be explained by only looking at loss-related performance metrics, especially with the particular case 1.5. Several interesting explanation removing case 1.5 can be however made. Comparing the entropy of case 0.2 to 1.0 versus 1.2, we can say that a lower entropy does not result in a high mean objective or \(\sigma \)SE. Similarly, for 0.2 to 0.8, a better policy or value function loss results in a higher mean objective SE and lower \(\sigma \)SE. The case 1.0 and 2.0 both have a better policy loss but a much worse value function loss, resulting in a lower mean objective SE and higher \(\sigma \)SE. It seems that there is a trade-offs between the losses but that the value loss has a greater impact. This is confirmed by comparing 2.0 and 1.2. 2.0 exacerbates a worse SEs even though the policy losses improve about twice as much.

Table 25 Loss related Figure of Merits for

A nice trade-off could be to utilize a value of 0.8 for because the mean objective SE is higher than for the case 1.0 and the best pattern found is very close. Nevertheless, the \(\sigma \)SE is at 15.4157 and we made the choice of reducing the entropy in Appendix A.4, which already reduced the \(\sigma \)SE with a better mean objective SE. We would risk to reach a worse local optima if we combined it with a lower value. To verify that, we also ran a case with and equal to 0.8 and 0.001, respectively (last row of Table 24). There, the best pattern found was worse than the base case, and the \(\sigma \)SE is much lower. Keep in mind that these hyper-parameters may be used to pursue further study and the latter value is too low, meaning that the algorithm was already too close to converge. Therefore, we decided not to pursue deeper studies with this parameter or another combination with .

1.8 B.8 Depth and breadth of the architecture studies

The depth of the neural architecture representing the policy corresponds to the number of hidden layer in the neural network. The breadth is the number of hidden nodes per layer. The number of hidden nodes is equal to 64 or 32. Figure 19 showcases the evolution of the mean score and max objective per episode of the different values of the depth and breadth studied. In all cases there is a smooth learning. The effect on the number of hidden layer is small compared to the other hyper-parameters studied in this appendix. The breadth has an impact on the learning as for each case the mean objective SE and \(\sigma \)SE are lower and higher, respectively, as shown in Table 26. The best patterns found seem to also be better for smaller number of nodes.

Another interesting behavior is that the best patterns are found faster when the number of hidden nodes is divided by two as confirmed by Fig. 19. However, the mean objective SE is smaller and the \(\sigma \)SE higher. In this case what might have happened is that it took more samples for cases with 64 hidden nodes to learn to reproduce better patterns but they managed to produce pattern closer to a better optima because of a better understanding of the search space. The number of hidden nodes therefore plays a role in the exploration/exploitation trade-off.

No cases are close to convergence except case 4. There the mean objective SE is lower than - 5.00, meaning that a feasible pattern is not always found in average. What might have happened is that it overfitted towards lower optima due to its larger number of trainable parameters. On top of that, the behavior of the number of hidden layer is opposite when having 64 or 32 nodes. For cases with 64 nodes, the cases 2 and 3 have a lower mean objective SE and higher \(\sigma \)SE, which is the opposite for cases with 32 hidden nodes. Moreover, the case 4 + 32 is really far from convergence and has found an excellent best pattern very rapidly. It is as if removing nodes per layer smoothed out its learning and enable to escape local optima. The default value of 2 exacerbates a lower mean objective SE and higher \(\sigma \)SE, promising more potential for improvement and stability if utilized for further optimization. Overall, most of the cases competed with the default case 2 but the differences are not large enough and we concluded that we did not need to delve into more studies.

Table 26 Figure of Merits for the depth and breadth of the architecture studied

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Seurin, P., Shirvan, K. Assessment of reinforcement learning algorithms for nuclear power plant fuel optimization. Appl Intell 54, 2100–2135 (2024). https://doi.org/10.1007/s10489-023-05013-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05013-5

Keywords

Navigation