Abstract
Existing methods for optimal control struggle to deal with the complexity commonly encountered in real-world systems, including dimensionality, process error, model bias and data heterogeneity. Instead of tackling these system complexities directly, researchers have typically sought to simplify models to fit optimal control methods. But when is the optimal solution to an approximate, stylized model better than an approximate solution to a more accurate model? While this question has largely gone unanswered owing to the difficulty of finding even approximate solutions for complex models, recent algorithmic and computational advances in deep reinforcement learning (DRL) might finally allow us to address these questions. DRL methods have to date been applied primarily in the context of games or robotic mechanics, which operate under precisely known rules. Here, we demonstrate the ability for DRL algorithms using deep neural networks to successfully approximate solutions (the “policy function” or control rule) in a non-linear three-variable model for a fishery without knowing or ever attempting to infer a model for the process itself. We find that the reinforcement learning agent discovers a policy that outperforms both constant escapement and constant mortality policies—the standard family of policies considered in fishery management. This DRL policy has the shape of a constant escapement policy whose escapement values depend on the stock sizes of other species in the model.
Similar content being viewed by others
Notes
A repository with all the relevant code to reproduce our results may be found at https://github.com/boettiger-lab/approx-model-or-approx-soln in the “src” subdirectory. The data used is found in the “data” subdirectory, but the user may use the code provided to generate new data sets.
As will be explained later, all our models are stochastic. If we set stochasticity to zero in Model 1, CMort matches the performance of the other management strategies.
In our mathematical formulation of the decision problem, we have assumed for simplicity that the fishing effort cost is zero and that fish price is stable over time. This way, we equate economic output with harvested biomass.
In this sense, it is important to note that the classical management strategies we compare against have a similar flow of information. Namely, data is used to estimate a dynamical model, and this model is used to generate a policy function. The difference to our approach is located in the process of *how* the model is used to optimize a policy. Because of this difference, RL-based approaches can produce good heuristic solutions for complex problems.
Transition operators are commonly discussed without having a direct time-dependence for simplicity, but the inclusion of t as an argument to T does not alter the structure of the learning problem appreciably.
Policies are, in general, functions from state space to policy space. In our paper, these are \(\pi :[0,1]^{\times 3}\rightarrow {\mathbb {R}}_+\) for the single fishery case, and \(\pi :[0,1]^{\times 3}\rightarrow {\mathbb {R}}_+^2\) for two fisheries. The space of all such functions is highly singular, spanning a non-separable Hilbert space Even restricting ourselves to continuous policy functions, we end up with a set of policies which span the infinite dimensional space \(L^2([0,1]^{\times 3})\). One way to avoid optimizing over an infinite dimensional ambient space is to discretize state space into a set of bins. This approach runs into tractability problems: First, the dimension of policy space scales exponentially with the number of species. Second, even for a fixed number of species (e.g., 3), the dimension optimized over can be prohibitively large—for example if one uses 1000 bins for each population in a three-species model, the overall number of parameters being optimized over is \(10^9\). Neural networks with much smaller number of parameters, on the other hand, can be quite expressive and sufficient to find a rather good (if not optimal) policy function.
All our agents were trained in a local server with two commercial GPUs. The training time was between 30 min and one hour in each case.
As noted before, here we equate economic profit with biomass caught. This is done as an approximation to convey the conceptual message more clearly, and we do not expect our results to significantly change if, e.g., “effort cost” is included in the reward function. When we refer to “large differences” in profit, or “paying dearly,” we mean that the ratio between average rewards is considerable—e.g. a 15% loss in profit.
The raw dataset is found at the data/results_data/2FISHERY/RXDRIFT sub-directory in the repository with the source code and data linked above. Scatter plots visualizing this policy are shown in Appendix B.
References
Anderson BDO, Moore JB (2007) Optimal Control: Linear Quadratic Methods. Courier Corporation, USA
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–79
Burgess MG, Giacomini HC, Szuwalski CS, Costello C, Gaines SD (2017) Describing ecosystem contexts with single-species models: a theoretical synthesis for fisheries. Fish Fish 18(2):264–84
Chapman M, Xu L, Lapeyrolerie M, Boettiger C (2023) Bridging adaptive management and reinforcement learning for more robust decisions. Philos Trans Royal Soc B 378(1881):20220195
Clark CW (1990) Mathematical bioeconomics: the optimal management of renewable resources, 2nd edn. Wiley-Interscience, UK
Clark CW (1973) Profit maximization and the extinction of animal species. J Polit Econ 81(4):950–61. https://doi.org/10.1086/260090
Collins MSFB, Tett SFB, Cooper C (2001) The internal climate variability of HadCM3, a version of the Hadley Centre coupled model without flux adjustments. Clim Dyn 17:61–81
Costello C, Ovando D, Clavelle T, Strauss CK, Hilborn R, Melnychuk MC, Branch TA et al (2016) lobal fishery prospects under contrasting management regimes. Proc Nat Acad Sci 113(18):5125–29. https://doi.org/10.1073/pnas.1520420113
Degrave J, Felici F, Buchli J, Neunert M, Tracey B, Carpanese F, Ewalds T et al (2022) Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602(7897):414–19
François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J et al (2018) An introduction to deep reinforcement learning. Found Trends in Mach ® Learn 11(3–4):219–354
Gordon C, Cooper C, Senior CA, Banks H, Gregory JM, Johns TC, Mitchell JFB, Wood RA (2000) The simulation of SST, sea ice extents and ocean heat transports in a version of the hadley centre coupled model without flux adjustments. Clim Dyn 16:147–68
Gordon HS, Press C (1954) The economic theory of a common-property resource: the fishery. J Polit Econ 62(2):124–42. https://doi.org/10.1086/257497
Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: model-based policy optimization. arXiv:1906.08253 [Cs, Stat]
Lapeyrolerie M, Chapman MS, Norman KEA, Boettiger C (2022) Deep reinforcement learning for conservation decisions. Methods Ecol Evol 13(11):2649–62
Mangel M (2006) The theoretical biologist’s toolbox: quantitative methods for ecology and evolutionary biology. Cambridge University Press, UK
Marescot L, Chapron G, Chadès I, Fackler PL, Duchamp C, Marboutin E, Gimenez O (2013) Complex decisions made simple: a primer on stochastic dynamic programming. Methods Ecol Evol 4(9):872–84
May RM (1977) Thresholds and breakpoints in ecosystems with a multiplicity of stable states. Nature 269(5628):471–77
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. (2013) Playing atari with deep reinforcement learning. arXiv Preprint arXiv:1312.5602
Moerland TM, Broekens J, Plaat A, Jonker CM et al (2023) Model-based reinforcement learning: a survey. Found Trends ® Mach Learn 16(1):1–118
OpenAI (2022) ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/
Polydoros AS, Nalpantidis L (2017) Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst 86(2):153–73
Pope VD, Gallani ML, Rowntree PR, Stratton RA (2000) The impact of new physical parametrizations in the Hadley Centre climate model: HadAM3. Clim Dyn 16:123–46
Punt AE, Butterworth DS, de Moor CL, De Oliveira JAA, Haddon M (2016) Management strategy evaluation: best practices. Fish Fish 17(2):303–34. https://doi.org/10.1111/faf.12104
RAM Legacy Stock Assessment Database (2020) RAM Legacy Stock Assessment Database V4.491. https://doi.org/10.5281/zenodo.3676088
Ramirez J, Yu W, Perrusquia A (2022) Model-free reinforcement learning from expert demonstrations: a survey. Artif Intell Rev 1:1–29
Riahi K, Van Vuuren DP, Kriegler E, Edmonds J, O’neill BC, Fujimori S, Bauer N (2017) The shared socioeconomic pathways and their energy, land use, and greenhouse gas emissions implications: an overview. Glob Environ Chang 42:153–68
Sato Y (2019) Model-free reinforcement learning for financial portfolios: a brief survey. arXiv Preprint arXiv:1904.04973
Schaefer MB (1954) Some aspects of the dynamics of populations important to the management of the commercial marine fisheries. Bull Inter-Am Tropical Tuna Comm 1(2):27–56. https://doi.org/10.1007/BF02464432
Seo J, Na Y-S, Kim B, Lee CY, Park MS, Park SJ, Lee YH (2022) Development of an operation trajectory design algorithm for control of multiple 0d parameters using deep reinforcement learning in KSTAR. Nucl Fusion 62(8):086049
Sethi SP, Sethi SP (2019) What is optimal control theory? Springer, USA
Worm B, Barbier EB, Beaumont N, Duffy JE, Folke C, Halpern BS, Jackson JBC et al (2006) Impacts of biodiversity loss on ocean ecosystem services. Science 314(5800):787–90. https://doi.org/10.1126/science.1132294
Zeng D, Gu L, Pan S, Cai J, Guo S (2019) Resource management at the network edge: a deep reinforcement learning approach. IEEE Netw 33(3):26–33
Zhang Y, Li S, Liao L (2019) Near-optimal control of nonlinear dynamical systems: a brief survey. Annu Rev Control 47:71–80
Acknowledgements
The title of this piece references a mathematical biology workshop at NIMBioS organized by Paul Armsworth, Alan Hastings, Megan Donahue, and Carl Towes in 2011 which first sought to emphasize ‘pretty darn good’ control solutions to more realistic problems over optimal control to idealized ones. This material is based upon work supported by the National Science Foundation under Grant No. DBI-1942280.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Results for Stationary Models
In the main text we focused on the non-stationary model (“three species, two fisheries, non-stationary” in Table 1) for the sake of space and because our results were most compelling there. Here we present the reward distributions for the other models considered—the three stationary models, lines 1-3 in Table 1. These results are shown in Figs. 11, 12 and 13.
Appendix: PPO Policy Function for Non-Stationary Model
In the main text, Fig. 7, we presented a visualization of the PPO+GP policy function obtained for the “three species, two fisheries, non-stationary” model. This policy function is a Gaussian process regression of scatter data of the PPO policy function. In Fig. 14 we present a representation of this scatter data in a similar format as Fig. 7.
Appendix: Gaussian Process Interpolation
Here we summarize the procedure used to interpolate the PPO policy (visualized in Fig. 14). We use the GaussianProcessRegressor object of the sklearn Python library with a kernel given by
This interpolation method is applied to scatter data of the PPO policy evaluated on 3 different grids on \((X,Y,Z)\) states: \(G_X\), a \(51\times 5 \times 5\) grid; \(G_Y\), a \(5\times 51 \times 5\) grid; and \(G_Z\), a \(5\times 5 \times 51\) grid. This combination of grids was used instead of a single dense grid in order to reduce the computational intensity of the interpolation procedure. For \(G_X\), the 5 values for \(Y\) and \(Z\) were varied in a “popular window,” i.e. episode time-series data was used to determine windows of \(Y\) and \(Z\) values which were most likely. The grids \(G_Y\) and \(G_Z\) were generated in a similar fashion, mutatis mutandis.Footnote 10 The length scale and noise level values of this kernel were chosen arbitrarily—no hyperparameter tuning was needed to produce satisfactory interpolation, as will be shown in the results section.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Montealegre-Mora, F., Lapeyrolerie, M., Chapman, M. et al. Pretty Darn Good Control: When are Approximate Solutions Better than Approximate Models. Bull Math Biol 85, 95 (2023). https://doi.org/10.1007/s11538-023-01198-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11538-023-01198-5