Skip to main content
Log in

Pretty Darn Good Control: When are Approximate Solutions Better than Approximate Models

  • Original Article
  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

Existing methods for optimal control struggle to deal with the complexity commonly encountered in real-world systems, including dimensionality, process error, model bias and data heterogeneity. Instead of tackling these system complexities directly, researchers have typically sought to simplify models to fit optimal control methods. But when is the optimal solution to an approximate, stylized model better than an approximate solution to a more accurate model? While this question has largely gone unanswered owing to the difficulty of finding even approximate solutions for complex models, recent algorithmic and computational advances in deep reinforcement learning (DRL) might finally allow us to address these questions. DRL methods have to date been applied primarily in the context of games or robotic mechanics, which operate under precisely known rules. Here, we demonstrate the ability for DRL algorithms using deep neural networks to successfully approximate solutions (the “policy function” or control rule) in a non-linear three-variable model for a fishery without knowing or ever attempting to infer a model for the process itself. We find that the reinforcement learning agent discovers a policy that outperforms both constant escapement and constant mortality policies—the standard family of policies considered in fishery management. This DRL policy has the shape of a constant escapement policy whose escapement values depend on the stock sizes of other species in the model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. A repository with all the relevant code to reproduce our results may be found at https://github.com/boettiger-lab/approx-model-or-approx-soln in the “src” subdirectory. The data used is found in the “data” subdirectory, but the user may use the code provided to generate new data sets.

  2. As will be explained later, all our models are stochastic. If we set stochasticity to zero in Model 1, CMort matches the performance of the other management strategies.

  3. In our mathematical formulation of the decision problem, we have assumed for simplicity that the fishing effort cost is zero and that fish price is stable over time. This way, we equate economic output with harvested biomass.

  4. In this sense, it is important to note that the classical management strategies we compare against have a similar flow of information. Namely, data is used to estimate a dynamical model, and this model is used to generate a policy function. The difference to our approach is located in the process of *how* the model is used to optimize a policy. Because of this difference, RL-based approaches can produce good heuristic solutions for complex problems.

  5. Transition operators are commonly discussed without having a direct time-dependence for simplicity, but the inclusion of t as an argument to T does not alter the structure of the learning problem appreciably.

  6. Policies are, in general, functions from state space to policy space. In our paper, these are \(\pi :[0,1]^{\times 3}\rightarrow {\mathbb {R}}_+\) for the single fishery case, and \(\pi :[0,1]^{\times 3}\rightarrow {\mathbb {R}}_+^2\) for two fisheries. The space of all such functions is highly singular, spanning a non-separable Hilbert space Even restricting ourselves to continuous policy functions, we end up with a set of policies which span the infinite dimensional space \(L^2([0,1]^{\times 3})\). One way to avoid optimizing over an infinite dimensional ambient space is to discretize state space into a set of bins. This approach runs into tractability problems: First, the dimension of policy space scales exponentially with the number of species. Second, even for a fixed number of species (e.g., 3), the dimension optimized over can be prohibitively large—for example if one uses 1000 bins for each population in a three-species model, the overall number of parameters being optimized over is \(10^9\). Neural networks with much smaller number of parameters, on the other hand, can be quite expressive and sufficient to find a rather good (if not optimal) policy function.

  7. All our agents were trained in a local server with two commercial GPUs. The training time was between 30 min and one hour in each case.

  8. https://docs.ray.io/

  9. As noted before, here we equate economic profit with biomass caught. This is done as an approximation to convey the conceptual message more clearly, and we do not expect our results to significantly change if, e.g., “effort cost” is included in the reward function. When we refer to “large differences” in profit, or “paying dearly,” we mean that the ratio between average rewards is considerable—e.g. a 15% loss in profit.

  10. The raw dataset is found at the data/results_data/2FISHERY/RXDRIFT sub-directory in the repository with the source code and data linked above. Scatter plots visualizing this policy are shown in Appendix B.

References

  • Anderson BDO, Moore JB (2007) Optimal Control: Linear Quadratic Methods. Courier Corporation, USA

    Google Scholar 

  • Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–79

    Article  Google Scholar 

  • Burgess MG, Giacomini HC, Szuwalski CS, Costello C, Gaines SD (2017) Describing ecosystem contexts with single-species models: a theoretical synthesis for fisheries. Fish Fish 18(2):264–84

    Article  Google Scholar 

  • Chapman M, Xu L, Lapeyrolerie M, Boettiger C (2023) Bridging adaptive management and reinforcement learning for more robust decisions. Philos Trans Royal Soc B 378(1881):20220195

    Article  Google Scholar 

  • Clark CW (1990) Mathematical bioeconomics: the optimal management of renewable resources, 2nd edn. Wiley-Interscience, UK

    MATH  Google Scholar 

  • Clark CW (1973) Profit maximization and the extinction of animal species. J Polit Econ 81(4):950–61. https://doi.org/10.1086/260090

    Article  Google Scholar 

  • Collins MSFB, Tett SFB, Cooper C (2001) The internal climate variability of HadCM3, a version of the Hadley Centre coupled model without flux adjustments. Clim Dyn 17:61–81

    Article  Google Scholar 

  • Costello C, Ovando D, Clavelle T, Strauss CK, Hilborn R, Melnychuk MC, Branch TA et al (2016) lobal fishery prospects under contrasting management regimes. Proc Nat Acad Sci 113(18):5125–29. https://doi.org/10.1073/pnas.1520420113

    Article  Google Scholar 

  • Degrave J, Felici F, Buchli J, Neunert M, Tracey B, Carpanese F, Ewalds T et al (2022) Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602(7897):414–19

    Article  Google Scholar 

  • François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J et al (2018) An introduction to deep reinforcement learning. Found Trends in Mach ® Learn 11(3–4):219–354

    Article  MATH  Google Scholar 

  • Gordon C, Cooper C, Senior CA, Banks H, Gregory JM, Johns TC, Mitchell JFB, Wood RA (2000) The simulation of SST, sea ice extents and ocean heat transports in a version of the hadley centre coupled model without flux adjustments. Clim Dyn 16:147–68

    Article  Google Scholar 

  • Gordon HS, Press C (1954) The economic theory of a common-property resource: the fishery. J Polit Econ 62(2):124–42. https://doi.org/10.1086/257497

    Article  Google Scholar 

  • Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: model-based policy optimization. arXiv:1906.08253 [Cs, Stat]

  • Lapeyrolerie M, Chapman MS, Norman KEA, Boettiger C (2022) Deep reinforcement learning for conservation decisions. Methods Ecol Evol 13(11):2649–62

    Article  Google Scholar 

  • Mangel M (2006) The theoretical biologist’s toolbox: quantitative methods for ecology and evolutionary biology. Cambridge University Press, UK

    Book  Google Scholar 

  • Marescot L, Chapron G, Chadès I, Fackler PL, Duchamp C, Marboutin E, Gimenez O (2013) Complex decisions made simple: a primer on stochastic dynamic programming. Methods Ecol Evol 4(9):872–84

    Article  Google Scholar 

  • May RM (1977) Thresholds and breakpoints in ecosystems with a multiplicity of stable states. Nature 269(5628):471–77

    Article  Google Scholar 

  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. (2013) Playing atari with deep reinforcement learning. arXiv Preprint arXiv:1312.5602

  • Moerland TM, Broekens J, Plaat A, Jonker CM et al (2023) Model-based reinforcement learning: a survey. Found Trends ® Mach Learn 16(1):1–118

    Article  MATH  Google Scholar 

  • OpenAI (2022) ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/

  • Polydoros AS, Nalpantidis L (2017) Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst 86(2):153–73

    Article  Google Scholar 

  • Pope VD, Gallani ML, Rowntree PR, Stratton RA (2000) The impact of new physical parametrizations in the Hadley Centre climate model: HadAM3. Clim Dyn 16:123–46

    Article  Google Scholar 

  • Punt AE, Butterworth DS, de Moor CL, De Oliveira JAA, Haddon M (2016) Management strategy evaluation: best practices. Fish Fish 17(2):303–34. https://doi.org/10.1111/faf.12104

    Article  Google Scholar 

  • RAM Legacy Stock Assessment Database (2020) RAM Legacy Stock Assessment Database V4.491. https://doi.org/10.5281/zenodo.3676088

  • Ramirez J, Yu W, Perrusquia A (2022) Model-free reinforcement learning from expert demonstrations: a survey. Artif Intell Rev 1:1–29

    Google Scholar 

  • Riahi K, Van Vuuren DP, Kriegler E, Edmonds J, O’neill BC, Fujimori S, Bauer N (2017) The shared socioeconomic pathways and their energy, land use, and greenhouse gas emissions implications: an overview. Glob Environ Chang 42:153–68

    Article  Google Scholar 

  • Sato Y (2019) Model-free reinforcement learning for financial portfolios: a brief survey. arXiv Preprint arXiv:1904.04973

  • Schaefer MB (1954) Some aspects of the dynamics of populations important to the management of the commercial marine fisheries. Bull Inter-Am Tropical Tuna Comm 1(2):27–56. https://doi.org/10.1007/BF02464432

    Article  Google Scholar 

  • Seo J, Na Y-S, Kim B, Lee CY, Park MS, Park SJ, Lee YH (2022) Development of an operation trajectory design algorithm for control of multiple 0d parameters using deep reinforcement learning in KSTAR. Nucl Fusion 62(8):086049

    Article  Google Scholar 

  • Sethi SP, Sethi SP (2019) What is optimal control theory? Springer, USA

    Book  MATH  Google Scholar 

  • Worm B, Barbier EB, Beaumont N, Duffy JE, Folke C, Halpern BS, Jackson JBC et al (2006) Impacts of biodiversity loss on ocean ecosystem services. Science 314(5800):787–90. https://doi.org/10.1126/science.1132294

    Article  Google Scholar 

  • Zeng D, Gu L, Pan S, Cai J, Guo S (2019) Resource management at the network edge: a deep reinforcement learning approach. IEEE Netw 33(3):26–33

    Article  Google Scholar 

  • Zhang Y, Li S, Liao L (2019) Near-optimal control of nonlinear dynamical systems: a brief survey. Annu Rev Control 47:71–80

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The title of this piece references a mathematical biology workshop at NIMBioS organized by Paul Armsworth, Alan Hastings, Megan Donahue, and Carl Towes in 2011 which first sought to emphasize ‘pretty darn good’ control solutions to more realistic problems over optimal control to idealized ones. This material is based upon work supported by the National Science Foundation under Grant No. DBI-1942280.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Boettiger.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Results for Stationary Models

In the main text we focused on the non-stationary model (“three species, two fisheries, non-stationary” in Table 1) for the sake of space and because our results were most compelling there. Here we present the reward distributions for the other models considered—the three stationary models, lines 1-3 in Table 1. These results are shown in Figs. 11, 12 and 13.

Fig. 11
figure 11

Reward distributions for the four strategies considered. These are based on 100 evaluation episodes of Model 3 in Table 1. We denote CEsc for constant escapement, CMort for constant mortality, PPO for the output policy of the PPO optimization algorithm, and PPO GP for the Gaussian process interpolation of the PPO policy

Fig. 12
figure 12

Reward distributions for the four strategies considered. These are based on 100 evaluation episodes of Model 2 in Table 1. We denote CEsc for constant escapement, CMort for constant mortality, PPO for the output policy of the PPO optimization algorithm, and PPO GP for the Gaussian process interpolation of the PPO policy

Fig. 13
figure 13

Reward distributions for the four strategies considered. These are based on 100 evaluation episodes of Model 1 in Table 1. We denote CEsc for constant escapement, CMort for constant mortality, PPO for the output policy of the PPO optimization algorithm, and PPO GP for the Gaussian process interpolation of the PPO policy

Appendix: PPO Policy Function for Non-Stationary Model

In the main text, Fig. 7, we presented a visualization of the PPO+GP policy function obtained for the “three species, two fisheries, non-stationary” model. This policy function is a Gaussian process regression of scatter data of the PPO policy function. In Fig. 14 we present a representation of this scatter data in a similar format as Fig. 7.

Fig. 14
figure 14

Plots of the PPO policy \(\pi _{\textrm{PPO}}\) along several relevant axes. Here \(M_X\) and \(M_Y\) are the X and Y components of the policy function. The values of the plots are generated in the following way: For each variable X, Y, and Z, the time-series of evaluation episodes are used to generate a window of typical values that variable attains when controlled by \(\pi _{\textrm{PPO}}\). Then, for each plot either X or Y was evaluated on 100 values in [0, 1] along the x axis, while the other variables (resp. Y and Z, or X and Z) were varied within the typical window computed before. Within this window, 5 values are used. The value of one of the latter two variables were visualized as color. This scatter data was used as an input to generate \(\pi _{\mathrm{PPO+GP}}\), visualized in the main text

Appendix: Gaussian Process Interpolation

Here we summarize the procedure used to interpolate the PPO policy (visualized in Fig. 14). We use the GaussianProcessRegressor object of the sklearn Python library with a kernel given by

$$\begin{aligned} \text {RBF}(\text {length scale = 10}) + \text {WhiteNoise}(\text {noise level = 0.1}). \end{aligned}$$

This interpolation method is applied to scatter data of the PPO policy evaluated on 3 different grids on \((X,Y,Z)\) states: \(G_X\), a \(51\times 5 \times 5\) grid; \(G_Y\), a \(5\times 51 \times 5\) grid; and \(G_Z\), a \(5\times 5 \times 51\) grid. This combination of grids was used instead of a single dense grid in order to reduce the computational intensity of the interpolation procedure. For \(G_X\), the 5 values for \(Y\) and \(Z\) were varied in a “popular window,” i.e. episode time-series data was used to determine windows of \(Y\) and \(Z\) values which were most likely. The grids \(G_Y\) and \(G_Z\) were generated in a similar fashion, mutatis mutandis.Footnote 10 The length scale and noise level values of this kernel were chosen arbitrarily—no hyperparameter tuning was needed to produce satisfactory interpolation, as will be shown in the results section.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Montealegre-Mora, F., Lapeyrolerie, M., Chapman, M. et al. Pretty Darn Good Control: When are Approximate Solutions Better than Approximate Models. Bull Math Biol 85, 95 (2023). https://doi.org/10.1007/s11538-023-01198-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11538-023-01198-5

Keywords

Navigation