Abstract
Equation learning methods present a promising tool to aid scientists in the modeling process for biological data. Previous equation learning studies have demonstrated that these methods can infer models from rich datasets; however, the performance of these methods in the presence of common challenges from biological data has not been thoroughly explored. We present an equation learning methodology comprised of data denoising, equation learning, model selection and post-processing steps that infers a dynamical systems model from noisy spatiotemporal data. The performance of this methodology is thoroughly investigated in the face of several common challenges presented by biological data, namely, sparse data sampling, large noise levels, and heterogeneity between datasets. We find that this methodology can accurately infer the correct underlying equation and predict unobserved system dynamics from a small number of time samples when the data are sampled over a time interval exhibiting both linear and nonlinear dynamics. Our findings suggest that equation learning methods can be used for model discovery and selection in many areas of biology when an informative dataset is used. We focus on glioblastoma multiforme modeling as a case study in this work to highlight how these results are informative for data-driven modeling-based tumor invasion predictions.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer series in statistics. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15
Baldock AL, Ahn S, Rockne R, Johnston S, Neal M, Corwin D, Clark-Swanson K, Sterin G, Trister AD, Malone H, Ebiana V, Sonabend AM, Mrugala M, Rockhill JK, Silbergeld DL, Lai A, Cloughesy T, Ii GMM, Bruce JN, Rostomily RC, Canoll P, Swanson KR (2014) Patient-specific metrics of invasiveness reveal significant prognostic benefit of resection in a predictable subset of gliomas. PLoS ONE 9(10):e99057. https://doi.org/10.1371/journal.pone.0099057
Banks HT, Sutton KL, Clayton Thompson W, Bocharov G, Roose D, Schenkel T, Meyerhans A (2011) Estimation of Cell Proliferation Dynamics Using CFSE Data. Bull Math Biol 73(1):116–150. https://doi.org/10.1007/s11538-010-9524-5
Banks HT, Hu S, Thompson WC (2014) Modeling and inverse problems in the presence of uncertainty. Chapman and Hall, Boca Raton
Banks HT, Catenacci J, Hu S (2016) Use of difference-based methods to explore statistical and mathematical model discrepancy in inverse problems. J Inverse Ill Posed Probl 24(4):413–433
Boninsegna L, Nüske F, Clementi C (2018) Sparse learning of stochastic dynamical equations. J Chem Phys 148(24):241723
Bortz DM, Nelson PW (2006) Model selection and mixed-effects modeling of HIV infection dynamics. Bull Math Biol 68(8):2005–2025. https://doi.org/10.1007/s11538-006-9084-x
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. PNAS 113(15):3932–3937. https://doi.org/10.1073/pnas.1517384113
Buhlmann P (2012) Bagging, boosting and ensemble methods. In: Gentle JE, Hrdle WK, Mori Y (eds) Handbook of computational statistics: concepts and methods. Springer handbooks of computational statistics. Springer, Berlin, pp 985–1022. https://doi.org/10.1007/978-3-642-21551-3_33
Burnham KP, Anderson DR, Burnham KP (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd, ed edn. Springer, New York
Dwyer G, Elkinton JS, Hajek AE (1998) Spatial scale and the spread of a fungal pathogen of gypsy moth. Am Nat 152(3):485–494. https://doi.org/10.1086/286185
Ferguson N.M, Laydon D, Nedjati-Gilani et al. G (2020) Impact of non-pharmaceutical interventions NPIs to reduce COVID-19 mortality and healthcare demand. pre-print. https://doi.org/10.25561/77482
Fisher RA (1937) The wave of advance of advantageous genes. Ann Eugen 7:353–369
Francis CRIC, Hurst RJ, Renwick JA (2003) Quantifying annual variation in catchability for commercial and research fishing. Fish Bull 101(2):293–304
Garcia-Ramos G, Rodriguez D (2002) Evolutionary speed of species invasions. Evolution 56(4):661–668. https://doi.org/10.1554/0014-3820(2002)056[0661:ESOSI]2.0.CO;2
Hastings A, Cuddington K, Davies KF, Dugaw CJ, Elmendorf S, Freestone A, Harrison S, Holland M, Lambrinos J, Malvadkar U, Melbourne BA, Moore K, Taylor C, Thomson D (2005) The spatial spread of invasions: new developments in theory and evidence. Ecol Lett 8(1):91–101. https://doi.org/10.1111/j.1461-0248.2004.00687.x
Hawkins-Daarud A, Johnston SK, Swanson KR (2019) Quantifying uncertainty and robustness in a biomathematical modelbased patient-specific response metric for glioblastoma. JCO Clin Cancer Inform 3:1–8. https://doi.org/10.1200/CCI.18.00066
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257. https://doi.org/10.1016/0893-6080(91)90009-T
Jin W, Shah ET, Penington CJ, McCue SW, Chopin LK, Simpson MJ (2016) Reproducibility of scratch assays is affected by the initial degree of confluence: experiments, modelling and model selection. J Theor Biol 390:136–145. https://doi.org/10.1016/j.jtbi.2015.10.040
Kaiser E, Kutz JN, Brunton SL (2018) Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proc R Soc A 474(2219):20180335
Keskar N.S, Mudigere D, Nocedal J, Smelyanskiy M, Tang P.T.P (2017) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836 [cs, math]
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arXiv:1412.6980 [cs]
Kolmogoroff A, Petrovsky I, Piscounoff N (1937) Etude de l’equation de la diffusion avec croissance de la quantite de matiere et son application a un probleme biologique. Moscow Univ Bull Math 1:1–25
Lagergren JH, Nardini JT, Michael Lavigne G, Rutter EM, Flores KB (2020) Learning partial differential equations for biological transport models from noisy spatio-temporal data. Proc R Soc A 476(2234):20190800. https://doi.org/10.1098/rspa.2019.0800
LeVeque RJ (2007) Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. Society for Industrial and Applied Mathematics, Philadelphia
Lubina JA, Levin SA (1988) The spread of a reinvading species: range expansion in the California sea otter. Am Nat 131(4):526–543
Mangan NM, Kutz JN, Brunton SL, Proctor JL (2017) Model selection for dynamical systems via sparse regression and information criteria. Proc R Soc A 473(2204):20170009
Massey SC, White H, Whitmire P, Doyle T, Johnston SK, Singleton KW, Jackson PR, Hawkins-Daarud A, Bendok BR, Porter AB, Vora S, Sarkaria JN, Hu LS, Mrugala MM, Swanson KR (2020) Image-based metric of invasiveness predicts response to adjuvant temozolomide for primary glioblastoma. PLoS ONE 15(3):e0230492. https://doi.org/10.1371/journal.pone.0230492
Mori Y, Jilkine A, Edelstein-Keshet L (2008) Wave-pinning and cell polarity from a bistable reaction-diffusion system. Biophys J 94(9):3684–3697. https://doi.org/10.1529/biophysj.107.120824
Murray JD (2002) Mathematical biology I. An introduction, 3rd edn. Springer, New York
Nardini J, Bortz D (2018) Investigation of a structured Fisher’s equation with applications in biochemistry. SIAM J Appl Math 78(3):1712–1736. https://doi.org/10.1137/16M1108546
Nardini JT, Bortz DM (2019) The influence of numerical error on parameter estimation and uncertainty quantification for advective PDE models. Inverse Prob 35(6):065003. https://doi.org/10.1088/1361-6420/ab10bb
Nardini JT, Chapnick DA, Liu X, Bortz DM (2016) Modeling keratinocyte wound healing: cell-cell adhesions promote sustained migration. J Theor Biol 400:103–117. https://doi.org/10.1016/j.jtbi.2016.04.015
Neal ML, Trister AD, Ahn S, Baldock A, Bridge CA, Guyman L, Lange J, Sodt R, Cloke T, Lai A, Cloughesy TF, Mrugala MM, Rockhill JK, Rockne RC, Swanson KR (2013a) Response classification based on a minimal model of glioblastoma growth is prognostic for clinical outcomes and distinguishes progression from pseudoprogression. Cancer Res 73(10):2976–2986. https://doi.org/10.1158/0008-5472.CAN-12-3588
Neal ML, Trister AD, Cloke T, Sodt R, Ahn S, Baldock AL, Bridge CA, Lai A, Cloughesy TF, Mrugala MM, Rockhill JK, Rockne RC, Swanson KR (2013b) Discriminating survival outcomes in patients with glioblastoma using a simulation-based, patient-specific response metric. PLoS ONE 8(1):1–7. https://doi.org/10.1371/journal.pone.0051951
Ozik J, Collier N, Wozniak JM, Macal C, Cockrell C, Friedman SH, Ghaffarizadeh A, Heiland R, An G, Macklin P (2018) High-throughput cancer hypothesis testing with an integrated PhysiCell-EMEWS workflow. BMC Bioinform 19(18):483. https://doi.org/10.1186/s12859-018-2510-x
Perretti C, Munch S, Sugihara G (2013) Model-free forecasting outperforms the correct mechanistic model for simulated and experimental data. PNAS 110:5253–5257
Raissi M, Karniadakis GE (2018) Hidden physics models: Machine learning of nonlinear partial differential equations. J Comput Phys 357:125–141. https://doi.org/10.1016/j.jcp.2017.11.039
Rockne RC, Trister AD, Jacobs J, Hawkins-Daarud AJ, Neal ML, Hendrickson K, Mrugala MM, Rockhill JK, Kinahan P, Krohn KA, Swanson KR (2015) A patient-specific computational model of hypoxia-modulated radiation resistance in glioblastoma using 18f-FMISO-PET. J R Soc Interface 12(103):20141174. https://doi.org/10.1098/rsif.2014.1174
Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential equations. Sci Adv 3(4):e1602614. https://doi.org/10.1126/sciadv.1602614
Rutter EM, Stepien TL, Anderies BJ, Plasencia JD, Woolf EC, Scheck AC, Turner GH, Liu Q, Frakes D, Kodibagkar V et al (2017) Mathematical analysis of glioma growth in a murine model. Sci Rep 7(2508):1–16
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136. Publisher: Institute of Mathematical Statistics
Ulmer M, Ziegelmeier L, Topaz CM (2019) A topological approach to selecting models of biological experiments. PLoS ONE 14(3):e0213679. https://doi.org/10.1371/journal.pone.0213679
Urban MC, Phillips BL, Skelly DK, Shine R, Wiens AEJJ, DeAngelis EDL (2008) A toad more traveled: the heterogeneous invasion dynamics of cane toads in Australia. Am Nat 171(3):E134–E148. https://doi.org/10.1086/527494
Walter E, Pronzato L (1990) Qualitative and quantitative experiment design for phenomenological modelsA survey. Automatica 26(2):195–213. https://doi.org/10.1016/0005-1098(90)90116-Y
Wang CH, Rockhill JK, Mrugala M, Peacock DL, Lai A, Jusenius K, Wardlaw JM, Cloughesy T, Spence AM, Rockne R, Alvord EC, Swanson KR (2009) Prognostic significance of growth kinetics in newly diagnosed glioblastomas revealed by combining serial imaging with a novel biomathematical model. Cancer Res 69(23):9133–9140. https://doi.org/10.1158/0008-5472.CAN-08-3863
Warne DJ, Baker RE, Simpson MJ (2019) Using experimental data and information criteria to guide model selection for reactiondiffusion problems in mathematical biology. Bull Math Biol 81(6):1760–1804. https://doi.org/10.1007/s11538-019-00589-x
Zhang T (2009) Adaptive forward-backward greedy algorithm for sparse learning with linear models. In: Advances in neural information processing systems, pp 1921–1928
Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars. Proc R Soc A 474(2217):20180305
Zhang S, Lin G (2019) Robust subsampling-based sparse Bayesian inference to tackle four challenges (large noise, outliers, data integration, and extrapolation) in the discovery of physical laws from data. arXiv:1907.07788 [cs, stat]
Acknowledgements
Funding was provided by National Science Foundation (Grant Nos. 1638521, IOS-1838314), National Institute on Aging (Grant No. R21AG059099), National Institutes of Health (Grant No. U01CA220378), James S. McDonnell Foundation (Grant No. 220020264) and Engineering and Physical Sciences Research Council (Grant No. EP/N50970X/1).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This material was based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statistical and Applied Mathematical Sciences Institute and IOS-1838314 to KBF, and in part by National Institute of Aging Grant R21AG059099 to KBF. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. BM gratefully acknowledges Ph.D. studentship funding from the UK EPSRC (reference EP/N50970X/1). AHD, LC, and KRS gratefully acknowledge funding through the NIH U01CA220378 and the James S. McDonnell Foundation 220020264.
Appendices
Simulating a Learned Equation
To simulate the inferred equation represented by the sparse vector \({\hat{\xi }}\), we begin by removing all zero terms from \({\hat{\xi }}\) as well as the corresponding terms from \(\varTheta \). We can now define our inferred dynamical systems model as
We use the method of lines approach to simulate this equation, in which we discretize the right-hand side in space and then integrate along the t dimension. The Scipy integration subpackage (version 1.4.1) is used to integrate this equation over time using an explicit fourth-order Runge–Kutta method. We ensure that the simulation is stable by enforcing the CFL condition for an advection equation with speed \(2\sqrt{Dr}\) is satisfied, e.g., \(2\sqrt{Dr}\Delta t \le \Delta x\). Some inferred equations may not be well-posed, e.g., \(u_t=-u_{xx}\). If the time integration fails at any point, we manually set the model output to \(10^6\) everywhere to ensure this model is not selected as a final inferred model.
For the final inferred columns of \(\varTheta =[\varTheta _1 , \varTheta _2 , \dots , \varTheta _n]\), we define nonlinear stencils, \(A_{\varTheta _i}\) such that \(A_{\varTheta _i}u\approx \varTheta _n\). As an example, we an upwind stencil (LeVeque 2007) for first-order derivative terms, such as \(A_{u_x}\), so that \(A_{u_x}u\approx u_x\). We use a central difference stencil for \(A_{u_{xx}}\). For multiplicative terms, we define the stencil for \(A_{uu_x}\) as \(A_{uu_x}v=u\odot (A_{u_x}v),\) where \(\odot \) denotes element-wise multiplication so that \(A_{uu_x}u \approx uu_x\). Similarly, we set \(A_{u_xu_{xx}}=A_{u_x}A_{u_{xx}}\), etc.
Learning the 1d Fisher–KPP Equation with 5% Noisy Data
In Table 5, we present the inferred equations for all 1d datasets considered with \(\sigma = 0.05\).
The slow simulation on the short time interval For noisy data sampled over the short time interval for the slow simulation, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for \(N=10\) time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.
The slow simulation on the long time interval Over the long time interval, our equation learning methodology does infer the Fisher–KPP equation with \(N=10\) time samples. Simulating the inferred equation for \(N=10\) time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval.
The diffuse simulation on the short time interval For noisy data sampled over the short time interval for the diffuse simulation, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for \(N=10\) time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.
The diffuse simulation on the long time interval Over the long time interval, our equation learning methodology does infer the Fisher–KPP equation with \(N=3\) time samples. Simulating the inferred equation for \(N=10\) time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval (Fig. 11 in “Appendix C”).
The fast simulation on the short time interval For noisy data sampled over the short time interval for the fast simulation, our equation learning methodology infers the Fisher–KPP equation with \(N=10\) time samples. Simulating the inferred equation for \(N=10\) time samples over the short time scale accurately matches the true underlying dynamics on the short time interval and accurately predicts the true dynamics on the long time interval (Fig. 11 in “Appendix C”).
The fast simulation on the long time interval Over the long time interval, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for \(N=10\) time samples over the short long scale does lead to an accurate description of the true underlying dynamics on the long time interval or prediction of the true dynamics on the short time interval.
The nodular simulation on the short time interval For noisy data sampled over the short time interval for the nodular simulation, our equation learning methodology infers the Fisher–KPP equation with \(N=3\) time samples. Simulating the inferred equation for \(N=10\) time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.
The nodular simulation on the long time interval Over the long time interval, our equation learning methodology infers the Fisher–KPP equation with \(N=10\) time samples. Simulating the inferred equation for \(N=10\) time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval.
Fit and Predicted Dynamics
The fit and predicted system dynamics for the diffuse, fast, and nodular s with 1% noise and \(N=5\) time samples are depicted in Figs. 8, 9, and 10, respectively. The fit and predicted dynamics for the diffuse and fast s with 5% noise and \(N=10\) time samples are depicted in Fig. 11.
Rights and permissions
About this article
Cite this article
Nardini, J.T., Lagergren, J.H., Hawkins-Daarud, A. et al. Learning Equations from Biological Data with Limited Time Samples. Bull Math Biol 82, 119 (2020). https://doi.org/10.1007/s11538-020-00794-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11538-020-00794-z