Learning Equations from Biological Data with Limited Time Samples

Nardini, John T.; Lagergren, John H.; Hawkins-Daarud, Andrea; Curtin, Lee; Morris, Bethan; Rutter, Erica M.; Swanson, Kristin R.; Flores, Kevin B.

doi:10.1007/s11538-020-00794-z

Learning Equations from Biological Data with Limited Time Samples

Original Article
Published: 09 September 2020

Volume 82, article number 119, (2020)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

John T. Nardini ORCID: orcid.org/0000-0002-5503-1934^1,2,
John H. Lagergren¹,
Andrea Hawkins-Daarud³,
Lee Curtin³,
Bethan Morris⁴,
Erica M. Rutter⁵,
Kristin R. Swanson³ &
…
Kevin B. Flores¹

976 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

Equation learning methods present a promising tool to aid scientists in the modeling process for biological data. Previous equation learning studies have demonstrated that these methods can infer models from rich datasets; however, the performance of these methods in the presence of common challenges from biological data has not been thoroughly explored. We present an equation learning methodology comprised of data denoising, equation learning, model selection and post-processing steps that infers a dynamical systems model from noisy spatiotemporal data. The performance of this methodology is thoroughly investigated in the face of several common challenges presented by biological data, namely, sparse data sampling, large noise levels, and heterogeneity between datasets. We find that this methodology can accurately infer the correct underlying equation and predict unobserved system dynamics from a small number of time samples when the data are sampled over a time interval exhibiting both linear and nonlinear dynamics. Our findings suggest that equation learning methods can be used for model discovery and selection in many areas of biology when an informative dataset is used. We focus on glioblastoma multiforme modeling as a case study in this work to highlight how these results are informative for data-driven modeling-based tumor invasion predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tutorial on PCA and approximate PCA and approximate kernel PCA

Article Open access 31 October 2022

A feature selection method based on Shapley values robust for concept shift in regression

Article Open access 09 May 2024

Multivariate Gaussian processes: definitions, examples and applications

Article Open access 27 January 2023

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705
Article MathSciNet MATH Google Scholar
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer series in statistics. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15
Chapter Google Scholar
Baldock AL, Ahn S, Rockne R, Johnston S, Neal M, Corwin D, Clark-Swanson K, Sterin G, Trister AD, Malone H, Ebiana V, Sonabend AM, Mrugala M, Rockhill JK, Silbergeld DL, Lai A, Cloughesy T, Ii GMM, Bruce JN, Rostomily RC, Canoll P, Swanson KR (2014) Patient-specific metrics of invasiveness reveal significant prognostic benefit of resection in a predictable subset of gliomas. PLoS ONE 9(10):e99057. https://doi.org/10.1371/journal.pone.0099057
Article Google Scholar
Banks HT, Sutton KL, Clayton Thompson W, Bocharov G, Roose D, Schenkel T, Meyerhans A (2011) Estimation of Cell Proliferation Dynamics Using CFSE Data. Bull Math Biol 73(1):116–150. https://doi.org/10.1007/s11538-010-9524-5
Article MathSciNet MATH Google Scholar
Banks HT, Hu S, Thompson WC (2014) Modeling and inverse problems in the presence of uncertainty. Chapman and Hall, Boca Raton
Book Google Scholar
Banks HT, Catenacci J, Hu S (2016) Use of difference-based methods to explore statistical and mathematical model discrepancy in inverse problems. J Inverse Ill Posed Probl 24(4):413–433
Article MathSciNet Google Scholar
Boninsegna L, Nüske F, Clementi C (2018) Sparse learning of stochastic dynamical equations. J Chem Phys 148(24):241723
Article Google Scholar
Bortz DM, Nelson PW (2006) Model selection and mixed-effects modeling of HIV infection dynamics. Bull Math Biol 68(8):2005–2025. https://doi.org/10.1007/s11538-006-9084-x
Article MathSciNet MATH Google Scholar
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. PNAS 113(15):3932–3937. https://doi.org/10.1073/pnas.1517384113
Article MathSciNet MATH Google Scholar
Buhlmann P (2012) Bagging, boosting and ensemble methods. In: Gentle JE, Hrdle WK, Mori Y (eds) Handbook of computational statistics: concepts and methods. Springer handbooks of computational statistics. Springer, Berlin, pp 985–1022. https://doi.org/10.1007/978-3-642-21551-3_33
Chapter Google Scholar
Burnham KP, Anderson DR, Burnham KP (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd, ed edn. Springer, New York
MATH Google Scholar
Dwyer G, Elkinton JS, Hajek AE (1998) Spatial scale and the spread of a fungal pathogen of gypsy moth. Am Nat 152(3):485–494. https://doi.org/10.1086/286185
Article Google Scholar
Ferguson N.M, Laydon D, Nedjati-Gilani et al. G (2020) Impact of non-pharmaceutical interventions NPIs to reduce COVID-19 mortality and healthcare demand. pre-print. https://doi.org/10.25561/77482
Fisher RA (1937) The wave of advance of advantageous genes. Ann Eugen 7:353–369
MATH Google Scholar
Francis CRIC, Hurst RJ, Renwick JA (2003) Quantifying annual variation in catchability for commercial and research fishing. Fish Bull 101(2):293–304
Google Scholar
Garcia-Ramos G, Rodriguez D (2002) Evolutionary speed of species invasions. Evolution 56(4):661–668. https://doi.org/10.1554/0014-3820(2002)056[0661:ESOSI]2.0.CO;2
Article Google Scholar
Hastings A, Cuddington K, Davies KF, Dugaw CJ, Elmendorf S, Freestone A, Harrison S, Holland M, Lambrinos J, Malvadkar U, Melbourne BA, Moore K, Taylor C, Thomson D (2005) The spatial spread of invasions: new developments in theory and evidence. Ecol Lett 8(1):91–101. https://doi.org/10.1111/j.1461-0248.2004.00687.x
Article Google Scholar
Hawkins-Daarud A, Johnston SK, Swanson KR (2019) Quantifying uncertainty and robustness in a biomathematical modelbased patient-specific response metric for glioblastoma. JCO Clin Cancer Inform 3:1–8. https://doi.org/10.1200/CCI.18.00066
Article Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257. https://doi.org/10.1016/0893-6080(91)90009-T
Article MathSciNet Google Scholar
Jin W, Shah ET, Penington CJ, McCue SW, Chopin LK, Simpson MJ (2016) Reproducibility of scratch assays is affected by the initial degree of confluence: experiments, modelling and model selection. J Theor Biol 390:136–145. https://doi.org/10.1016/j.jtbi.2015.10.040
Article MATH Google Scholar
Kaiser E, Kutz JN, Brunton SL (2018) Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proc R Soc A 474(2219):20180335
Article MathSciNet Google Scholar
Keskar N.S, Mudigere D, Nocedal J, Smelyanskiy M, Tang P.T.P (2017) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836 [cs, math]
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arXiv:1412.6980 [cs]
Kolmogoroff A, Petrovsky I, Piscounoff N (1937) Etude de l’equation de la diffusion avec croissance de la quantite de matiere et son application a un probleme biologique. Moscow Univ Bull Math 1:1–25
MATH Google Scholar
Lagergren JH, Nardini JT, Michael Lavigne G, Rutter EM, Flores KB (2020) Learning partial differential equations for biological transport models from noisy spatio-temporal data. Proc R Soc A 476(2234):20190800. https://doi.org/10.1098/rspa.2019.0800
Article MathSciNet MATH Google Scholar
LeVeque RJ (2007) Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. Society for Industrial and Applied Mathematics, Philadelphia
Book Google Scholar
Lubina JA, Levin SA (1988) The spread of a reinvading species: range expansion in the California sea otter. Am Nat 131(4):526–543
Article Google Scholar
Mangan NM, Kutz JN, Brunton SL, Proctor JL (2017) Model selection for dynamical systems via sparse regression and information criteria. Proc R Soc A 473(2204):20170009
Article MathSciNet Google Scholar
Massey SC, White H, Whitmire P, Doyle T, Johnston SK, Singleton KW, Jackson PR, Hawkins-Daarud A, Bendok BR, Porter AB, Vora S, Sarkaria JN, Hu LS, Mrugala MM, Swanson KR (2020) Image-based metric of invasiveness predicts response to adjuvant temozolomide for primary glioblastoma. PLoS ONE 15(3):e0230492. https://doi.org/10.1371/journal.pone.0230492
Article Google Scholar
Mori Y, Jilkine A, Edelstein-Keshet L (2008) Wave-pinning and cell polarity from a bistable reaction-diffusion system. Biophys J 94(9):3684–3697. https://doi.org/10.1529/biophysj.107.120824
Article Google Scholar
Murray JD (2002) Mathematical biology I. An introduction, 3rd edn. Springer, New York
Book Google Scholar
Nardini J, Bortz D (2018) Investigation of a structured Fisher’s equation with applications in biochemistry. SIAM J Appl Math 78(3):1712–1736. https://doi.org/10.1137/16M1108546
Article MathSciNet MATH Google Scholar
Nardini JT, Bortz DM (2019) The influence of numerical error on parameter estimation and uncertainty quantification for advective PDE models. Inverse Prob 35(6):065003. https://doi.org/10.1088/1361-6420/ab10bb
Article MathSciNet MATH Google Scholar
Nardini JT, Chapnick DA, Liu X, Bortz DM (2016) Modeling keratinocyte wound healing: cell-cell adhesions promote sustained migration. J Theor Biol 400:103–117. https://doi.org/10.1016/j.jtbi.2016.04.015
Article MathSciNet MATH Google Scholar
Neal ML, Trister AD, Ahn S, Baldock A, Bridge CA, Guyman L, Lange J, Sodt R, Cloke T, Lai A, Cloughesy TF, Mrugala MM, Rockhill JK, Rockne RC, Swanson KR (2013a) Response classification based on a minimal model of glioblastoma growth is prognostic for clinical outcomes and distinguishes progression from pseudoprogression. Cancer Res 73(10):2976–2986. https://doi.org/10.1158/0008-5472.CAN-12-3588
Neal ML, Trister AD, Cloke T, Sodt R, Ahn S, Baldock AL, Bridge CA, Lai A, Cloughesy TF, Mrugala MM, Rockhill JK, Rockne RC, Swanson KR (2013b) Discriminating survival outcomes in patients with glioblastoma using a simulation-based, patient-specific response metric. PLoS ONE 8(1):1–7. https://doi.org/10.1371/journal.pone.0051951
Ozik J, Collier N, Wozniak JM, Macal C, Cockrell C, Friedman SH, Ghaffarizadeh A, Heiland R, An G, Macklin P (2018) High-throughput cancer hypothesis testing with an integrated PhysiCell-EMEWS workflow. BMC Bioinform 19(18):483. https://doi.org/10.1186/s12859-018-2510-x
Article Google Scholar
Perretti C, Munch S, Sugihara G (2013) Model-free forecasting outperforms the correct mechanistic model for simulated and experimental data. PNAS 110:5253–5257
Article Google Scholar
Raissi M, Karniadakis GE (2018) Hidden physics models: Machine learning of nonlinear partial differential equations. J Comput Phys 357:125–141. https://doi.org/10.1016/j.jcp.2017.11.039
Article MathSciNet MATH Google Scholar
Rockne RC, Trister AD, Jacobs J, Hawkins-Daarud AJ, Neal ML, Hendrickson K, Mrugala MM, Rockhill JK, Kinahan P, Krohn KA, Swanson KR (2015) A patient-specific computational model of hypoxia-modulated radiation resistance in glioblastoma using 18f-FMISO-PET. J R Soc Interface 12(103):20141174. https://doi.org/10.1098/rsif.2014.1174
Article Google Scholar
Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential equations. Sci Adv 3(4):e1602614. https://doi.org/10.1126/sciadv.1602614
Article Google Scholar
Rutter EM, Stepien TL, Anderies BJ, Plasencia JD, Woolf EC, Scheck AC, Turner GH, Liu Q, Frakes D, Kodibagkar V et al (2017) Mathematical analysis of glioma growth in a murine model. Sci Rep 7(2508):1–16
Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136. Publisher: Institute of Mathematical Statistics
Ulmer M, Ziegelmeier L, Topaz CM (2019) A topological approach to selecting models of biological experiments. PLoS ONE 14(3):e0213679. https://doi.org/10.1371/journal.pone.0213679
Article Google Scholar
Urban MC, Phillips BL, Skelly DK, Shine R, Wiens AEJJ, DeAngelis EDL (2008) A toad more traveled: the heterogeneous invasion dynamics of cane toads in Australia. Am Nat 171(3):E134–E148. https://doi.org/10.1086/527494
Article Google Scholar
Walter E, Pronzato L (1990) Qualitative and quantitative experiment design for phenomenological modelsA survey. Automatica 26(2):195–213. https://doi.org/10.1016/0005-1098(90)90116-Y
Article MathSciNet MATH Google Scholar
Wang CH, Rockhill JK, Mrugala M, Peacock DL, Lai A, Jusenius K, Wardlaw JM, Cloughesy T, Spence AM, Rockne R, Alvord EC, Swanson KR (2009) Prognostic significance of growth kinetics in newly diagnosed glioblastomas revealed by combining serial imaging with a novel biomathematical model. Cancer Res 69(23):9133–9140. https://doi.org/10.1158/0008-5472.CAN-08-3863
Article Google Scholar
Warne DJ, Baker RE, Simpson MJ (2019) Using experimental data and information criteria to guide model selection for reactiondiffusion problems in mathematical biology. Bull Math Biol 81(6):1760–1804. https://doi.org/10.1007/s11538-019-00589-x
Article MathSciNet MATH Google Scholar
Zhang T (2009) Adaptive forward-backward greedy algorithm for sparse learning with linear models. In: Advances in neural information processing systems, pp 1921–1928
Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars. Proc R Soc A 474(2217):20180305
Article MathSciNet Google Scholar
Zhang S, Lin G (2019) Robust subsampling-based sparse Bayesian inference to tackle four challenges (large noise, outliers, data integration, and extrapolation) in the discovery of physical laws from data. arXiv:1907.07788 [cs, stat]

Download references

Acknowledgements

Funding was provided by National Science Foundation (Grant Nos. 1638521, IOS-1838314), National Institute on Aging (Grant No. R21AG059099), National Institutes of Health (Grant No. U01CA220378), James S. McDonnell Foundation (Grant No. 220020264) and Engineering and Physical Sciences Research Council (Grant No. EP/N50970X/1).

Author information

Authors and Affiliations

Department of Mathematics, North Carolina State University, Raleigh, NC, USA
John T. Nardini, John H. Lagergren & Kevin B. Flores
The Statistical and Applied Mathematical Sciences Institute, Durham, NC, USA
John T. Nardini
Mathematical NeuroOncology Laboratory, Precision Neurotherapeutics Innovation Program, Mayo Clinic, Phoenix, AZ, USA
Andrea Hawkins-Daarud, Lee Curtin & Kristin R. Swanson
Centre for Mathematical Medicine and Biology, University of Nottingham, Nottingham, UK
Bethan Morris
Department of Applied Mathematics, University of California, Merced, Merced, CA, USA
Erica M. Rutter

Authors

John T. Nardini
View author publications
You can also search for this author in PubMed Google Scholar
John H. Lagergren
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Hawkins-Daarud
View author publications
You can also search for this author in PubMed Google Scholar
Lee Curtin
View author publications
You can also search for this author in PubMed Google Scholar
Bethan Morris
View author publications
You can also search for this author in PubMed Google Scholar
Erica M. Rutter
View author publications
You can also search for this author in PubMed Google Scholar
Kristin R. Swanson
View author publications
You can also search for this author in PubMed Google Scholar
Kevin B. Flores
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John T. Nardini.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material was based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statistical and Applied Mathematical Sciences Institute and IOS-1838314 to KBF, and in part by National Institute of Aging Grant R21AG059099 to KBF. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. BM gratefully acknowledges Ph.D. studentship funding from the UK EPSRC (reference EP/N50970X/1). AHD, LC, and KRS gratefully acknowledge funding through the NIH U01CA220378 and the James S. McDonnell Foundation 220020264.

Appendices

Simulating a Learned Equation

To simulate the inferred equation represented by the sparse vector ${\hat{\xi }}$, we begin by removing all zero terms from ${\hat{\xi }}$ as well as the corresponding terms from $\varTheta $. We can now define our inferred dynamical systems model as

$$\begin{aligned} u_t = \sum _i \xi _i \varTheta _i. \end{aligned}$$

(13)

We use the method of lines approach to simulate this equation, in which we discretize the right-hand side in space and then integrate along the t dimension. The Scipy integration subpackage (version 1.4.1) is used to integrate this equation over time using an explicit fourth-order Runge–Kutta method. We ensure that the simulation is stable by enforcing the CFL condition for an advection equation with speed $2\sqrt{Dr}$ is satisfied, e.g., $2\sqrt{Dr}\Delta t \le \Delta x$. Some inferred equations may not be well-posed, e.g., $u_t=-u_{xx}$. If the time integration fails at any point, we manually set the model output to $10^6$ everywhere to ensure this model is not selected as a final inferred model.

For the final inferred columns of $\varTheta =[\varTheta _1 , \varTheta _2 , \dots , \varTheta _n]$, we define nonlinear stencils, $A_{\varTheta _i}$ such that $A_{\varTheta _i}u\approx \varTheta _n$. As an example, we an upwind stencil (LeVeque 2007) for first-order derivative terms, such as $A_{u_x}$, so that $A_{u_x}u\approx u_x$. We use a central difference stencil for $A_{u_{xx}}$. For multiplicative terms, we define the stencil for $A_{uu_x}$ as $A_{uu_x}v=u\odot (A_{u_x}v),$ where $\odot $ denotes element-wise multiplication so that $A_{uu_x}u \approx uu_x$. Similarly, we set $A_{u_xu_{xx}}=A_{u_x}A_{u_{xx}}$, etc.

Table 5 Learned 1d equations from our equation learning methodology for all simulations with 5% noisy data

Full size table

Learning the 1d Fisher–KPP Equation with 5% Noisy Data

In Table 5, we present the inferred equations for all 1d datasets considered with $\sigma = 0.05$.

The slow simulation on the short time interval For noisy data sampled over the short time interval for the slow simulation, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for $N=10$ time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.

The slow simulation on the long time interval Over the long time interval, our equation learning methodology does infer the Fisher–KPP equation with $N=10$ time samples. Simulating the inferred equation for $N=10$ time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval.

The diffuse simulation on the short time interval For noisy data sampled over the short time interval for the diffuse simulation, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for $N=10$ time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.

The diffuse simulation on the long time interval Over the long time interval, our equation learning methodology does infer the Fisher–KPP equation with $N=3$ time samples. Simulating the inferred equation for $N=10$ time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval (Fig. 11 in “Appendix C”).

The fast simulation on the short time interval For noisy data sampled over the short time interval for the fast simulation, our equation learning methodology infers the Fisher–KPP equation with $N=10$ time samples. Simulating the inferred equation for $N=10$ time samples over the short time scale accurately matches the true underlying dynamics on the short time interval and accurately predicts the true dynamics on the long time interval (Fig. 11 in “Appendix C”).

The fast simulation on the long time interval Over the long time interval, our equation learning methodology does not infer the correct underlying equation for any values of N considered. Simulating the inferred equation for $N=10$ time samples over the short long scale does lead to an accurate description of the true underlying dynamics on the long time interval or prediction of the true dynamics on the short time interval.

The nodular simulation on the short time interval For noisy data sampled over the short time interval for the nodular simulation, our equation learning methodology infers the Fisher–KPP equation with $N=3$ time samples. Simulating the inferred equation for $N=10$ time samples over the short time scale does not lead to an accurate description of the true underlying dynamics on the short time interval or prediction of the true dynamics on the long time interval.

The nodular simulation on the long time interval Over the long time interval, our equation learning methodology infers the Fisher–KPP equation with $N=10$ time samples. Simulating the inferred equation for $N=10$ time samples over the long time scale accurately matches the true underlying dynamics on the long time interval and accurately predicts the true dynamics on the short time interval.

Fit and Predicted Dynamics

The fit and predicted system dynamics for the diffuse, fast, and nodular s with 1% noise and $N=5$ time samples are depicted in Figs. 8, 9, and 10, respectively. The fit and predicted dynamics for the diffuse and fast s with 5% noise and $N=10$ time samples are depicted in Fig. 11.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nardini, J.T., Lagergren, J.H., Hawkins-Daarud, A. et al. Learning Equations from Biological Data with Limited Time Samples. Bull Math Biol 82, 119 (2020). https://doi.org/10.1007/s11538-020-00794-z

Download citation

Received: 19 May 2020
Accepted: 16 August 2020
Published: 09 September 2020
DOI: https://doi.org/10.1007/s11538-020-00794-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Equations from Biological Data with Limited Time Samples

Abstract

Access this article

Similar content being viewed by others

Tutorial on PCA and approximate PCA and approximate kernel PCA

A feature selection method based on Shapley values robust for concept shift in regression

Multivariate Gaussian processes: definitions, examples and applications

References

Acknowledgements