Mathematical Modelling of Parasite Dynamics: A Stochastic Simulation-Based Approach and Parameter Estimation via Modified Sequential-Type Approximate Bayesian Computation

The development of mathematical models for studying newly emerging and re-emerging infectious diseases has gained momentum due to global events. The gyrodactylid-fish system, like many host-parasite systems, serves as a valuable resource for ecological, evolutionary, and epidemiological investigations owing to its ease of experimental manipulation and long-term monitoring. Although this system has an existing individual-based model, it falls short in capturing information about species-specific microhabitat preferences and other biological details for different Gyrodactylus strains across diverse fish populations. This current study introduces a new individual-based stochastic simulation model that uses a hybrid \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document}τ-leaping algorithm to incorporate this essential data, enhancing our understanding of the complexity of the gyrodactylid-fish system. We compare the infection dynamics of three gyrodactylid strains across three host populations. A modified sequential-type approximate Bayesian computation (ABC) method, based on sequential Monte Carlo and sequential importance sampling, is developed. Additionally, we establish two penalised local-linear regression methods (based on L1 and L2 regularisations) for ABC post-processing analysis to fit our model using existing empirical data. With the support of experimental data and the fitted mathematical model, we address open biological questions for the first time and propose directions for future studies on the gyrodactylid-fish system. The adaptability of the mathematical model extends beyond the gyrodactylid-fish system to other host-parasite systems. Furthermore, the modified ABC methodologies provide efficient calibration for other multi-parameter models characterised by a large set of correlated or independent summary statistics. Supplementary Information The online version contains supplementary material available at 10.1007/s11538-024-01281-5.

1 Supplementary S1: Pseudo-codes of exact simulation and τ -leaping for the CTMC simulation model This section provides the pseudo-codes for both the exact Stochastic Simulation Algorithm (SSA) and the hybrid τ -leaping algorithm (Algorithms 1 and 2) designed for the multidimensional Continuous-Time Markov Chain (CTMC) model, aimed at simulating the gyrodactylid infection dynamics of a single fish (based on the stochastic scheme defined by Table 1 in the main paper).All indices in the algorithms are defined in a manner consistent with the specifications outlined in the main paper.
Output: Number of parasites at each body region over time, X j (t) = 2 k=1 A (h) j,k (t) for fish h; host survival status (alive: x = 1; dead: x = 2) while t < t final and x = 1 do Set initial time t = t 0 ; state A 1,1 = A 0 = 2 & zero elsewhere; immune state I j = I 0 = 1; host survival status x = 1.
Determine the higher-level events to occur at a host's body regions using a random number u from U nif orm(0, a 0 ) at a probability equal to a δ a0 such that: if u < a 1 then update A j,k = A j,k + 1 (birth event occurs) end if a 1 < u < a 1 + a 2 then update A j,k = A j,k − 1 (death event occurs) end if a 1 + a 2 + a 3 + a 4 < u < a 1 + a 2 + a 3 + a 4 + a 5 then set I j = 2 (immune response event occurs) else set x = 2 and break (host mortality event occurs) end Generate time increment τ SSA from Exponential(a 0 ), and update the time such that t = t + τ SSA .
Output: Number of parasites at each body region over time, X j (t) = j,k (t) for fish h; host survival status (alive: x = 1; dead: x = 2) 1 while t < t final and x = 1 do 2 Set initial time t = t 0 ; state A 1,1 = A 0 = 2 & zero elsewhere; immune state I j = I 0 = 1; host survival status x = 1.

3
Calculate the total event rates a δ A for all higher-level events δ = 1, 2, set t = t + τ leap and determine the higher-level events to occur at a host's body regions using a random number u from U nif orm(0, a 0 ) such that: Set x = 2 and break (host mortality event occurs) end if a 6 < u < a 5 + a 6 then Set I j = 2 (immune response event occurs) where P ∼ P oisson(a δ τ leap ) and v δ is state-change vector (birth, death and forward movement events occur) In this section, we present results regarding the impact of the error bound, denoted as (0 < 1), on the simulation output for the hybrid τ -leaping simulation model.We explore the trade-off between simulation accuracy and computational speed under fixed parameter values (Table 1).The analysis is based on 100 different simulation realisations or repetitions, each corresponding to the nine observed parasite-fish groups (given fish sex, fish size, fish stock, and parasite strain).The simulation accuracy is quantified using the mean square error, given by equation where X(g) leap,r (t) and X(g) SSA,r (t) are the mean parasite numbers over time t from hybrid τ − leaping and exact SSA, respectively; across each of the nine parasitefish groups (g) and simulation realisation (r).The respective confidence intervals of the mean over time between the two simulation methods were also compared at 0 < < 0.1 for each parasite-fish group over time.The simulation speed was quantified by the computational time (computer's CPU time measured in seconds).
The investigation revealed a reduction in simulation accuracy in the hybrid τ -leaping algorithm as the error bound ( ) increased from = 0.002 to = 0.1 (refer to Figure 1).Notably, mean parasite numbers from the hybrid τ -leaping simulations remained relatively consistent with the exact Stochastic Simulation Algorithm (SSA) at error bounds within the range 0.002 ≤ ≤ 0.01 across the nine parasite-fish groups (see Figures 2-6), including their respective confidence intervals.However, as the error bound increased beyond = 0.02, the performance of the τ -leaping algorithm deteriorated across the parasite-fish groups, particularly approaching = 0.1 (see Figures 7-11).Figure 12 indicates that at = 0.008 or = 0.01, the τ -leaping algorithm exhibited relatively fast performance, although not significantly different from the computational time of the exact SSA.This unexpected observation may be attributed to either the number of simulation repetitions (100 repetitions) or the relatively small number of parasites over time at the pre-specified parameter values (randomly chosen) such that the leaping condition was not frequently met.Based on considerations of simulation accuracy, = 0.002 remains a reasonable choice for the error bound in subsequent simulations using the multidimensional stochastic model.

Supplementary S3: Projection of parasite numbers after fish mortality
This section presents the regression function used to project parasite numbers after host mortality in the CTMC simulation model defined in the main paper.Now, when a host dies at time t ≥ 1, it is often observed in most biological systems that the entire parasite population on the host becomes extinct before time t + 1.However, a study on survival and behaviour of Gyrodactylus turnbulli and G. bullatarudis on dead fish revealed that under certain conditions, such as high parasite abundance at the time of death, these parasites might survive on a dead host for a while [2].Thus, after premature host mortality (before the end of the infection period), we cannot assume that there is parasite population extinction in all body regions, depending on the parasite abundance at the time of death.

The projection function
During such realisations when an infected fish dies before the end of the observa- Let y t = total parasites at time t, z t = log(y t ), k be the time before fish mortality, α a regression parameter and γ ∈ (0, 1), a tuning parameter.Given we want to estimate or project for ẑk+1 , • • • , ẑ9 .Now, let the proposed least squares regression equation, denoted by m(t), be defined as given by equation 2; with the estimate of the regression parameter (α) determined such that Thus, the prediction of where i is the number of predictions across time.Then, Hence, also letting y t,u be parasite at location u and time t implies the expected projections are defined such that Remark.The estimate of the regression parameter (α) defined by equation 3 can be derived explicitly using Theorem 1 (for the theorem and its mathematical proof, see section 3.2).Due to the high computational demand in simulating data from our complex stochastic model as well as implementing our weighted-iterative ABC for model fitting (especially during the computation of the high-dimensional summary statistics), we tuned or set the value of γ at ≥ 0.9 (based on all observed fish that prematurely died), which we found to be a reasonable choice based on some heuristic technique which achieved some degree of convergence when estimating α (defined by Theorem 1 under section 3.2) at varying γ values ranging from 0 to 1 across all the different parasite-host groups.Figure 13 shows the convergence effect of varying values of the tuning parameter γ ∈ (0, 1) on the estimate of the regression coefficient α (defined by equation 7) across all parasite-fish groups (based on all observed fish that prematurely died as indicated by different colours).The exact least squares estimate of the regression parameter α based on the linear regression equation 2 is given as: Proof of Theorem 1 . Suppose whilst fixing γ, and the loss function Differentiating equation 8 with respect to α and setting to zero gives
4 Supplementary S4: Assessing the modified ABC and regression adjustment using a numerical experiment

Introduction
In this section, we present the results of a numerical experiment using our stochas- Twumasi [4] also found from the simple simulation experiment that the computational cost (or cost in time) of the modified ABC-SMC sampler increases quadratically as the number of proposal draws increases from N = 500 to N = 5000 during the ABC fitting of the toy model (run in parallel using over 20 CPU cores of a multi-core processor).Based on these findings, it suggests that it will be costeffective to fit the complex stochastic simulation (described in the main paper) at any smaller values N ≥ 500 due to the high computational cost associated with simulations from our complex stochastic model (especially during realisations of parasite population explosion), estimation of the multidimensional summary statistics (across the entire host population), and the potential cost of implementing the modified ABC algorithm at higher values of N 500.
Generally, there was no significant difference in the accuracy of the unadjusted and ridge-adjusted posterior mean estimates (for instance) at the different values of N , and the true posterior mean estimates were found in their respective estimated credible intervals.However, the mean square error (MSE) of the adjusted posterior mean based on the proposed regression correction (with L2 regularisation) resulted in a relatively smaller MSE and credible interval width most of the time (especially at N ≤ 1000).In the main paper, an ABC regression adjustment method based on an L1 regularisation procedure has also been proposed for comparison with the ridge-adjusted method.Section 4.2 presents ABC results from a different numerical experiment based on our stochastic simulation model at N = 1500 given pseudo-observed data (generated from our simulation model).

Results of the numerical experiment based on our stochastic simulation model 4.2.1 ABC fitting based on pseudo-observed data
Pseudo-observed data were simulated from our stochastic model at predefined parameter values (on a logarithmic scale).The model was then fitted to the pseudo-observed data using our modified ABC-SMC algorithm (i.e., weightediterative ABC).Figures 14-16 illustrate a consistent reduction in the quantiles of ABC distances (which quantify the discrepancy between simulated and pseudoobserved data) over ABC time steps at any predefined quantile or percentile value (at 500 ≤ N ≤ 1500).This decreasing pattern indicates an enhanced performance or convergence of the modified ABC-SMC algorithm towards a good approximation of the true posterior distribution.This improvement aligns with findings from the unadjusted ABC marginal density plot (Figure 17), suggesting that as the ABC sequential updates unfold, the simulated data progressively gets closer to the pseudo-observed data, reflecting posterior convergence over time.
The resulting posterior was further adjusted using the two proposed penalised regression-adjusted methods (which employ L1 and L2 regularisation procedures).
Figure 18 indicates the presence of high multicollinearity between some of the regression predictors (in the neighbourhood of the observed summaries); confirming why the standard Beaumont et al. [1]                 The PCA plot presented in Figure 26 illustrates the variability and hierarchical relationship between pseudo-observed and simulated data within a reduceddimensional space.Simulations derived from the unadjusted posterior exhibit spatial concentration within the pseudo-observed data.In contrast, the pseudoobserved data aligns more closely with simulations based on the ridge-adjusted posterior, and the latter set is contained within simulations derived from the lassoadjusted posterior.This observation implies that the simulated data from the ridge-adjusted posterior are either much closer to the pseudo-observed data or exhibit lower discrepancy compared to simulations derived from the lasso-adjusted posterior.Meanwhile, simulations from the unadjusted posterior appear to underestimate the pseudo-observed data, a conclusion supported by the comparison of parasite mean intensities, as depicted in Figure 27. Figure 28 confirms that there is no statistical difference between the pseudo-observed data and the simulated data derived from the ridge-adjusted posterior concerning the distribution of parasites across the observation time points.Therefore, the posterior correction method utilising ridge adjustment exhibited a superior model fit compared to the lasso adjustment method, thereby enhancing the unadjusted posterior.However, it is essential to note that this conclusion may or may not consistently align with the actual observed data during the ABC fitting process.

Fig. 1 :
Fig. 1: Mean square error from the Hybrid τ -leaping algorithm at different error bounds.
tion period in the CTMC simulation model or observed empirical data, we use an estimated linear regression function (given by equation 6) based on the parasite data before host mortality, to linearly project the parasite numbers till the end of the observation period to aid in ABC implementation (by assigning more weight to the most recent data before host death as in a case of missing value imputation assuming at least missing-at-random).Furthermore, we minimise the projection estimation error by assigning more weight to the most recent outcomes or data prior to host mortality based on equations 2-4.Thus, the parasite population projection after host mortality is used to aid in the summary statistics computation for ABC fitting of our stochastic model; specifically, during the calculation of other components of the summary statistics such as the log counts of parasites over time till day 17, and weights of the Wasserstein 1-D distance metric between body regions of the host.We propose a linear function to estimate missing values after fish mortality till the end of the observation period:

Fig. 13 :
Fig. 13: Effect of varying values of the tuning parameter γ ∈ (0, 1) on the estimate of the regression coefficient α (with different colours indicating all observed fish that prematurely died).

3. 2
Derivation of an exact least squares estimator of the regression parameter α (defined by equation 3) based on Theorem 1 Theorem 1. (Least squares estimate of the regression parameter) simulation model under predetermined parameter settings and the proposed ABC methods.The objective is to assess the effectiveness of our modified ABC-SMC sampler and explore the identifiability of the new stochastic model for the gyrodactylid-fish system.From an unrelated toy model (with multivariate normal likelihood and known analytical posterior distribution), it has already been shown from works by Twumasi[4, pp.186-203] that our modified ABC-SMC algorithm (i.e., weighted-iterative ABC) resulted in mutually compatible posterior approximations at the different values of Monte Carlo sample size or number of particle draws, N (for 500 ≤ N ≤ 5000), based on a pre-specified tolerance and ABC time steps of size 10.Thus, it was inferred that the resulting posterior from the modified ABC-SMC algorithm is independent of N , and the degree of variability in the posterior distributions is not significantly different, irrespective of the number of particles drawn from the importance distribution.

Fig. 14 :
Fig. 14: Plot of distance quantiles at the 10 ABC time steps at N=500.

Fig. 15 :
Fig. 15: Plot of distance quantiles at the 10 ABC time steps at N=1000.

Fig. 16 :
Fig. 16: Plot of distance quantiles at the 10 ABC time steps at N=1500.

Fig. 17 :
Fig. 17: Marginal density plots of the approximate posterior distributions (in black) for the 23 parameters of the complex stochastic model against the sequentially improving prior distributions at N = 1500 (on logarithmic scale).

Fig. 18 :
Fig. 18: Correlation matrix plot indicating high multicollinearity between some of the 17 regression predictors (denoted by S i , 1 ≤ i ≤ 17 in the neighbourhood of the pseudo-observed summary statistics) in the modified regression-adjusted ABC (with L2 regularisation).

Fig. 19 :
Fig. 19: Marginal density plots of the unadjusted (in black) and adjusted (in green) posterior distributions of model parameters: b 11 , b 12 , b 21 , b 22 , b 31 , and b 32 against the sequentially improving prior distributions (on logarithmic scale) based on pseudo-observed data.

Fig. 20 :
Fig. 20: Marginal density plots of the unadjusted (in black) and adjusted (in green) posterior distributions of model parameters: d 11 , d 12 , d 21 , d 22 , d 31 , and d 32 against the sequentially improving prior distributions (on logarithmic scale) based on pseudo-observed data.

Fig. 21 :
Fig. 21: Marginal density plots of the unadjusted (in black) and adjusted (in green) posterior distributions of model parameters: m, r, r 1 , r 2 , r 3 , and s against the sequentially improving prior distributions (on logarithmic scale) based on pseudo-observed data.

Fig. 22 :
Fig. 22: Marginal density plots of the unadjusted (in black) and adjusted (in green) posterior distributions of model parameters: s 1 , 1 , 2 , 3 , and κ against the sequentially improving prior distributions (on logarithmic scale) based on pseudo-observed data.

Fig. 23 :
Fig. 23: Comparative plot of the bias of the estimated model parameters based on the unadjusted and regression-adjusted posteriors with their pooled estimates of the bias, variance and MSE.

Fig. 24 :
Fig. 24: PCoA plot of the similarities between posterior samples under a lower-dimensional space (Part A) and the distribution of the average distances of the posterior samples to their posterior centriod (Part B) between the unadjusted ABC and the two penalised regression-adjusted ABC methods (based on ridge and lasso regularisations).

4. 2 . 3
Multivariate posterior predictive analysis using PCA given pseudoobserved dataWe employed principal component analysis (PCA) to examine further the distribution of simulations derived from the posterior approximations and to evaluate the unadjusted and the two regression-adjusted posteriors.Specifically, our analysis explored potential patterns between the high-dimensional pseudo-observed data and simulations based on unadjusted, ridge-adjusted, and lasso-adjusted posterior within a lower-dimensional space.Part A of Figure25provides a graphical summary of the squared cosines of contributions (Cos2) arranged in decreasing order of importance.This arrangement offers insights into the contribution of each data variable (i.e., parasite data at each observed time point) to the principal components.Additionally, Part B of Figure 25 represents a PCA-biplot describing the structure of multivariate data and illustrating the relationships between the data variables.

Fig. 25 :
Fig. 25: Quality of data representation on the PCA factor map (Part A) and PCA biplot (Part B).

Fig. 26 :
Fig. 26: PCA plot describing the variability and hierarchical relationship between the pseudoobserved and simulated data (based on the unadjusted and regression-adjusted posterior estimates) within a lower dimensional space.

Fig. 27 :
Fig. 27: Parasite mean intensities across parasite strain and fish stock stratified by the four comparable groups.

Fig. 30 :
Fig. 30: Marginal density plots of the unadjusted approximate posterior distributions (in black) for the 23 parameters of the complex stochastic model against the sequentially improving prior distributions at N = 1000 (on logarithmic scale).

Fig. 31 :
Fig. 31: Marginal density plot of the unadjusted approximate posterior distributions (in black) for the 23 parameters of the complex stochastic model against the sequentially improving prior distributions at N = 1500 (on logarithmic scale).

Table 1 :
Fixed parameter values for choosing an error bound

Table 2 :
Comparison between unadjusted ABC posterior mean estimates ( θunadj ), and true parameter values ( θ0 ) of the 23 parameters of the stochastic model across at N = 1500 (on log scale) with their respective accuracy measures.

Table 3 :
Comparison between ridge-adjusted ABC posterior mean estimates ( θridge ), and true parameter values ( θ0 ) of the 23 parameters of the stochastic model across at N = 1500 (on log scale) with their respective accuracy measures.

Table 4 :
Comparison between lasso-adjusted ABC posterior mean estimates ( θlasso ), and true parameter values ( θ0 ) of the 23 parameters of the stochastic model across at N = 1500 (on log scale) with their respective accuracy measures.