On predictive inference for intractable models via approximate Bayesian computation

Approximate Bayesian computation (ABC) is commonly used for parameter estimation and model comparison for intractable simulator-based statistical models whose likelihood function cannot be evaluated. In this paper we instead investigate the feasibility of ABC as a generic approximate method for predictive inference, in particular, for computing the posterior predictive distribution of future observations or missing data of interest. We consider three complementary ABC approaches for this goal, each based on different assumptions regarding which predictive density of the intractable model can be sampled from. The case where only simulation from the joint density of the observed and future data given the model parameters can be used for inference is given particular attention and it is shown that the ideal summary statistic in this setting is minimal predictive sufficient instead of merely minimal sufficient (in the ordinary sense). An ABC prediction approach that takes advantage of a certain latent variable representation is also investigated. We additionally show how common ABC sampling algorithms can be used in the predictive settings considered. Our main results are first illustrated by using simple time-series models that facilitate analytical treatment, and later by using two common intractable dynamic models. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-022-10163-6.


Introduction
Approximate Bayesian computation has become an important technique for statistical inference in various application fields such as epidemiology and genomics (McKinley et al., 2009;Numminen et al., 2013;Kypraios et al., 2017;Järvenpää et al., 2019), systems biology (Wilkinson, 2019;Warne et al., 2019) and computational finance (Calvet and Czellar, 2014;Frazier et al., 2019) to name just a few.ABC allows to approximate the posterior distribution of unknown model parameters and the posterior probabilities of models using only forward model simulations.It hence facilitates Bayesian inference in an intuitive -though usually only in an approximate and computationally intensive -manner when the likelihood function is intractable in the sense that its analytical form is unavailable, too complex to analyze or too expensive to evaluate.Sometimes the terminology of likelihood-free inference (LFI) or simulation-based inference is used for computational methods in this space.
Despite substantive amount of research on ABC methodology and applications (see e.g.Marin et al. (2012); Sisson et al. (2019); Pesonen et al. (2021)) and the fundamental importance of prediction in statistical inference, the use of ABC in the context of predictive inference has not been widely studied.In this paper we investigate ABC as a generic technique to approximate the posterior predictive distribution of some future observations or missing data.We denote the unobserved (future or missing) data of interest by ỹ, the observed data by y and the unknown model parameters by θ.Throughout this paper we assume that the analytical forms of the likelihood function π(ỹ, y | θ) (and π(y | θ)) and the conditional predictive density π(ỹ | y, θ) are intractable.We especially study a rather general situation where one can jointly simulate from π(ỹ, y | θ) for each θ but, importantly, not necessarily from π(ỹ | y, θ).No specific assumptions of the model structure are made but we also separately analyze such models that facilitate simulation from π(ỹ | v, y, θ) where v are unobserved (finite-dimensional) latent variables.The goal is to infer the unobserved data ỹ and possibly also θ given the observations y.We consider only an offline setting where data are not obtained sequentially and where summarized data is used as common also in more conventional ABC inference tasks.
An earlier approach for ABC prediction, used in applied work e.g. by McKinley et al. (2009); Canale and Ruggiero (2016); Pesonen et al. (2021) and studied theoretically in the context of financial time-series models that facilitate partial analytical treatment by Frazier et al. (2019), consists of first using ABC to approximate π(θ | y) and then simulating directly from π(ỹ | y, θ).However, the latter step can be infeasible or its implementation may require laborious model-specific derivations.In contrary, sampling from π(ỹ, y | θ), in particular, can often be carried out easily -typically using the same algorithm as sampling from π(y | θ) which is in any case needed for ABC.The main advantage of the proposed ABC methods is that they allow predictive inference very generally -even when the model is specified as a complex computer code that jointly outputs pseudo-data from π(ỹ, y | θ).An example would be a temporal Markov model defined via an intractable transition density.If e.g.only some elements of the last state y τ were observed, then forward simulation of future data ỹ directly via π(ỹ | y, θ) = π(ỹ | y τ , θ) becomes infeasible.Applications include infectious diseases modelling and stochastic biochemical networks, among others, where (semi-)Markov and stochastic differential equation models are used (see e.g.Picchini (2014); Kypraios et al. (2017); Tancredi (2019); Wilkinson (2019); Buckwar et al. (2020)) and where incomplete measurements are often only available.See also Pesonen et al. (2021) where realistic intractable dynamic models are used for prediction, though in a simpler setting as considered here.The focus of this paper is, however, on understanding the predictive ABC approximations and not on particular applications.We also offer some new insight on the aforementioned earlier ABC prediction approach and argue that, if applicable, it can be expected to produce more accurate approximation of the posterior predictive distribution than the more generic methods we propose.
Predictive ABC methods are also useful for parameter estimation and model comparison of intractable models.For example, computation of the posterior predictive distribution for the hypothetical future data to be observed given the candidate experimental design and current data is required in sequential Bayesian experimental design.This requires ABC resolution when the model is intractable, see Hainy et al. (2016); Kleinegesse et al. (2021).Also, various methods for model assessment and comparison based on data splitting (see e.g.Vehtari and Ojanen (2012); Bürkner et al. (2020)), which are yet to be extended to the ABC setting, require the ability to predict data in a hold-out set (or in a cross-validation fold) given the rest of the data.Predictive inference with tractable models may sometimes require ABC: Bayesian inference could be based on summarized data to make the likelihood function better behaving (see e.g.Wood (2010); Fasiolo et al. (2016)) or to alleviate the effects of possible outlier observations and model misspecification (see e.g.Lewis et al. (2021)) which can render inference intractable.These potential applications are however not the focus of this work.
In this paper we only analyze generic ABC prediction methods which are widely applicable, fairly simple to implement and likely easy for a practitioner to understand.It is however worth emphasizing that, obviously, taking advantage of the model structure at hand or its partial analytical tractability, whenever feasible, might lead to more accurate or efficient model-specific algorithms.An example would be hidden Markov and state-space models (SSM) with intractable observation models which have been studied by Jasra et al. (2012); Martin et al. (2014); Calvet and Czellar (2014); Jasra (2015); Vankov et al. (2019).If ABC inference can be based on the full data (instead of summary statistics) as assumed in these works, and if the unobserved latent states are also to be estimated, the model-specific ABC algorithms in the above works are presumably most suitable if extended for the prediction task.Similar remark holds for Markov jump processes with tractable noise models that in some settings facilitate asymptotically exact particle MCMC methods, see e.g.Golightly and Wilkinson (2011); Wilkinson (2019); Warne et al. (2020).We also consider some SMMs but this is mainly for the simplicity of demonstration.See also Martin et al. (2019) where summary statistics based on auxiliary models are used in ABC parameter estimation with SMMs.
The rest of this paper is organized as follows: In Section 2 we briefly review the key ideas of Bayesian predictive inference and ABC.In Section 3 we first discuss the aforementioned common approach for predictive ABC.We then develop the new ABC prediction approaches and analyze some of their properties.We especially study the selection of summary statistics and observe that these ideally need to be minimal predictive sufficient instead of merely minimal sufficient (in the ordinary sense).In the alternative approach where an assumption of latent variable representation is made, a statistic that is jointly sufficient for both the parameters θ and the latent variables v relevant for the prediction task is appropriate.After that, in Section 4 we show how common ABC samplers such as ABC-MCMC (Marjoram et al., 2003) and PMC-ABC (Beaumont et al., 2009) can be used for predictive inference and discuss some practical aspects relevant for applications.Section 5 presents numerical experiments with M/G/1 queue and Lotka-Volterra models which illustrate the main theoretical findings and the quality of the predictive ABC approximations.Section 6 summarizes the main results and suggests topics for future work.Additional analysis and technical details can be found in the Supplementary material.

Background
We denote the observed data as y = y 1:n = (y 1 , . . ., y n ) and the unobserved data of interest as ỹ = (ỹ 1 , . . ., ỹñ ).We assume a parametric model π(ỹ, y | θ) = π(ỹ | y, θ)π(y | θ) with a tractable prior density π(θ) for the unknown model parameters θ ∈ Θ ⊂ R p .Unless explicitly stated otherwise, possible latent variables, which we typically denote by v (and ṽ), are not included to θ in our notation.The individual data points y i ∈ Y i ⊂ R di are typically neither independent nor identically distributed and are associated with known input variables x i ∈ X i with x = x 1:n = (x 1 , . . ., x n ).The same holds for unobserved data ỹ with inputs x = (x 1 , . . ., xñ ).In the case of a time-series model, each x i would typically denote time and y i related observation.We however suppress the dependence on x and x for brevity.

Predictive inference
Inference about the unobserved (future) data ỹ after observing y, which is specifically called predictive inference, is in Bayesian statistics done using the posterior distribution This quantity is called the posterior predictive distribution.It captures the uncertainty in ỹ both due to the stochastic model and the incomplete knowledge of the model parameters θ.The latter is represented by the posterior distribution where π(y) = π(y | θ)π(θ) dθ is the marginal likelihood.As mentioned in O'Hagan and Forster (2004, Section 3), θ can be seen as a nuisance parameter in predictive inference.However, both ỹ and θ might be of interest in some applications.
Although we are mainly concerned with prediction in this paper, (1) and our proposed ABC methods similarly apply when ỹ denotes missing observations to be estimated.Furthermore, in some applications latent variables ṽ are of the main interest instead of the related future data ỹ.For example, consider a SSM, where v = v 1:n denote the unobserved latent states and y = y 1:n the corresponding observations and where the transition and measurement distributions are respectively.The posterior predictive for e.g.ṽ = v n+1 is still obtained by applying (1) when ỹ is replaced by (ṽ, ỹ) and v is included to θ (or marginalized).Hence, we do not handle cases like this separately in what follows.When possible, taking the latent variable structure such as (3) into account can be useful for computational reasons as argued by Jasra (2015) and as considered in Section 3.4 of this paper.
The posterior predictive distribution (1) typically needs to be computed numerically.A common approach is to use the estimator where θ (j) , j = 1, . . ., m, are (possibly correlated) samples from the posterior π(θ | y) obtained e.g. by some Markov chain Monte Carlo (MCMC) method.If evaluating the conditional predictive density π(ỹ | θ, y) needed for ( 4) is difficult, samples from π(ỹ | y) can be obtained by simulating ỹ(j) from π(ỹ | θ (j) , y) for j = 1, . . ., m. Theoretical properties of the resulting approximations have recently been extensively studied by Krüger et al. (2021).All in all, provided that one can either evaluate the conditional predictive density or sample from it efficiently, no additional computational challenges arise in principle.

Approximate Bayesian computation
Before we study predictive inference for intractable models using ABC, we briefly review standard ABC parameter inference.By this we mean the task of approximating the posterior distribution π(θ | y) using only forward model simulations from π(y | θ).More detailed overviews of ABC than given here can be found in Marin et al. (2012); Lintusaari et al. (2017); Sisson et al. (2019).The basic ABC rejection sampler (Tavaré et al., 1997;Pritchard et al., 1999) consists of repeating the following steps for i = 1, . . ., m: A1 Simulate θ (i) from the prior π(θ), A2 Simulate pseudo-data z (i) from the model given The summary statistic1 s : i Y i → R d in A3 is used to summarize the potentially high-dimensional observed data y and the corresponding simulated data sets z (i) .The discrepancy function ∆ : R d × R d → R + and the threshold parameter h ≥ 0 are further used to compare the similarity of the summary statistics.Commonly ∆ is some (weighted) norm • in R d , that is, ∆(s y , s z ) = s y − s z , but other choices are also possible.We often denote the observed summary statistic using a short-hand notation s y := s(y) and similarly s z := s(z).
The accepted samples in A3 are considered as independent draws from the approximate posterior distribution called the ABC posterior.
The ABC rejection sampler outlined above and the more advanced samplers discussed in the context of ABC prediction in Section 4 target the ABC posterior where K h (r) ∝ 1 r≤h .Other choices of the kernel K h : R + → R + , such as the Gaussian kernel K h (r) ∝ exp(−r2 /(2h 2 )) where h > 0 is a bandwidth parameter, are also possible.The intractable likelihood π(y | θ) in (2) has essentially been replaced by the ABC likelihood π h (s y | θ) := K h (∆(s y , s z ))π(s z | θ) ds z to obtain (5).The quality of this approximation depends on the selection of the summary statistic s and the threshold (bandwidth) parameter h in particular.In principle s can be the identity mapping which produces no loss of information but, unless the data is low-dimensional, data summarization is necessary for computational efficiency.Similarly, the threshold h needs to be selected to balance the tradeoff between approximation accuracy and computational efficiency.It is possible to define ABC posterior also via multiple discrepancies ∆ (j) (s (j) (y), s (j) (z)), each based on different summary statistic s (j) and threshold h (j) , by replacing the kernel K h (∆(s y , s z )) in ( 5) with a product of kernels j K h (j) [∆ (j) (s (j) (y), s (j) (z))].Such an approach with uniform kernels K h (j) has been used by Pritchard et al. (1999);Ratmann et al. (2009); Numminen et al. (2013) for standard ABC parameter inference.The resulting alternative ABC posterior approximation differs from the more common one in that it defines a different acceptance region in the summary statistic space.Wilkinson (2013) has shown that these approaches can be interpreted as different measurement error models.

Predictive inference using ABC
In Section 3.1 we first consider the situation with tractable π(ỹ | y, θ) which typically does not pose significant additional computational or statistical challenges over standard ABC inference.We then analyze a more challenging situation where only joint simulation from π(ỹ, y | θ) is used (Section 3.2).Some special circumstances relevant for all methods are considered in Section 3.3 and, finally, taking advantage of a latent variable representation is studied in Section 3.4.Table 1 shows a summary of the ABC prediction methods.

Tractable simulation from conditional predictive density
When the exact posterior π(θ | y) in ( 1) is replaced by the ABC posterior 2 π h (θ | s y ), the following approximation for the posterior predictive distribution π(ỹ | y) results: This intuitive method has been used for applied work by McKinley et al. (2009); Canale and Ruggiero (2016); Pesonen et al. (2021) but without detailed analysis.Frazier et al. (2019) analyzed this approach theoretically and showed that under certain technical conditions the approximate posterior predictive in a sense coincidences with the exact one as the data size becomes arbitrarily large and as h → 0. Their numerical results interestingly suggest that the approximate posterior predictive can be accurate even if the corresponding ABC posterior of θ provides a poor approximation.In a typical LFI scenario one cannot evaluate π(ỹ | y, θ).Samples from π (F) h (ỹ | y; s y ) in (6) can be obtained using a similar two stage approach as in Section 2.1: One first draws θ (i) ∼ π h (θ | s y ) using some standard ABC sampler and finally draws ỹ(i) ∼ π(ỹ | θ (i) , y).We use the abbreviation ABC-F for this approach where "F"3 informs that this method is directly based on forward simulation via π(ỹ | y, θ).The ABC parameter draws can be recycled for other prediction tasks involving the same model and observed data y.
A limitation of ABC-F, mentioned already in Section 1, is that sampling from π(ỹ | θ, y) may be infeasible when the model is (truly) intractable.Frazier et al. (2019) used ABC-F to improve computational efficiency of certain time-series models which facilitate partial analytical treatment and access to π(ỹ | θ, y).Also, simulation from π(ỹ | θ, y) may be feasible when the model is intractable but is Markovian whose last measured state is fully observed.An important special case where ABC-F applies is when π(ỹ | y, θ) = π(ỹ | θ).In particular, the conditional predictive distribution of a model that produces i.i.d.data has the same form as the corresponding likelihood function π(y | θ) so that it is in principle possible to even recycle samples generated from π(y | θ) during the ABC sampling from π h (θ | s y ).A special instance of this is the standard posterior predictive checking of Gelman et al. (1996), Gelman et al. (2013, Section 6.3).In this method independent replicated data sets are drawn from the posterior predictive which are then compared against the observed data y.This can be implemented using only repeated sampling from π(y | θ) in the ABC case.

Intractable simulation from conditional predictive density
We now consider the situation where we do not necessary have access to π(ỹ | θ, y) but joint simulation from π(ỹ, y | θ) is feasible.We first write (1) as If we then replace π(ỹ, y | θ) in ( 7) with the quantity π h (ỹ, s y | θ) := K h (∆(s y , s z ))π(ỹ, s z | θ) ds z , in a similar manner as π(y | θ) is replaced by π h (s y | θ) to obtain (5), the following approximation for the posterior predictive results: We similarly obtain the following joint ABC posterior distribution These approximations depend on the simulator-based model only via π(ỹ, s z | θ) which is the joint distribution of the future (pseudo-)data ỹ and the summarized pseudo-data s z given parameter θ.Obviously, we can simulate from this density by first using the simulator-based model to draw from π(ỹ, z | θ) and then applying the mapping (ỹ, z) → (ỹ, s(z)).The rest of this section is devoted to the analysis of the approximations (8) and ( 9) which we refer as ABC-P.A related alternative, based on a "nested" ABC approximation of π(ỹ | θ, y) in the ABC-F approach, is analyzed in Supplementary material A. We first observe that marginalizing θ in (9) naturally produces (8) while marginalizing ỹ reasonably results the standard ABC posterior approximation π h (θ | s y ).We also notice the following connection between ABC-P and ABC-F: If we use the basic fact π(ỹ, 8) and then replace π(ỹ | s z , θ) with π(ỹ | s y , θ) in the resulting formula (which is a reasonable approximation in that the integrand of ( 8) is approximately zero when h is small and when s y is substantially different from s z ), we obtain a special case π (F)  h (ỹ | s y ; s y ) of the ABC-F approximation.Importantly, this connection additionally implies that the ABC-F and ABC-P target densities coincide in the case of independent data because then we have π(ỹ | s z , θ) = π(ỹ | θ).

Relation to exact posterior predictive distribution
We analyze the ABC-P target posterior predictive based on the observed summary s y when the uniform kernel K h (r) ∝ 1 r≤h is used.We adopt the setting of ABC samplers (Section 4.1) where h t is a sequence of decreasing thresholds.Lebesgue measure is here denoted as | • |.We make the following assumptions: (C1) The random vector (ỹ, θ, s y ) has a joint density function π(ỹ, θ, s y ) with respect to Lebesgue measure, (C2) The acceptance region These assumptions are similar to those used by Prangle (2017), who analyzed the convergence of the ABC posterior of θ, except that we need to consider future data ỹ in (C1).Also, we allow multiple discrepancies in (C2) which is relevant for an approach in Section 4.2.The discrepancy ∆ t in (C2) may depend on t although this dependence plays no important role in this paper.We refer to Prangle (2017, Section 2.3) for further discussion on the assumptions which also applies to our predictive ABC setting.
The following result shows that the ABC-P target posterior predictive becomes arbitrarily accurate if the size of the acceptance region shrinks to zero appropriately which is ensured by the above assumptions.The proof of this result is given in Supplementary material B.1.It is easy to verify that (C2), ( C4) and (C5) hold in a typical case where ∆(s z , s y ) is some norm and the sequence h t is such that h t → 0 as t → ∞.
Proposition 3.1.Assume that a uniform kernel is used and (C1)-(C5) are satisfied.Then, for almost all (ỹ, θ, s y ) wrt. the joint density π(ỹ, θ, s y ), it holds that We additionally obtain dy dθ, that is, the ABC posterior predictive coincides with the prior predictive when h = ∞.A curiosity is that ABC-F leads to a different prior predictive density π (F)  ∞ (ỹ | y; s y ) = π(ỹ | θ, y)π(θ) dθ which however clearly agrees with ∞ (ỹ | y; s y ).Using some results in Stein and Shakarchi (2005); Biau et al. (2015) one could likely extend Proposition 3.1 for more general kernel functions.See also Barber et al. (2015) for related analysis.In the following we however study the summary statistics selection as this analysis is more relevant for practice and differs from that of the standard ABC inference.

Selection of summary statistics
Recall that a statistic s is (classical) sufficient if the conditional density π(y | s y , θ) does not depend on θ and Bayes sufficient if π(θ | y) = π(θ | s y ) for each prior density π(θ) (and for almost all y).It can be shown that Bayes sufficiency is equivalent to classical sufficiency when θ is finite-dimensional, see e.g.Schervish (1995, Theorem 2.14).In what follows we hence consider only Bayes sufficiency4 which we call parametric sufficiency to surely distinguish it from other sufficiency conditions encountered below.
Under assumptions similar to our (C1)-(C5), the ABC posterior π ht (θ | s y ) converges to π(θ | s y ) as t → ∞, see Prangle (2017) (or Sisson et al. (2019, Chapter 1.7) for a less precise analysis).Now, when the summary statistic s is parametric sufficient, the limiting ABC posterior π(θ | s y ) further agrees with the exact posterior π(θ | y).The ABC-F posterior predictive π (F)  h (ỹ | y; s y ) in ( 6) obviously depends on h and s y only via the ABC posterior π h (θ | s y ) and hence agrees with the exact posterior predictive π(ỹ | y) in this scenario.(In Section 3.3 we nonetheless argue that parametric sufficiency is not a necessary condition for this result regarding ABC-F.)However, perhaps surprisingly, parametric sufficiency does not guarantee that an analogous result holds for either ABC-P limiting posterior predictive distributions in (10).
First, we define a summary statistic s to be predictive sufficient if If ỹ and y are conditionally independent given θ (in particular, if the data is i.i.d.) so that π(ỹ | y, θ) = π(ỹ | θ), then parametric sufficiency implies predictive sufficiency (11 . This result does not hold more generally as we see later in Example 3.1.Classical, Bayes and predictive sufficiency are however all equivalent in a setting of infinitely exchangeable observations of Bernardo and Smith (1994, Section 4.5) but this situation is not particularly relevant for ABC applications.
When the summary statistic s satisfies the predictive sufficiency condition (11), it is immediately seen that the ABC-P limiting predictive posterior approximation π(ỹ | s y ) of Proposition 3.1 coincides with the exact posterior predictive π(ỹ | y).The choice s(y) = y would trivially satisfy this condition but the summary statistic should be low-dimensional for computational reasons as in standard ABC inference.We conclude that the summary statistic should ideally be minimal predictive sufficient in the sense of ( 11) to estimate ỹ by using ABC-P.
Another, perhaps less known, form of predictive sufficiency exists (Skibinsky, 1967;Lauritzen, 1974;Bjørnstad, 1996).This stronger form of predictive sufficiency requires the following two conditions: This definition is most natural for our ABC setting but other equivalent definitions exist, see Bjørnstad (1996, Section 3.1).If s is predictive sufficient in the sense of ( 12) and ( 13), then s is obviously parametric sufficient but the converse does not hold in general.13) is satisfied and this definition of predictive sufficiency reverts to that of parametric sufficiency.When ( 12) and ( 13) both hold, we have Predictive sufficiency in the sense of ( 12) and ( 13) thus implies predictive sufficiency in the sense of ( 11).This offers a more intuitive way of finding summary statistics for ABC-P as compared to (11): One can take some parametric sufficient statistic and complement it with additional statistics that satisfy (13).This approach may however not produce minimal predictive sufficient statistic in the sense of (11).In practice, where low-dimensional sufficient statistics may not exist and are in any case difficult to find due to the intractability of the model, one can take some "usual" summary statistic deemed informative for θ and try to device additional ones that could be informative for prediction in the sense of (13).
When the summary statistic s satisfies ( 14), the joint ABC-P limiting approximation π(ỹ, θ | s y ) of Proposition 3.1, clearly coincides with π(ỹ, θ | y).This result is relevant when both ỹ and θ are of interest.Of course, s should be low-dimensional also in this case.Example 3.1 (Markov model).This example illustrates the effect of the summary statistics and threshold on the ABC posterior predictive.Consider a discrete-time process where t = 1, 2, . . ., n with n > 1 and y 0 = 0.The parameters are θ = (c, φ, σ 2 ) ∈ R × (−1, 1) × R + though φ and σ 2 are here considered known.Consider first the task of estimating c when y = y 1:n = (y 1 , . . ., y n ) ∈ R n is directly observed and the prior for c is π(c) ∝ 1.Then the weighted average ȳφ shown below in ( 16) is a parametric sufficient statistic and when a Gaussian kernel N (ȳ φ | zφ , h 2 ) is used, we have The exact posterior π(c | y) is obtained by setting h = 0 in ( 16).The derivation of ( 16) and the other results to follow are outlined in Supplementary material B.2.We next study ABC-P and ABC-F posterior predictive densities of ỹ = y n+1 .When the Gaussian kernel is again used between the summary statistics under consideration, we obtain5 While the statistic ȳφ is sufficient for estimating c, (18) shows that it is neither sufficient for y n+1 nor a good choice in practice.Namely, even when h = 0, the mean prediction of ( 18) can be very different than that of the exact posterior predictive in ( 17).Furthermore, it follows from ( 19) that the variance in ( 18) is always larger than that of (17) except for some special cases such as the i.i.d.data case φ = 0 where they are equal.
On the other hand, the predictive ABC-P posteriors based on (ȳ φ , y n ) and ẙφ := ȳφ + φy n in ( 17) and ( 20), respectively, both produce the exact posteriors when h = 0.Both summary statistics are predictive sufficient in the weaker sense.In fact, the former is sufficient for c and y n+p for any p ≥ 1 while the latter is that for y n+1 .Similarly as with ( 18), one can show that ẙφ is not a good summary statistic for estimating c.The emergence of the statistic y n is not surprising given the Markov property of the model.We also compute which shows that π (F) 0 (y n+1 | y; ȳφ ) = π(y n+1 | y), as expected.We see that even a fairly large bandwidth h may not substantially deteriorate the quality of the ABC-F posterior predictive: Heuristically comparing the ABC posterior for c in ( 16) and the ABC-F posterior predictive for y n+1 in (21) reveals that h needs to be small as compared to σ 2 /n in the former case whereas as compared to (1 + 1/n)σ 2 σ 2 /n in the latter case.We also see that ABC-F in (21) and ABC-P based on ẙφ in (20) have the same target ABC posterior predictive.These methods are however based on different summary statistics and hence require different choices of h to yield similar computational efficiency in practice.
Finally, we analyze the summary statistics s (1) (y) := ȳφ , s (2) (y) := (ȳ φ , y n ) and s (3) (y) := ẙφ empirically in a more practical setting where the common uniform kernel with threshold h is now used.We consider n = 100 observations and set φ = 0.5 and σ 2 = 1.Another choice is considered in Supplementary material C. The true value of c is 1.The inference is carried out using ABC-MCMC with 10 6 simulations (see Section 4 for details) and thresholds adjusted so that the acceptance probabilities are roughly 10% for all three cases.
The results in Figure 1 agree with the theoretical analysis above.In particular, we see that summary statistic s (3) , which is sufficient only for predicting y 101 , does not produce accurate posterior for c.While both s (1) and s (2) are sufficient for c, s (1) produces slightly more accurate approximation which is likely because matching this one-dimensional statistic is computationally more efficient than the two-dimensional s (2) .As expected, s (1) produces less accurate posterior predictive at t = 101 as the other summaries but unexpectedly works fairly well when t = 110 and t = 200.The ABC-F posterior predictive densities with s (1) and s (2) are not shown for clarity and because they both were essentially the same as the exact posterior.(Statistic s (3) was not considered as it would be a meaningless choice for ABC-F.)16) and (B.31) of Supplementary material with h = 0.

A remark on computational efficiency
Computational efficiency of ABC is often a concern and this is especially the case with ABC-P.While matching of some "global" features of the data may be enough for accurate ABC parameter estimation, ABC-P might additionally require matching of some specific "local" features.Such "local" summary statistic for prediction, even if low-dimensional, can have high variability so that a large number of model simulations even with the "true" parameter may be needed to obtain a discrepancy evaluation that is small enough to ensure accurate posterior predictive distribution.For instance, consider the Markov model in Example 3.1 but with the summary statistic s(y 1:n ) = y n alone.In Supplementary material B.3, we show that for any n ≥ 1, any fixed observations y 1:n and threshold h > 0, we have where P(•) is wrt.simulated data z given parameter c.It follows from ( 22) that the maximum probability of simulating data that matches the observation y n up to the absolute error h goes to zero as n → ∞ when |φ| ≥ 1.So, when the data size increases either substantially more simulations are needed or a larger error h must be tolerated.The result is unsurprising because the model is a random walk when |φ| ≥ 1.On the other hand, the maximum acceptance probability is bounded from below with a positive constant in the stationary case |φ| < 1.

Conditional predictive density with partial parameter dependence
We make some remarks on ABC prediction with ABC-P and ABC-F when the conditional predictive π(ỹ | y, θ) depends only partially on θ and the goal is to estimate only ỹ.We first suppose π(ỹ | y, θ) = π(ỹ | y).The first condition of predictive sufficiency (12) then becomes irrelevant and the second condition (13) is the same as the weaker form of predictive sufficiency (11).Hence, in a (hypothetical) situation where this special condition holds but one can nevertheless only simulate jointly from π(ỹ, y | θ), ABC-P is otherwise implemented as before but the summary statistic used does not need to be informative about θ.Additionally, ABC-F reverts to direct simulation from π(ỹ | y) in this case.A simple analytical example would be In the example, y n is purely predictive sufficient.
We next consider a situation more relevant for practical applications.We write θ = (ψ, η) and suppose π(ỹ | y, θ) = π(ỹ | y, ψ), that is, the conditional predictive density depends only on some of the parameters so that For instance, η could be a parameter of a noise model which is not assumed for future data ỹ.A simple analytical example would be y t | y t−1 , θ ∼ N (ψ +y t−1 , η 2 ) for t = 1, . . ., n, (ψ, η) ∈ R×R + and ỹ = y n+1 such that y n+1 | y n , θ ∼ N (y n+1 | ψ + y n , 1 2 ).When ( 23) holds, it is in principle enough that the summary statistic s is informative for ψ in ABC-F.Similarly, in the case of ABC-P, the predictive sufficiency conditions ( 12) and ( 13) need to concern only ψ instead of θ.In other words, in both cases, π(η | ψ, y) in principle does not need to be accurately approximated, only π(ψ | y) does.Design of informative low-dimensional summary statistics for ABC approximation of marginal posteriors, such as π(ψ | y) in our above case, has recently been studied by Drovandi et al. (2022).

Tractable simulation from conditional predictive given latent variables
We have now considered ABC prediction when simulation either from both π(ỹ | y, θ) and π(y | θ), or jointly from π(ỹ, y | θ), is feasible.In this section we briefly study an alternative that takes advantage of the latent variable representation where v denotes finite-dimensional, unobserved latent variables involved in the model.Combining ( 24) and (1) produces which motivates the following approximation of the posterior predictive density: This approach is called ABC-L where "L" refers to the latent variables.To implement ABC-L, the intractable model under consideration must satisfy (24) and sampling from both π(ỹ | y, v, θ) and π(y, v | θ) must be feasible.The latter assumption is evidently weaker than that of ABC-F: If one can directly sample from π(ỹ | y, θ), then one can also sample from π(ỹ | y, v, θ) by formally setting v = ∅.Note that π(v | y, θ) is not assumed tractable -if one can additionally sample from π(v | y, θ) then one can also sample from π(ỹ | y, θ) using ( 24) so that ABC-F applies.This is the case in Frazier et al. (2019, Section 4) where ABC-F was used for SMMs.ABC-L is in fact similar to ABC-F if θ is allowed to contain latent variables, that is, if θ is replaced with (v, θ).However, the asymptotic results of Frazier et al. (2019) do not apply for ABC-L as such because the latent variables v cannot typically be consistently estimated.
One can establish that π ht (v, θ | s y ) → π(v, θ | s y ) as t → ∞ but we omit the details as they are analogous to Section 3.2.1 and as a similar result is obtained as a special case of Jasra (2015, Proposition 2.1).Clearly, the summary statistic s should ideally be jointly sufficient for v and θ so that This condition resembles Bayes sufficiency.Similarly as discussed in Section 3.3, π(ỹ | y, v, θ) may depend only on some components of (v, θ) so that the sufficiency condition (28) in principle needs to hold only for them.Similarly to the case of ABC-F, the ABC-L posterior predictive (26) agrees with the exact posterior predictive A key difference between ABC-L and ABC-P is that latter method depends on the full data y only via the summary statistic s y while the former approach, similarly to ABC-F, depends directly on y via the conditional predictive π(ỹ | y, v, θ) in (26).A potential downside of ABC-L is that, even when one could use the fact that the conditional predictive depends only on some components of (v, θ), ABC inference in high-dimensional space might be needed which is notoriously a hard problem.
We end this section with a SSM example that demonstrates the selection of summary statistics for ABC-P and ABC-L.Nonetheless, in practice ABC-L should be used for truly intractable models that are different from the basic SSM defined via (3).Namely, in this case π(ỹ | y, v, θ) = π(ỹ | v n , θ) does not depend on y and it can be easily checked that ABC-P and ABC-L then in fact have the same target ABC posterior predictive.
Example 3.2 (State space model).Consider the following model where t = 1, 2, . . ., n with n > 1 and v 0 = 0.The set-up is similar to Example 3.1 except that the (latent) states are denoted by v and are not observed.Also, ω > 0 is an additional fixed parameter.We consider only the case h = 0 and investigate the effect of the summary statistics for estimating c or ỹ = y t+1 given observations y = y 1:n .Justifications for the various results below are given in Supplementary material B.4.Consider the following scalar-valued summary statistics s (1) (y 1:n ) := µ W −1 y 1:n and s (2) (y is the covariance matrix of v that depends on φ and σ 2 (but not on c) and whose formula is given in (B.24) of Supplementary material B.4, Σ n: denotes the nth row of Σ and W = Σ + ω 2 I, where I ∈ R n×n is the identity matrix.First of all, s (1) (y 1:n ) is minimal parametric sufficient for c and a linear combination of s (1) (y 1:n ) and s (2) (y 1:n ), shown in (B.53) of Supplementary material B.4, is minimal predictive sufficient for y n+1 .Hence, these would be ideal summary statistics for computing the ABC posterior of c and the ABC-P posterior predictive of y n+1 , respectively.
Access to π(y n+1 | y 1:n , v 1:n , c), which here simplifies to π(y n+1 | v n , c) = N (c + φv n , ω 2 + σ 2 ), and consequently the ABC posterior of (v n , c) would be both required for ABC-L.Now, (s (1) (y 1:n ), s (2) (y 1:n )) qualifies for a sufficient statistic for (v n , c) in the sense of (28) and is hence sufficient also for y n+1 in the ABC-L approach.This statistic is two-dimensional which is not surprising as it is sufficient for (v n , c) ∈ R 2 in a Gaussian linear system.As π(y n+1 | y 1:n , v 1:n , c) does not depend on y 1:n and depends on (v 1:n , c) only via c + φv n , it can be argued that the same scalar-valued summary statistic that is predictive sufficient for y n+1 , is also sufficient for y n+1 in the ABC-L approach.This shows that (s (1) (y 1:n ), s (2) (y 1:n )) is not minimal sufficient in this case.

Using ABC samplers for predictive inference
We next consider numerical methods for ABC-L and ABC-P.Our aim is not to develop or advocate some specific algorithm but instead concisely outline how commonly used ABC algorithms can be modified to produce samples from the ABC-L and ABC-P target densities.We also discuss some practical aspects of predictive ABC inference.

Sampling from ABC posterior predictive
We assume the observed summary statistic s y , discrepancy ∆, kernel K h and threshold h > 0 are all fixed and consider first the task of sampling from π h (θ | s y ) in (5).The ABC target density is augmented to π h (s z , θ | s y ) ∝ K h (∆(s y , s z ))π(s z | θ)π(θ) and the proposal density selected as π(s z | θ)q(θ).Self-normalized importance sampling (IS) can be then used as the resulting (unnormalized) importance weights are free of intractable terms: Weighted samples (ω (i) , θ (i) ) from π h (θ | s y ) are hence obtained by first sampling θ (i) from q(θ), then z (i) from π(z | θ (i) ), and finally computing ω(i) = K h (∆(s y , s (i) z ))π(θ (i) )/q(θ (i) ) with s (i) z = s(z (i) ) for i = 1, . . ., m. Normalized weights ω (i) can be obtained by computing ω (i) = ω(i) / m j=1 ω(j) .We similarly define the augmented target density π (L)  h (s z , v, θ | s y ) ∝ K h (∆(s y , s z ))π(s z , v | θ)π(θ) for the ABC-L approach.If the proposal is analogously set to π(s z , v | θ)q(θ), the intractable terms π(s z , v | θ) cancel out similarly as π(s z | θ) in ( 30) and the same (unnormalized) importance weight ω(θ) results.Hence, weighted samples (ω (i) , ỹ(i) ) from π (L)  h (ỹ | y; s y ) in ( 26) are obtained by repeating the following steps L1 Simulate θ (i) from q(θ), for i = 1, . . ., m.Alternatively, a two stage approach can be used where steps L1-L3 are first repeated for all i and L4 is finally run for those samples with non-zero weights.ABC rejection sampler for ABC-L results as a special case with K h (r) ∝ 1 r≤h and q(θ) = π(θ).Up to some caveats discussed later, PMC-ABC (Beaumont et al., 2009) 2012) can be similarly modified to sample from the (augmented) ABC-L target.This is also the case with ABC-MCMC (Marjoram et al., 2003).All in all, the main modifications are that also v (i) need to be stored (unless e.g. its weight is zero) and the extra prediction step L4.
Analogously to standard ABC-MCMC with the augmented target density π h (s z , θ | s y ), the detailed balance condition can be shown to hold so that π (P)  h (ỹ, s z , θ | s y ) is indeed the stationary distribution of the chain generated by the above algorithm.
The acceptance probability (31) and the corresponding IS weight do not depend on ỹ.Consequently, the computation time of ABC-P can be reduced using the latent variable representation which is the same condition as (24) assumed for ABC-L.Specifically, one then first simulates (z (i) , v (i) , θ (i) ) for each i and afterwards simulates only those of the corresponding predictions ỹ(i) using π(ỹ | z (i) , v (i) , θ (i) ) that are required to form the final estimate of π(ỹ | y).For example, those ỹ(i) that correspond to "burn-in" in the ABC-MCMC sampler or that either have zero IS weight or do not belong to the last population in the PMC-ABC algorithm, are not required.In some cases jointly sampling ỹ and z is not much slower than sampling only z.The above approach can still be useful because the samples can be reused for other related prediction tasks for which the selected summary statistic is deemed informative.Adapting some special ABC samplers for ABC-L and ABC-P may require additional modifications.Examples include the adaptive method for adjusting the discrepancy by Prangle (2017), which is further discussed in Section 4.2, and the adaptive threshold selection method by Simola et al. (2021), which could possibly be modified to account for the posterior predictive at least when ỹ is low-dimensional.

On forming the discrepancy for predictive ABC inference
A common choice for the discrepancy is the weighted Euclidean distance between the summary statistics.Its weights are often based on the variability of the summary statistic which is estimated using pilot simulations (or adjusted adaptively as in Prangle ( 2017)).However, such an approach may be unsuitable for ABC-P.For example, consider the Markov model of Example 3.1 where the summary statistic ȳφ is informative for the model parameter c and the statistic y n , the last observation, is informative for prediction.We formed a discrepancy ∆(s y , s z ) = (ȳ φ − zφ ) 2 /V(z φ ) + (y n − z n ) 2 /V(z n ) where the variances were estimated via pilot simulations.We then observed that this common approach lead to too low weight for the statistic y n due to its fairly high variability (see Section 3.2.3)and consequently noticeable ABC error in the prediction.This also happened when e.g.robust measures of variability were used and in the more realistic case of the continuous-time Markov model of Section 5.2.This difficulty can be alleviated by manual readjustment but this is a cumbersome and opaque procedure in practice.
The use of multiple discrepancies (see Section 2.2) can be helpful for appropriately weighting the summary statistics in some ABC prediction situations.Suppose that a meaningful partition of the summary statistic exists so that s(y) = (s(y), s(y)) where s(y) denotes those statistics that are deemed informative for the parameter θ, and s(y) those that are deemed informative for prediction (and ideally also interpretable).In our Markov example these would be s(y) = ȳφ and s(y) = y n .Then one could use the kernel where ht and ht denote separate thresholds and • , • the corresponding norms.If one chooses ht /a = ht /b = h t where a, b > 0 are fixed constants and h t a sequence of thresholds such that h t → 0 as t → ∞, then the acceptance region of (33) can be written as It is easy to check that A t satisfies the assumptions (C2), (C4) and (C5) so that Proposition 3.1 applies.
An important technical condition is that ht and ht decay at the same rate, otherwise the assumption (C5) would not hold.In our experiments in Section 5 where the above approach is used, || • || and ht are selected based on expert knowledge and the scale of the observed data, respectively.The weights for • can be determined using pilot runs or adaptively.The best practices likely depend on the problem at hand and a detailed investigation falls outside of the scope of this paper.

Illustrative examples
We next demonstrate the predictive ABC methods of Section 3 using two intractable dynamic models, M/G/1 queue (Section 5.1) and stochastic Lotka-Volterra (Section 5.2).The goal of these numerical experiments is two-fold: First, to show that ABC-P and ABC-L can produce reasonable approximations of the posterior predictive distribution in practice, although overestimated uncertainty often results as is common also in the standard ABC inference.Second, the results indicate that the goal of prediction should be accounted for when designing the summary statistics, although informative low-dimensional summary statistics (both for parameter estimation and prediction) are challenging to design for realistic applications.While some insight on selecting informative summaries and suitable ABC sampler configurations for the goal of prediction is provided, we do not aim for a comprehensive analysis.This is because these choices depend on the problem at hand and are active topics of ongoing research also in the context of standard ABC inference.
ABC-F and ABC-L approaches do not apply in all of our test cases unlike ABC-P which, in principle, can be implemented whenever jointly sampling from π(ỹ, y | θ) is feasible.Importantly, all three approaches produce the standard ABC posterior for θ as their marginal whenever they can be implemented.Hence, whenever we additionally computed this quantity, we only show it for ABC-P and the reader needs to notice that ABC-F and ABC-L based on the same summary statistic, discrepancy and threshold as ABC-P would also produce the same result.R-code (with fast C-implementation for model simulation) used for the experiments is available at https://github.com/mjarvenpaa/ABC-pred-inf.

M/G/1 queue model
We first consider the M/G/1 queue model where customers arrive at a single server with independent interarrival times w i ∼ Exp(θ 3 ), that is, the arrival times follow Poisson process with rate parameter θ 3 .The queue is empty before the first arrival.The service times u i are assumed to be independently distributed so that u i ∼ U([θ 1 , θ 2 ]).Only the interdeparture times y = (y 1 , . . ., y n ) are observed.This model has been used before e.g. by Fearnhead and Prangle (2012); Jiang et al. (2018); Bernton et al. (2019) to demonstrate ABC algorithms for estimating the model parameters θ = (θ 1 , θ 2 , θ 3 ).We also assume that the parameters θ are unknown but the main task is to estimate the waiting times of future customers.
The interdeparture times y i satisfy where v i := i j=1 w j are the arrival times and x i := i j=1 y j the corresponding departure times.We use the convention v 0 = x 0 = 0.While the likelihood function π(y | θ) is intractable as discussed by Heggland and Frigessi (2004), using (34) and as and hence from π(y, v | θ), where v = (v 1 , . . ., v n ).Obviously, this way one can sample also from π(ỹ, y | θ) where ỹ = (y n+1 , . . ., y n+ñ ) denote the interarrival times of the next ñ future customers.We use ω i := x i − v i , i = 1, . . ., n, to denote the waiting times of customers and similarly ωi , i = n + 1, . . ., n + ñ, denote those of the future customers.This definition includes both the time the customer waits to be served (which can be 0) and the service time.Clearly, sampling from π(ω, y | θ) is feasible so ABC-P can be easily implemented.Similarly, simulating from π(ω | y, v, θ) = π(ω | x n , v n , θ) is feasible which facilitates ABC-L.Simulating from π(ỹ | y, θ) or π(ω | y, θ), and consequently implementing ABC-F, is not straightforward due to the latent variables v involved and is hence not considered6 .

Common experimental details
We use the uniform prior In addition, we account for the constraint θ 1 ≤ min i∈{1,...,n} y i , which is is implicitly coded into the intractable likelihood function but not in the ABC procedure, as discussed by Bernton et al. (2019).Although M/G/1 model has commonly been used to study ABC algorithms, particle MCMC methods (Andrieu et al., 2010) and the special MCMC algorithm by Shestopaloff and Neil (2014) in principle also apply.We adopt the latter MCMC method and we further extend it in a straightforward way to obtain samples from π(ω | y).This allows us to compare the accuracy of the ABC methods to the ground-truth.We also consider the predictive distribution π(ω | y, v, θ true ) as this unveils the best possible prediction accuracy.Of course, in reality this baseline would be unavailable as θ true is unknown and v is unobserved.
We form a common 5-dimensional baseline summary statistic s (0) by using the α = 0.25, 0.5 and 0.75th empirical quantiles qα (y), and the range of y.For comparison, we adopt the strategy outlined in Section 4.2 and form another summary statistic s (1) (y) = (s (1) (y), s(1) (y)), where s(1) (y) = s (0) (y) and where the additional summaries s(1) (y) are given by max{i ≥ 1 : y i ≥ qα (y obs )} with convention max ∅ = 1.Note that y denotes any data realization while qα (y obs ) is computed using the observed data at hand denoted here as y obs for clarity.This tentative summary statistic captures the idea that individual large interdeparture times result when the queue is empty and the locations of such recent values provide information about the state of the queue for prediction.We use this strategy and s (1) also for ABC-L.The statistic s(1) (y) = y n is not chosen as it would be rather uninformative -short interdeparture times occur frequently irrespective of the state of the queue as seen in Figure 2.  Left plot: The case of Section 5.1.2where the queue length is varying.Occasional large interdeparture times suggest that the queue could be then empty.Right plot: The case of Section 5.1.3where the queue length tends to be growing.
We use α = 0.7, 0.8 and 0.9 so that s(1) (y) ∈ R 3 .We choose || • || for both s (0) and s(1) so that ||r|| = r Ĉ−1 r where Ĉ is the empirical covariance matrix of the summaries estimated, for simplicity and to facilitate meaningful comparison, by running the model 10 3 times at θ true .This value would be unavailable in practice but other point estimates or the approach by Prangle (2017) could be then used instead.We choose || • || as the L ∞ -norm so that ||r|| = max i∈{1,2,3} |r i |.We use ABC-MCMC with 2 • 10 6 iterations and burn-in 10 4 in both cases.In the case of s (0) , the threshold h is determined so that the acceptance probability (after burn-in) is 2%.In the case of s (1) we set h = 5 (Section 5.1.2) or h = 7 (Section 5.1.3).The threshold h is then determined to give the same acceptance rate as with s (0) to make the comparison meaningful although it is not possible to completely separate the effect of the summary statistic and choices related to computational efficiency.

Varying queue
We first consider a situation where θ true = (4, 7, 0.15).In this case the queue length tends to vary being occasionally empty and occasionally containing several customers.A simulated data set with n = 100 used in the experiments below is shown in Figure 2.
The top row of Figure 3 shows the ABC posteriors of the parameter θ.The marginal ABC posteriors for θ 1 are both fairly accurate, though the baseline summary statistic s (0) works slightly better.The likely reason is that the threshold h was set slightly larger than h because the additional summaries s(1) need also to be matched in this case and because the ABC-MCMC acceptance probability was required to be the same in both cases.On the other hand, s (1) works better for θ 3 which suggests that the additional summaries s(1) designed for prediction are informative also for this parameter.As common also in other ABC analyses in literature, poor marginal posterior approximations for θ 2 are observed.
The bottom row of Figure 3 shows the corresponding results for the prediction task.Interestingly, the ideal predictive density and the exact posterior predictive for the first future waiting time ω 101 are both bimodal.(We run several MCMC chains to ensure our observations are not caused by possible numerical issues.)An important result is that the ABC-P posterior predictive based on the baseline summary statistic s (0) has far too long right tail.This occurs because s (0) is invariant to the order of the observations and hence the trend ) and posterior predictive distributions computed using MCMC and ABC approaches for the waiting times of some future customers.The right tail of the ABC-P posterior predictive based on s (0) is truncated to ease visualization.
of the observed interdeparture times just before the prediction gets lost.This would also be the outcome with other methods in literature such as the distance functions studied by Jiang et al. (2018); Bernton et al. (2019).ABC-L posterior predictive also has a pronounced right tail but the additional summaries s(1) appear informative for prediction so that ABC-P based on s (1) produces reasonable approximations.

Growing queue
We use θ true = (8, 16, 0.15) in our second example so that the arrivals are relatively frequent as compared to the service times.The queue is mostly full in this case and the interarrival times y i have approximately the same distribution as the service times u i .We observe that ω i − ω i−1 = y i − w i then approximately follows the same distribution as u i − w i .That is, the increments of the process {ω i } i≥1 tend to be i.i.d.distributed so that {ω i } i≥1 is approximately a random walk.Furthermore, as u i − w i tends to be positive, the waiting times approximately grow linearly in expectation.The data set used is shown in Figure 2.
Figure 4 shows that the marginal ABC posterior distributions of θ 1 and θ 2 feature similar trends as before while those of θ 3 are equally good this time.The posterior uncertainty of θ 3 is however noticeable because the observed interdeparture times are essentially determined by the service times that depend only on θ 1 and θ 2 .Similarly, the observations are not very informative for estimating future waiting times and the ideal predictive density substantially benefits from its use of the true values of θ and v. Figure 4 further shows that the waiting times are growing approximately linearly as the above analysis suggests.All ABC approaches produce similar predictive densities which are comparable to the ground-truth.It is intuitive that the knowledge of the order of the interdeparture times or the additional summaries in s (1) are not highly useful when the queue is growing fairly steadily.The fourth plot illustrates the predictive densities for ω (solid lines: the median, dot-dashed lines: 75% credible interval (CI), dashed lines: 90% CI).Note that these lines are also drawn for the observed customers in the case of ABC-P.The vertical gray line shows the first future customer.The black rings show the unobserved waiting times not used for inference.

Stochastic Lotka-Volterra model
We consider the Lotka-Volterra model which is a Markov jump process that describes the continuous-time evolution of a population of predators interacting with a prey population.This model also serves as a basic example of stochastic kinetics networks used to model the evolution of interacting molecules, see e.g.Wilkinson (2019).The size of the prey population at time t is y 1 t and similarly y 2 t is the predator population.We also denote y t := (y 1 t , y 2 t ).The model dynamics are Markovian with the transition probabilities where γ θ (w) := θ 1 w 1 + θ 2 w 1 w 2 + θ 3 w 2 , ∆t is a small time increment and θ = (θ 1 , θ 2 , θ 3 ) are positive rate parameters.
While the likelihood function associated with this model is intractable (except for some special scenarios), realizations can be simulated given any initial state and parameters θ using the Gillespie algorithm (Gillespie, 1977).The Markov property implies that one can also simulate from π(ỹ | y, θ) = π(ỹ | y τ , θ) where ỹ represent the population sizes at some future time points, y the observed populations and y τ the observed populations at the last observation time τ .However, if e.g.only the prey population is observed then ABC-F approach cannot be directly implemented via Gillespie algorithm because y τ is not fully known.Particle MCMC methods can be in principle used for inference in the noisy case, see Wilkinson (2019); Warne et al. (2020) and references therein for details.In this paper we however limit our attention to ABC methods in the non-noisy setting.

Common experimental details
The following settings are common to all Lotka-Volterra experiments considered: The initial state is y 0 = (100, 50) and we use θ true = (1, 0.005, 0.6) which produces oscillating behavior of the populations.The parameters are log-transformed for inference and we use the prior (log θ 1 , log θ 2 , log θ 3 ) ∼ U([−6, 2] 3 ).We use similar strategies for forming the summary statistics, selecting thresholds and the same settings for the ABC-MCMC sampler as in the M/G/1 experiments.In particular, the acceptance probability of ABC-MCMC is 2% and 2 • 10 6 iterations are run.We use L ∞ -norm for || • || but this time || • || is the weighted Euclidean distance whose weights are reciprocals of the squared median absolute deviations (MAD) of the simulated summaries estimated empirically as before.Summary statistics are designed individually for each inference task as explained below.

Prediction tasks
We consider a prediction task where the population sizes at 81 equidistant times in [0, τ 1 ] are observed and the main goal is to compute the posterior predictive distribution of both populations in [τ 1 , τ 2 ].We select τ 1 = 24 and τ 2 = 45 time units.We study two cases: 1) Both populations are observed and we compare ABC-P and ABC-F and 2) Only prey populations y 1 t are observed and we compare ABC-P and ABC-L.ABC-L is implemented by setting v = y 2 τ1 in case 2. Similarly to earlier analyses (e.g.Papamakarios and Murray (2016); Wilkinson (2019)) where the goal is to infer θ, we use the following summary statistics for our case 1: the mean, standard deviation, two first autocorrelations of both populations and cross-correlation between both populations.The resulting 9dimensional baseline summary statistic is denoted by s (0) .We then form a summary statistic s (1) specifically for the prediction task at hand: We set s(1) = s (0) and s(y) = y τ1 ∈ R 2 .The baseline summary statistic s (0) for case 2 includes the mean, standard deviation and two first autocorrelations of the prey population.Summary statistic s (1) is formed so that s(1) = s (0) and7 s(y) = (y 1 τ1−iδ ) 5 i=0 ∈ R 6 where δ = 0.3 is the time between consecutive observations.We use h = 50 in case 1 and h = 30 in case 2. In case 1 we compare our ABC methods to the ideal predictive distribution π(ỹ | y, θ true ) = π(ỹ | y τ1 , θ true ) from which we directly simulate using the Gillespie algorithm.In case 2 our comparison is to an approximation of π(ỹ | y 1 , θ true ) which is obtained by first generating (ỹ (i) , z (i) ) ∼ π(ỹ, z | θ true ), i = 1, . . ., 10 6 , using the Gillespie algorithm and accepting ỹ(i) if ||s(z (i) ) − s(y)|| = max i∈{0,...,5} |z 1 (i) τ1−iδ − y 1 τ1−iδ | ≤ 7.This is a form of ABC rejection sampling (but conditionally on θ true , see Supplementary material A for discussion).
Typical results for case 1 are shown in Figure 5a.All ABC posterior predictive densities roughly capture the future model dynamics.The mean absolute error of the posterior median prediction (abbreviated as MAE and where the mean is taken over both populations and the prediction interval) of ABC-P is 40 with both s (0) and s (1) .Statistic s (1) however produces smaller standard deviation (std) of the posterior predictive in the region t ≈ τ 1 than s (0) (e.g.27 vs.70 for the predator population at t = τ 1 + δ).While both approaches produce visibly inflated 90% CI there, the accuracy with s (1) could be increased at the cost of more simulations by lowering h as this quantity directly controls the absolute error at t = τ 1 .However, we do not have such apparent guarantee with s (0) .The ABC-F posterior predictive is similar to the ideal predictive density over the whole interval [τ 1 , τ 2 ] (MAE is 5.3 and the std of the posterior predictive of the predator population at t = τ 1 + δ is 6.6 while that of the ideal baseline is 6.4) so that ABC-F is clearly preferable to ABC-P.
Figure 5b shows that, unsurprisingly, prediction in case 2 is more challenging than in case 1. ABC-P based on s (1) produces visibly more accurate posterior predictive than s (0) .MAE for ABC-P with s (0) is 72 and with s (1) it is 54.ABC-L produces the most accurate median prediction (MAE is 24) and also the least inflated posterior predictive of the prey population in the region t ≈ τ 1 as a consequence of its ability to partially use the exact value of the observation y 1 τ1 .ABC-P based on s (1) and ABC-L however both produce similar predictive posteriors for the unobserved predator population y 2 t at t ≈ τ 1 .Marginal ABC posterior distributions for θ in both cases are finally shown Figure 6.Summary statistic s (1) produces slightly more inflated posterior as s (0) similarly -and for the same reason -as in the M/G/1 experiments.

Missing data task
The goal here is to estimate the unobserved population sizes in [τ 1 , τ 2 ] based on 51 equidistant observations both in [0, τ 1 ] and [τ 2 , τ 3 ] with 0 < τ 1 < τ 2 < τ 3 .In particular, we use τ 1 = 15, τ 2 = 36 and τ 3 = 51.The summary statistics of Section 5.2.2 are not meaningful and we instead use the mean and standard deviation of both population sizes separately computed for intervals [0, τ 1 ] and [τ 2 , τ 3 ].This produces 8-dimensional summary statistic s(1) .Additionally, we choose s(1) (y) = (y τ1 , y τ2 ) ∈ R 4 and h = 50.Only ABC-P is  ABC−P, s (1) ABC−F, s (0)    where the prediction interval starts and the black circles show the observations.The solid lines show the median and the dashed lines the 90% CI.These lines are also drawn in [0, τ 1 ] to demonstrate that the observed data is not conditioned on exactly in ABC."True param."shows the ideal predictive distribution based on θ true (case 1) and its approximation (case 2).The prey population size is truncated in case 2 because the predator population can die out with a non-negligible posterior probability in which case the prey population starts to grow exponentially fast.considered as the other ABC approaches do not to apply8 .We compare it to the ideal predictive distribution π(ỹ | y, θ true ) = π(ỹ | y τ1 , y τ2 , θ true ) which is computed in a similar fashion as the one in case 2 of Section 5.2.2:We first generate 10 6 samples (ỹ (i) , z (i)  τ2 ) ∼ π(ỹ, z τ2 | y τ1 , θ true ) and then accept those of them that satisfy ||z (i)  τ2 − y τ2 || ≤ 5 (our first data set) or that match the observed y τ2 exactly (our second data set).In the latter case we obtain exact samples from π(ỹ | y τ1 , y τ2 , θ true ).
Results with two data realizations are shown in Figure 7.In a typical case shown in the left column, the approximation is good as compared to the approximate ideal predictive density (MAE is 17).As before, the uncertainty is overestimated especially near τ 1 and τ 2 as the matching to the observed y τ1 and y τ2 is only done up to the threshold h.The right column presents a special situation where both populations have become extinct.The approximation quality is fair in this case (MAE is 59).The ABC posteriors for θ are additionally shown in Supplementary material C.

Discussion
There is no single natural, general way to to approximate the posterior predictive distribution of intractable models via ABC.In this paper we have studied three such complementary approaches summarized in Table 1.These methods make different assumptions regarding which predictive density of the model can be simulated from.We provided some new insight to the earlier approach, abbreviated as ABC-F in this paper, and pointed out its limitations.We then proposed related but more general ABC-P and ABC-L methods, investigated their basic properties (especially the selection of summary statistics), discussed some ways to implement these methods in practice, and finally provided numerical examples of all three methods.
Our analysis indicates that forming summary statistics for ABC-P and weighting them appropriately requires more care than for the standard ABC inference or for ABC-F because these statistics need to be informative in the sense of predictive sufficiency instead of ordinary sufficiency.Of course, they should also be low-dimensional to avoid the curse of dimensionality.Such suitable statistics may have large variability which can lead to more pronounced computational challenges than in standard ABC inference and in some situations result substantially inflated predictive intervals.Also, suitable low-dimensional statistics may not exist which could be the case e.g. with intractable spatial data models which were however not analyzed in this paper.On the other hand, such suitable statistics can be often constructed when the model features Markovian or some other simplifying structure.ABC-P produced reasonable posterior predictive approximations in our experiments although we used a basic ABC-MCMC implementation and summary statistics designed for prediction in a rather rudimentary fashion.
Our analysis and empirical results suggest that in general ABC-F should be used instead of ABC-P.However, as we have discussed, implementing ABC-F can be difficult or infeasible which makes ABC-P and ABC-L a useful addition to statistician's toolbox.We cannot conclude whether ABC-L should be preferred to ABC-P when both approaches are feasible.This likely depends on the model at hand and the suitability of the selected summary statistics which is often hard to assess in practice.It should also be noted that when asymptotically exact (particle) MCMC or other model-specific methods can be used, they are likely more accurate than the more generic, approximate methods analyzed in this paper.
Our key aim was to give a unified overview on approximating the posterior predictive via ABC in an accessible manner.As future work, one could study more closely how to best use such methods for specific models and inference tasks.One could also investigate if LFI techniques such as regression adjustment (Beaumont et al., 2002;Blum, 2010), Bayesian synthetic likelihood (Price et al., 2018), ratio density estimation (Thomas et al., 2022) or conditional density estimation methods based on neural networks or Gaussian processes (Papamakarios and Murray, 2016;Papamakarios et al., 2019;Grazian and Fan, 2020;Järvenpää et al., 2020) can be used to improve the accuracy or computational efficiency of the ABC-P approach.Another important research direction would be to study how to construct informative summary statistics for predictive ABC inference in an automatic fashion.Such methods have been developed for standard ABC inference (see e.g.Sisson et al. (2019, Chapter 5)) and some of them could possibly be extended to the predictive setting.Also, our analysis on the summary statistics for prediction might offer new perspective for selecting them for standard ABC inference.satisfy the condition (13) while s needs to be parametric sufficient satisfying (12).Also, it is not immediately clear how to most efficiently sample from (A.2) as the normalization constant of π h(ỹ | θ, sy ) depends on θ or how to determine h which in principle could depend on θ.On the other hand, ABC-P' might facilitate more convenient implementation than ABC-P in a sense that the estimation of θ and ỹ can be separated.More detailed investigation is however left as a potential topic for future work.where the integral is computed by deducing that the result is Gaussian density whose mean (variance) follows by using the law of total expectation (variance) or, alternatively, by applying Gaussian identities.The result (21) follows immediately by setting p = 1.Similarly, the discussion below (B.20) and the result (B.31) with p = 1 and h = 0 imply (17).As the resulting formula depends on data y 1:n only via ẙφ , the summary statistic ẙφ is predictive sufficient (in the weaker sense).

Figure 1 :
Figure1: Illustration of the effect of summary statistics s (1) (y) = ȳφ , s (2) (y) = (ȳ φ , y n ) and s (3) (y) = ẙφ = ȳφ + φy n on the ABC approximation accuracy in Example 3.1.The first plot on the left shows the ABC(-P/F) posterior for c and the three other plots the ABC-P posterior predictive distribution at some future time points.The black vertical line shows the true value of c used to generate the data and the red lines show the exact Gaussian posteriors obtained using (16) and (B.31) of Supplementary material with h = 0.
and other related sequential IS methods such as Sisson et al. (2007); Toni et al. (2009); Del Moral et al. (

Figure 2 :
Figure 2: Observed interdeparture times y = (y 1 , . . ., y n ) with n = 100 used in the M/G/1 experiments.Left plot: The case of Section 5.1.2where the queue length is varying.Occasional large interdeparture times suggest that the queue could be then empty.Right plot: The case of Section 5.1.3where the queue length tends to be growing.

Figure 3 :
Figure 3: Results for the M/G/1 experiments of Section 5.1.2.Top row: Marginal posterior distributions computed using MCMC and ABC for parameters θ.Bottom row: Predictive distribution based on the true parameter θ true ("True param.")and posterior predictive distributions computed using MCMC and ABC approaches for the waiting times of some future customers.The right tail of the ABC-P posterior predictive based on s (0) is truncated to ease visualization.

Figure 4 :
Figure4: Typical results for the M/G/1 experiment of Section 5.1.3.The first three plots on the left show the marginal posteriors for θ.The fourth plot illustrates the predictive densities for ω (solid lines: the median, dot-dashed lines: 75% credible interval (CI), dashed lines: 90% CI).Note that these lines are also drawn for the observed customers in the case of ABC-P.The vertical gray line shows the first future customer.The black rings show the unobserved waiting times not used for inference.
Case 2: Only prey population observed.

Figure 5 :
Figure5: Typical results for the Lotka-Volterra experiments of Section 5.2.2.The vertical gray line shows where the prediction interval starts and the black circles show the observations.The solid lines show the median and the dashed lines the 90% CI.These lines are also drawn in [0, τ 1 ] to demonstrate that the observed data is not conditioned on exactly in ABC."True param."shows the ideal predictive distribution based on θ true (case 1) and its approximation (case 2).The prey population size is truncated in case 2 because the predator population can die out with a non-negligible posterior probability in which case the prey population starts to grow exponentially fast.

Figure 6 :
Figure 6: Posterior distributions for the parameters of the Lotka-Volterra experiments corresponding to Figure 5 and Section 5.2.2.Top row: Case 1 where both populations are observed.Bottom row: Case 2 where only prey population is observed.The black vertical line shows the true value of the parameter.

Figure 7 :
Figure 7: Results for the Lotka-Volterra experiment of Section 5.2.3 with two sets of observed data.The vertical gray lines indicate the unobserved time interval [15, 36] and the black circles show the observations.The solid lines show the median and the dashed lines the 90% CI. "True param."shows the ideal predictive distribution based on θ true (right column) and its approximation (left column).

Figure C. 2 :
Figure C.2: Posterior distributions for the parameters of the Lotka-Volterra experiments corresponding to Figure 7 in Section 5.2.3 of the main paper.Top row: The first realization of data.Bottom row: The second realization of data where the populations have become extinct.The black vertical line shows the true value of the parameter.

Table 1 :
A simplified overview of the ABC prediction methods considered in this paper.An alternative to ABC-P is also briefly analyzed in Supplementary material A. Numerical methods are considered in Section 4.