Local dimension reduction of summary statistics for likelihood-free inference

Approximate Bayesian computation (ABC) and other likelihood-free inference methods have gained popularity in the last decade, as they allow rigorous statistical inference for complex models without analytically tractable likelihood functions. A key component for accurate inference with ABC is the choice of summary statistics, which summarize the information in the data, but at the same time should be low-dimensional for efficiency. Several dimension reduction techniques have been introduced to automatically construct informative and low-dimensional summaries from a possibly large pool of candidate summaries. Projection-based methods, which are based on learning simple functional relationships from the summaries to parameters, are widely used and usually perform well, but might fail when the assumptions behind the transformation are not satisfied. We introduce a localization strategy for projection-based dimension reduction methods, in which the transformation is estimated in the neighborhood of the observed data instead of the whole space. Localization strategies have been suggested before, but the performance of the transformed summaries outside the local neighborhood has not been guaranteed. In our localization approach the transformation is validated and optimized over validation datasets, ensuring reliable performance. We demonstrate the improvement in the estimation accuracy for localized versions of linear regression and partial least squares, for three different models of varying complexity.


Introduction
Approximate Bayesian computation (ABC) and other likelihood-free inference (LFI) methods have gained widespread popularity in the last decade (Lintusaari et al., 2017;Sisson et al., 2018). Beginning with applications in population genetics and evolutionary biology (Tavaré et al., 1997;Pritchard et al., 1999;Beaumont et al., 2002), the methods have recently expanded to many other fields science ranging from financial modelling (Peters et al., 2018) to human-computer interaction (Kangasrääsiö et al., 2017), and supported by opensource software such as ELFI (Lintusaari et al., 2018). One of the major contributors to the rise of popularity of the LFI methods is that they allow to connect existing computer simulators to data in a statistically rigorous way. In their simplest form, LFI methods only require the ability to generate pseudo-datasets from a computer simulator and a way to measure the similarity between simulated and observed datasets.
A key component of any simulation-based likelihoodfree inference method is choosing how to measure the similarity between the simulated and observed data sets. The similarity is usually based on low-dimensional summary statistics, which contain most of the information in the data (Prangle, 2018). The low dimensionality is crucial for the good performance of the methods, since they suffer heavily from the curse of dimensionality. For example, under optimal conditions the mean squared 2 Jukka Sirén, Samuel Kaski error of ABC estimate scales as O p (N −4/(q+4) ), where N is the number of samples and q is the dimensionality of the summary statistics (Barber et al., 2015). Design of such summaries is in most cases difficult, which complicates the application of likelihood-free methods to new problems. Some methods bypass the use of summary statistics completely, and work directly on the full data. For example, Gutmann et al. (2018) proposed to use classification as a measure for similarity for likelihood-free inference. However, the method is applicable only in situations, where multiple exchangeable samples are available, and hence not generally be applicable.
Dimension reduction techniques offer a semi-automatic way of producing summary statistics that balances the trade-off between dimensionality and informativeness (Blum et al., 2013;Prangle, 2018). The most widely used methods are based on a large set of candidate summary statistics. Subset selection methods choose a small subset of the candidate summary statistics, which are the most informative about the parameters (Joyce and Marjoram, 2008;Nunes and Balding, 2010). Projection-based methods construct a functional relationship from summary statistics to the parameters using for example linear regression, and produce new low dimensional summaries as a combination of the candidate summaries (Wegmann, Leuenberger, and Excoffier, 2009;Aeschbacher, Beaumont, and Futschik, 2012;Fearnhead and Prangle, 2012).
Projection-based dimension reduction techniques face at least two challenges that might compromise their efficiency in producing informative and low-dimensional summary statistics. First, the relationship between the summaries and the target might be more complex than assumed by the dimension reduction method. Usually the relationship is assumed to be linear, which rarely holds globally in the whole parameter space. Second, some of the candidate summaries might be informative only in a subset of the parameter space. This could happen for example in dynamical models, where the amount of data is dependent on the parameter values (Sirén et al., 2018). Consequently, a large dataset allows estimation of more detailed dependencies among the paremeters, which would not be possible with a small sized dataset. Therefore, the optimal summaries for ABC should be different in these regions of the parameter space.
The difficulty of applying global projection based methods can be alleviated by fitting the relationship between summaries and parameters locally around the observed data. This localization may be motivated by the fact that the relationship is usually much simpler, when restricted to a smaller region, and hence easier to fit. Also, for estimating the posterior distribution of the observed data, the good performance of the summary statistics is most crucial locally around the data.
Localization of summary statistics selection has been proposed using at least three different strategies in the literature. In strategy 1, a projection based transformation is estimated using only simulations that result in datasets close to the empricial data. Aeschbacher et al. (2012) suggested performing a pilot ABC analysis using all candidate summaries, and training the boosting with the accepted simulations. Constructing the summary statistics in the neighborhood of the observed data makes it possible to capture the relationship between summaries and parameters more accurately even with a simple model. However, such an approach could perform poorly outside the region of accepted simulations, because after the transformation even simulations outside this region could have similar summaries as the observed data (Fearnhead and Prangle, 2012;Prangle, 2018). Additionally, it is not clear how large set of closest simulations should be used from the pilot analysis. In strategy 2, the prior support is narrowed down to region of a non-negligible posterior density. Fearnhead and Prangle (2012) suggest performing a pilot ABC run with all candidate summaries, restricting the prior range to a hypercube containing the posterior support and fitting linear regression from the candidates to the parameters. As the prior range is narrowed down, the transformed summaries should behave well in the whole parameter space. A drawback with this approach is that the narrowing down of the prior range might not provide much localization, especially in a high-dimensional setting. In strategy 3, the localization is achieved by learning a dimension reduction that performs optimally in the neighborhood of the observed data. Nunes and Balding (2010) introduced a two-stage strategy for selecting a subset of summaries. In the first stage of their method, a number of closest datasets to the empirical one are chosen to be used as validation datasets for the second stage of selecting the optimal subset of summaries. While computationally expensive, this approach has been shown to produce well performing summary statistics (Blum et al., 2013), but the validation strategy has not been applied for projection based methods.
Here we introduce an algorithm for reliable localization of projection-based dimension reduction techniques. The algorithm combines the localization strategies 1 and 3 described above, and works with any projection based technique. It is based on first choosing a number of validation datasets, and then optimizing a local projection based dimension reduction on the validation datasets. The optimization is performed over the Local dimension reduction 3 size of the neighborhood around the dataset and possible parameters associated with the projection technique, such as the number of components used in partial least squares (PLS, Wegmann et al., 2009). By evaluating the performance of the local transformation on the validation dataset globally, the method is able to overcome the issue of poorly performing summaries outside the local neighborhood. We show improvement over global dimension reduction in different models of varying complexity for both linear regression and partial least squares. Compared to the previously published localization approaches, the optimization of the local transformation results in higher accuracy and improved stability of the transformed summary statistics.

Methods
Rejection ABC is the simplest algorithm for performing likelihood-free computation (Sisson et al., 2018). It is based on generating N simulated pseudo-datasets from the model p(D|θ) and comparing those to the observed data. For each simulation i a d-dimensional parameter value θ i is sampled from the prior distribution and a pseudo-dataset D i is generated from the model p(·|θ). The distance from D i to D obs is calculated with distance d(D i , D obs ), and if d(D i , D obs ) < , for some prespecified > 0, then simulation i is accepted. The parameters associated with the accepted simulations constitute then an ABC approximation of the posterior distribution p(θ|D i ). As an alternative to specifying a fixed , many ABC applications instead accept a fixed quantile α of the closest simulations so that the number samples to approximate the posterior is αN .
The distance function d(·, ·) is typically defined using a q-dimensional vector of summary statistics S, which summarizes the information in the data in a lower-dimensional form (Prangle, 2018). However, in many applications q could be very large, for example hundreds. As discussed above in the Introduction, high dimensionality of S provides a challenge for accurate ABC inference. Dimension reduction techniques try to reduce the dimension of S by using a transformation f (S) that produces lower-dimensional summaries that retain most of the information about the model parameters. f (S) may reproduce a small number of elements of S, as in subset selection methods that aim to find an informative subset of the summaries (Joyce and Marjoram, 2008;Nunes and Balding, 2010), or f (·) could be a linear mapping, as in linear regression (Fearnhead and Prangle, 2012) and PLS (Wegmann et al., 2009) that are the most widely used projection-based dimension reduction methods. Dimension reduction techniques require that a sample of N simulations with parameters θ and summary statistics S is available for estimating the transformation f (·).

Local dimension reduction of summaries
Projection based methods of dimension reduction usually aim to find a global mapping from summaries to the model parameters, which does not lose information in the summaries and yet has a simple form. However, the actual relationship is often very complex and such a simple functional form is not possible to obtain. Localization of the transformation, i.e. estimating the projection in the neighborhood of the observed data, provides a solution to this problem, as the relationship is typically less complex within a smaller region.
Localization in the data (or summary statistics) space as proposed by Aeschbacher et al. (2012) is conceptually simple. Instead of using all N simulations for estimating the projection, the estimation is performed using quantile α of simulations closest to the observed data D obs . General pseudo-code for the localization is presented in Algorithm 1. The user needs to choose the quantile α, the initial transformation f 1 that is used for defining the set of closest simulations, and the transformation parameters λ if needed. The f 1 could be the identity function, resulting in using the set of all candidate summaries as in Aeschbacher et al. (2012), or transformed summaries from a projection-based method applied globally to all simulations. The parameters λ for the transformation f l could be, for example, the number of components for PLS, or criteria for choosing them.

Algorithm 1: LocalProjection
Input: Initial transformation f1, transformation parameters λ, size of the local neighborhood α, target summaries S obs , simulated summaries S and parameters θ Output: Local transformation f l Calculate distances d(f1(Si), f1(S obs )) for all simulations i; Select set I l consisting of αN simulations with the smallest distances; Construct transformation f l based on simulations in I l using parameters λ; The construction of the transformation f l on the last line of the algorithm 1 depends on the projection method used with the algorithm. For example, with regression it amounts to building a linear regression model where the S(I l ) and θ(I l ) are subsets of S and θ corresponding to simulations I l . The transformation is then obtained as a point estimate of the regression coefficients β.

Optimized local dimension reduction
The localization algorithm 1 has the potential to produce more efficient summary statistics than those obtained from global projection methods, but two issues may potentially lead to poorly-performing summaries. First, the transformation is constructed in the neighborhood of the observed data and should perform well there, but nothing guarantees that the projected summaries are sensible outside this region. Second, the size of the neighborhood α used to train the projection should be set somehow, but its optimal value could be almost anywhere between 0 and 1 depending on the model and simulation setting. Aeschbacher et al. (2012) suggest setting α = 500/N , but such an arbitrary choice will probably not work in all situtations.
We use an optimization strategy similar to the first step proposed by Nunes and Balding (2010) for the localization, but instead of choosing the best subset of candidate summaries, the optimization targets the size of the neighborhood α and other transformation parameters λ of algorithm 1. The optimization is based on N valid validation datasets that are chosen as the closest to the observed data using transformation f v and used to measure the performance of a local transformation f l . The performance of a local transformation with parameters α = α * and λ = λ * in each step of the optimization algorithm is evaluated by constructing local transformations with α * and λ * targeting each of the validation datasets and measuring the accuracy of the obtained posteriors. We measure the accuracy of a posterior sample using root mean squared error (RMSE). For posterior sample θ j (I) of parameter component j with true value θ obs,j , the RMSE is computed as For evaluating the posterior approximation in the whole parameter space, we used summed RMSE, We used average SRM SE over validation datasets as the target for the minimization, although other choices such as maximum over validation datasets could be used as well. Algorithm 2 shows pseudo-code for the optimized local dimension reduction using exhaustive search over a set Λ of candidate values for the transformation parameters λ. The transformation parameters λ in algorithm 2 are not limited to the parameters of the local transformation f l . The λ could include also the size of the neighborhood α and parameters corresponding to the initial transformation f 1 that is used for localization. For example, with local PLS the optimization could be over α and the number of components of the transformations f 1 and f l . The algorithm 2 could also be used without localization to optimize parameters of a global transformation.

Algorithm 2: LocalProjectionOptimized
Input: Initial transformations fv and f1, candidate transformation parameters Λ, number of validation datasets N valid , number of samples to approximate posterior Npost, target summaries S obs , simulated summaries S and parameters θ Output: Local transformation f l Calculate distances d(fv(Si), fv(S obs )) for all simulations i; Select set I valid consisting of N valid simulations with smallest distances ; for λ ∈ Λ for i ∈ I valid Construct local transformation f i,λ with transformation parameters λ targeting dataset i using algorithm 1 with f1 as the initial transformation.; Construct local transformation f l with transformation parametersλ targeting observed dataset using algorithm 1 with f1 as the initial transformation;

Example cases
In this section we apply the developed methods for analyzing simulated datasets under four different models. We compared seven different dimension reduction techniques: linear regression (Reg), local linear regression (localReg), optimized local linear regression (local-Regopt), partial least squares (PLS), optimized partial least squares (PLSopt), local partial least squares (lo-calPLS) and optimized local partial least squares (lo-calPLSopt). The aim of the comparison was to study how much localization improves widely used projectionbased dimension reduction techniques on different models, and to study the effect of optimization on the local dimension reduction techniques. We implemented the dimension reduction methods and performed the example analyses in Matlab 1 .

Setting for the examples
In all cases we normalized the candidate summaries before applying the dimension reductions. We first normalized each non-negative candidate with one parameter Box-Cox transformation with parameter value 0.5, and after this standardized each candidate summary to have zero mean and unit variance. In all applications of the algorithm 2 we set the number of validation datasets as N valid = 20 and number of posterior samples used within the algorithm as N post = 100. The posterior distributions were approximated using 200 closest simulations in each case. In Reg we modeled the model parameters θ with linear regression using the set of all candidate summaries S as covariates, and used the predictionsθ = Sβ obtained with ordinary least squares as transformed summaries. In localReg we used similar linear regression, but localized the transformation with algorithm 1 using the size of the local neighborhood α = 500/N . In local-Regopt we localized linear regression with algorithm 2 optimizing α. As candidate values for α we used α c = (0.032, 0.045, 0.063, 0.089, 0.13, 0.18, 0.25, 0.35, 0.50, 0.71), and as initial transformation we used global linear regression.
In PLS we fitted partial least squares from S to θ using the plsregress function implemented in Matlab, and used the transformed PLS compoments as summaries. We chose the number of compomnents based on mean squared errors estimated with 10-fold cross-validation. We cut the number of components at the point where inclusion of the next-largest component decreased MSE less than 1 % of the total variation in θ, but still using at most 15 components. In PLSopt we fitted global PLS similarly as in PLS, but optimized the number of components with algorithm 2. In localPLS we performed localized PLS with algorithm 1 using α = 500/N and chose the numbers of components for both f 1 and f l similarly as in global PLS. In localPLSopt we performed localized PLS with algorithm 2, optimizing α among α c , number of components of the local PLS transformation f l , and number of components of the initial PLS trans-formation f 1 used for localization. In all PLS transformations we set the maximum number of components to 15. For selecting the validation datasets, we used the PLS transformation with 15 components as the initial transformation f v .

Ricker map
Ricker map is an ecological model describing the dynamics of a population over time. The model has a relatively simple form, yet it produces highly complex dynamics with nearly chaotic behavior. Inference for such models is difficult with likelihood-based approaches, but ABC and other likelihood-free methods have been proposed as alternatives (Wood, 2010;Fearnhead and Prangle, 2012). The population size N t changes over one time step according to where e t are independent noise terms with normal distribution N(0,σ 2 e ) and r is the intrinsic growth term. Observations y t from the model at time t are assumed to follow Poisson(φN t ) distribution. The parameters of interest are θ = (log(r), σ e , φ).
In our study we followed Fearnhead and Prangle (2012) and simulated the Ricker map for 100 time steps from initial state N 0 = 1 with data from the last 50 time steps. We created 100 test data sets using log(r) = 3.8, φ = 10 and log(σ e ) values from uniform grid between log(0.1) and 0. We used independent uniform priors on log(r), log(σ e ) and φ with ranges (0, 10), (log(0.1), 0) and (0, 100), respectively. For the analyses we simulated a total of 1,000,000 datasets with parameter values sampled from the prior distribution. As candidate summaries we used autocovariances and autocorrelations up to lag 5 for y, mean and variance of y, t I(y t = k) for k = 0, ..., 4, log( t y i t ) for i = 2, ..., 6, logarithms of the mean and variance of y, time-ordered observations and magnitude-ordered observations. In total, we had 124 candidate summaries.
With the Ricker model, localization helped to achieve higher accuracy both with linear regression and PLS, compared to the global versions of the transformations for all parameters (Fig. 1). The optimized local transformations produced on average higher accuracy than the regular local transformations, but there was some variation for different parameters (Fig. S1 in the Supplementary material). The average values for α and the number of PLS components under different transformation strategies are shown in Tables S1 and S2, respectively, in the Supplementary material. The plot shows the average SRMSE for the 100 test datesets. 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets. 'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively.

Individual-based model of bird population dynamics
We analyzed the individual-based model (IBM) developed in Sirén et al. (2018) for understanding the population dynamics of White-starred robin in Taita Hills forest network in Kenya. The posterior distribution of the model parameters as well as predictions of population state were estimated with an ABC approach in the paper for capture-recapture and genetic data spanning 13 years. We give here a brief overview of the model, but for detailed description see Sirén et al. (2018).
The model is spatially structured with 14 habitat patches of different sizes A suitable for the species surrounded by matrix. Each patch holds a Poisson(qA) distributed number of territories that can be occupied by a male and a female. Mating happens once a year in territories occupied by a pair and produces Binomial(2, p J ) juveniles. After fledgling phase, the juveniles emigrate with probability logit -1 (ν j + ν a A) and immigrate to another patch i with probability proportional to e −αI di , where d i is the distance to patch i. The juveniles become adults after two years and may occupy free territories. Floaters (adults not occupying a territory) emi-grate to another patch with daily probability logit -1 (ν f + ν a A) and have same immigration probabilities as the juveniles. The mortality of individuals is modeled on daily basis with probability logit -1 (ζ d + ζ S I(f emale)), where I(f emale) is indicator for females. Each individual carries a genotype in a number of diploid microsatellite loci, and the genotypes follow Mendelian laws with stepwise mutation occuring with probability µ. Observations are made during mist-netting sessions with each individual in the patch having probability logit -1 (η 1 +η 2 L+η 3 A+η 4 I(f loater)) of being observed, where L is the sampling intensity and I(f loater) is an indicator for floaters. The observation includes the identity of the bird and whether it is a juvenile or an adult, and for a predefined proportion of individuals also genotype and sex. The parameter vector of interest is θ = (log(q), ζ d , ζ s , ν f , log(−ν a ), log(α), logit(p J ), ν j , log(µ), η 1 , log(η 2 ), log(−η 3 ), log(η 4 )).
We used the same set of simulations analyzed in Sirén et al. (2018) in scenario R+G(5,0.2) (5 loci and 20 % of individuals genotyped) corresponding to the full observed data. This included a total of 100,000 simulations with parameter values drawn from uniform prior distributions and evaluated at 344 candidate summary statistics, and additional 100 test datasets that were simulated with parameters set to produce datasets similar to the observed data. Due to the formulation of the model, the number of observations in the dataset varies strongly with the parameter values, and almost 75 % (74,830) of all simulations did not produce stable population and hence the datasets contained no observations. Therefore, we used two different sets of simulations to analyze the 100 test datasets: all 100,000 simulations and those 25,170 simulations with observations.
Using all the 100,000 simulations localization of the transformations with optimization lead to improved accuracy with PLS, but when restricted to only using the 25,170 simulations with observations the accuracy was similar with the global and local versions of PLS (Fig.  2). This was probably due to the low number of simulations, which made it difficult to robustly estimate the relationship from the high-dimensional set of candidate summaries to the parameters. The performance of the four PLS transformations using only simulations with observations was also simlar with optimized local PLS using all simulations.
For linear regression the results were somewhat mixed. With all simulations localization resulted in significantly higher SRMSE than the global regression, but using only simulations with observations optimized local regression helped to improve accuracy over global version. The difference using localization with the two sets of simulations was that with all simulations the initial regression transformation used for the localization worked poorly. Only a few thousand of the closest simulations contained observations with the number varying with the test dataset, while other simulations with observations had highest distance to the observed data. This did not cause problems with the global regression, because the closest 200 simulations were similar to the test data, but made it almost impossible to use a localized version of the regression.
The non-optimized localization provided slight improvement over global transformation only for PLS with all simulations, while for other combinations it decreased the accuracy compared to the global transformation (Fig. 2). The failure of the regular localization was probably due to the size of the neighborhood (500 samples), which was too small for the high dimensional problem. RMSE values separately for each parameter are shown in Figs S2 and S3 in the Supplementary material.
To confirm that the problems with the local linear regressions were related to the initial transformation, we reran the analyses using identity function as the initial transformation (i.e. using all candidate summaries directly) for regular and optimized local regressions. The alternative initial transformation lead to good performance for both local regressions, and optimized local regression had the highest accuracy over all dimension reduction techniques for both sets of simulations ( 3.4 g-and-k-distribution g-and-k-distribution is a flexible univariate distribution that can be used to model skewed and heavy-tailed data with a small number of parameters (Haynes et al., 1997). The likelihood function of the distribution does not have a closed form, but its quantile function is where z(·) is the quantile function of the standard normal distribution. Simulation from the model is straightforward with the inversion method, making the model is ideally suited for ABC, and it has been widely used as a test case for new ABC methods (Fearnhead and Prangle, 2012;Drovandi et al., 2015;Prangle, 2017;Sisson et al., 2018). Typically the parameters of interest are θ = (A, B, g, k) with restriction B > 0 and k > −1/2, and with c fixed to value 0.8. We analyzed 100 simulated datasets with 10,000 samples each from the g-and-k-distribution with parameters θ = (3, 1, 2, 0.5) following the study of Fearnhead and Prangle (2012). For the ABC based inference we simulated 800,000 pseudo-datasets with parameters sampled from uniform prior distribution on (0, 10) 4 . We used 200 evenly spaced quantiles as candidate summaries for the dimension reduction algorithms.
Additionally, we tested the effect of dimensionality to the performance of the algorithms as a function of both the number of simulations and number of candidate summaries. We considered subsets of the simulations with 25, 50, 100 or 200 candidate summaries and 25,000, 50,000, 100,000, 200,000, 400,000 or 800,000 simulated pseudo datasets. We ran the algorithms on all test datasets, with each combination of the number of candidate summaries and pseudo datasets.
Localized versions PLS provided clear improvement in accuracy compared to their global counterparts with all combinations of numbers of simulations and candidate summaries (Fig. S5c,d). Optimization of the local PLS resulted in slightly higher accuracy over the regular version, but the difference was not large (Fig. S5c in the Supplementary material). With localized PLS the decrease in SRMSE was not affected by the number of candidate summaries, and was slightly negatively correlated with the number of simulations (Fig. 3c,d).
Optimized local linear regression provided increased accuracy compared to global regression (Fig. 3b). With regular local linear regression there was in many cases improvement over global regression, but with a high number of simulations N it sometimes resulted in poorer accuracy (Fig. 3a). With linear regression localization improved the accuracy most when the number of candidate summaries was high (100 or 200), with high variability between test datasets and combinations. The large decrease in SRMSE in high dimensions with the linear regression was caused by the failure of the global regression transformation (Fig. S6o-r,t-s in the Supplementary material). In the highest dimensional setting (N = 800, 000, n S = 200, Fig. S6s in the Supplementary material) even the localized versions of linear regression failed to produce useful summaries. This could be due to too high lower limit for the size of the local neighborhood α, because the α was set to the lower bound 0.0316 for all test datasets (Fig. S7a in the Supplementary material).
Overall, the size of the neighborhood α in the localized algorithms was smaller with regression than PLS, but dimensionality had only a small effect on it (Fig. S7 in the Supplementary material). Similarly, the number of components in different versions of PLS did not show a clear effect of dimensionality (Fig. S8 in the Supplementary material). The high variation between settings and datasets in the number of PLS components and quantiles for optimized local PLS indicates that there is usually not a single optimal value for these, but different combinations produced similar results. The optimized PLS did not differ significantly from the regular PLS (Fig. S5b).
The optimized local versions of both linear regression and PLS were clearly superior over the global versions for estimating g and k, but provided similar or lower accuracy for A and B (Fig. S9 in the Supplementary material). This was probably caused by the optimized local algorithms minimizing sum of RMSEs over parameters, as there was higher uncertainty in g and k and hence the poorer relative accuracy in A and B did not affect the overall accuracy as much.

Discussion
We introduced a localization strategy for projection based dimension reduction techniques for summary statistics. The introduced algorithm creates a low-dimensional transformation of the summaries, which is optimized to perform well in the neighborhood of the observed data. The proposed localization strategy is general and can be used with any projection-based dimension reduction technique to improve the efficiency of likelihood-free inference.
The optimization of the localized transformation over validation datasets guarantees good performance of the transformed summaries also outside the local neighborhood. This is in contrast to the similar localization strategy suggested earlier by Aeschbacher et al. (2012), which did not validate or optimize the constructed summaries in the whole space. Our results show that the optimization improves the accuracy of summaries produced with localization, although the difference is not large in many cases. More importantly, the optimization results in more stable transformations with less variation in accuracy among datasets, whereas localization without optimization sometimes produces poorly behaving summaries. The improvements provided by the optimization were more pronounced in high-dimensional situations, such as with the Whitestarred robin model, for which the non-optimized localization in many cases lead to inferior performance. The failure was probably caused by too small neighborhood used for localization. While this could be fixed by using a higher value for α, it also highlights the need to adapt the dimension reduction technique to the problem at hand, which is automatically provided by our approach.
The optimized localization strategy as presented in Algorithm 2 does add a possibly significant computational cost to the inference. As the optimization is based on an exhaustive search over candidate transformation parameters Λ, a total of N valid |Λ| transformations have to be constructed instead of a single global transformation. In the case of transformation f depending on parameters, such as the number of components in PLS, there would be multiple parameters to optimize and hence the size of Λ would have to be high. Therefore, localization of the transformation might not be sensible for models, from which it is very fast to simulate new datasets, such as the g-and-k-distribution and Ricker map. For computationally heavy model, for which simulation of one dataset could take minutes or even hours, localization helps to achieve higher accuracy without too big additional computational cost. However, the computational cost could easily be reduced by using more sophisticated optimization algorithms such as Bayesian optimization (Snoek et al., 2012), which could find the optimal solution with fewer parameter value λ evaluations.
The choice of the initial transformation f 1 used for localization in Algorithm 2 may have a significant impact on the performance of the transformed summaries. In most cases the use of the global version of the projectionbased method works well, as shown by our results, but sometimes the global transformation could produce summaries that lead the localization to a wrong direction. For example in the White-starred robin model global regression resulted in summaries under which zero simulations were closer to the test datasets than most of the positive simulations. As a result, the local transformations were based mostly on zero simulations and failed to be informative about the parameters. By using all the candidate summaries as initial summary statistics, the local regression worked as expected and provided a clear improvement over the global regression. While this kind of pathological performance is not expected to be common, it is still advisable to check that the summaries used for localization are reasonable. On the other hand, the significant increase in accuracy after localization using over 300 candidate summaries suggests that the initial choice does not need to be perfect.
The example cases analysed in this work show interesting results conserning localized linear regression and PLS as dimension reduction techniques. When dimensionality was high, linear regression failed sometimes to produce good transformed summaries. The failure of regression occured with multiple models, whereas PLS seemed to work more robustly regardless of the problem. With the g-and-k distribution increase of both the number of simulations N and number of summaries n S caused difficulties for regression. With the Whitestarred robin model the cause of problems for regression was the high number of simulations with zero observations. The latter failure could be related to the differening principles behind the two methosd. The zero simulation did not have an influence on the PLS components that were used as the transformed summaries with PLS. On the other hand, with linear regression the transformed summaries were the linear projections, which were distorted by the high-number of zero simulations. Having said that, we acknowledge that comparison between the two methods was not the main goal of the present work, and the results should be considered at most as suggestive of their general performance. Our results are also somewhat in contrast to the findings of Blum et al. (2013), who found that linear regression generally outperformed PLS and the difference was highest in high-dimensional settings. The relative performance of the methods seems to vary with the model under study, and more research on the subject would be needed to understand it better.
In this work we have focused on finding relatively simple linear transformations for the summaries. Such transformations are perhaps ideally suited for localization, as the linearity assumption is not expected to hold globally in the space of possible summaries, but locally within a restricted range linearity often is a reasonable approximation. Additionally, linear mappings are easy to learn even in high-dimensional settings allowing more narrow localization. However, the localization algorithms presented here are suitable for any other projection-based dimension reduction method, including non-linear regression approaches such as feedforward neural networks (Blum and François, 2010) or boosting (Aeschbacher et al., 2012). With the more complicated transformations localization might not lead to as big improvents, since they require larger number of samples to be fit and, at the same time, might capture the true relationship better in a larger neighborhood. Whether a more narrow linear transformation provides in general better summaries than a wider non-linear transformation remains an open quesion, although the comparisons in Blum et al. (2013) suggest that non-linear methods lead to roughly similar performance as the linear methods. The transformation of the summaries produces a scaling for the candidate summaries, and for the accuracy of the ABC inference it mostly matters in the neighborhood around the analyzed dataset. If more complex transformations provide better scaling further away from the data, it might not have any significant impact on the inference results.
All dimension reduction techniques for summary statistics, including the one introduced in this work, are based on rejection sampling ABC, which is computationally inefficient in anything but low-dimensional problems. The main advantage of rejection sampling is that it facilitates performing multiple ABC analyses for optimizing the dimension reduction using the same set of simulations, which would not be possible with any other ABC algorithm. If the accuracy provided by the rejection sampling is not enough, the transformed summaries may be used in another ABC analysis with a more advanced ABC algorithm, such as ABC SMC (Toni et al., 2009) or BOLFI (Gutmann and Corander, 2016). However, it might be possible to extend SMC-type ABC algorithms to simultaneously target the posterior and adapt the transformation of the summaries. Prangle (2017) introduced a population Monte Carlo ABC algorithm that adapted the weights of the summaries in the distance function at every step, and mentioned the possibility that it could be extended to adapt the summaries themselves. Ensuring convergence of such an algorithm might prove to be difficult, because the algorithm would at the same time be modifying the target and trying to concentrate locally around the target. Nevertheless, such an algorithm could provide a significant increase in efficiency for ABC analysis of models with a high number of candidate summaries, and hence more research on the subject would be justified.   Figure S1: Accuracy of different dimension reduction techniques with the Ricker model evaluated over simulated test datasets. Each panel shows the RMSEs for one parameter with each distribution showing the average RMSE for the 100 test datesets. 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets. 'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively. Figure S2: Accuracy of different dimension reduction techniques with the bird population dynamics IBM evaluated over simulated test datasets using all simulations. Each panel shows the RMSEs for one parameter with each distribution showing the average RMSE for the 100 test datesets. 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets. 'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively. Figure S4: Same as Fig 2, but with identity function instead of global regression as the initial transformation for the local regressions ('localReg' and 'localRegopt'). Accuracy of different dimension reduction techniques with the bird population dynamics IBM evaluated over simulated test datasets. The plots shows the average SRMSE for the 100 test datesets using all simulations (a or using only those 25,170 simulations with observations (b). 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets.'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively. Figure S5: Reduction in RMSE using optimization and localization of the transformation with the g-and-k-distribution evaluated over simulated test datasets for different numbers of simulations and candidate summaries. The panels show the median relative SRMSEs over test datasets for optimized over regular local regression (a), optimized over regular PLS (b) and optimized over regular local PLS (c) as a function of number of simulations (N ). Each line shows the reduction for one number of candidate summaries (n S ) as indicated in the legend. The dotted vertical lines indicate 90 % intervals for the SRMSEs over the test datasets. Figure S6: Accuracy of different dimension reduction techniques with the g-and-k-distribution evaluated over simulated test datasets for different numbers of simulations and candidate summaries. Each panel shows the sum of RMSEs over the parameter with each distribution showing the average RMSE for the 100 test datesets for one combination of number of simulations (N ) and number of candidate summaries (n S ). 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets. 'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively. Figure S7: The size of local neighborhood α used for learning localized transformation with the g-and-k-distribution evaluated over simulated test datasets for different numbers of simulations and candidate summaries. The panels show the average α over test datasets for optimized local regression (a) and optimized local PLS (b). The dotted vertical lines indicate 90 % intervals for α over the test datasets. Figure S8: The number of components used for PLS with the g-and-k-distribution evaluated over simulated test datasets for different numbers of simulations and candidate summaries. The panels show the average number of components over test datasets for regular PLS (a), optimized PLS (b), local transformation in regular local PLS (d), initial transformation in optimized local PLS (d) and local transformation in optimized local PLS (e). The dotted vertical lines indicate 90 % intervals for the number of components over the test datasets. Figure S9: Accuracy of different dimension reduction techniques with the g-and-k-distribution evaluated over simulated test datasets for the setting with 100 candidate summaries and 200,000 simulations. Each panel shows the RMSEs for one parameter with each distribution showing the average RMSE for the 100 test datesets. 'Reg' and 'PLS' refer to global regression and PLS transformations for the parameters. 'PLSopt' refers to PLS with the number of components optimized using validation datasets. 'localReg' and 'localPLS' refer to local versions of the regression and PLS transformation with algorithm 1, respectively. 'localRegopt' and 'localPLSopt' refer to optimized local versions of the regression and PLS transformation with algorithm 2, respectively.