# Likelihood-free inference via classification

- 2.7k Downloads
- 4 Citations

## Abstract

Increasingly complex generative models are being used across disciplines as they allow for realistic characterization of data, but a common difficulty with them is the prohibitively large computational cost to evaluate the likelihood function and thus to perform likelihood-based statistical inference. A likelihood-free inference framework has emerged where the parameters are identified by finding values that yield simulated data resembling the observed data. While widely applicable, a major difficulty in this framework is how to measure the discrepancy between the simulated and observed data. Transforming the original problem into a problem of classifying the data into simulated versus observed, we find that classification accuracy can be used to assess the discrepancy. The complete arsenal of classification methods becomes thereby available for inference of intractable generative models. We validate our approach using theory and simulations for both point estimation and Bayesian inference, and demonstrate its use on real data by inferring an individual-based epidemiological model for bacterial infections in child care centers.

## Keywords

Approximate Bayesian computation Generative models Intractable likelihood Latent variable models Simulator-based models## 1 Introduction

The likelihood function plays a central role in statistical inference by quantifying to which extent some values of the model parameters are consistent with the observed data. For complex models, however, evaluating the likelihood function can be computationally very costly, which often prevents its use in practice. This paper is about statistical inference for generative models whose likelihood function cannot be computed in a reasonable time.^{1}

A generative model is here defined as a parametrized probabilistic mechanism which specifies how the data are generated. It is usually implemented as a computer program that takes a state of the random number generator and some values of the model parameters \(\varvec{\theta }\) as input and that returns simulated data \(\mathbf {Y}_{\varvec{\theta }}\) as output. The mapping from the parameters \(\varvec{\theta }\) to simulated data \(\mathbf {Y}_{\varvec{\theta }}\) is stochastic, and running the computer program for different states of the random number generator corresponds to sampling from the model. Generative models are also known as simulator- or simulation-based models (Hartig et al. 2011), or implicit models (Diggle and Gratton 1984), and are closely related to probabilistic programs (Mansinghka et al. 2013). Their scope of applicability is extremely wide ranging from genetics and ecology (Beaumont 2010) to economics (Gouriéroux et al. 1993), physics (Cameron and Pettitt 2012), and computer vision (Zhu et al. 2009).

A disadvantage of complex generative models is the difficulty of performing inference with them: evaluating the likelihood function involves computing the probability of the observed data \(\mathbf {X}\) as function of the model parameters \(\varvec{\theta }\), which for complex models cannot be done analytically or computationally within practical time limits.

As generative models are widely used, solutions have emerged in multiple fields to perform “likelihood-free” inference, that is, inference which does not rely on the availability of the likelihood function. Approximate Bayesian computation (ABC) stems from research in genetics (Beaumont et al. 2002; Marjoram et al. 2003; Pritchard et al. 1999; Tavaré et al. 1997), while the method of simulated moments (McFadden 1989; Pakes and Pollard 1989) and indirect inference (Gouriéroux et al. 1993; Smith 2008) come from econometrics. The latter methods are traditionally used in a classical inference framework while ABC has its roots in Bayesian inference, but the boundaries have started to blur (Drovandi et al. 2011). Despite their differences, the methods all share the basic idea to perform inference about \(\varvec{\theta }\) by identifying values which generate simulated data \(\mathbf {Y}_{\varvec{\theta }}\) that resemble the observed data \(\mathbf {X}\).

The discrepancy between the simulated and observed data is typically measured by reducing each data set to a vector of summary statistics and measuring the distance between them. Both the distance function used and the summary statistics are critical for the success of the inference procedure (see, for example, the reviews by Lintusaari et al. (2017) and Marin et al. (2012). Traditionally, researchers choose the two quantities subjectively, relying on expert knowledge about the observed data. The goal of this paper is to show that the complete arsenal of classification methods can be brought to our disposal to measure the discrepancy, and thus to perform inference for intractable generative models.

The paper is based on the observation that distinguishing two data sets that were generated with very different values of \(\varvec{\theta }\) is usually easier than distinguishing two data sets that were generated with similar values. We propose to use the discriminability (classifiability) of the observed and simulated data as a discrepancy measure in likelihood-free inference.

We visualize the basic idea in Fig. 1 for the inference of the mean \(\varvec{\theta }\) of a bivariate Gaussian with identity covariance matrix. The observed data \(\mathbf {X}\), shown with black circles, were generated with mean \(\varvec{\theta }^{\circ }\) equal to zero. Figure 1a shows that data \(\mathbf {Y}_{\varvec{\theta }}\) simulated with mean \(\varvec{\theta }=(6,0)\) can be easily distinguished from \(\mathbf {X}\). The indicated classification rule yields an accuracy of 100%. In Fig. 1b, on the other hand, the data were simulated with \(\varvec{\theta }= (1/2,0)\) and distinguishing such data from \(\mathbf {X}\) is much more difficult; the best classification rule only yields 58% correct assignments. Moreover, if the data were simulated with \(\varvec{\theta }= \varvec{\theta }^{\circ }\), the classification task could not be solved significantly above chance level. This suggests that we can perform likelihood-free inference by identifying parameters which yield chance-level discriminability only.

## 2 Measuring discrepancy via classification

Standard classification methods operate on feature vectors that numerically represent the properties of the data that are judged relevant for the discrimination task (Hastie et al. 2009; Wasserman 2004). There is some freedom in how the feature vectors are constructed. In the simplest case, the data are statistically independent and identically distributed (iid) random variables, and the features are equal to the data points, as in Fig. 1. But the approach of using classification to measure the discrepancy is not restricted to iid data. In the paper, we will construct features and set up a classification problems also for time series or matrix-valued data.

*n*feature vectors from each of the two data sets. The \(\mathbf {x}_i\) are then associated with class label 0 and the \(\mathbf {y}_i\) with class label 1, which yields the augmented data set \(\mathcal {D}_{\varvec{\theta }}\),

*h*that maps each feature vector \(\mathbf {u}\) to its class label \(h(\mathbf {u}) \in \{0,1\}\). The performance of

*h*on \(\mathcal {D}_{\varvec{\theta }}\) can be assessed by the classification accuracy \(\text {CA}\),

In the motivating example in Fig. 1, the labels of the data points are indicated by their markers, and the Bayes classification rule by the hatched areas. The classification accuracy \(J_n^{*}(\varvec{\theta })\) decreases from 100% (perfect classification performance) toward 50% (chance-level performance) as \(\varvec{\theta }\) approaches \(\varvec{\theta }^{\circ }\), the parameter value which was used to generate the observed data \(\mathbf {X}\). While this provides an intuitive justification for using \(J_n^{*}(\varvec{\theta })\) as discrepancy measure, an analytical justification will be given in the next section where we show that \(J_n^{*}(\varvec{\theta })\) is related to the total variation distance under mild conditions.

In practice, \(J_n^{*}(\varvec{\theta })\) is not computable because the Bayes classification rule \(h^{*}_{\varvec{\theta }}\) involves the probability distribution of the data which is unknown in the first place. But the classification literature provides a wealth of methods to learn an approximation \(\hat{h}_{\varvec{\theta }}\) of the Bayes classification rule, and \(J_n^{*}(\varvec{\theta })\) can be estimated via cross-validation (Hastie et al. 2009; Wasserman 2004).

We will use several straightforward methods to obtain \(\hat{h}_{\varvec{\theta }}\): linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), \(L_1\)-regularized polynomial logistic regression, \(L_1\)-regularized polynomial support vector machine (SVM) classification, and an aggregation of the above and other methods (max-rule, see Supplementary material 1.1). These are by no means the only applicable methods. In fact, any method yielding a good approximation of \(h^{*}_{\varvec{\theta }}\) may be chosen; our approach makes the complete arsenal of classification methods available for inference of generative models.

*K*-fold cross-validation where the data \(\mathcal {D}_{\varvec{\theta }}\) are divided into

*K*folds of training and validation sets, the different validation sets being disjoint. The training sets are used to learn the classification rules \(\hat{h}_{\varvec{\theta }}^k\) by any of the methods above, and the validation sets \(\mathcal {D}_{\varvec{\theta }}^k\) are used to measure their performances \(\text {CA}(\hat{h}_{\varvec{\theta }}^k,\mathcal {D}_{\varvec{\theta }}^k)\). The average classification accuracy on the validation sets, \(J_n(\varvec{\theta })\),

We used \(K = 5\) folds in the paper. In cross-validation, large values of *K* generally lead to approximations with smaller bias but larger variance than small values of *K*. Intermediate values like \(K=5\) are thought to lead to a good balance between the two desiderata (e.g., Hastie et al. 2009, Section 7.10).

We next show on a range of different kinds of data that most of the different classification methods yield equally good approximations of \(J_n^{*}(\varvec{\theta })\) for large sample sizes. Continuous data (drawn from a univariate Gaussian distribution of variance one), binary data (from a Bernoulli distribution), count data (from a Poisson distribution), and time series data (from a zero mean moving average model of order one) are considered. For the first three data sets, the unknown parameter is the mean, and for the moving average model, the lag coefficient is the unknown quantity (see Supplementary material 1.2 for the model specifications). Unlike for the other three data sets, the data points from the moving average model are not statistically independent, as the lag coefficient affects the correlation between two consecutive time points \(x_t\) and \(x_{t+1}\). For the classification, we treated each pair \((x_t,x_{t+1})\) as a feature.

Figure 2 shows that for the Gaussian, Bernoulli, and Poisson data, all the considered classification methods perform as well as the Bayes classification rule (BCR), yielding discrepancy measures \(J_n(\varvec{\theta })\) that are practically identical to \(J_n^{*}(\varvec{\theta })\). The same holds for the moving average model, with the exception of LDA. The reason is that LDA is not sensitive to the correlation between \(x_t\) and \(x_{t+1}\), which would be needed to discover the value of the lag coefficient. In other words, the Bayes classification rule \(h^{*}_{\varvec{\theta }}\) is outside the family of possible classification rules learned by LDA.

## 3 Classical inference via classification

*n*increases. Figure 3 provides motivating evidence for consistency of \(\varvec{\hat{\theta }}_n\).

The proposition below lists two conditions. The first one is related to convergence of frequencies to expectations (law of large numbers), the second to the ability to learn the Bayes classification rule more accurately as the sample size increases. We prove the proposition in “Appendix.” Some basic assumptions are made: The \(\mathbf {x}_i\) are assumed to have the marginal probability measure \({\text {P}}_{\varvec{\theta }^{\circ }}\) and the \(\mathbf {y}_i\) the marginal probability measure \({\text {P}}_{\varvec{\theta }}\) for all *i*, which amounts to a weak stationarity assumption. Importantly, the stationarity assumption does not rule out statistical dependencies between the data points; time series data, for example, are allowed. We also assume that the parametrization of \({\text {P}}_{\varvec{\theta }}\) is not degenerate, that is, there is a compact set \({\varTheta }\) containing \(\varvec{\theta }^{\circ }\) where \(\varvec{\theta }\ne \varvec{\theta }^{\circ }\) implies that \({\text {P}}_{\varvec{\theta }} \ne P_{\varvec{\theta }^{\circ }}\).

### Proposition 1

*n*increases, \(\varvec{\hat{\theta }}_n \mathop {\rightarrow }\limits ^{P}\varvec{\theta }^{\circ }\), if

The two conditions guarantee that \(J_n(\varvec{\theta })\) converges uniformly to \(J(\varvec{\theta })\), so that \(J(\varvec{\theta })\) is minimized with the minimization of \(J_n(\varvec{\theta })\) as *n* increases. Since \(J(\varvec{\theta })\) attains its minimum at \(\varvec{\theta }^{\circ }\), \(\varvec{\hat{\theta }}_n\) converges to \(\varvec{\theta }^{\circ }\). By definition of \(H^{*}_{\varvec{\theta }}\), \({\text {P}}_{\varvec{\theta }}(H^{*}_{\varvec{\theta }}) - {\text {P}}_{\varvec{\theta }^{\circ }}(H^{*}_{\varvec{\theta }})\) is one half of the total variation distance between the two distributions (Pollard 2001, Chapter 3). The limiting objective \(J(\varvec{\theta })\) corresponds thus to a well-defined statistical distance between \({\text {P}}_{\varvec{\theta }}\) and \({\text {P}}_{\varvec{\theta }^{\circ }}\).

The condition in Eq. (7) is about convergence of sample averages to expectations. Standard convergence results apply for statistically independent features. For features with statistical dependencies, e.g., time series data, corresponding convergence results are investigated in empirical process theory (van der Vaart and Wellner 1996), which forms a natural limit of what is studied in this paper. We may only note that by definition of *J*, convergence will depend on the complexity of the sets \(H^{*}_{\varvec{\theta }}\), \(\varvec{\theta }\in {\varTheta }\), and hence the complexity of the Bayes classification rules \(h^{*}_{\varvec{\theta }}\). The condition does not depend on the classification method employed. In other words, this first condition is about the difficulty of the classification problems that need to be solved. The condition in Eq. (8), on the other hand, is about the ability to solve them: The performance of the learned rule needs to approach the performance of the Bayes classification rule as the number of available samples increases. How to best learn such rules and finding conditions which guarantee successful learning is a research area in itself (Zhang 2004).

## 4 Bayesian inference via classification

We consider next inference of the posterior distribution of \(\varvec{\theta }\) in the framework of approximate Bayesian computation (ABC).

- 1.
Proposing a parameter value \(\varvec{\theta }'\),

- 2.
Simulating pseudo-observed data \(\mathbf {Y}_{\varvec{\theta }'}\), and then

- 3.
Accepting or rejecting the proposal based on a comparison of \(\mathbf {Y}_{\varvec{\theta }'}\) with the real observed data \(\mathbf {X}\).

The results reported in this paper were obtained with a sequential Monte Carlo implementation (see Supplementary material 1.3). The use of \(J_n\) in ABC is, however, not restricted to that particular algorithm.

We validated classifier ABC on binary (Bernoulli), count (Poisson), continuous (Gaussian), and time series (ARCH) data (see Supplementary material 1.2 for the model details). The true posterior for the autoregressive conditional heteroskedasticity (ARCH) model is not available in closed form. We approximated it using deterministic numerical integration, as detailed in Supplementary material 1.2.

The inferred empirical posterior probability density functions (pdfs) are shown in Fig. 4. There is a good match with the true posterior pdfs or the approximation obtained with deterministic numerical integration. Different classification methods yield different results, but the overall performance is rather similar. Regarding computation time, the simpler LDA and QDA tend to be faster than the other classification methods used, with the max-rule being the slowest one. Additional examples as well as links to movies showing the evolution of the posterior samples in the ABC algorithm can be found in Supplementary material 4.

As a quantitative analysis, we computed the relative error of the posterior means and standard deviations. The results, reported as part of Supplementary material 4, show that the errors in the posterior mean are within 5% after five iterations of the ABC algorithm for the examples with independent data points. For the time series, where the data points are not independent, a larger error of 15% occurs. The histograms and scatter plots show, however, that the corresponding ABC samples are still very reasonable.

## 5 Application on real data

We next used our approach to infer an intractable model of bacterial infections in child care centers.

### 5.1 Data and model

The observed data \(\mathbf {X}\) were the presence or absence of different strains of the bacterium *Streptococcus pneumoniae* among attendees of \(M=29\) child care centers in the metropolitan area of Oslo, Norway, at single points of time \(T_m\) (cross-sectional data). On average, \(N = 53\) children attended a center. Only a subset of size \(N_m\) of all attendees of each center was sampled. The data were collected and first described by Vestrheim et al. (2008).

In the following, we represent the colonization state of individual *i* in a child care center by the binary variable \(I_{is}^{t}, s=1,\ldots ,S\), where *S* the total number of strains in circulation. If the attendee is infected with strain *s* of the bacterium at time *t*, \(I_{is}^t=1\), and otherwise, \(I_{is}^t=0\). The observed data \(\mathbf {X}\) consisted thus of a set of \(M=29\) binary matrices of size \(N_m \times S\) formed by the \(I_{is}^{T_m}\), \(i=1,\ldots ,N_m, s=1,\ldots ,S\).

*i*and

*s*, after which the states evolved in a stochastic manner according to the following transition probabilities:

*h*is a small time interval and

*o*(

*h*) a remainder term satisfying \(\lim _{h\rightarrow 0} o(h)/h = 0\). Equation (9) describes the probability to clear strain

*s*, Eq. (10) the probability to be infected by it when previously not infected with any strain, and Eq. (11) the probability to be infected by it when previously infected with another strain \(s'\). The rate of infection with strain

*s*at time

*t*is denoted by \(R_s^t\), and \(\theta \in (0,1)\) is an unknown co-infection parameter. For \(\theta = 0\), the probability for a co-infection is zero. The rate \(R_s^t\) was modeled as

*N*is the average number of children attending the child care center, and \({\varLambda }\) and \(\beta \) are two unknown rate parameters that scale the static probability \(P_s\) for an infection happening outside the child care center and the dynamic probability \(E_s^t\) for an infection from within, respectively. The probability \(P_s\) and the number of strains

*S*were determined by an analysis of the overall distribution of the strains in the cross-sectional data (yielding \(S=33\); for \(P_s\), see Numminen et al. 2013). The expression for \(E_s^t\) in Eq. (13) was derived by assuming that contacts happen uniformly at random [the probability for a contact is \(1/(N-1)\)], and that the strains attendee

*j*is carrying are all transmitted with equal probability (with \(n_j^t\) being the total number of strains carried by attendee

*j*, the probability for a transmission of strain

*s*is \(I_{js}^t/n_j^t\)).

The observation model was random sampling of \(N_m\) individuals without replacement from the average number *N* of individuals attending a child care center. A stationarity assumption was made so that the exact value of the sampling time \(T_m\) was not of importance as long as it is sufficiently large so that the system is in its stationary regime.

The model has three parameters for which uniform priors were assumed: Parameter \(\beta \in (0,11)\) which is related to the probability to be infected by someone inside a child care center, parameter \({\varLambda }\in (0,2)\) for the probability of an infection from an outside source, and parameter \(\theta \in (0,1)\) which is related to the probability to be infected with multiple strains. With a slight abuse of notation, we will use \(\varvec{\theta }=(\beta ,{\varLambda },\theta )\) to denote the compound parameter vector.

### 5.2 Reference inference method

Since the likelihood function is intractable, the model was inferred with ABC in previous work (Numminen et al. 2013). The summary statistics were chosen based on epidemiological considerations and the distance function was adapted to the specific problem at hand.

- 1.
The strain diversity in the child care centers,

- 2.
The number of different strains circulating,

- 3.
The proportion of individuals who are infected, and

- 4.
The proportion of individuals who are infected with more than one strain.

### 5.3 Formulation as classification problem

For likelihood-free inference via standard classification, the observed matrix-valued data were transformed to feature vectors. We used simple features which reflect the matrix structure and the binary nature of the data.

For the matrix nature of the data, the rank of each matrix and the \(L_2\)-norm of the singular values (scaled by the size of the matrix) were used. For the binary nature of the data, we counted the fraction of ones in certain subsets of each matrix and used the average of the counts and their variability as features. The set of rows and the set of columns were used, as well as 100 randomly chosen subsets. Each random subset contained 10% of the elements of a matrix. Since the average of the counts is the same for the row and column subsets (it equals the fraction of all ones in a matrix), only one average was used.

### 5.4 Inference results

In ABC, the applicability of a discrepancy measure can be assessed by first performing inference on synthetic data of the same size and structure as the observed data but simulated from the model with known parameter values. Since ABC algorithms are rather time-consuming, we first tested the applicability of \(J_n\) in the framework of point estimation. We computed \(J_n(\varvec{\theta })\) varying only two of the three parameters at a time, keeping the third parameter fixed at the value which was used to generate the data. To eliminate random effects, we used for all \(\varvec{\theta }\) the same random number generator seed when simulating the \(\mathbf {Y}_{\varvec{\theta }}\). The seeds for \(\mathbf {X}\) and the \(\mathbf {Y}_{\varvec{\theta }}\) were different.

Figure 6 shows the results for classification with randomly chosen subsets (top row) and without (bottom row). The diagrams on the top and bottom row are very similar, both have well-defined regions in the parameter space for which \(J_n\) is close to one half, which corresponds to chance-level discriminability. But the features from the random subsets were helpful to discriminate between \(\mathbf {X}\) and \(\mathbf {Y}_{\varvec{\theta }}\) and produced more localized regions with small \(J_n\). The results suggest that LDA, the arguably simplest classification method, is suitable to infer the epidemic model.

We next applied classifier ABC on the synthetic data, using a sequential Monte Carlo ABC algorithm with four generations as previously done by Numminen et al. (2013).

The results on real data are shown in Fig. 8. It can be seen that the posterior distributions obtained with classifier ABC are generally similar to the expert solution. The posterior mode of \(\beta \) for classifier ABC with random subsets is slightly smaller than for the other methods. The shift could be due to stochastic variation because we only worked with 1000 ABC samples. It could, however, also be that the random features picked up some properties of the real data which the other methods are not sensitive to.

The computation time of classifier ABC with LDA was about the same as for the method by Numminen et al. (2013): On average, the total time for the data generation and the discrepancy measurement was 28.49 ± 3.45 s for LDA while it was 28.41 ± 3.45 s for the expert method; with 28.4 ± 3.45 s, most of the time was spent on generating data from the epidemic model. Altogether, classifier ABC thus yielded inference results which are equivalent to the expert solution, from both a statistical and computational point of view.

### 5.5 Compensating for missing expert statistics

So far we did not use expert knowledge about the inference problem when solving it with classifier ABC. Using discriminability in a classification task as a discrepancy measure is a data-driven approach to assess the similarity between simulated and observed data. But it is not necessarily a black-box approach. Knowledge about the problem at hand can be incorporated when specifying the classification problem. Furthermore, the approach is compatible with summary statistics derived from expert knowledge: Classifier ABC, and more generally the discrepancy measure \(J_n\), is able to incorporate the expert statistics by letting them be features (covariates) in the classification. The combined use of expert statistics and classifier ABC enables one to filter out properties of the model which are either not of interest or known to be wrong. Moreover, it makes the inference more robust, for example to possible misspecifications or insufficiencies of the summary statistics, as we illustrate next.

We selected two simple expert statistics used by Numminen et al. (2013), namely the number of different strains circulating and the proportion of infected individuals, and inferred the posteriors with this reduced set of summary statistics, using the method by Numminen et al. (2013) as before. Figure 9 shows that consequently, the posterior distributions of \({\varLambda }\) and \(\theta \) deteriorated. The used expert statistics alone were insufficient to perform ABC. Combining the insufficient set of summary statistics with classifier ABC, however, led to a recovery of the posteriors. The result are for classifier ABC with random subsets, but the same holds for classifier ABC without random subsets (Supplementary material 5).

## 6 Discussion

Generative models are useful and widely applicable for dealing with uncertainty and for making inferences from data. The intractability of the likelihood function is, however, often a serious problem in the inference for realistic models. While likelihood-free methods provide a powerful framework for performing inference, a limiting difficulty is the required discrepancy measurement between simulated and observed data.

We found that classification can be used to measure the discrepancy. This finding has practical value because it reduces the difficult problem of choosing an appropriate discrepancy measure to a more standard problem where we can leverage a wealth of existing solutions; whenever we can classify, we can do likelihood-free inference. It offers also theoretical value because it reveals that classification can yield consistent likelihood-free inference, and that the two fields of research, which appear very much separated at first glance, are actually tightly connected.

### 6.1 Summary statistics versus features

In the proposed approach, instead of choosing summary statistics and a distance function between them as in the standard approach, we need to choose a classification method and the features. The reader may thus wonder whether we replaced one possibly arbitrary choice with another. The important point is that by choosing a classification method, we only decide about a function space, and not the classification rule itself. The classification rule that is finally used to measure the discrepancy is learned from data and is not specified by the user, which is in stark contrast to the traditional approach based on fixed summary statistics. Moreover, the function space can be chosen using cross-validation, as implemented with our max-rule, which reduces the arbitrariness even more. In Fig. 2, for example, the max-rule successfully chose to use other classification methods than LDA for the inference of the moving average model. The influence of the choice of features is also rather mild, because they only affect the discrepancy measurement via the learned classification rule. This property of the proposed approach allowed us to even use random features in the inference of the epidemic model.

The possibility to use random features, however, does not mean that we should not use reliable expert knowledge when available. Indeed, summary statistics derived from expert knowledge can be included by letting them be features (covariates) in the classification.

### 6.2 Related work

In previous work, regression with the parameters \(\varvec{\theta }\) as response variables was used to generate summary statistics from a larger pool of candidates (Aeschbacher et al. 2012; Fearnhead and Prangle 2012; Wegmann et al. 2009). The shared characteristic of these works and our approach is the learning of transformations of the summary statistics and the features, respectively. The criteria which drive the learning are, however, rather different.

Since the candidate statistics are a function of the simulated data \(\mathbf {Y}_{\varvec{\theta }}\), we may consider the regression to provide an approximate inversion of the data generation process \(\varvec{\theta }\mapsto \mathbf {Y}_{\varvec{\theta }}\). In this interpretation, the (Euclidean) distance of the summary statistics is an approximation of the (Euclidean) distance of the parameters. The optimal inversion of the data-generating process in a mean squared error sense is the conditional expectation \({{\mathrm{E}}}(\varvec{\theta }| \mathbf {Y}_{\varvec{\theta }})\). Fearnhead and Prangle (2012) showed that this conditional expectation is also the optimal summary statistic for \(\mathbf {Y}_{\varvec{\theta }}\) if the goal is to infer \(\varvec{\theta }^{\circ }\) as accurately as possible under a quadratic loss. Transformations based on regression are thus strongly linked to the computation of the distance between the parameters. The reason we learn transformations, on the other hand, is that we would like to approximate \(J_n^{*}(\varvec{\theta })\) well, which is linked to the computation of the total variation distance between the distributions indexed by the parameters.

Classification was recently used in other work on ABC, but in a different manner. Intractable density ratios in Markov chain Monte Carlo algorithms were estimated using tools from classification (Pham et al. 2014), in particular random forests, and Pudlo et al. (2016) used random forests for model selection by learning to predict the model class from the simulated data instead of computing their posterior probabilities. This is different from using classification to define a discrepancy measure between simulated and observed data, as done here.

A particular classification method, (nonlinear) logistic regression, was used for the estimation of unnormalized models (Gutmann and Hyvärinen 2012), which are models where the probability density functions are known up to the normalizing partition function only (see Gutmann and Hyvärinen (2013a) for a review paper, and Barthelmé and Chopin (2015), Gutmann et al. (2011) and Pihlaja et al. (2010) for generalizations). Likelihood-based inference is intractable for unnormalized models, but unlike in the generative models considered here, the shape of the model-pdf is known which can be exploited in the inference.

At about the same time, we first presented our work (Gutmann et al. 2014a, b), Goodfellow et al. (2014) proposed to use nonlinear logistic regression to train a neural network such that it transforms “noise” samples into samples approximately following the same distribution as some given data set. The main difference to our work is that the method of Goodfellow et al. (2014) is a method for producing random samples while ours is a method for statistical inference.

### 6.3 Sequential inference and prediction

We did not make any specific assumptions about the model or the structure of the observed data \(\mathbf {X}\). An interesting special case occurs when \(\mathbf {X}\) are an element \(\mathbf {X}^{(t_0)}\) of a sequence of data sets \(\mathbf {X}^{(t)}\) which are observed one after the other, and the generative model is specified accordingly to generate a sequence of simulated data sets.

For inference at \(t_0\), we can distinguish between simulated data which were generated either before or after \(\mathbf {X}^{(t_0)}\) are observed: In the former case, the simulated data are predictions about \(\mathbf {X}^{(t_0)}\), and after observation of \(\mathbf {X}^{(t_0)}\), likelihood-free inference about \(\varvec{\theta }\) corresponds to assessing the accuracy of the predictions. That is, the discrepancy measurement converts the predictions of \(\mathbf {X}^{(t_0)}\) into inferences of the causes of \(\mathbf {X}^{(t_0)}\). In the latter case, each simulated data set can immediately be compared to \(\mathbf {X}^{(t_0)}\) which enables efficient iterative identification of parameter values with low discrepancy (Gutmann and Corander 2016). That is, the possible causes of \(\mathbf {X}^{(t_0)}\) can be explained more accurately with the benefit of hindsight.

### 6.4 Relation to perception and artificial intelligence

Probabilistic modeling and inference play key roles in image understanding (Gutmann and Hyvärinen 2013b), robotics (Thrun et al. 2006), and artificial intelligence (Ghahramani 2015). Perception has been modeled as (Bayesian) inference based on a “mental” generative model of the world (e.g., Vincent 2015). In most of the literature, variational approximate inference has been used for intractable generative models, giving rise to the Helmholtz machine (Dayan et al. 1995) and to the free-energy in neuroscience (Friston 2010). But other approximate inference methods can be considered as well.

The discussion about sequential inference and prediction points to similarities between perception and likelihood-free inference or approximate Bayesian computation. It is intuitively sensible that perception would involve prediction of new sensory input given the past, as well as an assessment of the predictions and a refinement of their explanations after arrival of the data. The quality of the inference depends on the quality of the generative model and the quality of the discrepancy assessment. That is, the inference results may only be useful if the generative model of the world is rich enough to produce data resembling the observed data, and if the discrepancy measure can reliably distinguish between the “mentally” generated and the actually observed data.

We proposed to measure the discrepancy via classification, being agnostic about the particular classifier used. It is an open question how to generally best measure the classification accuracy when the data are arriving sequentially. Classifiers are, however, rather naturally part of perceptual systems. Rapid object recognition, for instance, can be achieved via feedforward multilayer classifiers (Serre et al. 2007), and there are several techniques to learn representations which facilitate classification (Bengio et al. 2013). It is thus conceivable that a given classification machinery is used for several purposes, for example to quickly recognize certain objects but also to assess the discrepancy between simulated and observed data.

## 7 Conclusions and future work

In the paper, we proposed to measure the discrepancy in likelihood-free inference via classification. We focused on the principle and not on a particular classification method. Some methods may be particularly suited for certain models, where it may be possible to measure the discrepancy via the loss function that is used to learn the classification rule instead of the classification accuracy.

When working with the classification accuracy, we only use a single bit of information per data point. While this is little information, we showed that the approach yielded accurate posterior inferences and that it defines a consistent estimator. The Bayesian inference results were empirical, and it is likely that a more rigorous theoretical analysis will reveal that the single bit of information puts a limit on the possible closeness to the true posterior. While our empirical results suggest that other error sources may be more dominant in practice, the bottleneck can be avoided by using the current setup to identify the relevant summary statistics, or some transformation of them, and by computing the discrepancy by their Euclidean distance as in classical ABC. While this is a possible approach, in recent work, we chose another path by training the classifier on two simulated data sets whose size can be made as large as computationally possible (Dutta et al. 2016).

We here worked with a single simulated data set per parameter value. If multiple simulated data sets are available, they may be used to define an approximate likelihood function by, for example, averaging their corresponding discrepancies (see, e.g., Gutmann and Corander 2016, Section 3.3). The approximate likelihood function can then be maximized with respect to the parameters or used in place of the actual likelihood function in standard methods for posterior sampling.

Further exploration of the connection between classification and likelihood-free inference is likely to lead to practical improvements in general: Each parameter \(\varvec{\theta }\), for instance, induces a classification problem. We here treated the classification problems separately, but they are actually related. First, the observed data \(\mathbf {X}\) occur in all the classification problems. Second, the simulated data sets \(\mathbf {Y}_{\varvec{\theta }}\) are likely to share some properties if the parameters are not too different. Taking advantage of the relation between the different classification problems may lead to both computational and statistical gains. In the classification literature, leveraging the solution of one problem to solve another one is generally known as transfer learning (Pan and Yang 2010). In the same spirit, leveraging transfer learning, or other methods from classification, seems promising to further advance likelihood-free inference.

## Footnotes

## Notes

### Acknowledgements

The work was partially done when MUG and RD were at the Department of Mathematics and Statistics, University of Helsinki, and the Department of Computer Science, Aalto University, respectively. The work was supported by ERC Grant No. 239784 and the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN). RD is presently funded by Swiss National Science Foundation Grant No. \(105218\_163196.\) We thank Elina Numminen for providing computer code for the epidemic model.

## Supplementary material

## References

- Aeschbacher, S., Beaumont, M., Futschik, A.: A novel approach for choosing summary statistics in approximate Bayesian computation. Genetics
**192**(3), 1027–1047 (2012)CrossRefGoogle Scholar - Barthelmé, S., Chopin, N.: The Poisson transform for unnormalised statistical models. Stat. Comput.
**25**(4), 767–780 (2015)MathSciNetCrossRefzbMATHGoogle Scholar - Beaumont, M., Zhang, W., Balding, D.: Approximate Bayesian computation in population genetics. Genetics
**162**(4), 2025–2035 (2002)Google Scholar - Beaumont, M.A.: Approximate Bayesian computation in evolution and ecology. Ann. Rev. Ecol. Evol. Syst.
**41**(1), 379–406 (2010)CrossRefGoogle Scholar - Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.
**35**(8), 1798–1828 (2013)CrossRefGoogle Scholar - Cameron, E., Pettitt, A.N.: Approximate Bayesian computation for astronomical model analysis: a case study in galaxy demographics and morphological transformation at high redshift. Mon. Not. R. Astron. Soc.
**425**(1), 44–65 (2012)CrossRefGoogle Scholar - Dayan, P., Hinton, G., Neal, R., Zemel, R.: The Helmholtz machine. Neural Comput.
**7**(5), 889–904 (1995)CrossRefGoogle Scholar - Diggle, P., Gratton, R.: Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. Ser. B (Methodol.)
**46**(2), 193–227 (1984)MathSciNetzbMATHGoogle Scholar - Drovandi, C., Pettitt, A., Faddy, M.: Approximate Bayesian computation using indirect inference. J. R. Stat. Soc. Ser. C (Appl. Stat.)
**60**(3), 317–337 (2011)MathSciNetCrossRefGoogle Scholar - Dutta, R., Corander, J., Kaski, S., Gutmann, M.: Likelihood-free inference by penalised logistic regression. (2016) arXiv:1611.10242
- Fearnhead, P., Prangle, D.: Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**74**(3), 419–474 (2012)MathSciNetCrossRefGoogle Scholar - Friston, K.: The free-energy principle: a unified brain theory? Nat. Rev. Neurosci.
**11**(2), 127–138 (2010)CrossRefGoogle Scholar - Ghahramani, Z.: Probabilistic machine learning and artificial intelligence. Nature
**521**(7553), 452–459 (2015)CrossRefGoogle Scholar - Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014). http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
- Gouriéroux, C., Monfort, A., Renault, E.: Indirect inference. J. Appl. Econom.
**8**(S1), S85–S118 (1993)CrossRefzbMATHGoogle Scholar - Gutmann, M., Corander, J.: Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res.
**17**(125), 1–47 (2016)MathSciNetzbMATHGoogle Scholar - Gutmann, M., Hirayama, J.: Bregman divergence as general framework to estimate unnormalized statistical models. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI) (2011)Google Scholar
- Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res.
**13**, 307–361 (2012)MathSciNetzbMATHGoogle Scholar - Gutmann, M., Hyvärinen, A.: Estimation of unnormalized statistical models without numerical integration. In: Proceedings of the Sixth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE) (2013a)Google Scholar
- Gutmann, M., Hyvärinen, A.: A three-layer model of natural image statistics. J. Physiol. Paris
**107**(5), 369–398 (2013b)CrossRefGoogle Scholar - Gutmann, M., Dutta, R., Kaski, S., Corander, J.: Classifier ABC. In: Fifth IMS–ISBA Joint Meeting (posters) (2014a)Google Scholar
- Gutmann, M., Dutta, R., Kaski, S., Corander, J.: Likelihood-free inference via classification. (2014b) arXiv:1407.4981
- Hartig, F., Calabrese, J., Reineking, B., Wiegand, T., Huth, A.: Statistical inference for stochastic simulation models—theory and application. Ecol. Lett.
**14**(8), 816–827 (2011)CrossRefGoogle Scholar - Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)CrossRefzbMATHGoogle Scholar
- Lintusaari, J., Gutmann, M., Dutta, R., Kaski, S., Corander, J.: Fundamentals and recent developments in approximate Bayesian computation. Syst. Biol.
**66**(1), e66–e82 (2017)Google Scholar - Mansinghka, V., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate Bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems (NIPS), vol. 26 (2013)Google Scholar
- Marin, J.M., Pudlo, P., Robert, C., Ryder, R.: Approximate Bayesian computational methods. Stat. Comput.
**22**(6), 1167–1180 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - Marjoram, P., Molitor, J., Plagnol, V., Tavaré, S.: Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci.
**100**(26), 15,324–15,328 (2003)CrossRefGoogle Scholar - McFadden, D.: A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica
**57**(5), 995–1026 (1989)MathSciNetCrossRefzbMATHGoogle Scholar - Numminen, E., Cheng, L., Gyllenberg, M., Corander, J.: Estimating the transmission dynamics of
*Streptococcus pneumoniae*from strain prevalence data. Biometrics**69**(3), 748–757 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - Pakes, A., Pollard, D.: Simulation and the asymptotics of optimization estimators. Econometrica
**57**(5), 1027–1057 (1989)MathSciNetCrossRefzbMATHGoogle Scholar - Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
**22**(10), 1345–1359 (2010)CrossRefGoogle Scholar - Pham, K., Nott, D., Chaudhuri, S.: A note on approximating ABC-MCMC using flexible classifiers. STAT
**3**(1), 218–227 (2014)CrossRefGoogle Scholar - Pihlaja, M., Gutmann, M., Hyvärinen, A.: A family of computationally efficient and simple estimators for unnormalized statistical models. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI) (2010)Google Scholar
- Pollard, D.: A User’s Guide to Measure Theoretic Probability. Cambridge University Press, Cambridge (2001)CrossRefzbMATHGoogle Scholar
- Pritchard, J., Seielstad, M., Perez-Lezaun, A., Feldman, M.: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol.
**16**(12), 1791–1798 (1999)CrossRefGoogle Scholar - Pudlo, P., Marin, J.M., Estoup, A., Cornuet, J.M., Gautier, M., Robert, C.: Reliable ABC model choice via random forests. Bioinformatics
**32**(6), 859–866 (2016)Google Scholar - Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell.
**29**(3), 411–426 (2007)Google Scholar - Smith, A.: The New Palgrave Dictionary of Economics, 2nd edn. Palgrave Macmillan, London (2008). chap Indirect InferenceGoogle Scholar
- Tavaré, S., Balding, D., Griffiths, R., Donnelly, P.: Inferring coalescence times from DNA sequence data. Genetics
**145**(2), 505–518 (1997)Google Scholar - Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
- van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)CrossRefzbMATHGoogle Scholar
- van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes. Springer, New York (1996)CrossRefzbMATHGoogle Scholar
- Vestrheim, D.F., Høiby, E.A., Aaberge, I.S., Caugant, D.A.: Phenotypic and genotypic characterization of \(Streptococcus pneumoniae\) strains colonizing children attending day-care centers in Norway. J. Clin. Microbiol.
**46**(8), 2508–2518 (2008)CrossRefGoogle Scholar - Vincent, B.T.: A tutorial on Bayesian models of perception. J. Math. Psychol.
**66**, 103–114 (2015)MathSciNetCrossRefzbMATHGoogle Scholar - Wasserman, L.: All of Statistics. Springer, New York (2004)CrossRefzbMATHGoogle Scholar
- Wegmann, D., Leuenberger, C., Excoffier, L.: Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics
**182**(4), 1207–1218 (2009)CrossRefGoogle Scholar - Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat.
**32**(1), 56–85 (2004)MathSciNetCrossRefzbMATHGoogle Scholar - Zhu, L., Chen, Y., Yuille, A.: Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell.
**31**(1), 114–128 (2009)CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.