A Frequency Matching Method: Solving Inverse Problems by Use of Geologically Realistic Prior Information

The frequency matching method defines a closed form expression for a complex prior that quantifies the higher order statistics of a proposed solution model to an inverse problem. While existing solution methods to inverse problems are capable of sampling the solution space while taking into account arbitrarily complex a priori information defined by sample algorithms, it is not possible to directly compute the maximum a posteriori model, as the prior probability of a solution model cannot be expressed. We demonstrate how the frequency matching method enables us to compute the maximum a posteriori solution model to an inverse problem by using a priori information based on multiple point statistics learned from training images. We demonstrate the applicability of the suggested method on a synthetic tomographic crosshole inverse problem.


Introduction
Inverse problems arising in the field of geoscience are typically ill-posed; the available data are scarce and the solution to the inverse problem is therefore not welldetermined.In probabilistic inverse problem theory the solution to a problem is given as an a posteriori probability density function that combines states of information provided by observed data and the a priori information (Tarantola 2005).The ambiguities of the solution of the inverse problem due to the lack of restrictions on the solution is then reflected in the a posteriori probability.
A priori information used in probabilistic inverse problem theory is often covariance-based a priori models.In these models the spatial correlation between the model parameters is defined by two-point statistics.In reality, two-point-based a priori models are too limited to capture curvilinear features such as channels or cross beddings.It is therefore often insufficient to rely only on the two-point statistics, and thus higher order statistics must also be taken into account in order to correctly produce geologically realistic descriptions of the subsurface.It is assumed that geological information is available in the form of a training image.This image could for instance have been artificially created to describe the expectations for the solution model or it could be information from a previous solution to a comparable inverse problem.The computed models should not be identical to the training image, but rather express a compromise between honoring observed data and comply with the information extracted from the training image.The latter can be achieved by ensuring that the models have the same multiple point statistics as the training image.Guardiano and Srivastava (1993) proposed a sequential simulation algorithm that was capable of simulating spatial features inferred from a training image.Their approach was computationally infeasible until Strebelle (2002) developed the single normal equation simulation (snesim) algorithm.Multiple point statistics in general and the snesim algorithm in particular have been widely used for creating models based on training images and for solving inverse problems, see for instance Caers and Zhang (2004), Arpat (2005), Hansen et al. (2008), Peredo and Ortiz (2010), Suzuki and Caers (2008), Jafarpour and Khodabakhshi (2011).A method called the probability perturbation method (PPM) has been proposed by Caers and Hoffman (2006).It allows for gradual deformation of one realization of snesim to another realization of snesim.Caers and Hoffman propose to use the PPM method to find a solution to an inverse problem that is consistent with both a complex prior model, as defined by a training image, and data observations.PPM is used iteratively to perturb a realization from snesim while reducing the data misfit.However, as demonstrated by Hansen et al. (2012), as a result of the probability of the prior model not being evaluated, the model found using PPM is not the maximizer of the posterior density function, but simply the realization of the multiple point based prior with the highest likelihood value.There is no control of how reasonable the computed model is with respect to the prior model.It may be highly unrealistic.
The sequential Gibbs sampling method by Hansen et al. (2012) is used to sample the a posteriori probability density function given, for example a training image based prior.However, as with the PPM it cannot be used for optimization and locating the maximum a posteriori (MAP) model, as the prior probability is not quantified.The focus of our research is the development of the frequency matching (FM) method.The core of this method is the characterization of images by their multiple point statistics.An image is represented by the histogram of the multiple point-based spatial event in the image; this histogram is denoted the frequency distribution of the image.The most significant aspect of this method, compared to existing methods based on multiple point statistics for solving inverse problems, is the fact that it explicitly formulates an a priori probability density distribution, which enables it to efficiently quantify the probability of a realization from the a priori probability.
The classical approach when solving inverse problems by the least squares methods assumes a Gaussian prior distribution with a certain expectation.Solution models to the inverse problem are penalized depending on their deviation from the expected model.In the FM method, the frequency distribution of the training image acts as the expected model and a solution image is penalized depending on how much its frequency distribution deviates from that of the training image.To perform this comparison we introduce a dissimilarity measure between a training image and a model image as the χ 2 distance between their frequency distributions.Using this dissimilarity measure for quantifying the a priori probability of a model the FM method allows us to directly compute the MAP model, which is not possible using known techniques such as the PPM and sequential Gibbs sampling methods.
Another class of methods are the Markov random fields (MRF) methods (Tjelmeland and Besag 1998).The prior probability density given by Markov methods involves a product of a large number of marginals.A disadvantage is therefore, despite having an expression for the normalization constant, that it can be computationally expensive to compute.Subclasses of the MRF methods such as Markov mesh models (Stien and Kolbjørnsen 2011) and partially ordered Markov models (Cressie and Davidson 1998) avoid the computation of the normalization constant, and this advantage over the MRF methods is shared by the FM method.Moreover, in contrast to methods such as PMM and MRF, the FM method is fully non-parametric, as it does not require probability distributions to be written in a closed form.
This paper is ordered as follows.In Sect. 2 we define how we characterize images by their frequency distributions, we introduce our choice of a priori distribution of the inverse problem and we elaborate on how it can be incorporated into traditional inverse problem theory.Our implementation of the FM method is discussed in Sect.3. In Sect. 4 we present our test case and the results when solving an inverse problem using frequency matching-based a priori information.Section 5 summarizes our findings and conclusions.

Method
In geosciences, inverse problems involve a set of measurements or observations d obs used to determine the spatial distribution of physical properties of the subsurface.These properties are typically described by a model with a discrete set of parameters, m.For simplicity, we will assume that the physical property is modeled using a regular grid in space.The model parameters are said to form an image of the physical property.
Consider the general forward problem, of computing the observations d given the perhaps non-linear forward operator g and the model parameters m.The values of the observation parameters are computed straightforwardly by applying the forward operator to the model parameters.The associated inverse problem consists of computing the model parameters m given the forward operator g and a set of observations d obs .As the inverse problem is usually severely under-determined, the model m that satisfies d obs = g(m) is not uniquely determined.Furthermore, some of the models satisfying d obs = g(m) within the required level of accuracy will be uninteresting for a geoscientist as the nature of the forward operator g and the measurement noise in d obs may yield a physically unrealistic description of the property.The inverse problem therefore consists of not just computing a set of model parameters satisfying Eq. 1, but computing a set of model parameters that gives a realistic description of the physical property while honoring the observed data.The FM method is used to express how geologically reasonable a model is by quantifying its a priori probability using multiple point statistics.Letting the a priori information be available in, for instance, a training image, the FM method solves an inverse problem by computing a model that satisfies not only the relation from Eq. 1 but a model that is also similar to the training image.The latter ensures that the model will be geologically reasonable.

2.1
The Maximum A Posteriori Model Tarantola and Valette (1982) derived a probabilistic approach to solve inverse problems where the solution to the inverse problem is given by a probability density function, denoted the a posteriori distribution.This approach makes use of a prior distribution and a likelihood function to assign probabilities to all possible models.The a priori probability density function ρ describes the data independent prior knowledge of the model parameters; in the FM method we choose to define it as follows where α acts as a weighting parameter and f is a dissimilarity function presented in Sect.2.4.Traditionally, f measures the distance between the model and an a priori model.The idea behind the FM method is the same, except we wish not to compare models directly but to compare the multiple point statistics of models.We therefore choose a traditional prior but replace the distance function such that instead of measuring the distance between models directly, we measure the dissimilarity between them.The dissimilarity is expressed as a distance between their multiple point statistics.
The likelihood function L is a probabilistic measure of how well data associated with a certain model matches the observed data, accounting for the uncertainties of the observed data, Here, C d is the data covariance matrix and the measurement errors are assumed to be independent and Gaussian distributed with mean values 0. The a posteriori distribution is then proportional to the product of the prior distribution and the likelihood The set of model parameters that maximizes the a posteriori probability density is called the maximum a posteriori (MAP) model The dissimilarity function f is a measure of how well the model satisfies the a priori knowledge that is available, for example from a training image.The more similar, in some sense, the image from a set of model parameters m is to the training image the smaller the function value f (m) is.Equivalently to the more traditional term m − m prior 2 C m , stemming from a Gaussian a priori distribution of the model parameters with mean values m prior and covariance matrix C m , f (m) can be thought of as a distance.It is not a distance between m and the training image (f (m) may be zero for other images than the training image), but a distance between the multiple point statistics of the image formed by the model parameters and the multiple point statistics of the training image.

The Multiple Point Statistics of an Image
Consider an image Z = {1, 2, . . ., N} with N voxels (or pixels if the image is only two dimensional) where the voxels can have the m different values 0, 1, . . ., m − 1.We introduce the N variables, z 1 , z 2 , . . ., z N and let z k describe the value of the kth voxel of the image.It is assumed that the image is a realization of an unknown, random process satisfying: 1.The value of the kth voxel, z k , is, given the values of voxels in a certain neighborhood N k around voxel k, independent of voxel values not in the neighborhood.Voxel k itself is not contained in N k .Let z k be a vector of the values of the ordered neighboring voxels in N k ; we then have where f Z denotes the conditional probability distribution of the voxel z k given the values of the voxels within the neighborhood.2. For an image of infinite size the geometrical shape of all neighborhoods N k are identical.This implies that if voxel k has coordinates (k x , k y , k z ), and voxel l has coordinates (l x , l y , l z ), then 3. If we assume ergodicity, that is, when two voxels, voxel k and voxel l, have the same values as their neighboring voxels, then the conditional probability distribution of voxel k and voxel l are identical Knowing the conditionals f Z (z k |z k ) we know the multiple point statistics of the image, just as a variogram would describe the two-point statistics of an image.The basis of sequential simulation as proposed by Guardiano and Srivastava (1993) is to exploit the aforementioned assumptions to estimate the conditional probabilities f Z (z k |z k ) based on the marginals obtained from the training image, and then to use the conditional distributions to generate new realizations of the unknown random process from which the training image is a realization.The FM method, on the other hand, operates by characterizing images by their frequency distributions.As described in the following section, the frequency distribution of voxel values within the given neighborhood of an image is given by its marginal distributions.This means that comparison of images is done by comparing their marginals.For now, the training image is assumed to be stationary.With the current formulation of the frequency distributions this is the only feasible approach.Discussion of how to avoid the assumption of stationarity exists in literature, see for instance the recent Honarkhah (2011).Some of these approaches mentioned here might also be useful for the FM method, but we will leave this to future research to determine.

Characterizing Images by their Frequency Distribution
Before presenting the FM method we define what we denote the frequency distribution.Given an image with the set of voxels Z = {1, . . ., N} and voxel values z 1 , . . ., z N we define the template function Ω as a function that takes as argument a voxel k and returns the set of voxels belonging to the neighborhood N k of voxel k.In the FM method, the neighborhood of a voxel is indirectly given by the statistical properties of the image itself; however, the shape of a neighborhood satisfying the assumptions from Sect.2.2 is unknown.For each training image one must therefore define a template function Ω that seeks to correctly describe the neighborhood.The choice of template function determines if a voxel is considered to be an inner voxel.An inner voxel is a voxel with the maximal neighborhood size, and the set of inner voxels, Z in , of the image is therefore defined as where |N k | denotes the number of voxels in N k .Let n denote the number of voxels in the neighborhood of an inner voxel.Typically, voxels on the boundary or close to the boundary of an image will not be inner voxels.To each inner voxel z k we assign a pattern value p k ; we say the inner voxel is the center voxel of a pattern.This pattern value is a unique identifier of the pattern and may be chosen arbitrarily.The most obvious choice is perhaps a vector value with the discrete variables in the pattern, or a scalar value calculated based on the values of the variables.The choice should be made in consideration of the implementation of the FM method.The pattern value is uniquely determined by the value of the voxel z k and the values of the voxels in its neighborhood, z k .As the pattern value is determined by the values of n + 1 voxels, which can each have m different values, the maximum number of different patterns is m n+1 .Let π i , for i = 1, . . ., m n+1 , count the number of patterns that have the ith pattern value.The frequency distribution is then defined as Let p Ω denote the mapping from voxel values of an image Z to its frequency distribution π , that is, p Ω (z 1 , . . ., z N ) = π .Figure 1 shows an example of an image and the patterns it contains for the template function that defines neighborhoods as follows Recall from Sect.2.2 that (l x , l y ) are the coordinates of voxel l in this twodimensional example image.We note that for a given template function the frequency distribution of an image is uniquely determined.The opposite, however, does not hold.Different images can, excluding symmetries, have the same frequency distribution.This is what the FM method seeks to exploit by using the frequency distribution to generate new images, at the same time similar to, and different from, our training image.

Computing the Similarity of Two Images
The FM method compares a solution image to a training image by comparing its frequency distribution to the frequency distribution of the training image.How dissimilar the solution image is to the training image is determined by a dissimilarity function, which assigns a distance between their frequency distributions.This distance reflects how likely the solution image is to be a realization of the same unknown process as the training image is a realization of.The bigger the distance, the more dissimilar are the frequency distributions and thereby also the images, and the less likely is the image to be a realization of the same random process as the training image.The dissimilarity function can therefore be used to determine which of two images is most likely to be a realization of the same random process as the training image is a realization of.
The dissimilarity function is not uniquely given but an obvious choice is the χ 2 distance also described in Sheskin (2004).It is used to measure the distance between two frequency distributions by measuring how similar the proportions of patterns in the frequency distributions are.Given two frequency distributions, the χ 2 distance estimates the underlying distribution.It then computes the distance between the two frequency distributions by computing each of their distances to the underlying distribution.Those distances are computed using a weighted Euclidean norm where the weights are the inverse of the counts of the underlying distribution, see Fig. 2. In our research, using the counts of the underlying distribution turns out to be a favorable weighting of small versus big differences instead of using a traditional p-norm as used by Peredo and Ortiz (2010).
we compute what we define as the dissimilarity function value of the image where i denotes the counts of the underlying distribution of patterns with the ith pattern value for images of the same size as the image and TI i denotes the counts of the underlying distribution of patterns with the ith pattern value for images of the same size as the training image.These counts are computed as where n Z and n TI are the total number of counts of patterns in the frequency distributions of the image and the training image, that is, the number of inner voxels in the image and the training image, respectively.

Solving Inverse Problems
We define the frequency matching method for solving inverse problems formulated as least squares problems using geologically complex a priori information as the fol- where c(π ) is the dissimilarity function value of the solution image defined by Eq. 3 and α is a weighting parameter.The forward operator g, which traditionally is a mapping from model space to data space, also contains the mapping of the categorical values z k ∈ {0, . . ., m − 1} for k = 1, . . ., N of the image into the model parameters m that can take m different discrete values.
The value of α cannot be theoretically determined.It is expected to depend on the problem at hand; among other factors its resolution, the chosen neighborhood function and the dimension of the data space.It can be thought of as playing the same role for the dissimilarity function as the covariance matrix C d does for the data misfit.So it should in some sense reflect the variance of the dissimilarity function and in that way determine how much trust we put in the dissimilarity value.Variance, or trust, in a training image is difficult to quantify, as the training image is typically given by a geologist to reflect certain expectations to model.Not having a theoretical expression for α therefore allows us to manipulate the α value to loosely quantify the trust we have in the training image.In the case where we have accurate data but only a vague idea of the structures of the subsurface the α can be chosen low, in order to emphasize the trust we have in the data and the uncertainty we have of the structure of the model.In the opposite case, where data are inaccurate but the training image is considered to be a very good description of the subsurface, the α value can be chosen high, to give the dissimilarity function more weight.
Due to the typically high number of model parameters, the combinatorial optimization problem should be solved by use of an iterative solution method; such a method will iterate through the model space and search for the optimal solution.While the choice of solution method is less interesting when formulating the FM method, it is of great importance when applying it.The choice of solution method and the definition of how it iterates through the solution space by perturbing images has a significant impact on the feasibility of the method in terms of its running time.As we are not sampling the solution space we do not need to ensure that the method captures the uncertainty of the model parameters, and the ideal would be a method that converges directly to the maximum a posteriori solution.While continuous optimization problems hold information about the gradient of the objective function that the solution method can use to converge to a stationary solution, this is not the case for our discrete problem.Instead we consider the multiple point statistics of the training image when perturbing a current image and in that way we seek to generate models which better match the multiple point statistics of the training image and thus guide the solution method to the maximum a posteriori model.

Properties of the Frequency Matching Method
The FM method is a general method and in theory it can be used to simulate any type of structure, as long as a valid training image is available and a feasible template function is chosen appropriately.If neighborhoods are chosen too small, the method will still be able to match the frequency distributions.However, it will not reproduce the spatial structures simply because these are not correctly described by the chosen multiple point statistics and as a result the computed model will not be realistic.If neighborhoods are chosen too big, CPU cost and memory demand will increase, and as a result the running time per iteration of the chosen solution method will increase.Depending on the choice of iterative solution method, increasing the size n of the neighborhood is likely to also increase the number of iterations needed and thereby increase the convergence time.When the size of neighborhoods is increased, the maximum number of different patterns, m n+1 , is also increased.The number of different patterns present is, naturally, limited by the number of inner voxels, which is significantly smaller than m n+1 .In fact, the number of patterns present in an image is restricted further as training images are chosen such that they describe a certain structure.This structure is also sought to be described in the solutions.The structure is created by repetition of patterns, and the frequency distributions will reveal this repetition by having multiple counts of the same pattern.This means, the number of patterns with non-zero frequency is greatly smaller than m n+1 resulting in the frequency distributions becoming extremely sparse.For bigger test cases, with millions of parameters, patterns consisting of hundreds of voxels and multiple categories, this behavior needs to be investigated further.
The dimension of the images, if they are two or three dimensional, is not important to the FM method.The complexity of the method is given by the maximal size of neighborhoods, n.The increase in n as a result of going from two-to threedimensional images is therefore more important than the actual increase in physical dimensions.In fact, when it comes to assigning pattern values a neighborhood is, regardless of its physical dimension, considered one dimensional where the ordering of the voxels is the important aspect.Additionally, the number of categories of voxel values m does not influence the running time per iteration.As with the number of neighbors, n, it only influences the number of different possible patterns m n+1 and thereby influences the sparsity of the frequency distribution of the training image.The higher m is, the sparser is the frequency distribution.It is expected that the sparsity of the frequency distribution affects the level of difficulty of the combinatorial optimization problem.Strebelle (2002) recommends choosing a training image that is at least twice as large as the structures it describes; one must assume this advice also applies to the FM method.Like the snesim algorithm, the FM method can approximate continuous properties by discretizing them into a small number of categories.One of the advantages of the FM method is that by matching the frequency distributions it indirectly ensures that the proportion of voxels in each of the m categories is consistent between the training image and the solution image.It is therefore not necessary to explicitly account for this ratio.Unlike the snesim algorithm, the computed solution images therefore need very little post treatment-in the current implementation the solution receives no post treatment.However, the α parameter does allow for the user to specify how strictly the frequency distributions should be matched.In the case where the data are considered very informative or the training image is considered far from reality, decreasing the α allows for the data to be given more weight and the multiple point statistics will not be as strictly enforced.
Constraints on the model parameters can easily be dealt with by reducing the feasible set {0, . . ., m − 1} for those values of k in the constraints of the problem stated in Eq. 6.The constrained voxels remain part of the image Z and when computing the frequency distribution of an image they are not distinguished from non-constrained voxels.However, when perturbing an image all constraints of the inverse problem should at all times be satisfied and conditioned to the hard data.The additional constraints on the model parameters will therefore be honored.

Implementation
This section describes the current implementation of the frequency matching method.Algorithm 1 gives a general outline of how to apply the FM method, that is, how to solve the optimization problem from Eq. 6 with an iterative optimization method.In the remainder of the section, the implementation of the different parts of the FM method will be discussed.It should be noted that the implementation of the FM method is not unique; for instance, there are many options for how the solution method iterates through the model space by perturbing models.The different choices should be made depending on the problem at hand and the current implementation might not be favorable for some given problems.The overall structure in Algorithm 1 will be valid regardless of what choices are made on a more detailed level.The current implementation is based on a Simulated Annealing scheme.Simulated Annealing is a well-known heuristic optimization method first presented by Kirkpatrick et al. (1983) as a solution method for combinatorial optimization problems.The acceptance of perturbed images is done using an exponential cooling rate and the parameters controlling the cooling are tuned to achieve an acceptance ratio of approximately 15 accepted perturbed models for each 100 suggested perturbed models.A perturbed model is generated by erasing the values of the voxels in a part of the image and then re-simulating the voxel values by use of sequential simulation.

Reformulation of the Dissimilarity Function
The definition of the dissimilarity function from Eq. 3 has one great advantage that we for computational reasons simply cannot afford to overlook.As discussed previously, the frequency distributions are expected to be sparse as the number of patterns present in an image is significantly smaller than m n+1 .This means that a lot of the terms in the dissimilarity function from Eq. 3 will be zero, yet the dissimilarity function can be simplified further.It will be shown that the dissimilarity function value of a frequency distribution, c(π ), given the frequency distribution of a training image, π , can be computed using only entries of π where π TI > 0. In other words, to compute the dissimilarity function value of an image we need only to know the count of patterns in the image that also appear in the training image.Computationally, this is a great advantage as we can disregard the patterns in our solution image that do not appear in the training image and we need not compute nor store the entire frequency distribution of our solution image, which is shown by inserting the expressions of the counts for the underlying distribution defined by Eqs. 4 and 5 This leads to the introduction of the following two subsets of I The two subsets form a partition of I as they satisfy I 1 ∪ I 2 = I and I 1 ∩ I 2 = ∅.The dissimilarity function Eq. 7 can then be written as recalling that i∈I π i = n Z and that π i = 0 for i / ∈ I .A clear advantage of this formulation of the dissimilarity function is that the entire frequency distribution π of the image does not need to be known; as previously stated, it only requires the counts π i of the patterns also found in the training image, which is for i ∈ I 1 .

Computing and Storing the Frequency Distributions
The formulation of the dissimilarity function from Eq. 3 and later Eq. 8 means that it is only necessary to store non-zero entries in a frequency distribution of a training image π TI .Algorithm 2 shows how the frequency distribution of a training image is computed such that zero entries are avoided.The algorithm also returns a list p with the same number of elements as the frequency distribution and it holds the pattern values corresponding to each entry of π TI .

Perturbation of an Image
The iterative solver moves through the model space by perturbing models and this is the part of the iterative solver that leaves the most choices to be made.An intuitive but naive approach would be to simply change the value of a random voxel.This will result in a perturbed model that is very close to the original model, and it will therefore require a lot of iterations to converge.The current implementation changes the values of a block of voxels in a random place of the image.
Before explaining in detail how the perturbation is done, let Z cond ⊂ Z be the set of voxels that we have hard data for, which means their value is known and should be conditioned to.First a voxel k is chosen randomly.Then the value of all voxels in a domain D k ⊂ (Z \ Z cond ) around voxel k are erased.Last, the values of the voxels in D k are simulated using sequential simulation.The size of the domain should be chosen to reflect how different the perturbed image should be from the current image.The bigger the domain, the fewer iterations we will expect the solver will need to iterate through the model space to converge, but the more expensive an iteration will become.Choosing the size of the domain is therefore a trade-off between number of iterations and thereby forward calculations and the cost of computing a perturbed image.
Algorithm 4 shows how an image is perturbed to generate a new image.Simulate z l given all assigned voxels in N l .end

Updating the Frequency Distribution
As a new image is created by changing the value of a minority of the voxels, it would be time consuming to compute the frequency distribution of all voxel values of the new image when the frequency distribution of the old image is known.Recall that n is the maximum number of neighbors a voxel can have; inner voxels have exactly n neighbors.Therefore, in addiction to changing its own pattern value, changing the value of a voxel will affect the pattern value of at most n other voxels.This means that we obtain the frequency distribution of the new image by performing at most n + 1 subtractions and n + 1 additions per changed voxel to the entries of the already known frequency distribution.
The total number of subtractions and additions can be lowered further by exploiting the block structure of the set of voxels perturbed.The pattern value of a voxel will be changed when any of its neighboring voxels are perturbed, but the frequency distribution need only be updated twice for each affected voxel.We introduce a set of voxels Z aff , which is the set of voxels who are affected when perturbing image Z into Z, that is, the set of voxels whose pattern values are changed when perturbing image Z into image Z How the partial frequency distribution is updated when an image is perturbed is illustrated in Algorithm 5.  pixels) are the high velocity zones.The problem is scalable and for the example we have chosen the parameters presented by Table 1.
The template is chosen, such that the neighborhood of pixel k is the following set of pixels Recall that pixel l has the coordinates (l x , l z ); the first coordinate being the horizontal distance from the left borehole and the second coordinate being the depth, both measured in pixels.To compute a perturbed image, the domain used in Algorithm 4 is defined as follows The values of all pixels l ∈ D k will be re-simulated using Sequential Simulation conditioned to the remaining pixels l / ∈ D k .We are not using any hard data in the example, which means Z cond = ∅.
This choice of template function yields n = 34 where the geometrical shape of the neighborhood of inner pixels is a 7 pixels by 5 pixels rectangle.This is chosen based on the trends in the training image, where the distance of continuity is larger horizontally than vertically.However, it should be noted that this choice of template function is not expected to meet the assumptions of conditional independence of Sect.2.2.The distance of continuity in the training image appears much larger horizontally than only seven pixels, and vertically the width of the channels is approximately ten pixels.This implies that, despite matched frequency distributions, a computed solution will not necessarily be recognized to have the same visual structures as the training image.The goal is solve the inverse problem which involves fitting the data and therefore, as our example will show, neighborhoods of this size are sufficient.The data-fitting term of the objective function guides the solution method, such that the structures from the training image are correctly reproduced.The low number of neighbors constrains the small-scale variations, which are not well-determined by the travel time data.However, the travel time data successfully determine the large-scale structures.The template function does not need to describe structures of the largest scales of the training image as long as the observed data are of a certain quality.Before elaborating on how the α value was determined, we present some of the models computed for different values of α. Figure 6 shows the computed models for four logarithmically distributed values of α between 10 −3 and 10 1 .It is seen how the model for lowest value of α is geologically unrealistic and does not reproduce the a priori expected structures from the training image as it primarily is a solution to the ill-posed, under-determined, data-fitting problem.As α increases, the channel structures of the training image are recognized in the computed models.However, for too large α values the solutions are dominated by the χ 2 term as the data have been deprioritized, and the solutions are not geologically reasonable either.As discussed, the chosen template is too small to satisfy the conditions from Sect.2.2, yielding models that do in fact minimize the χ 2 distance, but do not reproduce the structures form the training image.The data misfit is now assigned too little weight to help compensate for the small neighborhoods, and the compromise between minimizing the data misfit and minimizing the dissimilarity that before worked out well is no longer present.
We propose to use the L-curve method (Hansen and O'Leary 1993) to determine an appropriate value of α. Figure 7 shows the value of χ 2 (m FM ) versus the value of 1 2 g(m FM ) − d obs 2 C d for 13 models.The models have been computed for logarithmically distributed values of α ranging from 1 (upper left corner) to 10 −3 (lower right corner).Each of the 13 models is marked with a blue circle.The models from Fig. 6 are furthermore marked with a red circle.The model from Fig. 5 is marked with a red star.We recognize the characteristic L-shaped behavior in the figure and the model from Fig. 5 is the model located in the corner of the L-curve.The corresponding value α = 1.8 × 10 −2 is therefore considered an appropriate value of α.

Conclusions
We have proposed the frequency matching method which enables us to quantify a probability density function that describes the multiple point statistics of an image.In this way, the maximum a posteriori solution to an inverse problem using training image-based complex prior information can be computed.The frequency matching method formulates a closed form expression for the a priori probability of a given model.This is obtained by comparing the multiple point statistics of the model to the multiple point statistics from a training image using a χ 2 dissimilarity distance.
Through a synthetic test case from crosshole tomography, we have demonstrated how the frequency matching method can be used to determine the maximum a posteriori solution.When the a priori distribution is used in inversion, a parameter α is required.We have shown how we are able to recreate the reference model by choosing this weighing parameter appropriately.Future work could focus on determining the theoretically optimal value of α as an alternative to using the L-curve method.

Fig. 1
Fig. 1 Example of patterns found in an image.Notice how the image is completely described by the (ordered) patterns in every third row and column; the patterns are marked in red

Fig. 2
Fig.2Illustration of the χ 2 distance between two frequency distributions π and π TI , each containing the counts of two different pattern values, p 1 and p 2 .The difference between the frequency distributions is computed as the sum of the length of the two red line segments.The length of each line segment is computed using a weighted Euclidean norm.The counts of the underlying distribution are found as the orthogonal projection of the frequency distributions onto the a line going through the origin such that π − 2 = π TI − TI 2

Algorithm 1 :
The Frequency Matching Method Input: Training image, Z TI , Starting image Z Output: Maximum a posteriori image Z FM Compute frequency distribution of training image π TI and pattern list p (Algorithm 2) Compute partial frequency distribution of starting image π (Algorithm 3) while not converged do Compute perturbed image Z based on Z (Algorithm 4) Compute partial frequency distribution of perturbed image π (Algorithm 5) if accept the perturbed image then Set Z ← Z and π ← π end end

Algorithm 2 :
Frequency Distribution of a Training Image Input: Training Image Z TI Output: Frequency distribution π TI , list of pattern values p Initialization: empty list π TI , empty list p for each inner voxel, i.e., k ∈ Z TI in do Extract pattern k Compute pattern value p k if the pattern was previously found then Add 1 to the corresponding entry of π TI else Add p k to the list of pattern values p Set the corresponding new entry of π TI equal to 1 end end Algorithm 3 computes the partial frequency distribution π of an image that is needed to evaluate the dissimilarity function c(π) = χ 2 (π , π TI ) from Eq. 8.The partial frequency distribution only stores the frequencies of the patterns also found in the training image.Algorithm 3: Partial Frequency Distribution of an Image Input: Image Z, list of pattern values p from the training image Output: Partial frequency distribution π Initialization: all zero list π (same length as p) for each inner voxel, i.e., k ∈ Z in do Extract pattern k Compute pattern value p k if the pattern is found in the training image then Add 1 to the corresponding entry of π end end

Algorithm 4 :
Perturbation of an Image Input: Image Z, partial frequency distribution π of Z Output: Perturbed image Z Initialization: set π = π Pick random voxel k for each voxel l around voxel k, i.e., l ∈ D k do Erase the value of voxel l, i.e., z l is unassigned end for each unassigned voxel l around voxel k, i.e., l ∈ D k do

Fig. 7
Fig.7L-curve used to determine the optimal α value.Models have been computed for 13 logarithmically distributed values of α ranging from 1 (upper left corner) to 10 −3 (lower right corner).Each of the 13 models is marked with a blue circle.See the text for further explanation