Dimensional Reduction of Pattern-Based Simulation Using Wavelet Analysis
A pattern-based simulation technique using wavelet analysis is proposed for the simulation (wavesim) of categorical and continuous variables. Patterns are extracted by scanning a training image with a template and then storing them in a pattern database. The dimension reduction of patterns in the pattern database is performed by wavelet decomposition at certain scale and the approximate sub-band is used for pattern database classification. The pattern database classification is performed by the k-means clustering algorithm and classes are represented by a class prototype. For the simulation of categorical variables, the conditional cumulative density function (ccdf) for each class is generated based on the frequency of the individual categories at the central node of the template. During the simulation process, the similarity of the conditioning data event with the class prototypes is measured using the L2-norm. When simulating categorical variables, the ccdf of the best matched class is used to draw a pattern from a class. When continuous variables are simulated, a random pattern is drawn from the best matched class. Several examples of conditional and unconditional simulation with two- and three- dimensional data sets show that the spatial continuity of geometric features and shapes is well reproduced. A comparative study with the filtersim algorithm shows that the wavesim performs better than filtersim in all examples. A full-field case study at the Olympic Dam base metals deposit, South Australia, simulates the lithological rock-type units as categorical variables. Results show that the proportions of various rock-type units in the hard data are well reproduced when similar to those in the training image; when rock-type proportions between the training image and hard data differ, the results show a compromise between the two.
KeywordsPattern-based simulation k-means clustering Wavelet analysis Conditional simulation Training image
Simulation at spatially correlated continuous and/or categorical variables such as the geological units and metal grades of mineral deposits, or sedimentary facies and pertinent attributes of petroleum reservoirs and water aquifers is a challenging task. The well-known variogram-based two-point statistical techniques (Goovaerts 1997; Deutsch and Journel 1998) are limited in its ability to adequately model spatial complexity (Journel and Alabert 1989). To address the limitations of two-point statistical models, new developments have introduced high-order spatial statistics in the form of the multi-point models (mp) (Guardiano and Srivastava 1993; Tjelmeland 1998; Journel 1997). In multi-point models, a pattern is defined as a set of values spatially distributed over a given template of spatial locations (Arpat and Caers 2007; Remy et al. 2009). During simulation, multi-point conditioning data in the form of a template is compared with patterns of the training image (a geological analogue of what is being modeled) and a pattern is selected from the training image. A pattern is selected either based on the most similar pattern (Arpat and Caers 2007) or a random pattern from the best matched class (Zhang et al. 2006; Wu et al. 2008). Different distance functions are used for similarity measures, including Manhattan distance (Zhang et al. 2006), L2-norm (Chatterjee and Dimitrakopoulos 2011), and others.
The main goal of mp simulation methods is finding the best matching pattern from a pattern database with the conditioning data event. A pattern database is generated by scanning the training image using the given template. The snesim algorithm (Strebelle 2000, 2002) generates a search tree from the training image and modeled conditional cumulative distribution function (ccdf) of patterns. The main disadvantage of the snesim algorithm is that it is demanding in terms of computer random-access memory (RAM) particularly when very large training images are used. RAM requirements may limit snesim when large size simulation of several multiple categories is needed. The snesim algorithm searches for exact replicates of conditioning data event. Since exact replicates may not always be possible to obtain from the pattern database, some conditioning data points from the conditioning data event are deleted. Arpat and Caers (2007) proposed an mp simulation algorithm termed as simpat (simulation with patterns), which is not based on the exact match of the training patterns with the conditioning data event but rather it searches for the best possible match. A simpat algorithm considers the training image as a collection of patterns, same as snesim, from which a pattern can be selected to locally match as close as possible to the conditioning data event. The main advantage of this algorithm is that no conditioning data points from the conditioning data event are required to be deleted; however, the major limitation is that the entire pattern database will be searched to find the best match at each simulating node; therefore computational time will be extensively high.
The filtersim (simulation using filter scores) algorithm overcomes simpat’s computing limitation (Zhang et al. 2006; Wu et al. 2008). Like simpat, the main advantage of filtersim is that no conditioning data points need to be deleted from the conditioning data event for matching with the patterns from the pattern database. In filtersim, similarly to snesim and other mp simulation approaches, scanning of the entire training image is performed using a given template to obtain patterns. Different filters are applied on patterns to obtain values of filter scores. The patterns in the pattern database are then grouped, based on their filter score values, into different classes. The classes are represented by their prototype, which is the average value of all patterns in a class. During simulation, the conditioning data event is compared with the class prototypes to find the closest matched class. Unlike simpat, filtersim does not need to search the entire pattern database. The algorithm is looking for ‘best match’ rather than ‘exact match’; therefore, no elimination of conditioning data points from the data event is required. Honarkhah and Caers (2010) introduced a distance-based simulation algorithm for efficiently classifying pattern database. Their results show that the algorithm performs better than filtersim for pattern reproduction. However, in all-pattern-bases simulation, a pattern is drawn randomly from a class; no conditional cumulative distribution function (ccdf) is generated for each class like snesim for categorical variable simulation. Therefore, the success of the technique is dependent on how well the patterns in the pattern database are classified. Since no ccdf’s are generated, the pattern obtained from a ‘best match’ class is random; no statistics are involved in it.
In other approaches to mp simulation, researchers proposed a high-order spatial cumulants-based technique where ccdf was generated by Legendre polynomials (Dimitrakopoulos et al. 2010; Mustapha and Dimitrakopoulos 2010). In this framework, the coefficients of Legendre polynomials are calculated using cumulant maps generated from a training image. This high-order simulation technique is data driven instead of training image driven and therefore reproduces high-order spatial statistics of the data. The limit of this approach is that, at present, the framework is limited to simulating continuous variables. Gloaguen and Dimitrakopoulos (2009) present a different technique of conditional simulation using the inter-scale dependency at the wavelet domain. The advantage of this approach is that the direct conditioning is easy, but it is difficult to fitting the conditioning data in the wavelet domain.
As an alternative to other mp simulations, a pattern-based simulation algorithm using wavelet analysis is proposed in this paper termed as wavesim. The pattern database is generated in a manner similar to other mp simulation techniques. The pattern database is classified by using wavelet approximate sub-band coefficients of each pattern. The wavelet approximate sub-band can capture most of the pattern variability, and at the same time reduce the dimensionality of the pattern database. Pattern database classification is performed using the k-means clustering technique. For categorical data simulation, the ccdf of the individual prototype class for the central node category of the template is developed using the probability of each individual category within the class; however, for continuous data simulation, a random sample is selected from ‘best match’ class. For simulation, the similarity of the prototype classes with the conditioning data event is calculated. A random pattern is generated from the developed ccdf of the ‘best match’ class. Unlike filtersim, wavesim is not generating a random pattern from a class; rather, it generates a random pattern from a ccdf developed for a class. However, for continuous data, no ccdf is generated.
The present paper is organized as follows. Section 2 describes the wavesim method. A brief overview of pattern-based simulation is presented in Sect. 2.1, and the basic fundamentals of wavelet analysis and dimensional reduction techniques are presented in Sect. 2.2. The pattern database classification technique is presented in Sect. 2.3, while Sect. 2.4 describes the ccdf generation of a class. The similar measures of the class prototype with the conditional data event are presented in Sect. 2.5. Section 3.1 presents unconditional simulation using binary training image, three-categories training image and continuous training image for two-dimensional problem. The conditional simulations of two- and three-dimensional continuous data are presented in Sect. 3.2. The sensitivity of wavesim is presented in Sect. 4. An application for simulating categorical variable of three-dimensional data at the Olympic Dam base metal deposit in South Australia is presented in Sect. 5; and, conclusion and discussion follow.
2.1 Generation of a Pattern Database
Pattern-based simulation is viewed as an image reconstruction problem (Arpat 2004; Zhang et al. 2006; Wu et al. 2008). Instead of directly reproducing the multiple-point statistics of a training image, the training image patterns are reproduced in a stochastic manner, and this ultimately respects the multi-point statistics of the training image (Arpat and Caers 2007). Pattern-based simulation algorithms consist of two steps. First, a pattern database is generated by scanning the training image using a given template. Then, a pattern that provides the best match to the conditioning data is searched from in the pattern database.
2.2 Dimensional Reduction of a Pattern Database
After generating the patdbT irrespective of using a continuous or a categorical training image, the classification of the pattern database will be performed so that during simulation, instead of searching the entire pattern database (patdbT), only some representative members, i.e. prototypes of the classes, are compared with the conditioning data event. However, when the template dimension is large, the dimension of patdbT will also be large. Therefore, classification of this large dimensional pattern database patdbT is a computationally demanding task. In previous research, the patdbT classification was performed by reducing the dimensions of the pattern by using few filter scores (Zhang et al. 2006; Wu et al. 2008). Zhang et al. (2006) and Wu et al. (2008) used 6 and 9 filters for two- and three-dimensional training images, respectively. Any dimensional pattern in the patdbT is represented by 6×M filter scores (for a two-dimensional image) where M is the number of categories (M=1 for continuous image) present in the training image. A wavelet-based representation of patterns is introduced where the dimension of the pattern-for-pattern classification can be reduced by selecting the scale of wavelet decomposition.
2.3 Pattern Database Classification
For classification of pattern database patdbT, the approximate sub-band of the patterns, which is reduced in dimension depending on the value of j, is used. The k-means clustering technique (MacQueen 1967; Hartigan and Wong 1979; Ding and He 2004) is applied to classify the pattern database patdbT. The main idea of k-means clustering is to divide the patdbT into a number of classes such that the sum of the inter-class distance is maximized.
2.4 Similarity Measures Between Conditional Data and Class Prototypes
2.5 Conditional Cumulative Distribution Function (ccdf) of a Class for Categorical Image
During the simulation process, after finding the best matched class, a uniform random number is generated. From the developed ccdf, the category at the central node is obtained corresponding to the generated random number. Then, a random pattern is drawn from the matched class patterns which have the same central node category as the category obtained from the ccdf. After pasting the drawn pattern at a simulated node, the next node is visited in a random path. The same distance function and the patterns-drawing algorithm are performed until all nodes are simulated. The algorithm stops when no nodes are left unvisited. It is noted that, for continuous image, a random pattern is drawn from a class; no ccdf is generated for continuous case.
Scan the training image ti using the given template T. Perform wavelet decomposition of the generated patterns using selected wavelet basis function and scale. Save the wavelet coefficient and approximate coefficients in the pattern database. If the training image is categorical image, generate M binary image from the M-categories training image before wavelet decomposition.
Classify the patterns, based on only the approximate sub-band coefficients in previously defined cluster numbers and calculate the class prototype using the point-wise averaging of all patterns within a class.
Define a random path visiting once and only once all unsampled nodes.
Use the same template shape T at each unsampled location u. The distance from the class prototype is calculated from the conditioning data available within the template using either (7) or (9). Select the class which has minimum distance from the conditioning data. If no conditioning data are available, a random class is selected.
Draw a random pattern from the prototype class and paste the pattern by centring at the simulated location u. If any hard data or central node value of any already simulated locations are present in any node within the template T at location u, they are frozen before simulated pattern pasting. For categorical data, the random sample is drawn based on ccdf generated for each class as described in Sect. 2.5.
Add the simulated value at point u to a different file to use it during distance calculation.
Repeat Steps 4 and 6 for the next points in the random path defined in Step 3.
Repeat Steps 3 to 7 to generate different realizations using different random paths.
3 Application of the Proposed Method
The wavesim algorithm is validated by simulating known categorical and continuous two-dimensional and three-dimensional data sets. The exhaustive data sets are obtained from different sources. All runs are performed on a 3.2 GHz Intel(R) Xeon (TM) PC with 2 GB of RAM. For wavelet decomposition, the Haar basis functions are applied for all cases unless otherwise specified. The results of wavesim are compared with filtersim results to make a valid comparison. The filtersim results are generated by using SGeMS software (Remy et al. 2009). To assess the wavesim method, unconditional and conditional simulations are performed for categorical and continuous data sets which are then compared against results from filtersim.
3.1 Unconditional Simulation
To perform unconditional simulation, binary training image, three-categories training image, and continuous training image are considered and presented hereafter.
3.1.1 Two-categories Training Image
3.1.2 Multi-Categories Training Image
3.1.3 Contentious Training Image
For unconditional simulation of continuous data, an exhaustive two-dimensional continuous horizontal slice is obtained from a three-dimensional fluvial reservoir. The exhaustive data sets used here are obtained from the Stanford V Reservoir Data Set (Mao and Journel 1999). The channel configurations and orientation is complex in nature form one slice to another in the vertical direction. The size of the domain to be simulated is 100×128. The number of clusters chosen for this analysis is 200. The sensitivity of the number of clusters is presented in Sect. 4.
3.2 Conditional Simulation
Two different examples are shown for conditional simulation with the wavesim and compared with the filtersim results. One two-dimensional and one three-dimensional continuous data examples are presented hereafter.
3.2.1 Two-Dimensional Conditional Simulation with Continuous Data
The same Stanford V Reservoir Data Set (Mao and Journel 1999) is used for conditional simulation example. One slice of the three-dimensional reservoir data is used as reference image where from conditioning data are sampled. Another slice is used as the training image. The size of the domain to be simulated is 100×128.
3.2.2 Three-Dimensional Conditional Simulation with Continuous Data
To performed three-dimensional conditional simulation for continuous data, we have rescaled the Stanford V Reservoir Data Set (Mao and Journel 1999) to 100×100×28. One part of the three-dimensional reservoir data is used as training image and the other part is simulated using wavesim and filtersim algorithm. The size of the training image and reference image is 100×5×28 each.
In Sect. 3.1.1 and for a binary training image, we have used the approximate sub-band of 2-scale wavelet decomposition of a 9×9 training image. Therefore, the number of variables used for classification in our algorithm is 9 in comparison to 6 of filtersim. Thus, the computing time of our proposed approach is slightly higher than that of filtersim for the simulation images in Fig. 8. However, in Fig. 7 we have presented different realizations of the same training image using 3-scale (4 variables) and 4-scale (1 variable) decomposition. It is observed in Fig. 7 that the simulation using 3-scale and 4-scale decomposition is also performing better than the filtersim results (Fig. 8(c), (d)). Since 3-scale and 4-scale are using less number of variables, 4 and 1 respectively, than filtersim, computing time will also be less for the proposed method compared to filtersim.
4 Sensitivity Analysis
It is now clear from presented examples that wavesim has performed better than the filtersim algorithm for continuous and categorical, two- and three-dimensional problems. However, the success of the proposed method, same as for filtersim, depends on some parameters. In this section, we will present the sensitivity of the proposed method to different parameters. The number of clusters for pattern database classification, type of basis functions used, weights assigned to the distance calculation, the number of wavelet coefficients used for distance calculation will be investigated in the simulated realization. In this section, the sensitivity of the method is tested using conditional simulation techniques with same data set presented in Sect. 3.2.1. The data set 1 is used as conditioning data for sensitivity analysis unless otherwise mentioned.
4.1 Sensitivity to the Cluster Number
4.2 Sensitivity to the Wavelet Basis
It is always expected that when the higher order wavelet basis functions will be used, the results of the simulated map should be improved. However, we have not observed that improvement in this case. The possible reason may be that the approximate sub-band coefficients using Haar basis are sufficient to capture the complexity present in the patterns of the training image. It implies that even if we have not seen any such improvement in this case with increasing the order of wavelet basis, the improvement may be observed when the training image pattern is more complex.
4.3 Sensitivity to the Training Image
4.4 Sensitivity to the Number of Wavelet Coefficients
The examples presented so far were performed by calculating the distance from the conditioning data event to the cluster center using only approximate sub-band coefficients if the conditioning data event is fully informed (9). However, it is presented in different literature that by only keeping few wavelet coefficients with approximate sub-band coefficients can improve the quality of the reconstructed image significantly (Donoho et al. 1996; Vannucci and Corradi 1999). Therefore, in this example, the distance calculation was performed by using few wavelet coefficients along with approximate sub-band coefficients when the conditioning data event is fully informed. The dimension of the resultant data for distance calculation will be not increased much by adding few wavelet coefficients; however, adding few coefficients may increase the power of the algorithm. Four different runs were performed by changing the number of wavelet coefficients. In the first run, only approximate sub-band is used. In other three runs, numbers of wavelet coefficients incorporated for distance calculation are 40, 80, and 100.
5 Case Study
The reproduction of the directional variability is tested by calculating directional indicator variograms. The directional indicator variograms hard data and simulated realizations are presented elsewhere (Chatterjee and Dimitrakopoulos 2010). The indicator variograms and cross-variograms show that the directional variability of hard data for these rock types is reproduced by the simulated realizations.
Similarly to other multi-point simulation techniques, since the patterns are obtained from the training image, the wavesim is also training image driven. Thus, when conditional simulation is performed, the simulated realization reproduces the statistics of the training image. When the amount of hard data is increased, the effect of hard data is introduced in the resultant simulated realizations, and a clear conflict between hard data and training image statistics will be observed in simulated realizations, similarly to other mp simulation algorithms. As a result, if the statistics of the hard data and the training image are distinctly different and the conditional simulation is performed using a considerable number of hard data, the simulated realizations will fail to reproduce the training image or hard data statistics.
A pattern-based conditional simulation algorithm, wavesim, is presented. The algorithm uses wavelet basis function for dimensional reduction of patterns. The technique is based on pattern classification and pattern matching; the dimensional reductions of the patterns were performed by wavelet decomposition. The pattern classification was performed by the k-means clustering algorithm. The algorithm is verified by two- and three-dimensional conditional and unconditional simulation using different data sets like binary and two-class categorical data, continuous complex channels data. The algorithm reproduced the continuity of the channels for two- and three-dimensional examples using conditional and unconditional simulation. The comparative study with the filtersim algorithm showed that the wavesim performed better than the filtersim for reproducing the continuity of the channels for all examples.
The sensitivity of the algorithm to different parameters was also explored. The study shows that the algorithm is sensitive to the number of clusters, like other pattern-based simulation methods, and the orientation of training image. Therefore, optimal selection of the cluster number may help to improve the performance of the wavesim algorithm. Moreover, the algorithm is not sensitive to two key parameters of the wavesim algorithm, that is, the wavelet basis functions and number of wavelet coefficients. However, this is the case specific observation. It is true that an extensive sensitivity study is required with different levels of complex training image to show the true effects of wavelet basis functions. That will be considered in our future study. The case study at Olympic Dam mine was presented for multi-class categorical conditional simulation. The results showed that the proportion of the rock codes is reasonably reproduced.
The major advantages of the wavesim algorithm are: (a) due to the nature of the approximate sub-band of the wavelet decomposition, which reduces the dimensionality of the pattern and captures most of the data variability, the pattern classification of the high dimensional pattern database can be performed successfully with less computational effort; and (b) since the ccdf is developed for each class for categorical simulation, the pattern drawing from a class is performed based on a probability law, rather than random drawing, which may help with the reproduction of channels better.
The limits of this technique are similar to other mp simulation methods: (a) the algorithm is training image driven, therefore when the statistics of the training image and hard data are different, the algorithm will reproduce statistics in-between the hard data and training image; and (b) when the number of categories in the categorical image increases, the dimension of the pattern database will increase considerably, thus the dimensional reduction technique using the approximate sub-band after wavelet decomposition of the pattern database may not be computationally efficient.
The work in this paper was funded by NSERC Discovery Grant 239019 and the members of McGill’s COSMO Lab, AngloGold Ashanti, Barrick Gold, BHP Billiton, De Beers, Newmont Mining, and Vale. We would like to thank the management of Olympic Dam mine for giving permission to use their data. We would also like to thank our reviewers Mehrdad Honarkhah, Jef Caers, and Jianbing Wu for their valuable comments to improve our first version of the manuscript.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
- Arpat GB (2004) Sequential simulation with patterns. PhD thesis, Stanford University, Stanford, CA Google Scholar
- Chatterjee S, Dimitrakopoulos R (2010) Wavelet-based indicator simulation using training images: an application at Olympic Dam Mine, South Australia. COSMO Res Rep 4(2):153–186 Google Scholar
- Deutsch CV, Journel AG (1998) GSLIB: geostatistical software library and user’s guide. Oxford University Press, New York Google Scholar
- Ding C, He X (2004) K-means clustering via principal component analysis. In: Proc int’l conf machine learning (ICML 2004), pp 225–232 Google Scholar
- Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, New York Google Scholar
- Hartigan JA, Wong MA (1979) Algorithm AS 136: A K-means clustering algorithm. J R Stat Soc, Ser C, Appl Stat 28(1):100–108 Google Scholar
- Journel AG (1997) Deterministic geostatistics: a new visit. In: Baafy E, Shofield N (eds) Geostatistics Woolongong’96. Kluwer Academic, Dordrecht, pp 213–224 Google Scholar
- Kuglin C, Hines D (1975) The phase correlation image alignment method. In: Proc IEEE int conf on cybernetics and society, pp 163–165 Google Scholar
- MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proc of 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297 Google Scholar
- Mallat S (1998) A wavelet tour of signal processing. Academic Press, San Diego Google Scholar
- Mao S, Journel AG (1999) Generation of a reference petrophysical and seismic three-dimensional data set: the Stanford V reservoir; Stanford Center for Reservoir Forecasting Annual Meeting. Available at: http://ekofisk.stanford.edu/SCRF.html
- Osterholt V (2006) Simulation of ore deposit geology and an application at the Yandicoogina iron ore deposit, Western Australia. PhD thesis (Unpublished), University of Queensland, 144 p Google Scholar
- Strebelle S (2000) Sequential simulation drawing structures from training images. PhD thesis, Stanford University, Stanford, CA Google Scholar