# Improving pattern discovery and visualisation with self-adaptive neural networks through data transformations

## Authors

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s13042-011-0050-z

- Cite this article as:
- Zheng, H. & Wang, H. Int. J. Mach. Learn. & Cyber. (2012) 3: 173. doi:10.1007/s13042-011-0050-z

- 9 Citations
- 105 Views

## Abstract

The ability to reveal the relevant patterns in an intuitively attractive way through incremental learning makes self-adaptive neural networks (SANNs) a power tool to support pattern discovery and visualisation. Based on the combination of the information related to both the shape and magnitude of the data, this paper introduces a SANN, which implements new similarity matching criteria and error accumulation strategies for network growth. It was tested on two datasets including a real biological gene expression dataset. The results obtained have demonstrated several significant features exhibited by the proposed SANN model for improving pattern discovery and visualisation.

### Keywords

Self-adaptive neural networksPattern discovery and visualisationSimilarity measureChi-squares statistics## 1 Introduction

### 1.1 An introduction to pattern discovery and visualisation

Pattern discovery is a fundamental operation in a data mining process [1]. The aim of pattern discovery is not merely to recognise patterns but to find useful patterns; more importantly it aims to reveal the hidden patterns intrinsic to the tasks investigated [2]. It has been well recognised that the discovery of relevant patterns hidden in the data may have important applications and implications in different application areas. In biosciences, for example, revealing significant patterns in gene expression profiles may lead to a better understanding of structural and functional genetic roles, aid in the prevention and diagnosis of complex diseases, and improve the design of highly effective therapies [3, 4].

Pattern visualisation is an alternative approach to support data mining and knowledge discovery. It is concerned with exploring data and information in such a way as to gain understanding and insight into the data studied [5]. An effective visualisation technique can provide a qualitative overview of large and complex datasets summarise data, and assist in the identification of interesting properties, which may provide the basis for more focused quantitative analyses [6].

It has been suggested that there be a strong connection between pattern discovery and visualisation [7]. On one hand, the search for interesting patterns, as well as the understanding of the meanings of the patterns discovered can be accomplished with the aid of visualisation techniques. These techniques can serve as a powerful tool for identifying structures, tendencies and relationships in data in an intuitive way [8]. On the other hand, large amounts of data can often not be visualised directly, as the resulting graphical representation gets too complex to be captured by users. Therefore, pattern discovery techniques can be used to capture and aggregate significant information contained in data. The resulting structure can then be visualized [9]. In this sense, pattern discovery paves a way for users to intuitively access bigger picture of the problem under study.

Due to the ability to learn from data and to reveal the relevant patterns in an intuitively attractive way through incremental learning, self-adaptive neural networks (SANNs) are becoming a power tool to support pattern discovery and visualisation [7].

### 1.2 SANNs: an overview

The modern era of SANNs started with the pioneering work of Teuvo Kohonen [10], who developed the kohonen self-organising features map (SOM), which is perhaps the most widely investigated self-organising neural network in the literature. The principal goal of the SOM algorithm is to transform high-dimensional input patterns into a one- or two-dimensional discrete map, and to perform this transformation adaptively in a topological ordered fashion.

While demonstrating a number of features suitable for pattern discovery and visualization, the SOM-based approaches exhibit several limitations that hinder their performance. For a given application, the usefulness of a map produced by a trained SOM will depend on how accurately it represents the input space. Moreover, the static grid representation of the SOM has negatively influenced its applications in pattern discovery and visualisation. For most applications, for example, data visualisation is hard to achieve solely based on the raw grid map alone.

Over the last decade, there have been numerous attempts to improve the SOM neural network. Due to the ability to dynamically organise themselves according to the natural structure of the underlying data, SANNs offer significant advantages in this endeavour [7]. For example, the *Growing Self-Organizing Map* (GSOM) [11, 12] starts with a map of 2 × 2 neurons and new neurons are incrementally added at the boundary of the map where the network is exhibiting a large cumulative error in representation. In the *Growing Cell Structures* (GCS) [13], the basic 2D grid of the SOM has been replaced by a network of neurons whose connectivity defines a system of triangles. By separating neurons into disconnected areas, the GCS can produce the explicit representation of cluster boundary. The self-organising tree algorithm (SOTA) [14] is another example of SANNs. The initial topology of SOTA is composed of two terminal neurons connected by an internal neuron. The learning process starts with the presentation of each sample from the training set. Self-organising tree algorithm has demonstrated interesting features in pattern discovery and visualisation. For example, the nested output structure, in which neurons at each level are averages of the samples below, makes it straightforward to compare patterns at different hierarchical levels.

It is worth noting that the development of SANNs has traditionally focused on data domains that are assumed to be modelled by a Gaussian distribution. The similarity between data samples is largely measured using Euclidean distance and/or Pearson Correlation-based distance metrics. It has been found, however, that the Pearson correlation coefficient is too sensitive to the shape and the Euclidean distance mainly considers the magnitude of the changes of the data [15]. More recently, Zheng et al*.* [16] developed a Poisson-based SANN, which are tailored to knowledge discovery in Serial Analysis of Gene Expression (SAGE) data. While successfully addressing the probability model governing the SAGE data, the Poisson-based similarity measure ignores the *direction of departure* information as pointed out by Kim et al*.* [15]. They proposed a new similarity measure which considers the data magnitude when measuring the shape similarity and implemented it within a *k*-means clustering procedure, which inevitably exhibits some of the limitations found in the traditional *k*-means algorithm. For example, it requires users to specify the number of clusters in advance. The outcome of the *k*-means algorithm is an unorganised collection of clusters that is not conductive to proper explanation [17].

### 1.3 Objectives of this study

This paper aims to present a new SANN model, which takes the information related to both magnitude and shape information into account to improve pattern discovery and visualization. The main objective of this study is, based on the adaptation of a Chi-square statistics similarity measure obtained in a transformed feature space, to implement new strategies for determining a winning node and network growth.

The remainder of this paper is organized as follows. Section 2 presents a detailed description of the new SANN learning algorithm, followed by an introduction of two datasets used in this study. Results and a comparative analysis are presented in Sect. 4. This paper concludes with the discussion of results and future research.

## 2 The proposed SANN

### 2.1 Learning algorithm

Like GSOM [11], the proposed SANN model is randomly initialized with four neurons on a 2D grid. Such an initial structure allows the network to grow in any direction solely depending on the input data [11].

*x*

_{i}is sequentially presented and each presentation involves the following two basic operations: (a) Finding the winning neuron for each input sample; and (b) Updating the weights associated with both the winning neuron and its topological neighbours. However, by utilizing the Chi-square statistics based similarity measured in a transformed feature space, the proposed SANN implements new strategies for determining a winning neuron and initiating neuron growth, which distinguish it from other SANN models. The main features demonstrated by the proposed model include:

Matching criterion for finding a winning neuron

Determining a winning neuron for each input data sample is the fundamental process for SANN models. Euclidean distance-based approaches are perhaps the most widely used matching criterion in the development of SANN models. Being specifically sensitive to outliers and changes of magnitude, such a distance measure has demonstrated poor performance in several applications [16]. Aiming to target the situation where the magnitude information needs to be taken into account when measuring the shape similarity, the proposed SANN model introduces the following Chi-square based matching criterion.

*x*

_{i}be the input vector representing the

*i*th

*n*-dimensional input sample and

*w*

_{j}be the associated weight vector of the

*j*th neuron, where

*x*

_{i}and

*w*

_{j}are denoted by the following representation:

*w*

_{i}, coincides with the centroid vector of all the data samples assigned to this neuron, the dispersion between

*x*

_{i}and

*w*

_{j}can be estimated using Chi-square statistics which accounts for both the magnitude and shape information as shown below.

*i*is assigned to neuron

*j*. Thus, the winning neuron,

*w*

_{c}, signified by the subscript

*c*, is determined by the following smallest distance criterion:

*.*[15], while penalizing the deviation from the expected value, such a measure ignores the direction of difference information and thus may lose some shape information. To address this, instead of calculating the deviation in the original space, we measure the deviation in a transformed feature space which is constructed using the mutual differences of the original vector components [15]. For a given

*T*-dimensional vector,

**V**_{i}= (

*V*

_{i}(1),…,

*V*

_{i}(

*T*)), the transformed vector can be represented as a

*T*(

*T−*1)/2 vector in the form of

*V*

_{i}(

*t*

_{1})−

*V*

_{i}(

*t*

_{2}) for

*t*

_{1}= 1,…,

*T*−1 and

*t*

_{2}= (

*t*

_{1}+ 1),…,

*T*. Accordingly, the transformed

*x*

_{i}and

*w*

_{j}vectors can be expressed as

*x*

_{i}) and the expected values (the weight vector of the neuron,

*w*

_{j}) can be measured using the following statistics:

Error accumulation strategy for initiating the growth of new neurons.

*E*) commonly computed based on the calculation of the Euclidean distance between the input sample and the winning neuron and accumulated over each learning cycle is used to estimate when and where to insert new neurons. It has been found, however, that this measure is overly sensitive to the change of magnitude and may fail to correctly guide the growth of new neurons [16]. In this study, a deviation based on Chi-square statistics measured in the transformed space is adopted and a cumulative deviation (

*E*) is calculated for each winning neuron over each learning cycle using the following formula.

*w*

_{c,k}is

*k*th feature of the winning neuron,

*c*,

*x*

_{k}represents the

*k*th feature of the input vector,

*x*, in the original feature space.

*E*

_{c}(

*t*) represents the deviation computed at time

*t.*

The mechanism that determines when to generate a new neuron

In the original GSOM algorithm [11], the network keeps track of the highest cumulative quantisation error, *E*_{max}, during the growing phase, and periodically compares it with the growth threshold (GT) defined by the user. When the value of *E*_{max} exceeds GT, the growth of neurons is initiated. To provide effective control over the growth of a network, a spread factor (SF) was introduced. However, the introduction of SF was based on the assumption that the node growth is guided using cumulative quantisation errors measured by Euclidean distance. As a new error accumulation strategy has been introduced in this study, such a mechanism to initiate the growth of the network is arguably not suitable for the proposed SANN model.

In an attempt to reduce computational complexity, the proposed model adopts the following approach. After each learning epoch, the growth of new neurons is initiated at the neuron with the highest error. After that, cumulative errors of all neruones are reset to zero. This, on the one hand, eliminates the need for redistributing error information. On the other hand, it gives every node the same opportunity to be a winner in the next learning cycle.

Summary of the proposed SANN algorithm

1: Initialize each neuron with random values |

2: |

3: For each learning cycle |

4: Select a sample, |

5: Compute the distance between the input sample, |

6: Find the winning neuron using the minimum-distance criterion according to (4) |

7: Update the weights of the winning neuron and its neighbours (the same as GSOM) |

8: Increase the error value of the winner using (8) |

9: Find a neuron with the highest cumulative deviation and initiate the growth of new neurons |

10: |

11: |

12: For each learning cycle |

13: Present each input sample |

14: Determine the winning neuron |

15: Update the weights of the winner and its immediate neighbours |

16: |

### 2.2 Learning parameters

In this study, *N*_{0} is the size of initial neighborhood; * α*_{0} is the initial learning rate; and NLE stands for the number of learning epochs. For a training set, *T*, of *n* cases, a learning epoch refers to *n* learning cycles, within which the network is sequentially presented with each training sample. Unless indicated otherwise, the parameters reported in this paper are as follows: For the Iris data data, *N*_{0} = 4, *α*_{0} = 0.1, and the maximum NLE (growing phase) = 10, NLE (smoothing phase) = 10. For mouse retinal SAGE data, *N*_{0} = 6, *α*_{0} = 0.01, and the maximum NLE (growing phase) = 30, NLE (smoothing phase) = 50. The numbers beside each neuron shown on the output maps depict the order in which they were added during the growth phase. The rectangular neurons illustrated in the resulting maps represent for *dummy neurons*, which accumulated zero hits at the end of the learning phase. The datasets analyzed in this paper are described in the following section.

## 3 The datasets under study

### 3.1 Iris dataset

*UCI Maching Learning Repository*(http://www.ics.uci.edu/~mlearn/MLRepository.html).

The statistical description of each feature in Iris dataset

Feature | Min | Max | Mean | Standard deviation |
---|---|---|---|---|

1 | 4.3 | 7.9 | 5.84 | 0.83 |

2 | 2.0 | 4.4 | 3.05 | 0.43 |

3 | 1.0 | 6.9 | 3.76 | 1.76 |

4 | 0.1 | 2.5 | 1.20 | 0.76 |

Functional categorization of the 153 mouse retinal SAGE tags (125 developmental genes; 28 non-developmental genes)

Cluster | Number of SAGE tags |
---|---|

Early I | 32 |

Early II | 34 |

Late I | 32 |

Late II | 27 |

Non-dev. | 28 |

### 3.2 Mouse retinal gene expression data

This dataset was constructed by using SAGE, which allows a simultaneous analysis of thousand of genes without the need for prior gene sequence information [23]. It consists of 10 SAGE libraries (38,818 unique tags with tag counts ≥2) from developing retina taken at 2-day intervals from embryonic day 12.5 (*E*12.5) to postnatal day 10.5 (*P*10.5) and adult retina [19]. Among the 38,818 tags, there are 1,467 tags with abundance levels equal to or greater than 20 in at least one of the ten libraries. In this study, a subset of 153 SAGE tags with known biological functions provided by Kim et al*.* [15] were used to test the algorithms. These 153 SAGE tags are divided into five classes based on their expression level across different developmental stages as shown in Table 3. Each tag is represented ten SAGE libraries, i.e. *E*12.5, *E*14.5, *E*16.5, *E*18.5, *P*0.5, *P*2.5, *P*4.5, *P*6.5, *P*10, and *Adult retina*.

## 4 Results

### 4.1 Analysis of iris data

Based on the further analysis with different learning parameters on the area of interests, a hierarchical and multi resolution clustering may be implemented. Figure 2b is the spread-out version of Branch A produced with the learning rate set to 0.05. The output has been developed in two ways, suggesting that there may be two classes in this branch. This coincides with the sample distribution over these two sub-branches, where 97.6% of the samples in Sub-Branch A1 belong to the Versicolor Class, and 90.4% of the samples in Sub-Branch A2 belong to the Virginica Class.

### 4.2 Analysis of mouse retina gene expression data

The sample distribution over each branch

Branch | Node | Early I | Early II | Late I | Late II | Non-dev |
---|---|---|---|---|---|---|

A | 88 | 0 | 5 | 0 | 0 | 0 |

86 | 5 | 1 | 0 | 0 | 0 | |

85 | 3 | 1 | 0 | 0 | 0 | |

87 | 0 | 5 | 0 | 0 | 0 | |

83 | 8 | 0 | 0 | 0 | 0 | |

77 | 6 | 0 | 0 | 0 | 0 | |

71 | 0 | 5 | 0 | 0 | 0 | |

70 | 0 | 6 | 0 | 0 | 0 | |

69 | 0 | 6 | 0 | 0 | 0 | |

57 | 0 | 3 | 0 | 0 | 0 | |

55 | 3 | 0 | 0 | 0 | 0 | |

52 | 6 | 0 | 0 | 0 | 0 | |

50 | 0 | 1 | 0 | 0 | 0 | |

| 66 | 0 | 0 | 0 | 3 | 0 |

67 | 0 | 0 | 3 | 1 | 0 | |

74 | 0 | 0 | 1 | 3 | 0 | |

75 | 0 | 0 | 1 | 0 | 0 | |

80 | 0 | 0 | 1 | 1 | 0 | |

76 | 0 | 0 | 1 | 0 | 0 | |

81 | 0 | 0 | 1 | 0 | 0 | |

82 | 0 | 0 | 2 | 2 | 0 | |

C | 1 | 0 | 0 | 0 | 0 | 6 |

3 | 1 | 0 | 0 | 0 | 2 | |

5 | 0 | 0 | 0 | 0 | 4 | |

0 | 0 | 0 | 0 | 0 | 8 | |

2 | 0 | 1 | 0 | 0 | 1 | |

4 | 0 | 0 | 0 | 0 | 2 | |

7 | 0 | 0 | 0 | 0 | 1 | |

6 | 0 | 0 | 0 | 0 | 2 | |

10 | 0 | 0 | 0 | 0 | 2 |

Due to its self-adaptive properties, after training, the proposed network can reveal the clusters hidden in the data through its shape, allowing users to identify relevant patterns with relatively little effort. The resulting map shown in Fig. 4a has branched out into three directions, indicating the main groups of the data. An analysis of the sample distribution over each branch (See Table 4) indicates that each branch is associated with certain expression patterns encoded in this SAGE data. For example, all the samples assigning to Branch A belong to the clusters that exhibit higher expression in the early stage of embryonic development (Early I and Early II classes). Similar observation can be found in Branch B. All the samples found in Branch B have higher mRNA expression in postnatal periods (Late I and Late II classes). Branch C, on the other hand, consists of all the 28 tags unrelated to the mouse retina development (Non-dev class).

*hypergeometric distribution*probability [21] is calculated.

*K*is the number of tags that fall into a region,

*k*is the number of class members in the region,

*N*is the total number of SAGE tags and

*n*is the number of tags belonging to a specific class in the whole dataset. For example, in Region A1, 31 out of 32 Early I tags are grouped together (

*p*< 10

^{−20}). All the 27 samples found in Region A2 belong to Early II class (

*p*< 10

^{−20}). Based on these figures, one may confidently assume that Regions A1 and A2 are significantly associated with Early I and Early II classes, respectively.

Like GSOM [11], when there is a need to add new nodes to the network, the proposed model grows new neurons in all free immediate neighbouring positions. This may create some redundant nodes such as *dummy nodes* which is a node that accumulated ‘0’ hits during the training phase. Interestingly, such dummy nodes can be used to aid the interpretation of cluster structure. For example, in Fig. 4b, Regions A1 and A2 have been well-separated from each other by the area covered by the dummy nodes 0, 4, 6, 9, 12, 15, 18, 21, 24, 27, and 32.

In summary, the GSOM model fails to achieve its clustering goals when applied to analyse this real biological data. For instance, the samples from the same cluster were scattered to different regions and the resulting map didn’t reflect the intrinsic data structure. For example, there is no specific region found to be significantly linked to Late II class (*p* > 0.05). On the contrary, the maps derived by using the proposed network successfully reveal significant patterns encoded in the data. Similar samples are grouped together. By mean of hypergeometric distribution, each branch or each region separated by the dummy nodes is found to be highly represented by certain classes (*p* < 10^{−20}). Such a map may greatly facilitate the study of significant patterns hidden in the data.

### 4.3 An interactive platform

Visual inspection of the resulting map to gain an overall idea of the structure of the data. The proposed model can indicate the clusters by its shape, allowing the human visual system to detect and identify the patterns hidden in the data with relatively little effort.

Hierarchical and multi-resolution clustering analysis. A finer analysis on the areas of interest can be carried out using different learning parameters later on. Based on a further spread-out version of selected regions produced, a hierarchical and multi-resolution clustering may be implemented.

## 5 Conclusion

Pattern discovery and visualization are two important operations in a data mining process. The ability to reveal the significant patterns hidden in the data in an intuitively attractive way through incremental learning makes SANNs a power tool to support pattern discovery and visualisation. Using Chi-square statistics-based approaches to measuring dispersion between input samples and weights associated with each neuron in a transformed feature space, this paper presented a SANN model, which implements a new matching criterion for determining a winning node and error accumulation strategies for node growth. The results obtained in this study demonstrated several significant features exhibited by the proposed model in pattern discovery and visualisation. For example, based on its self-adaptive capability, similar data are successfully clustered. Like existing SANN models [11, 12, 16], the resulting adaptive map automatically models the data under study. Upon the completion of learning process, the proposed model can develop into different shapes to reflect significant patterns hidden in the data. By keeping a regular 2D structure at all times, the model has an appealing property which allows a user to perform pattern discovery and visualisation at the same time with relative ease. By performing a finer analysis of areas of interest with different learning parameters, the interactive data mining platform provided in this study can be used to implement hierarchical and multi-resolution clustering.

Instead of estimating the exact position where new nodes need to be inserted, the proposed model generates new nodes in all free immediate neighbouring location [11, 12]. This may inevitably introduce some irrelevant neurons, which may degrade the quality of the visualisation of output maps. Moreover, at present, as there is no standard way to determine the best combination of learning parameters, the estimate of optimal learning parameters is based on conducting several experiments. Thus, incorporation of neuron pruning methods [22] and other machine learning techniques into the learning process to provide a better insight into the effect of node removal and initial learning parameter settings would be a crucial part of future work.

Instead of the use of a GT [11, 12], the proposed network grows by adding new neurons at the end of each learning epoch. After the growth of new neurones, cumulative errors of all neruones are reset to zero. This, on one hand, gives the same opportunity for each neuron to be a winner in each learning cycle. On the other hand, it greatly reduces computational complexity. The same strategy has been successfully applied in other SANN models [16]. Nevertheless, the development of a new version of the SF based on the calculation of Chi-square statistics to control the expansion of networks provides an important direction for future research. In addition, only two datasets were analysed in this paper. The behavior of the proposed model in the analysis of other types of datasets deserves further investigation.

Like the SOM and other SANN models, the computational complexity of the proposed model is linear in the number of input samples. However, each training sample is presented multiple times and the desired performance can only be achieved after gradual adaptation of weights associated with each neurone. This may affect the performance of the model in large-scale applications. Thus the scalability of the proposed model will be further investigated. Other future work includes a detailed comparison with other well developed methods such as the SOM [10] and the *adaptive resonance theory* (ART) model [24].

It is worth noting that no single SANN algorithm can always perform best for different applications. It is crucial to understand some key factors that may influence the selection of learning models. Given the similarity measure adopted in this study, the proposed model would be more suitable when there is a need to include the magnitude information in measuring the shape similarity. Examples include clustering analysis of gene expression data in which the shape information is a more critical factor that needs to be considered whereas the absolute expression level associated with each gene under certain conditions should be taken into account as well [15].