International Journal of Machine Learning and Cybernetics

, Volume 3, Issue 3, pp 173–182

Improving pattern discovery and visualisation with self-adaptive neural networks through data transformations

Authors

    • School of Computing and Mathematics, Computer Science Research InstituteUniversity of Ulster
  • Haiying Wang
    • School of Computing and Mathematics, Computer Science Research InstituteUniversity of Ulster
Original Article

DOI: 10.1007/s13042-011-0050-z

Cite this article as:
Zheng, H. & Wang, H. Int. J. Mach. Learn. & Cyber. (2012) 3: 173. doi:10.1007/s13042-011-0050-z

Abstract

The ability to reveal the relevant patterns in an intuitively attractive way through incremental learning makes self-adaptive neural networks (SANNs) a power tool to support pattern discovery and visualisation. Based on the combination of the information related to both the shape and magnitude of the data, this paper introduces a SANN, which implements new similarity matching criteria and error accumulation strategies for network growth. It was tested on two datasets including a real biological gene expression dataset. The results obtained have demonstrated several significant features exhibited by the proposed SANN model for improving pattern discovery and visualisation.

Keywords

Self-adaptive neural networksPattern discovery and visualisationSimilarity measureChi-squares statistics

1 Introduction

1.1 An introduction to pattern discovery and visualisation

Pattern discovery is a fundamental operation in a data mining process [1]. The aim of pattern discovery is not merely to recognise patterns but to find useful patterns; more importantly it aims to reveal the hidden patterns intrinsic to the tasks investigated [2]. It has been well recognised that the discovery of relevant patterns hidden in the data may have important applications and implications in different application areas. In biosciences, for example, revealing significant patterns in gene expression profiles may lead to a better understanding of structural and functional genetic roles, aid in the prevention and diagnosis of complex diseases, and improve the design of highly effective therapies [3, 4].

Pattern visualisation is an alternative approach to support data mining and knowledge discovery. It is concerned with exploring data and information in such a way as to gain understanding and insight into the data studied [5]. An effective visualisation technique can provide a qualitative overview of large and complex datasets summarise data, and assist in the identification of interesting properties, which may provide the basis for more focused quantitative analyses [6].

It has been suggested that there be a strong connection between pattern discovery and visualisation [7]. On one hand, the search for interesting patterns, as well as the understanding of the meanings of the patterns discovered can be accomplished with the aid of visualisation techniques. These techniques can serve as a powerful tool for identifying structures, tendencies and relationships in data in an intuitive way [8]. On the other hand, large amounts of data can often not be visualised directly, as the resulting graphical representation gets too complex to be captured by users. Therefore, pattern discovery techniques can be used to capture and aggregate significant information contained in data. The resulting structure can then be visualized [9]. In this sense, pattern discovery paves a way for users to intuitively access bigger picture of the problem under study.

Due to the ability to learn from data and to reveal the relevant patterns in an intuitively attractive way through incremental learning, self-adaptive neural networks (SANNs) are becoming a power tool to support pattern discovery and visualisation [7].

1.2 SANNs: an overview

The modern era of SANNs started with the pioneering work of Teuvo Kohonen [10], who developed the kohonen self-organising features map (SOM), which is perhaps the most widely investigated self-organising neural network in the literature. The principal goal of the SOM algorithm is to transform high-dimensional input patterns into a one- or two-dimensional discrete map, and to perform this transformation adaptively in a topological ordered fashion.

While demonstrating a number of features suitable for pattern discovery and visualization, the SOM-based approaches exhibit several limitations that hinder their performance. For a given application, the usefulness of a map produced by a trained SOM will depend on how accurately it represents the input space. Moreover, the static grid representation of the SOM has negatively influenced its applications in pattern discovery and visualisation. For most applications, for example, data visualisation is hard to achieve solely based on the raw grid map alone.

Over the last decade, there have been numerous attempts to improve the SOM neural network. Due to the ability to dynamically organise themselves according to the natural structure of the underlying data, SANNs offer significant advantages in this endeavour [7]. For example, the Growing Self-Organizing Map (GSOM) [11, 12] starts with a map of 2 × 2 neurons and new neurons are incrementally added at the boundary of the map where the network is exhibiting a large cumulative error in representation. In the Growing Cell Structures (GCS) [13], the basic 2D grid of the SOM has been replaced by a network of neurons whose connectivity defines a system of triangles. By separating neurons into disconnected areas, the GCS can produce the explicit representation of cluster boundary. The self-organising tree algorithm (SOTA) [14] is another example of SANNs. The initial topology of SOTA is composed of two terminal neurons connected by an internal neuron. The learning process starts with the presentation of each sample from the training set. Self-organising tree algorithm has demonstrated interesting features in pattern discovery and visualisation. For example, the nested output structure, in which neurons at each level are averages of the samples below, makes it straightforward to compare patterns at different hierarchical levels.

It is worth noting that the development of SANNs has traditionally focused on data domains that are assumed to be modelled by a Gaussian distribution. The similarity between data samples is largely measured using Euclidean distance and/or Pearson Correlation-based distance metrics. It has been found, however, that the Pearson correlation coefficient is too sensitive to the shape and the Euclidean distance mainly considers the magnitude of the changes of the data [15]. More recently, Zheng et al. [16] developed a Poisson-based SANN, which are tailored to knowledge discovery in Serial Analysis of Gene Expression (SAGE) data. While successfully addressing the probability model governing the SAGE data, the Poisson-based similarity measure ignores the direction of departure information as pointed out by Kim et al. [15]. They proposed a new similarity measure which considers the data magnitude when measuring the shape similarity and implemented it within a k-means clustering procedure, which inevitably exhibits some of the limitations found in the traditional k-means algorithm. For example, it requires users to specify the number of clusters in advance. The outcome of the k-means algorithm is an unorganised collection of clusters that is not conductive to proper explanation [17].

1.3 Objectives of this study

This paper aims to present a new SANN model, which takes the information related to both magnitude and shape information into account to improve pattern discovery and visualization. The main objective of this study is, based on the adaptation of a Chi-square statistics similarity measure obtained in a transformed feature space, to implement new strategies for determining a winning node and network growth.

The remainder of this paper is organized as follows. Section 2 presents a detailed description of the new SANN learning algorithm, followed by an introduction of two datasets used in this study. Results and a comparative analysis are presented in Sect. 4. This paper concludes with the discussion of results and future research.

2 The proposed SANN

2.1 Learning algorithm

Like GSOM [11], the proposed SANN model is randomly initialized with four neurons on a 2D grid. Such an initial structure allows the network to grow in any direction solely depending on the input data [11].

Once the network has been initialized, each input sample xi is sequentially presented and each presentation involves the following two basic operations: (a) Finding the winning neuron for each input sample; and (b) Updating the weights associated with both the winning neuron and its topological neighbours. However, by utilizing the Chi-square statistics based similarity measured in a transformed feature space, the proposed SANN implements new strategies for determining a winning neuron and initiating neuron growth, which distinguish it from other SANN models. The main features demonstrated by the proposed model include:
  • Matching criterion for finding a winning neuron

Determining a winning neuron for each input data sample is the fundamental process for SANN models. Euclidean distance-based approaches are perhaps the most widely used matching criterion in the development of SANN models. Being specifically sensitive to outliers and changes of magnitude, such a distance measure has demonstrated poor performance in several applications [16]. Aiming to target the situation where the magnitude information needs to be taken into account when measuring the shape similarity, the proposed SANN model introduces the following Chi-square based matching criterion.

Let xi be the input vector representing the ith n-dimensional input sample and wj be the associated weight vector of the jth neuron, where xi and wj are denoted by the following representation:
$$ x_{i} = \left[ {x_{i,1} ,x_{i,2} , \ldots ,x_{i,n} } \right]^{T} $$
(1)
$$ w_{j} = \left[ {w_{j,1} ,w_{j,2} , \ldots ,w_{j,n} } \right]^{T} $$
(2)
Given that after learning, each weight vector, wi, coincides with the centroid vector of all the data samples assigned to this neuron, the dispersion between xi and wj can be estimated using Chi-square statistics which accounts for both the magnitude and shape information as shown below.
$$ d(i,j) = \sum\limits_{k = 1}^{n} {\left( {{\raise0.7ex\hbox{${\left( {x_{i,k} - w{}_{j,k}} \right)^{2} }$} \!\mathord{\left/ {\vphantom {{\left( {x_{i,k} - w{}_{j,k}} \right)^{2} } {w_{j,k} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${w_{j,k} }$}}} \right)} $$
(3)
The lower the value of \( d(i,j) \) is, the more likely that the sample i is assigned to neuron j. Thus, the winning neuron, wc, signified by the subscript c, is determined by the following smallest distance criterion:
$$ d(i,c) = \min (d(i,j)),\quad \forall j $$
(4)
As pointed out by Kim et al. [15], while penalizing the deviation from the expected value, such a measure ignores the direction of difference information and thus may lose some shape information. To address this, instead of calculating the deviation in the original space, we measure the deviation in a transformed feature space which is constructed using the mutual differences of the original vector components [15]. For a given T-dimensional vector, Vi = (Vi(1),…, Vi(T)), the transformed vector can be represented as a T(T−1)/2 vector in the form of Vi(t1)−Vi(t2) for t1 = 1,…, T−1 and t2 = (t1 + 1),…, T. Accordingly, the transformed xi and wj vectors can be expressed as
$$ x_{i} = [x_{i,1} - x_{i,2} ,x_{i,1} - x_{i,3} , \ldots ,x_{i,t1} - x_{i,t2} , \ldots ,x_{i,(n - 1)} - x_{i,n} ]^{T} ,\forall \,t_{1} = 1, \ldots ,n\quad {\text{and}}\quad t_{2} = (t_{1} + 1), \ldots ,T $$
(5)
$$ w_{j} = [w_{j,1} - w_{j,2} ,w_{j,2} - w_{j,3} , \ldots ,w_{j,t1} - w_{j,t2} , \ldots w_{j,(n - 1)} - w_{j,n} ]^{T} ,\forall \,t_{1} = 1, \ldots ,n\quad {\text{and}}\quad t_{2} = (t_{1} + 1), \ldots ,T $$
(6)
Thus the deviation between the observed values (the input vector, xi) and the expected values (the weight vector of the neuron, wj) can be measured using the following statistics:
$$ d(i,j) = \sum\limits_{k1 = 1}^{n} {\sum\limits_{k2 = k1 + 1}^{n} {\left( {{\raise0.7ex\hbox{${\left( {\left( {x_{i,k1} - x_{i,k2} } \right) - \left( {w{}_{j,k1} - w_{j,k2} } \right)} \right)^{2} }$} \!\mathord{\left/ {\vphantom {{\left( {\left( {x_{i,k1} - x_{i,k2} } \right) - \left( {w{}_{j,k1} - w_{j,k2} } \right)} \right)^{2} } {\left( {w_{j,k1} + w_{j,k2} } \right)}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\left( {w_{j,k1} + w_{j,k2} } \right)}$}}} \right)} } $$
(7)
  • Error accumulation strategy for initiating the growth of new neurons.

Traditionally, a quantisation error (E) commonly computed based on the calculation of the Euclidean distance between the input sample and the winning neuron and accumulated over each learning cycle is used to estimate when and where to insert new neurons. It has been found, however, that this measure is overly sensitive to the change of magnitude and may fail to correctly guide the growth of new neurons [16]. In this study, a deviation based on Chi-square statistics measured in the transformed space is adopted and a cumulative deviation (E) is calculated for each winning neuron over each learning cycle using the following formula.
$$ E_{c} (t + 1) = E_{c} (t) + \sum\limits_{k1 = 1}^{n} {\sum\limits_{k2 = k1 + 1}^{n} {\left( {{\raise0.7ex\hbox{${\left( {(x_{k1} - x_{k2} ) - (w{}_{c,k1} - w_{c,k2} )} \right)^{2} }$} \!\mathord{\left/ {\vphantom {{\left( {(x_{k1} - x_{k2} ) - (w{}_{c,k1} - w_{c,k2} )} \right)^{2} } {\left( {w_{c,k1} + w_{c,k2} } \right)}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\left( {w_{c,k1} + w_{c,k2} } \right)}$}}} \right)} } $$
(8)
Where wc,k is kth feature of the winning neuron, c, xk represents the kth feature of the input vector, x, in the original feature space. Ec(t) represents the deviation computed at time t.
After a learning epoch, new neurons are generated on all free neighbouring positions of the neuron with a highest deviation value, as illustrated in Fig. 1. In order to fit into the existing neighbourhood, a similar strategy used by the GSOM for weight initilization of new neurons is implemented [11, 12].
https://static-content.springer.com/image/art%3A10.1007%2Fs13042-011-0050-z/MediaObjects/13042_2011_50_Fig1_HTML.gif
Fig. 1

New neurons generation process for the proposed SANN model. a The topology before generation, b the accumulation of errors during learning process, the neuron marked with a filled circle has the highest cumulative error after a learning epoch, c neuron growth on all free neighbouring positions. The neurons marked with shaded circle are newly generated

  • The mechanism that determines when to generate a new neuron

In the original GSOM algorithm [11], the network keeps track of the highest cumulative quantisation error, Emax, during the growing phase, and periodically compares it with the growth threshold (GT) defined by the user. When the value of Emax exceeds GT, the growth of neurons is initiated. To provide effective control over the growth of a network, a spread factor (SF) was introduced. However, the introduction of SF was based on the assumption that the node growth is guided using cumulative quantisation errors measured by Euclidean distance. As a new error accumulation strategy has been introduced in this study, such a mechanism to initiate the growth of the network is arguably not suitable for the proposed SANN model.

In an attempt to reduce computational complexity, the proposed model adopts the following approach. After each learning epoch, the growth of new neurons is initiated at the neuron with the highest error. After that, cumulative errors of all neruones are reset to zero. This, on the one hand, eliminates the need for redistributing error information. On the other hand, it gives every node the same opportunity to be a winner in the next learning cycle.

Like the GSOM algorithm [11], the smoothing phase is initiated at the end of the growing process. The implementation of the smoothing phase is similar to the one used in the GSOM [11]. Table 1 summarizes the learning algorithm used by the proposed model.
Table 1

Summary of the proposed SANN algorithm

1: Initialization: Start the network with four neurons on a 2D grid

                           Initialize each neuron with random values

2: Repeat growing

3:        For each learning cycle

4:                 Select a sample, xi from the input dataset

5:                 Compute the distance between the input sample, xi, and each neuron, wj, using (7)

6:                 Find the winning neuron using the minimum-distance criterion according to (4)

7:                 Update the weights of the winning neuron and its neighbours (the same as GSOM)

8:                 Increase the error value of the winner using (8)

9:         Find a neuron with the highest cumulative deviation and initiate the growth of new neurons

10: Until stopping criterion is satisfied (the learning process is normally stopped when computational bounds, such as the number of learning epochs exceed or when the quantization error of neurons in the network fall below a given threshold)

11: Repeat smoothing

12:         For each learning cycle

13:                 Present each input sample

14:                 Determine the winning neuron

15:                 Update the weights of the winner and its immediate neighbours

16: Until The error values of the neurons become very small or computational bounds are exceeded

2.2 Learning parameters

In this study, N0 is the size of initial neighborhood; α0 is the initial learning rate; and NLE stands for the number of learning epochs. For a training set, T, of n cases, a learning epoch refers to n learning cycles, within which the network is sequentially presented with each training sample. Unless indicated otherwise, the parameters reported in this paper are as follows: For the Iris data data, N0 = 4, α0 = 0.1, and the maximum NLE (growing phase) = 10, NLE (smoothing phase) = 10. For mouse retinal SAGE data, N0 = 6, α0 = 0.01, and the maximum NLE (growing phase) = 30, NLE (smoothing phase) = 50. The numbers beside each neuron shown on the output maps depict the order in which they were added during the growth phase. The rectangular neurons illustrated in the resulting maps represent for dummy neurons, which accumulated zero hits at the end of the learning phase. The datasets analyzed in this paper are described in the following section.

3 The datasets under study

3.1 Iris dataset

The Iris data published by Fisher [18] have been widely used as a benchmark dataset in discriminant analysis and cluster analysis. This is a real life dataset obtained from measurements on three species of 150 Iris flowers where each flower is measured by four measurements, namely petal length, petal width, sepal length and sepal width. There are three classes, in which two (Versicolor and Virginica) are not linearly separable. Each class has 50 samples. The statistical description of each feature is given in Table 2. The dataset can be freely downloaded from the UCI Maching Learning Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html).
Table 2

The statistical description of each feature in Iris dataset

Feature

Min

Max

Mean

Standard deviation

1

4.3

7.9

5.84

0.83

2

2.0

4.4

3.05

0.43

3

1.0

6.9

3.76

1.76

4

0.1

2.5

1.20

0.76

Table 3

Functional categorization of the 153 mouse retinal SAGE tags (125 developmental genes; 28 non-developmental genes)

Cluster

Number of SAGE tags

Early I

32

Early II

34

Late I

32

Late II

27

Non-dev.

28

3.2 Mouse retinal gene expression data

This dataset was constructed by using SAGE, which allows a simultaneous analysis of thousand of genes without the need for prior gene sequence information [23]. It consists of 10 SAGE libraries (38,818 unique tags with tag counts ≥2) from developing retina taken at 2-day intervals from embryonic day 12.5 (E12.5) to postnatal day 10.5 (P10.5) and adult retina [19]. Among the 38,818 tags, there are 1,467 tags with abundance levels equal to or greater than 20 in at least one of the ten libraries. In this study, a subset of 153 SAGE tags with known biological functions provided by Kim et al. [15] were used to test the algorithms. These 153 SAGE tags are divided into five classes based on their expression level across different developmental stages as shown in Table 3. Each tag is represented ten SAGE libraries, i.e. E12.5, E14.5, E16.5, E18.5, P0.5, P2.5, P4.5, P6.5, P10, and Adult retina.

4 Results

4.1 Analysis of iris data

To demonstrate of the significant features exhibited by the proposed model in pattern discovery and visualization, we firstly tested it using the well-known Iris dataset. As shown in Fig. 2, the resulting map (Fig. 2a) has branched out in two directions, indicating that there is one group separated distinguishingly from other observations. In Branch B, all samples belong to Setosa Class; while majority of the samples of Versicolor and Virginica Class are assigned to Branch A. This is consistent with the characteristics of Iris dataset, in which the Setosa class is linearly separable from the other two classes (Versicolor and Virginica).
https://static-content.springer.com/image/art%3A10.1007%2Fs13042-011-0050-z/MediaObjects/13042_2011_50_Fig2_HTML.gif
Fig. 2

The resulting maps for Iris dataset with the initial learning rate set to 0.1

Based on the further analysis with different learning parameters on the area of interests, a hierarchical and multi resolution clustering may be implemented. Figure 2b is the spread-out version of Branch A produced with the learning rate set to 0.05. The output has been developed in two ways, suggesting that there may be two classes in this branch. This coincides with the sample distribution over these two sub-branches, where 97.6% of the samples in Sub-Branch A1 belong to the Versicolor Class, and 90.4% of the samples in Sub-Branch A2 belong to the Virginica Class.

4.2 Analysis of mouse retina gene expression data

To further validation, the model was tested on a real biological dataset. This dataset consists of ten murine SAGE libraries, in which the count of each SAGE tag is approximately Poisson distributed [20]. We firstly applied the GSOM model [12] to this data. As shown in Fig. 3, the map branched out in several directions and the samples from the same cluster were found to be scattered to different locations. For example, the tags categorised as Late II class can be found in Nodes 83, 72, 79, 86, and 77. Interpreting such a map may prove to be rather difficult. It fails to provide an effective platform for visualising the intrinsic data structure in this problem.
https://static-content.springer.com/image/art%3A10.1007%2Fs13042-011-0050-z/MediaObjects/13042_2011_50_Fig3_HTML.gif
Fig. 3

The resulting map of the mouse retina SAGE data based on the GSOM model (N0 = 4, α0 = 0.1)

The same data were used to train the proposed model. The outcomes of a representative map produced by the proposed SANN model are shown in Fig. 4a. The class distribution over each neuron is given in Table 4.
https://static-content.springer.com/image/art%3A10.1007%2Fs13042-011-0050-z/MediaObjects/13042_2011_50_Fig4_HTML.gif
Fig. 4

The resulting map of the mouse retina SAGE data using the proposed SANN model

Table 4

The sample distribution over each branch

Branch

Node

Early I

Early II

Late I

Late II

Non-dev

A

88

0

5

0

0

0

86

5

1

0

0

0

85

3

1

0

0

0

87

0

5

0

0

0

83

8

0

0

0

0

77

6

0

0

0

0

71

0

5

0

0

0

70

0

6

0

0

0

69

0

6

0

0

0

57

0

3

0

0

0

55

3

0

0

0

0

52

6

0

0

0

0

50

0

1

0

0

0

B

66

0

0

0

3

0

67

0

0

3

1

0

74

0

0

1

3

0

75

0

0

1

0

0

80

0

0

1

1

0

76

0

0

1

0

0

81

0

0

1

0

0

82

0

0

2

2

0

C

1

0

0

0

0

6

3

1

0

0

0

2

5

0

0

0

0

4

0

0

0

0

0

8

2

0

1

0

0

1

4

0

0

0

0

2

7

0

0

0

0

1

6

0

0

0

0

2

10

0

0

0

0

2

Due to its self-adaptive properties, after training, the proposed network can reveal the clusters hidden in the data through its shape, allowing users to identify relevant patterns with relatively little effort. The resulting map shown in Fig. 4a has branched out into three directions, indicating the main groups of the data. An analysis of the sample distribution over each branch (See Table 4) indicates that each branch is associated with certain expression patterns encoded in this SAGE data. For example, all the samples assigning to Branch A belong to the clusters that exhibit higher expression in the early stage of embryonic development (Early I and Early II classes). Similar observation can be found in Branch B. All the samples found in Branch B have higher mRNA expression in postnatal periods (Late I and Late II classes). Branch C, on the other hand, consists of all the 28 tags unrelated to the mouse retina development (Non-dev class).

The platform proposed in this paper can be used to implement a multi-resolution and hierarchical clustering on areas of interest. A spread out version of Branch A is shown in Fig. 4b, in which the samples belonging to Early I and Early II are located in two well-differentiated regions (Region A1 is composed of 31 Early I tags and six Early II tags and Region A2 consists of 27 Early II tags). To assess the level of cluster enrichment for each region, the following hypergeometric distribution probability [21] is calculated.
$$ p = 1 - \sum\limits_{i = 0}^{k - 1} {\frac{{\left( {\begin{array}{*{20}c} K \\ i \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - K} \\ {n - i} \\ \end{array} } \right)}}{{\left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right)}}} $$
(9)
where K is the number of tags that fall into a region, k is the number of class members in the region, N is the total number of SAGE tags and n is the number of tags belonging to a specific class in the whole dataset. For example, in Region A1, 31 out of 32 Early I tags are grouped together (p < 10−20). All the 27 samples found in Region A2 belong to Early II class (p < 10−20). Based on these figures, one may confidently assume that Regions A1 and A2 are significantly associated with Early I and Early II classes, respectively.

Like GSOM [11], when there is a need to add new nodes to the network, the proposed model grows new neurons in all free immediate neighbouring positions. This may create some redundant nodes such as dummy nodes which is a node that accumulated ‘0’ hits during the training phase. Interestingly, such dummy nodes can be used to aid the interpretation of cluster structure. For example, in Fig. 4b, Regions A1 and A2 have been well-separated from each other by the area covered by the dummy nodes 0, 4, 6, 9, 12, 15, 18, 21, 24, 27, and 32.

In summary, the GSOM model fails to achieve its clustering goals when applied to analyse this real biological data. For instance, the samples from the same cluster were scattered to different regions and the resulting map didn’t reflect the intrinsic data structure. For example, there is no specific region found to be significantly linked to Late II class (p > 0.05). On the contrary, the maps derived by using the proposed network successfully reveal significant patterns encoded in the data. Similar samples are grouped together. By mean of hypergeometric distribution, each branch or each region separated by the dummy nodes is found to be highly represented by certain classes (p < 10−20). Such a map may greatly facilitate the study of significant patterns hidden in the data.

4.3 An interactive platform

The development of the proposed SANN-based pattern discovery and visualisation system was based on the platform developed by Wang et al. [12] as shown in Fig. 5. It mainly consists of the following components: Main Toolbar, Neuron Pane, Resulting Map Pane and Status Bar. This can be used for various purposes. For example, in this study, the interactive platform has been used to implement the following tasks.
https://static-content.springer.com/image/art%3A10.1007%2Fs13042-011-0050-z/MediaObjects/13042_2011_50_Fig5_HTML.gif
Fig. 5

The interface for the proposed SANN-based pattern discovery and visualisation system. It was developed based on the platform published by Wang et al. [12]

  • Visual inspection of the resulting map to gain an overall idea of the structure of the data. The proposed model can indicate the clusters by its shape, allowing the human visual system to detect and identify the patterns hidden in the data with relatively little effort.

  • Hierarchical and multi-resolution clustering analysis. A finer analysis on the areas of interest can be carried out using different learning parameters later on. Based on a further spread-out version of selected regions produced, a hierarchical and multi-resolution clustering may be implemented.

5 Conclusion

Pattern discovery and visualization are two important operations in a data mining process. The ability to reveal the significant patterns hidden in the data in an intuitively attractive way through incremental learning makes SANNs a power tool to support pattern discovery and visualisation. Using Chi-square statistics-based approaches to measuring dispersion between input samples and weights associated with each neuron in a transformed feature space, this paper presented a SANN model, which implements a new matching criterion for determining a winning node and error accumulation strategies for node growth. The results obtained in this study demonstrated several significant features exhibited by the proposed model in pattern discovery and visualisation. For example, based on its self-adaptive capability, similar data are successfully clustered. Like existing SANN models [11, 12, 16], the resulting adaptive map automatically models the data under study. Upon the completion of learning process, the proposed model can develop into different shapes to reflect significant patterns hidden in the data. By keeping a regular 2D structure at all times, the model has an appealing property which allows a user to perform pattern discovery and visualisation at the same time with relative ease. By performing a finer analysis of areas of interest with different learning parameters, the interactive data mining platform provided in this study can be used to implement hierarchical and multi-resolution clustering.

Instead of estimating the exact position where new nodes need to be inserted, the proposed model generates new nodes in all free immediate neighbouring location [11, 12]. This may inevitably introduce some irrelevant neurons, which may degrade the quality of the visualisation of output maps. Moreover, at present, as there is no standard way to determine the best combination of learning parameters, the estimate of optimal learning parameters is based on conducting several experiments. Thus, incorporation of neuron pruning methods [22] and other machine learning techniques into the learning process to provide a better insight into the effect of node removal and initial learning parameter settings would be a crucial part of future work.

Instead of the use of a GT [11, 12], the proposed network grows by adding new neurons at the end of each learning epoch. After the growth of new neurones, cumulative errors of all neruones are reset to zero. This, on one hand, gives the same opportunity for each neuron to be a winner in each learning cycle. On the other hand, it greatly reduces computational complexity. The same strategy has been successfully applied in other SANN models [16]. Nevertheless, the development of a new version of the SF based on the calculation of Chi-square statistics to control the expansion of networks provides an important direction for future research. In addition, only two datasets were analysed in this paper. The behavior of the proposed model in the analysis of other types of datasets deserves further investigation.

Like the SOM and other SANN models, the computational complexity of the proposed model is linear in the number of input samples. However, each training sample is presented multiple times and the desired performance can only be achieved after gradual adaptation of weights associated with each neurone. This may affect the performance of the model in large-scale applications. Thus the scalability of the proposed model will be further investigated. Other future work includes a detailed comparison with other well developed methods such as the SOM [10] and the adaptive resonance theory (ART) model [24].

It is worth noting that no single SANN algorithm can always perform best for different applications. It is crucial to understand some key factors that may influence the selection of learning models. Given the similarity measure adopted in this study, the proposed model would be more suitable when there is a need to include the magnitude information in measuring the shape similarity. Examples include clustering analysis of gene expression data in which the shape information is a more critical factor that needs to be considered whereas the absolute expression level associated with each gene under certain conditions should be taken into account as well [15].

Copyright information

© Springer-Verlag 2011