Keywords

1 Introduction

Recently, several topics and issues have emerged such as availability of big and/or high-dimensional data, petascale HPC systems aiming towards exascale supercomputers and need of processing this kind of data. Large high-dimensional data collections are commonly available in areas like medicine, biology, information retrieval, web analyze, social network analyze, image processing, financial transaction analysis and many others. To process such kind of data unsupervised learning algorithms, such as Self Organizing Maps (SOM) or Growing Neural Gas (GNG), are usually used. Various aspects of parallel implementation of these algorithms on HPC were studied e.g. [12, 13].

The one of still preserving issues is efficient utilization of the computation resources. When speaking on SOM or GNG learning algorithms there are two challenges. The first one is fast computation of similarity in high-dimensional space and the second one is ideally uniform distribution of computation load among individual CPU cores.

Parallel implementation usually allocates one CPU core to group of neurons, evaluate similarity of these neurons with given input vector, find local best matching neuron and then using some form of communication to find a global best matching neuron in the whole neural network. The CPU cores are allocated regularly, using some pattern [12] regardless of input vectors distribution on neurons i.e. CPU cores causing a bottleneck in the parallel learning algorithm. To reduce the bottleneck input vectors preprocessing is done using small SOM and clustering algorithm. This allows us to improve distribution of neurons over CPU cores and subsequently speed-up the learning algorithm itself.

The paper is organized as follows. The Sect. 2 briefly describes used neural networks: SOM and GNG. Input vectors preprocessing algorithm is provided in Sect. 3. Experimental results are given in Sect. 4.

2 Artificial Neural Networks

In this section we will describe two types of neural networks, the first is Self Organizing Maps and the second is Growing Neural Gas and then we present a combination of SOM and GNG.

2.1 Self Organizing Maps

Self Organizing Maps (SOMs), also known as Kohonen maps, were proposed by Kohonen in 1982 [4]. SOM is a kind of artificial neural network that is trained by unsupervised learning. Using SOM, the input space of training samples can be represented in a lower-dimensional (often two-dimensional) space [5], called a map. Such a model is efficient in structure visualization due to its feature of topological preservation using a neighbourhood function.

SOM consists of two layers of neurons: an input layer that receives and transmits the input information, and an output layer, the map that represents the output characteristics. The output layer is commonly organized as a two-dimensional rectangular grid of nodes, where each node corresponds to one neuron. Both layers are feed-forward connected; each neuron in the input layer is connected to each neuron in the output layer. A real number, or weight, is assigned to each of these connections.

2.2 Growing Neural Gas

The representation of Growing Neural Gas is an undirected graph which need not be connected. Generally, there are no restrictions to the topology. The graph is generated and continuously updated by competitive Hebbian Learning [7, 9]. According to the pre-set conditions, new neurons are automatically added and connections between neurons are subject to time and can be removed. GNG can be used for vector quantization by finding the code-vectors in clusters [3], image compression, disease diagnosis.

GNG works by modifying the graph, where the operations are the addition and removal of neurons and edges between neurons.

To understand the functioning of GNG, it is necessary to define the learning algorithm. The algorithm published in [12] is based on the original algorithm [2, 3], but it is modified for better continuity in the SOM algorithm.

Remark. The notation used in the paper is briefly listed in Table 1.

Table 1. Notation used in the paper

3 Combination of SOM and GNG

In our previous paper [12], we focused on a combination of SOM and GNG, where the basic idea was to pre-process the input data by SOM, as a result of which there are clusters of similar data. Subsequently, we created the same number of GNG network as clusters, and assigned each cluster to one GNG. Each GNG creates its own neural map and after the learning process is finished, the results are merged. The entire description above can be summarized as follows: Help speeding up computation parallelization is shown in Fig. 1 where the top layer of parallelization (SOM) is described in a previous paper [13]. In this chapter we will describe an improved method focusing on the creating clusters of input data and optimization of used resources.

Fig. 1.
figure 1

Parallel algorithm

To improve the efficiency of parallelization is needed to clusters of input data for GNG network contained approximately the same number of input data. Based on the fact that when the GNG assigned more input vectors, the calculation takes longer. In the past to create clusters we used to spanning tree algorithm [12] which, however, does not reflect the number of input vectors in clusters. To obtain clusters of neurons we now use two different clustering algorithms. The first algorithm is agglomerative hierarchical clustering methods of calculation for determining the distance between clusters. Practically useful methods are Ward’s method, Centroid-linkage and Average-linkage (AVL), see Fig. 2. The second clustering algorithm is the algorithm PAM [8], which like the above hierarchical algorithms, operates with a matrix of distances between the neurons in the output layer. The thus formed clusters are input to the following algorithm, which subdivides the neurons of the output layer SOM into clusters containing the closest possible number of the input data. The clusters are created on the basis of Algorithm 1.

Fig. 2.
figure 2

Dendrograms of hierarchical clustering (Average-linkage and Ward’s method) of output layer of SOM 5 \(\times \) 5

figure a

We proposed an optimization for the calculation of GNG networks, which aims to optimize the maximum utilization of the allocated resources, but on condition that the computing time must be similar. The principle of optimization is based on the idea that individual computing resources will count more GNG networks than only one – as it has until now. In Algorithm 2, the overall functionality is described. Here it is necessary to mention two facts regarding point 4. Firstly, it is a variation of a known problem Subset sum [6], which is an NP complete problem [1]. But at the beginning it was defined that the size of the output map of SOM is small and therefore the maximum possible number of clusters is also small, and therefore negligible. And secondly the maximum limit is reduced by 10 % on the grounds that to work with each GNG network time for I/O operations must be added.

figure b

4 Experiments

We will describe different datasets and we will provide experiments with datasets in this section.

4.1 Experimental Datasets and Hardware

Two datasets were used in the experiments. The first dataset was commonly used in Information Retrieval – Medlars. The second one was the test data for the elementary benchmark for clustering algorithms [11].

Weblogs Dataset. To test the learning algorithm effectiveness on high dimensional datasets, a Weblogs dataset was used. The Weblogs dataset contained web logs from an Apache server. The dataset contained records of two month’s requested activities (HTTP requests) to the NASA Kennedy Space Center WWW server in FloridaFootnote 1. The standard data preprocessing methods were applied to the obtained dataset. The records from search engines and spiders were removed, and only the web site browsing was left (without download of pictures and icons, stylesheets, scripts etc.). The final dataset (input vector space) was of a dimension 90,060 and consisted of 54,961 input vectors. For a detailed description, see our previous work [10], where a web sites community behaviour was analyzed.

Medlars Dataset. The Medlars dataset consisted of 1,033 English abstracts from medical scienceFootnote 2. The 8,567 distinct terms were extracted from the Medlars dataset. Each term represents a potential dimension in the input vector space. The term’s level of significance (weight) in a particular document represents a value of the component of the input vector. Finally, the input vector space has a dimension of 8,707, and 1,033 input vectors were extracted from the dataset.

Experimental Hardware. The experiments were performed on a Linux HPC cluster, named Anselm, with 209 computing nodes, where each node had 16 processors with 64 GB of memory. Processors in the nodes were Intel Sandy Bridge E5-2665. Compute network is InfiniBand QDR, fully non-blocking, fat-tree. Detailed information about hardware is possible to find on the web site of Anselm HPC clusterFootnote 3.

Fig. 3.
figure 3

Quality of hierarchical clustering of output layer of SOM k \(\times \) k

Fig. 4.
figure 4

Quality of PAM clustering of output layer of SOM k \(\times \) k

4.2 First Part of the Experiment

The first part of the experiments was oriented towards the comparison of quality of clustering by the agglomerative hierarchical clustering and PAM Clustering. The used dataset was Weblogs. All the experiments were carried out for 20 epochs; the random initial values of neuron weights in the first epoch were always set to the same values. The tests were performed for SOM with rectangular shape \(5 \times 5\) neurons, \(6 \times 6\) neurons, \(7 \times 7\) neurons and \(8 \times 8\) neurons. The metrics used for determining the quality of clustering is Average Silhouette Index (ASI). The achieved quality of clustering for agglomerative hierarchical clustering is presented in Fig. 3 and for PAM Clustering is presented in Fig. 4.

4.3 Second Part of the Experiment

The second part of the experiments was oriented towards comparing the time efficiency of algorithms PAM and AVL. The GNG parameters are as follows \(\gamma \) = 100, \(e_w\) = 0.05, \(e_n\) = 0.006, \(\alpha \) = 0.5, \(\beta \) = 0.0005, \(a_{max}\) = 160, \(N_{max}\) = 1221, T = 200. The used dataset was Weblogs. Dimensions of SOM are \(5 \times 5\), \(6 \times 6\), \(7 \times 7\) and \(8 \times 8\). Number of cores are 32 for each group and the groups are computed sequentially (Tables 2 and 3).

Table 2. Computing time with respect to number of cores, standard GNG algorithm, dataset Medlars
Table 3. The quality of algorithms (the lower is better)

4.4 Third Part of the Experiment

The third part of experiments was oriented towards speedup and resource optimization. The GNG parameters are as follows \(\gamma \) = 200, \(e_w\) = 0.05, \(e_n\) = 0.006, \(\alpha \) = 0.5, \(\beta \) = 0.0005, \(a_{max}\) = 160, \(N_{max}\) = 1000, T = 200. Medlars dataset was used. The tests were performed for SOM with a rectangular shape of \(5 \times 5\) neurons. The achieved speedup is presented in Tables 4 and 5. Both samples have similar speedup but fundamentally differ in efficiency; in the fastest case without optimization the efficiency is 0.02, but with optimization it is 0.23.

Table 4. Combination SOM and GNG’s
Table 5. Combination SOM and GNG’s and resource optimization

5 Conclusion

In this paper, we presented experiments with preprocessing of data in output layer of SOM. Different clustering algorithm was used and different quality of clusters were obtained. The global input data preprocessed on this SOM was used as an input for GNG neural network. This approach allows us almost uniformly distribute data on computation cores efficiently utilize them. The achieved speed-up is also very good. In future work we intend to focus on sparse date, and improved acceleration and use Xeon phi for better speed-up.