Semi-supervised Classifying of Modelled Auditory Nerve Patterns for Vowel Stimuli with Additive Noise

  • Anton YakovenkoEmail author
  • Eugene Sidorenko
  • Galina Malykhina
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 799)


The paper proposes an approach to stationary patterns of auditory neural activity analysis from the point of semi-supervised learning in self-organizing maps (SOM). The suggested approach has allowed to classify and identify complex auditory stimuli, such as vowels, given limited prior information about the data. A computational model of the auditory periphery has been used to obtain auditory nerve fiber responses. Label propagation through Delaunay triangulation proximity graph, derived by SOM algorithm, is implemented to classify unlabeled units. In order to avoid the “dead” unit problem in Emergent SOM and to improve method effectiveness, an adaptive conscience mechanism has been realized. The study has considered the influence of AWGN on the robustness of auditory stimuli identification under various SNRs. The representation of acoustic signals in the form of neural activity in the auditory nerve fibers has proven more noise-robust compared to that in the form of the most common acoustic features, such as MFCC and PLP. The approach has produced high accuracy, both in case of similar sounds and with high SNR.


Auditory nerve data analysis Unsupervised learning Neurogram Machine hearing Label propagation Self-organizing maps 

1 Introduction

The central nervous system receive information about the environment through peripheral coding. In the course of evolution, presentation of sensory information has been formed that allows the subject to effectively recognize phenomena in various physical contexts. In particular, the listener can successfully analyze the auditory scene and detect sound events in a wide variety of acoustic conditions.

Automatic speech processing systems are considerably inferior to those of the human in quality of recognition [1], particularly with regard to continuous speech and noisy environments. The performance of automatic speech processing systems can be improved through introducing additional linguistic and contextual information [2], e.g. through prediction of isolated phonemes. However, the intrinsic robustness of speech recognition in humans does not inherently depend on language or context [3]. Thus, further development of speech technology can involve integration of knowledge on physiology of auditory perception and neural coding.

The present study proposes a method for analysis of resultant stationary patterns of evoked neural activity, which are generated by a simulation model of the auditory nerve as a response to complex tones represented vowel stimuli, as well as their classification and identification. Generally, such multidimensional data have a complex structure and a large number of observations, which complicates the task and limits the application of many methods. The relevant practical problem is classification in limited prior knowledge, when class labels are only known for some of the observations [4]. In this case, supervised methods cannot ensure the desired accuracy. On the other hand, unsupervised techniques do not enable using prior information and learning data on the basis thereof. To compensate for these shortcomings, the study proposes an approach based on the model of self-organizing maps in the context of semi-supervised classification task. The present study is based on the results of the previous work, in which, processing of the modelled auditory nerve fiber responses revealed a cluster structure in the unlabeled data of the neural activity for voices of different speakers [5]. In order to estimate robustness, the study considers the influence of additive white Gaussian noise (AWGN) on the auditory stimuli. Finally, a conclusion is made about the results obtained versus traditional acoustic features widely used for signal representation in automatic speech recognition, such as mel-frequency cepstral coefficients (MFCC) [6] and perceptual linear prediction coefficients (PLP) [7].

2 Methodology Background

In the auditory periphery, an input sound is converted from acoustic to mechanical oscillations. The latter, in turn, stimulate the electrical activity of the auditory nerve fibers. The signals generated are transported to the corresponding cortex areas via nerve fibers, which results in auditory perception. In this way, humans can effectively identify perceptive qualities of sound (pitch, loudness, timbre, etc.). Various areas of the cortex are characterized by neural maps that ensure spatial representation of sensory information [8]. The tonotopic map, which forms a topographic projection from the cochlea in accordance with signal frequencies, is responsible for processing auditory information.

Self-organizing feature maps (SOM) are a computational model of sensory topographic maps [9]. A unique combination of such properties as approximation of the input space, topological ordering, density matching, and feature selection makes SOM stand out among other artificial neural networks. This method has been widely used in intelligent data analysis in various areas of application.

Two types of SOM are distinguished based on the number of neurons (or nodes, units) in the self-organizing layer. The first type has a small quantity of neurons, each of which after learning becomes a cluster center. This is the most common interpretation of the method. However, in this case, each neuron’s area of influence can be regarded as a k-means cluster. Thus, topology preservation in the projection proves to be of little use in this setting. The second type of SOM usage is characterized by a much larger quantity of elements (several thousand), which allows to learn and effectively visualize the data structure. Here, the SOM demonstrates emergent properties that are not confined to the mere sum of individual element actions and are not found in small maps. Consequently, this approach is called Emergent SOM (ESOM) [10]. Approach allows one to construct a boundary between clusters of any complexity. In visualization, the structural properties of data are represented by means of a U-Matrix, P-Matrix or a combination thereof (U*-Matrix). The issues of big multidimensional data processing have become relevant. Thus, given prior uncertainty in data structure, a promising emphasis in the area of knowledge discovery [11] is the development and application of ESOM for automatic cluster identification and unsupervised classification tasks.

3 Algorithm

Let \(\mathbf {x}=[\xi _{1},...,\xi _{D}]^{T}\in X\) be an input vector. There are a set of SOM neurons \(Y_{N}=\left\{ \mathbf {y}_{1}...\mathbf {y}_{i}...\mathbf {y}_{n} \right\} ,N=\left\{ 1,...,n \right\} \), described by the model vectors (or synaptic weights), \(\mathbf {y}_{i}=\left[ \xi _{i1},...,\xi _{iD} \right] \in \mathbb {R}^{D}\). According to the SOM architecture, the input vector \(\mathbf {x}\) is applied in parallel to each neuron \(\mathbf {y}\) of the output map, which represents a regular grid of interconnected units. SOM can be considered as an undirected graph \(G=(V,E)\) consisting of vertices V – indexes of a map units, and edges E – lateral connections between neighboring neurons. In a self-organizing process produced by input samples, the SOM performs a topological mapping \(f_{map}:X\subset \mathbb {R}^{D}\rightarrow G\subset \mathbb {R}^{M},D>M\). This mechanism is based on the adaptive competitive learning algorithm, introduced by Kohonen [9].

At first, for the input vector \(\mathbf {x}\) the winner neuron \(\omega (\mathbf {x})\), or best matching unit (BMU), is determined by evaluating the Euclidean distance between \(\mathbf {x}\) and each map unit:
$$\begin{aligned} \omega (\mathbf {x})=arg\; \underset{i}{min} \left\| \mathbf {x}-\mathbf {y}_i \right\| . \end{aligned}$$
The BMU modifies the model vectors of neighboring neurons within the topological neighborhood \(h_{i,\omega }\). The adaptive process is accompanied by decreasing \(h_{i,\omega }\) with time t, so the neighborhood function is determined by the following exponential dependence:
$$\begin{aligned} h_{i,\omega }=exp\left( \frac{d_{i,\omega }^{2}}{2\sigma ^{2}(t) } \right) , \end{aligned}$$
where \(\sigma (t)\) is a decreasing radius of a topological neighborhood, \(d_{i,\omega }\) is a lateral distance between the BMU \(\omega \) and the neighboring unit i. Influence of the BMU is weakened with increasing \(d_{i,\omega }\). Updating the model vectors at each iteration occurs according to the following expression:
$$\begin{aligned} \mathbf {y}_i(t+1)=\mathbf {y}_i(t)+h_{i,\omega }(t)\alpha (t)\left[ \mathbf {x}(t)-\mathbf {y}_i(t) \right] , \end{aligned}$$
where \(\alpha (t)\) is a decreasing learning rate. The inputs are fed to SOM in a sequential mode until convergence. Convergence criteria of a training step is determined by the absence of significant changes in the map structure, according to a sufficiently small threshold value \(\delta \):
$$\begin{aligned} O(t)\sum _{i}\left\| \mathbf {y}_i(t+1)-\mathbf {y}_i(t) \right\| , (O(t)-O(t-1))< \delta . \end{aligned}$$
With a random initialization of the network weights, and in according to large number of ESOM elements, the probability that some neurons can get into the area of a space with a low data density is increased. So-called “dead” units have a negative impact on the quality of the data interpretation. Therefore, in order to involve all map units, the conscience mechanism [12] with an adaptive activation threshold \(p_i\) of each neuron is introduced:
$$\begin{aligned} p_i(t+1)=\left\{ \begin{matrix} p_i(t)+\frac{1}{n},(i\ne \omega )\\ p_i(t)-p_{min},(i=\omega )\end{matrix}\right. \end{aligned}$$
where \(p_{min}\) is a minimal potential that determines the participation of a given neuron in the competition process. If \(p_i<p_{min}\), then the neuron i is temporarily disabling, and the BMU is searched among its nearest neighbors.
The use of the Euclidean measure is due to the fact that then SOM units are represents the set of seed points of a Voronoi tessellation, partitioning the input space. Thus, in accordance with the hexagonal structure of the lateral connections, the SOM is a proximity graph, namely Delaunay triangulation. The proximity graphs are effectively used in semi-supervised learning tasks using the label propagation methods in order to spread class labels to nearby nodes [13]. Consider this approach in the ESOM context. Upon completion of the learning process on the partially-labeled data a set \(Y_N\) of neurons consist of a subset of BMUs \(Y_{L}^{(1)}=\left\{ y_1 ... y_l \right\} \subset Y_N, L=\left\{ 1,...,l \right\} \), each with a corresponding class label \(Z_L=\left\{ z_1,...,z_l \right\} \), and a subset of the map units \(Y_{U}^{(2)}=\left\{ y_{l+1}...y_{l+u} \right\} \subset Y_N, U=\left\{ l+1,...,l+u \right\} ,l<u\) for which the exact classification label \(Z_U=\left\{ z_{l+1},...,z_{l+u} \right\} \) is to be determined. Suppose that the number of classes \(C_k\) is known. Consider binary classification task \(Z_L\in C_k=\left\{ 0,1 \right\} \). Formally, in this case, the goal of semi-supervised learning is to construct a classifier function for a finite set of partially-labeled map units \(f_C:Y_N\rightarrow \left\{ C_0,C_1 \right\} \). Thus, it is necessary to estimate \(Z_U\) from \(Y_N\) and \(Z_L\). To determine the values of edges E of a graph G, i.e. the lateral connection \(w_{ij}\) between neighboring neurons i and j in the SOM space, u-distance can be used:
$$\begin{aligned} u(i,j)=\underset{i=j}{min}\sum _{k=1}^{r-1}d(\mathbf {y}_{i_{k}},\mathbf {y}_{i_{k+1}}), \end{aligned}$$
$$\begin{aligned} w_{ij}(\mathbf {x})=exp\left( -\frac{u^{2}(i,j)}{\varsigma ^{2}} \right) , \end{aligned}$$
where \(r_i\in \mathbb {N}^2\) is a position of \(\mathbf {y}_i\) on the regular grid, \(\varsigma \) is a radius that determines how far the class label information will be propagate over the graph from the seed units. The value of the parameter \(\varsigma \) is a crucial, the problem of its determining is discussed in [14].

4 Results

Sound oscillations are generally divided into simple and complex. Simple oscillations follow the sinusoidal law and are called pure tones. These sounds, however, are hardly found in nature. A complex sound, by contrast, can be represented as a set of tones different in frequency and amplitude. The same applies to vocal sounds, e.g., used in speech synthesis. Thereby, the complex acoustic signals considered in this paper are represented by a dual-tone multi-frequency model of speech vowels synthesized as a sum of harmonic oscillations of the first two formants. For the information on the mean formant frequencies used to generate the vowel signals, see Table 2 in [15] (age group 20–25 years). The sampling rate and the duration for each signal were 44.1 kHz and 250 ms respectively.

A physiologically-based computational model of the auditory periphery (MAP) [16] has been used to form a response of the auditory nerve to the acoustic signals. Response modeling was performed for the nerve fibers with a high spontaneous rate. The results are presented in a multidimensional data matrix X of observations that defines the auditory neurogram. This is a time-frequency representation that reflects the firing rate of the modeled auditory nerve fiber ensemble responding to the input signal. The matrix columns \(\mathbf {x}\) correspond to observations in discrete time, and the rows \(\xi \) are the features representing the range of \(D = 41\) characteristic frequencies, log-scaled in 250–8000 Hz. Further the matrices obtained for each signal were directly used as input data set for ESOM. The total amount of data was an array of 486200 observations, of which only \(25\%\) have a class label.
Table 1.

The resulting learning parameters


Group 1

Group 2

Group 3













Sigma (\(\varsigma \))




Table 2.

Average classification error (%)

SNR (dB)

Group 1

Group 2

Group 3

















For data analysis, a hexagonal grid of 4096 neurons with a planar topology was used. ESOM training was carried out on partially labeled input samples for clear auditory stimuli, without AWGN. The quality of the map projection was evaluated using standard metrics, such as final quantization error (FQE) and topographic error (FTE). When convergence is achieved, the BMU nodes are assigned a class label, which is transmitted to the neighboring nodes using the label propagation algorithm. The obtained values of the corresponding learning parameters are presented in Table 1.

Testing was performed using unlabeled data of clean and noisy auditory stimuli. To evaluate the average quality of the classification, the vowel phonemes were divided into three groups of sounds according to their formant frequencies: Group 1 - different, Group 2 - similar and Group 3 - very similar. The influence of AWGN was verified at 30, 20 and 10 dB SNR. Table 2 shows the average classification error for each group of sound stimuli.

5 Conclusions

According to the obtained results, it can be concluded that the proposed approach has demonstrated high accuracy in solving the task under consideration. The use of emergent self-organizing maps revealed a complex structure of the large multidimensional auditory nerve data and corresponding linearly non-separable clusters representing sound stimuli. Introduction of the adaptive conscience mechanism prevented the appearance of “dead" units, for which it is difficult to determine class affiliation by the label propagation algorithm. Comparison of the sound representation in the form of a stationary response pattern of the auditory neural activity with acoustic features gave the following results: for clean signals, classification accuracy was equally high, however, for signals with AWGN, the quality of MFCC and PLP-based classification was significantly reduced. The weakest results, as expected, were obtained in case of similar vowel phonemes classification (Group 3) with high SNR (10 dB). Under similar conditions, proposed approach was able to provide an accuracy about of 60%, whereas the accuracy of the considered acoustic features did not exceed 30%.



The reported study was funded by the Russian Foundation for Basic Research according to the research project 18-31-00304.


  1. 1.
    Meyer, B., Wächter, M., Brand, T., Kollmeier, B.: Phoneme confusions in human and automatic speech recognition. In: Proceedings of Interspeech, pp. 1485–1488 (2007)Google Scholar
  2. 2.
    Yousafzai, J., Ager, M., Cvetkovic, Z., Sollich, P.: Discriminative and generative machine learning approaches towards robust phoneme classification. In: Proceedings of IEEE Workshop on Information Theory and Application, pp. 471–475 (2008)Google Scholar
  3. 3.
    Miller, G.A., Nicely, P.E.: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 27(2), 338–352 (1955)CrossRefGoogle Scholar
  4. 4.
    Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cambridge (2006)CrossRefGoogle Scholar
  5. 5.
    Yakovenko, A., Malykhina, G.: Bio-inspired approach for automatic speaker clustering using auditory modeling and self-organizing maps. Procedia Comput. Sci. 123, 547–552 (2018)CrossRefGoogle Scholar
  6. 6.
    Huang, X., Acero, A., Hon, H.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River (2001)Google Scholar
  7. 7.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRefGoogle Scholar
  8. 8.
    Imai, T.: Positional information in neural map development: lessons from the olfactory system. Dev. Growth. Differ 54(3), 358–365 (2012)CrossRefGoogle Scholar
  9. 9.
    Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  10. 10.
    Ultsch, A., Mörchen, F.: ESOM-maps: tools for clustering, visualization, and classification with Emergent SOM. Technical Report, Department of Mathematics and Computer Science, University of Marburg, Germany, p. 46 (2005)Google Scholar
  11. 11.
    Ultsch, A., Lötsch, J.: Machine-learned cluster identification in high-dimensional data. J. Biomed. Inform. 66, 95–104 (2017)CrossRefGoogle Scholar
  12. 12.
    DeSieno, D.: Adding a Conscience to Competitive Learning. In: Proceedings of the Second Annual IEEE International Conference on Neural Networks, pp. 117–124 (1988)Google Scholar
  13. 13.
    Zhu, X.: Semi-supervised learning with graphs. Doctoral dissertation, Carnegie Mellon University. CMU-LTI-05-192 (2005)Google Scholar
  14. 14.
    Herrmann, L., Ultsch, A.: Label propagation for semi-supervised learning in self-organizing maps. In: Proceedings of the 6th International Workshop on Self-Organizing Maps (WSOM). Bielefeld University, Germany (2007)Google Scholar
  15. 15.
    Hawkins, S., Midgley, J.: Formant frequencies of RP monophthongs in four age groups of speakers. J. Int. Phon. Assoc. 35(2), 183–199 (2005)CrossRefGoogle Scholar
  16. 16.
    Meddis, R., et al.: A computer model of the auditory periphery and its application to the study of hearing. In: Proceedings of the 16th International Symposium on Hearing, Cambridge, UK, pp. 23–27 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Anton Yakovenko
    • 1
    Email author
  • Eugene Sidorenko
    • 1
  • Galina Malykhina
    • 1
  1. 1.Peter the Great St.Petersburg Polytechnic UniversitySt.PetersburgRussia

Personalised recommendations