1 Introduction

Researchers in the field of speech processing have made significant strides in recent years toward optimizing automated speech recognition (ASR) systems for noise-free environments. One of the hot topics in the speech recognition debate [1] is how to make the recognition system more robust against natural speech changes (such as speaker variation, accent, background noise, transmission channel, etc.). Three main strategies of speech augmentation, the extraction of resilient features, and the compensation of audio model parameters are the focus of the majority of research studies in the area of speech recognition robustness to fluctuations [2, 3]. However, recent studies have shown that the finest ASR systems can only achieve identification rates that are inferior to those of the human auditory system. Therefore, by studying the physiology of human hearing, we may perhaps improve the recognition of such systems [4, 5].

Speech recognition has emerged as a key technique for enhancing human–machine interaction. The growing interest in neural networks and bio-inspired systems has stimulated the deployment of new methodologies, which are needed in light of the shortcomings of present automatic speech recognition systems, such as their unrealistic cloud-based solutions [6, 7]. The primary audio modeling technique for automatic large vocabulary speech recognition is now artificial neural networks (ANNs). A typical ANN has a complicated hardware requirement and hardware complexity due to its multi-layered architecture [8]. Although several models of neurons have been proposed to mimic biological nonlinear dynamics, Izhikevich's model is among the most complicated and accurate, with properties that are strikingly close to those of the human physiological nerve [9,10,11]. The silicon neuron is a tiny circuit built from transistors that can simulate brain activity. Izhikevich neuron models and time-delayed synaptic plasticity models are used to describe the nodes and neurons in brain-inspired spiking neural networks (SNNs), making those very similar to biological neural networks. Their spike-based computing allows them to function on low-power neuromorphic technology [12, 13].

Spiking neural networks with leaky integrate-and-fire (LIF) neurons offer possibilities for energy-efficient neuromorphic computing, especially in edge devices [14], due to their event-driven operation and in-built states for preserving information across time. Processing and extracting meaningful information from spatio-temporal data displayed as a collection of spike trains over time is essential for spike-based neuromorphic applications [15]. In this paper, we use the voice recognition challenge as a case study and exploit the capabilities of the spike neural network to identify digits from the audio signal [16]. To make judgments or store information, the human brain uses complex convolutional patterns of neuronal populations generated by the activation of sensory neurons and the subsequent transmission of inputs to cortical neurons [17].

The primary type of nerve cell is a neuron. The job of these cells is to relay information along nerve fibers. They accomplish this by channeling electrical discharges (spikes) [18]. These cells are unable to reproduce. The parts of a neuron are the cell body, the axon, the synaptic terminals, and the dendrites [19]. The majorities of neurons receive information from appendages called dendrites and convey information to other cells via axons. Perikaryon is the name for the cell body of neurons. A neuron has three functional parts: dendrites, cell bodies, and axons, which are responsible for delivering nerve signals. Polarization is the scientific name for the idea that all information processed by a neuron goes in the same direction [20].

Synapses are the junctions between neurons that allow for information exchange. Synapses are overproduced throughout neurodevelopment, only to be destroyed or modified later on. The term "synaptic pruning" describes this process in the brain. Connections in an engineered network design often start few and grow over time. It is thought to be pointless to include links in constructed networks that will subsequently be eliminated [21].

Spiking neural networks have allowed for significant progress in neuroscience-inspired algorithms in recent years [22]. SNNs are computer models that encode and process information in the time domain by drawing on a variety of dynamic models of biological neurons [23]. When the membrane potential is raised by an incoming spike, the neuron's output will spike just as it crosses a threshold. The membrane potential returns to its initial value after firing [24]. Sparse and asynchronous binary signals, which are sensitive to time- and activity-dependent characteristics, are employed in SNNs as a means of communication and synaptic weights between neurons. By specifying these weights, SNNs have demonstrated their ability to process data information from many sources. Additionally, for diagnostic applications, SNN-based models have successfully shown their capacity to recreate the underlying brain processes [25, 26]. An overview of biological neurons and the merge-and-fire model is shown in Fig. 1. Biological neurons in Fig. 1a are linked to a few other neurons via synapses, where messages are passed along as neurotransmitters that are released from the axon terminal. Dendritic signals converge at the soma. The soma initiates the action potential (AP), which travels down the axon and comes to an end at the axon terminal. An extended fiber that emanates from the soma is the axon. Dynamic models of real neurons are depicted in Fig. 1b as merge-and-fire neurons. The input pulses produced by proneurons are called Ni. The state of the neuron (membrane potential (V)) is improved by these input pulses. The neuron releases a spike whenever the membrane potential rises above a predetermined threshold (Vth). The state variable is reset to its initial value (Vreset) after firing.

Fig. 1
figure 1

a Schematic representation of a biological neuron with connected synapses and b a representative model for a biological neural network based on the LIF model

More than two layers of artificial neurons are used in deep neural networks, which are designed to mimic the structure and function of the brain's cortex in a simplified version [27]. In addition, deep learning models can offer theories and explanations for how the brain might complete difficult tasks in ambiguous settings [28]. Even while deep learning models tend to be versatile, it takes a sizable dataset to train them well [29].

We have accomplished significant applications in these networks for artificial intelligence classification and processing based on the definitions made for the spiking neural network and the simple hardware implementation of this system in low-power processors. As a result, numerous attempts have been made to employ SNN for processing and resolving such issues. However, training and optimizing this AI model are not without their difficulties. One of the biggest obstacles is figuring out how to define various weights for the neurons in each layer of the deep spiking neural network to train the network for different categorization inputs. The DSNN network training problem becomes more challenging as the number of layers and neurons in each layer rises, as does the number of weighting parameters for each neuron. One of the most pressing issues in modern classification research and artificial intelligence applications is the development of an effective training model for DSNN. For training deep neural networks in this work, we used several meta-heuristic methods based on the fuzzy weighted system (FWS) learning rule merged with Spike-timing-dependent plasticity (STDP). Our case study is about categorizing and recognizing digits in audio data. In this context, sound features resistant to environmental changes and accompanying noises have been chosen using a gray wolf optimization technique. A compact spatio-temporal spike map can be created in a spike-based environment using the input features that have been chosen, and this map can then be fed into network models and data approaches. Based on the Izhikevich model for the input class, we have constructed spiking auto-encoders and trained and tested them using the TIDIGITS dataset. We also introduce a framework for speech synthesis based on encoded features, which is implemented using multi-layered, fully connected spiking neural networks. Then, spatio-temporal compressed spike maps of those properties are generated using LIF merge and fire model neurons. The WHO wild horse algorithm has been used to train these neurons and to establish the weights and threshold voltage of the neurons defined in the network. The adoption of a random weighting mechanism based on fuzzy logic, which is adequate for each neuron, is the training's key novelty. With the assistance of the WHO method, only two variables are calculated as opposed to specifying a substantial digit of input weights for each neuron. The compressed spike maps are then applied to audio samples using the trained weights, and the results of the classification are translated back into the original form using a spiking autoencoder. To the best of our knowledge, this is the first attempt to use optimization methods and a fuzzy logic system to execute speech synthesis from extracted features in a spike-based environment.

This article's primary contributions are:

  • A neuron model that converts real-valued input into spike patterns via an Izhikevich model-based encoding process;

  • A robust feature selection algorithm is introduced for a deep spiking neural network that requires a pre-unit to send training samples;

  • Offering a learning algorithm that can figure out the network's structure by itself using training examples.

  • Rules for weight updates using stochastic computations in a fuzzy logic environment.

Here is how the rest of the paper is laid out: In Sect. 2, we quickly explore the context of related works and fundamental ideas. In Sect. 3, we describe how to use the provided methods to classify speech signals for various examples. In the fourth section, the results of MATLAB simulations of the proposed design are presented and analyzed. In the last section, the article's conclusions are given.

2 Background Research

2.1 Related Works

It is not easy to train deep spiking neural networks with a DSNN. For quick and effective pattern identification, the authors of [30] propose a new ANN-to-SNN conversion framework and layered learning architecture called progressive cascade learning of deep SNNs. To best mimic the activation value of analog neurons, a primitive network transformation technique is developed by investigating the equivalence of ANN and SNN in the discrete representation space. Layered learning with an adjustable training schedule is introduced to fine-tune the network weights and correct for approximation mistakes brought on by the initial network transformation. To classify human footsteps in the wild using basic time domain data, the authors of the paper [31] used a simulation-based methodology. A computationally lightweight classifier based on an ANN was utilized in this study to categorize acoustic signals using SNN. For low-power subthreshold analog implementations to be used in wireless sensor systems, SNN, and feature extraction algorithms are necessary. The conversion from ANN to SNN inevitably results in a loss of 5% classification accuracy, although SNN allows for low-power operation at the algorithm level. The accuracy of classification was enhanced, and it became more analogous to the ANN model.

Encoding real-world signals as spikes efficiently is crucial and has a major impact on system performance as a whole. Both spike density and memory retention of task-relevant information are important for efficient signal encoding to spikes. A speaker-independent digit classifier is analyzed in [32], along with four spike encoding algorithms (Send on Delta, Time to First Spike, Leaky Integrate, and the Fire Neuron and Bens Spiker algorithm). Classification accuracy is enhanced using these techniques. As an alternative to human performance on speech processing tasks, neuromorphic audio sensors combined with sparse neural networks present an intriguing possibility. The weights of the connections are determined by training a convolutional neural network with certain activation functions, employing static images based on the firing rate with spiking data collected from the cochlea, as shown in [33]. A huge dataset including the spoken directions "left" and "right" was used for training and testing, and the system ultimately attained an accuracy of 89.90%.

In [34], we present SNNs with enhanced intrinsic regression dynamics that can learn large sequences efficiently. The proposed architectures have an advantage over LSTMs in that they have half as many trainable parameters. To overcome the indistinctness of spiking neurons, a surrogate function is used, and this leads to a gradient mismatch problem, which is mitigated by the training scheme presented for training the proposed architectures, which enables SNNs to produce multi-bit outputs (rather than simple binary spikes). In [35], a hierarchical spiking neural network (HSNN) is presented that predicts several speaker organizational hierarchies of the ascending auditory pathway. This HSNN is tuned to maximize word recognition accuracy in noisy environments. The best HSNN predicts several changes in the ascending auditory pathway, including sequential losses of temporal resolution and synchronization capacity, increasing dispersion, and selectivity. This is demonstrated by comparison with data from the auditory nerve, midbrain, thalamus, and cerebral cortex. To facilitate audio processing, the authors of the study [36] suggest a neural spiking encoding and decoding technique. Biologically plausible auditory encoding (BAE) is a neural encoding scheme that mimics the function of the perceptual components of the human auditory system, such as the cochlear filter bank, the inner hair cells, the auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve.

A strategy for synthesizing graphics using many methods in a spike-based environment is proposed in the paper [37]. In this study, spiking auto-encoders are employed to compress the spatial and temporal information of video and audio inputs before being decoded to produce synthetic images. The neuron's activation function is approximated by a sigmoidal function, and a direct learning technique is employed to compute the membrane potential loss at the output layer and re-propagate it to allow for differentiation. The spiking autoencoders produce very low reconstruction losses that are equivalent to artificial neural networks when tested on MNIST and Fashion-MNIST. To develop meaningful spatio-temporal representations of data in both audio and visual approaches, spiking autoencoders are trained and assessed. In [38], an ASC framework, SOM-SNN, is developed that is more consistent with biological reality. The audio signals' inherent frequency content is represented by an unsupervised self-organizing map (SOM), and then the signals' spatio-temporal spiking pattern is classified by an event-based spiking neural network inside this framework. Compared to other deep learning-based models and traditional SNNs, experimental results on the RWCP ambient noise and TIDIGITS spoken digit datasets demonstrate competitive classification accuracy.

The audio modeling capabilities of SNNs are tested and evaluated in numerous big vocabulary recognition scenarios in the article [39]. The experimental results reveal that ASR is competitive with ANN in terms of accuracy while classifying each audio frame in just 10 algorithmic time steps and using only 0.68 times the total synaptic operation.

It may be inferred from a review of the various methodologies discussed in this section that there are various methods for developing and training SNN networks. However, the accuracy of spiking neural networks—which have offered new methods and applications to raise this issue in numerous studies [40,41,42]—will remain a significant and unexplored aspect of this research. Training time and complexity become issues when new training approaches are introduced to the network to improve accuracy. These cases are considered to be major issues of future research [43]. According to this research, a novel and straightforward approach to SNN training was attempted in this paper to find a way to improve training accuracy while lowering training complexity.

2.2 Introductory Concepts

2.2.1 Fuzzy Logic System

The fact that we cannot say for sure that a piece of data in a class matches 100% raises a serious problem for the classification of the collection of digits. To allow for the relative assignment of digits based on a model of membership function in a certain set, the idea of a fuzzy set is introduced. Fuzzy processing's most crucial steps are conditional on the nature of the problem at hand and the fuzzy method of analysis chosen to handle the data. There are three main phases to fuzzy processing: information fuzzification, fuzzy inference using predefined fuzzy rules, and information defuzzification [44, 45]. Data encoding (fuzzification) and result decoding (de-fuzzification) are two stages of the fuzzy logic process. Fuzzy processing stages can be implemented with this method. Therefore, the most crucial phases that enable problem management with fuzzy rule approaches are fuzzification and de-fuzzification.

The middle stage, which is the adjustment of the membership values or what we would refer to as the intelligence step because it distinguishes the approach from the other step, is where we can observe fuzzy processing in action. The many different membership functions in use in fuzzy logic each have their effect. Fuzzy system inference is more effective when used with the right kind of membership. There are many benefits to using fuzzy logic for information processing. The many different membership functions in use in fuzzy logic each have their effect. Fuzzy system inference is more effective when used with the right kind of membership. There are many benefits to using fuzzy logic for information processing.

  • Most data processing today makes use of fuzzy approaches.

  • It helps us deal with and make use of uncertainty.

  • It takes some effort to grasp the principles of fuzzy logic.

  • Fuzzy logic offers a lot of versatility.

  • Fuzzy logic works even when the data is inaccurate.

Fuzzy logic is superior because it takes into account structure while constructing its knowledge, while all other approaches suffer from imprecision [46].

To deal with complications such as spiking neural network training, it has been suggested that human reasoning based on if–then rules, as provided by fuzzy set theory and fuzzy logic, be used in various information processing applications. On the other hand, processing information and data might result in uncertainty for a variety of reasons, including randomness, ambiguity, and vagueness [47, 48]. In this research, we will use a random fuzzy pattern as a basis for determining the necessary weights for the neurons in an SNN.

2.2.2 Spiking Neural Network

By simulating the nerves in the brain, SNN spiking neural networks are made up of a lot of densely coupled processing units called neurons that cooperate to solve a problem. Spikes allow these neurons to communicate with one another. The layer that receives input data is typically referred to as the input layer, and the layer that receives output data is typically referred to as the output layer. If there are any additional layers between these two layers, they are referred to as hidden layers in neural network classification. It should be noted that a deep spiking neural network (DSNN) is constructed for the high number of hidden layers. The structure of a DSNN network is shown in Fig. 2.

Fig. 2
figure 2

Structure of a DSNN network

When using feedforward architecture, signals always travel from the input layer to the output layer. Word weights are assigned to input units, and activation of these units is handled by several feedforward layers in the network to achieve the classification of a test sample. The worth of each final product is settled by a batch decision.

The neural networks found in animals' brains serve as inspiration for the computational systems known as "spiking neural networks." Changing the weights and thresholds of neurons in a neural network is how they learn. Several methods, including Hebbian learning, delta, competitive learning, and spike time-dependent plasticity have been proposed based on this.

2.2.3 Gray Wolf Optimization Algorithm (GWO)

This optimization or meta-discovery technique takes cues from the actions and strategies of gray wolves. This population-based approach follows a straightforward procedure, making it amenable to adaptation for use on massive datasets [49, 50].

The hierarchical organization and social behavior of gray wolves during hunting are the models for the GWO algorithm. The apex predators at the top of the food chain are gray wolves. Each pack of gray wolves typically consists of five to twelve individuals. There is a clear order of social dominance and assigned responsibilities for everyone in this group [51]. Every wolf pack has four distinct hunting strategies.

  • Both male and female wolves are known as alpha pack leaders. The herd is dominated by these wolves.

  • Beta wolves are those who support the alpha wolf's decision-making yet are often picked over the alphas themselves.

  • Lower than Beta Wolves, Delta Wolves include the elderly, the hunters, and the pack leaders responsible for the care of young wolves.

  • Omega wolves are at the bottom of the hierarchical pyramid and have the fewest rights in comparison to the other members of the pack. After all, they eat and are not involved in making decisions.

2.2.4 Wild Horse Optimization Algorithm (WHO)

A human, animal, plant, or even a physical or chemical agent can serve as the basis for an optimization algorithm. Animal behavior has served as inspiration for many algorithms presented in the recent decade. In this research, we implement a novel optimization method named the wild horse optimizer (WHO), which takes its cues from the cooperative nature of free-ranging horses. In the wild, a herd of horses will typically consist of a stallion, multiple mares, and their young. Grazing, pursuing, dominance, leadership, and mating are just some of the various characteristics displayed by horses. Horses have a charming demeanor that sets them apart from other animals. Because of the social nature of horse herds, young horses often move on to other herds to socialize before attaining sexual maturity. The purpose of this door is to discourage father-daughter and sibling relationships. The suggested method draws primary inspiration from the demeanor of horses [52, 53]. Figure 3 depicts the algorithmic flow of the wild horse optimizer implemented in this paper. The five primary steps of the Wild Horse Optimizer are as follows:

  • Establishing the initial population, organizing the horses into groups, and selecting leaders.

  • grazing and mating of horses;

  • Leadership and leadership of the group by the leader (Narayan)

  • The choosing and exchange of leaders.

  • Save the ideal response.

Fig. 3
figure 3

WHO algorithm flowchart [53]

2.2.5 Hypothesis and Limitations of the Developed Method

This work describes a method for digit classification from speech signals that makes use of modeling, recognition, and comprehension of spatio-temporal electroencephalography data as well as complicated cognitive measurement of the brain during mental activity processes. The crucial point is that mental activity is carried out by intricate brain processes related to digit recognition, which can only be comprehended by measuring the accurate data model for digit recognition. A proposed method based on the spiking neural network has recently been presented, the architecture is called NeoCube as a general framework for modeling digit-recognize data based on electroencephalography data. Data modeling has been done and presented with MATLAB software. In this method, the problem of spiking neural network training faces complications that endanger the accuracy of training. The problem of modeling spiking neurons can be solved with the hypothesis of a simple circuit according to Fig. 5 in the hardware dimension. However, the important issue is the learning method of this hardware system and the value of the circuit components as weighting according to the training data sample. The limitations of this problem have been solved by the integrator modeling of spiking waves defined by the integrate-and-fire neuron model. The problem of weighting and training encounters the constraint of goal-convergence as the number of nodes and hidden layers in the network increases. To improve training speed and accuracy, this work presents a fuzzy uniform weighting scheme that simplifies the process of determining neuron input weights. On the other hand, the use of new meta-heuristic algorithms in the calculation of weights has been able to improve accuracy.

This technique increases the number of neurons and hidden layers in an attempt to lessen the impact of the limitation brought on by low accuracy. The main problem is that when the total number of neurons in the network increases, the number of training parameters increases, and this hard problem will make the convergence of training difficult.

3 Materials and Methods: Suggested Learning Method

Determining the threshold values and weights of each neuron for the number of high layers and increasing the number of neurons presents a significant barrier in the DSNN network learning problem. It takes 27 weighted lines connecting neurons and 8 threshold values to generate a "fire edge" in the scenario network depicted in Fig. 2. In this proposed layout, an FWS fuzzy weighting method is used to try to bring down this value. The process of training a deep spiking neural network is illustrated generally in Fig. 4. In the following, we will discuss the full interpretation of this proposed framework.

Fig. 4
figure 4

Proposed plan A digit classification system with FWS-DSNN technique B FWS-DSNN training model. C Total DSNN model (include: input and hidden and output layers model)

3.1 Spiking Neuron Model

Figure 4 depicts the process of extracting frame-based characteristics for use in SNN-based acoustic models. These characteristics are typically believed to be constant during the brief period time of the segmented frames due to the moderate changes in the speech signals. We have used the Izhikevich model for the input layer neurons to generate spike trains based on the predetermined values for the extracted features. This model can be used to produce a set of spike trains at varying frequencies for each normalized feature. With the use of an integrate-and-fire (IF) neuron model that employs a pulse train generating strategy in the neurons of the intermediate and output layers, we can analyze these data based on a set frame with low computational overhead. Although IF neurons do not replicate the temporal dynamics of biological neurons, they are well suited to this task since spike timing has little impact on the neural representation employed. However, the speed of SNN analysis was improved by defining the maximum orbital frequency of neurons.

3.1.1 Izhikevich Neuron Model

A basic spiking model is utilized in this research, which is just as computationally efficient as the merge-and-fire model and as biologically realistic as the Hodgkin–Huxley model. The mathematical study of the model is given in Izhikevich's monograph [54], and for different types, it faithfully reproduces the spiking and bursting activity of known types of cortical neurons, depending on the four parameters. We can reduce several biophysically correct Hodgkin-Huxley-type brain models to a 2-D system of ordinary differential Eqs. (1, 2) using a variety of techniques [54].

$$v^{\prime} = 0.04 \times v^{2} + 5 \times v + 140 - u + I$$
(1)
$$u^{\prime} = a \times \left( {b \times v - u} \right)$$
(2)

With reset after auxiliary spike:

$$if\,v > V_{th} \,;\,then\,\left\{ {\begin{array}{*{20}c} {V \leftarrow C} \\ {u \leftarrow u + d} \\ \end{array} } \right.$$
(3)

where ' is equal to d/dt, where t is time, and v and u are dimensionless variables. Additionally, a, b, c, and d are dimensionless parameters. The variable u represents the membrane recovery variable, which accounts for the activation of K+ ion currents and the inactivation of Na+ ion currents, providing negative feedback to the variable v. The variable v reflects the membrane potential of the neuron. The membrane voltage and the recovery variable are reset by Eq. (3) after the spike reaches its peak (i.e., Vth). The I variable is used to supply injected dc currents or synaptic currents. This model has been utilized in this study to specify the input neurons that would represent the characteristics gleaned from audio frames as various spike trains.

3.1.2 Integrate-and-Fire (IF) Neurons

The most popular model, which is typically employed as a spiking neural network, is this one. The foundation of this approach is electrical theory. During the brief pulse, a spike that is traveling down the axon is modulated by a low-pass channel and turned into a current pulse, I(t-tj(f)), which charges the coordinate circuit and causes it to fire. The resulting voltage can be raised to increase the postsynaptic potential ε (t-ti f). However, when the voltage rises above the threshold level, the neuron delivers a pulse [55,56,57].

$${\tau }_{m}\frac{\partial u}{\partial m}=-u\left(t\right)+RI\left(t\right).$$
(4)

It takes a time equal to τm for the action of a membrane potential u to "leak" the voltage. Similar to the spike response paradigm, the neuron fires when u crosses the threshold and a brief pulse σ is triggered.

$${I}_{i}(t)=\sum_{j\in {\Gamma }_{i}}{c}_{ij}\sum_{{t}_{j}}\delta \left(t-{t}_{j}^{\left(0\right)}\right).$$
(5)

Due to their constrained pulse width, I neurons typically have zero input current. The synaptic survival factor cij amplifies spikes, creating a postsynaptic potential that charges the capacitor. This concept can surely be used with multidimensional hardware because it is computationally straightforward [58].

The integrate-and-fire neuron is depicted in a simplified form in Fig. 5. The capacitor is charged by a current pulse I(t) generated by the low-pass filter shown on the left. When the voltage u across the capacitor exceeds the cutoff value, the schematic model on the right produces a spike.

Fig. 5
figure 5

Schematic representation of the integrate-and-fire neuron model [58]

3.2 Neural Network Training Plan

Information from the input layer is processed by SNNs as spike trains. Because of this, unique procedures are required to encode the feature vectors of continuous-valued audio signals in spike trains, process them in the intermediate layer, and decode the classification results from the activity of the output neurons. In this study, we present a spiking neural coding scheme for this task; the weighting is fuzzy-defined, and a meta-heuristic algorithm is used to search. This section details the procedures for teaching a deep spiking neural network (FWS-DSNN) using a fuzzy weighting mechanism.

The main contribution of this study is a method for training FWS-DSNN-based deep voice models using speech variables that are commonly used for digit categorization. We have used an 8 kHz audio data collection featuring male and female speakers in a variety of environments and noise levels. Features such as cepstral melfrequency coefficients (MFCC), zero-crossing ratios (ZCR), and acoustic power are derived from these signals and utilized to categorize digits. These input voice features are contextualized by concatenating numerous frames to make use of greater temporal context information before being fed into the FWS-DSNN. Before training the SNN-based acoustic model, the feature selection system with the gray wolf optimization meta-heuristic algorithm is used to align the speech features with the target column labels based on the dispersion of each feature across classes and the density of each feature within a class. Algorithm 1 presents the source code for the objective function algorithm used to choose the right feature.

Algorithm 1
figure a

Display the pseudo-code of the objective function program to select and reduce the characteristics of audio signals

By specifying this function, the optimal feature for classification is chosen among those with the highest density of that feature in each class and the highest dispersion of features relative to the various classes. The gray wolf optimization process is used to finalize the outcomes of feature selection.

The phases of FWS-DSNN training are then carried out using the selected features. An Izhikevich-model network's input layer generates a spike train with a unique set of features for each feature. Weight and threshold must be defined at this point using the training data. With these records, a back-to-back learning technique can be used to enhance the FWS-DSNN deep audio model. Using input spike trains from several intermediate and output layers of spiking neurons, FWS-DSNN learns to optimize the chosen input speech features through fuzzy weighting and threshold selection optimized with the wild horse optimization algorithm.

This fuzzy optimization scheme with WHO first modifies the input feature vector based on frame X, for example, MFCC or (FBANK) features, where the feature vector [x1, x2,…,xn] is defined as follows.

$$\left[ {W_{1} ,W_{2} , \ldots ,W_{N} } \right] = wightingfunc\left( {Z_{1} ,Z_{2} ,N} \right)$$
$$F = \sum\limits_{i = 1}^{N} {w_{i} \times x_{i} }$$

where N is the total number of neuron inputs and Z1, and Z2 are the fuzzy model's weight generation parameters. The weights of individual neurons are denoted by the variable Wi.

The novel aspect of this work with multiple weights is that it makes use of a fuzzy logic system to generate random weights. The random weighting set was generated using the mathematical function A.cos(t.s). As a result of this work, only two parameters, Z1, and Z2, need to be calculated to define a neuron in a multi-layer network with numerous neurons in each layer. This makes it easier to define the neurons in each layer with a less number of parameters. The simplification of FWS-DSNN's training and education is a direct result of this decrease in parameters. In Algorithm 2, we see the implementation of the proposed fuzzy weighting function.

Algorithm 2
figure b

The pseudo-code related to the objective function of generating random weights according to the parameters z1, z2

The fuzzy system created to provide random weights by the objective function is seen in Fig. 6. Figure 6a is a high-level diagram of the fuzzy system, and Fig. 6b describes the proposed system's input and output membership functions. Fuzzy rules specified in Table 1 are visualized in Fig. 6c, which depicts the circuit's input and output characteristics. We modify the system's z1 value to introduce dispersion for different weights based on the proposed stated rules.

Fig. 6
figure 6

Introducing the fuzzy system of generating random weights

Table 1 Displaying fuzzy rules

In this regard, we have optimized the calculation of the Z1, and Z2 parameters, as well as the threshold value (Vth), using the wild horse optimization technique. We obtain a random distribution of weights and threshold values through the fuzzy weighted layering of various neurons in various layers, which paves the way for the training of the FWS-DSNN network. The audio digit produced by the trained FWS-DSNN model is mixed with the data from the language model and pronunciation vocabulary during the inference stage. Rather than estimating all weights with the use of a meta-heuristic algorithm, a fantastic way to simplify FWS-DSNN network training is to use the FWS random fuzzy weighting system as the definition of random weights to develop a weight generation system. The primary reasons for employing FWS-based training are (1) the reduction of the number of individual weights for the neurons in each layer caused by applying the proposed random weighting function, and (2) the acceleration of the decoding process caused by the existence of effective search algorithms on FWS. The search procedure generates a grid containing the most likely hypotheses. Based on a weighted sum of the spike trains to the network assumptions, the system selects the output of the digit recognition from the audio signal.

With this training technique, weighted transformation is carried out during the entire learning process. According to the intended digit, the original input properties are modified in this representation, and the information is then classified using the RMS spike value of the output neuron. Since the input data may be quickly processed inside the constrained encoding window, this encoding implementation technique is helpful for speedy inference. Starting with the first median encoding layer, as shown in Fig. 4, sequential learning is applied to each neuron in the successive FWS-DSNN layers. Algorithm 3 presents the defined neuronal perspective for the FWS-DSNN system.

Algorithm 3
figure c

Pseudo-code for spike neuron generation with fuzzy weighting system

We use the free sum membrane potential of the spike output neurons for neural decoding to achieve smooth learning with high-precision error gradients acquired in the output layer. The calculations needed to deploy these two layers on the boundary edge generated by the WHO algorithm are constrained by the fact that the dimensions of the input feature vectors and the output classes are substantially lower than the hidden layers.

Concatenated neural networks were constructed using this formula, as seen in Fig. 4. To calculate the precise spike representation, forward activation propagation utilizes SNN layers, which then pass on the total number of spikes and spike trains to succeeding FWS-DSNN layers. The information conveyed to the interconnected FWS-DSNN layers is coordinated by this convolutional layer structure. Algorithm 4 lays forth the specifics of this sequential learning rule.

Algorithm 4
figure d

Pseudo-code for sequential learning rule

4 Simulation and Discussion of Results

4.1 Dataset and Feature Extraction

The TIDIGITS dataset consists of audio clips that were recorded with speakers reading out digits. A straightforward audio/speech dataset consists of speech figures that have been recorded in wav files at a frequency of 8 kHz. There is almost no silence at the beginning or finish of the recordings because of how they are edited. This collection was compiled using the references [29, 59]. The extracted features are:

  • Mel frequency spectral coefficients (MFCCs)—coefficients that form the spectral representation of sound based on frequency bands that are distant from the response of the human auditory system (Mel scale).

  • Chroma—corresponds to 12 different screw classes.

  • Average Mel Spectrogram—spectrogram based on the Mel scale.

  • Spectral contrast—shows the center of mass of the spectrum.

  • Tonnetz—represents the tonal space.

These attributes are NumPy arrays with the following sizes: 20, 12, 28, 7, and 6. To create a feature array of size (173), these are concatenated. Each entry in the Excel file has its label, which is put to the array's head.

Different types of speech digits have been distinguished using primarily cepstral mel-frequency coefficients (MFCCs) in several publications [31, 32]. In the non-linear MEL scale of frequency, the short-term power spectrum of a speech is represented by the cepstrum mel-frequency (MFC), which is derived from the linear convention transformation of a log power spectrum. The MFC is composed of components called MFCC. Common examples of MFCC include [33, 34]:

  • Consider a signal's Fourier transform (a window sample).

  • Map the power spectrum acquired on the MEL scale using overlapping windows.

  • Measure the powers in each of the MEL frequencies logarithmically.

  • Treat the discrete cosine transform as a signal by selecting it from the Log Mel power list.

  • MFCCs, the spectrum's ranges are computed.

In this study, we apply a statistical feature model to extract a total of 77 features from each audio signal; these features include the coefficients estimated from the window of the signal, as well as the Zero-crossing Rate and the power of the signal. In this case study, only 7 main features are known and picked with the help of the gray wolf algorithm due to the large amount of computing processing involved.

The simulation of the proposed design has been done under MATLAB software coding and the results for different examples of training and testing have been reviewed and analyzed. In the feature reduction portion, a feature adaptability objective function with target classification in digits 0–9 has been utilized to choose features with the aid of the GWO algorithm. Matlab variance function can be used to determine which features should be prioritized for further analysis based on their ability to distinguish between classes and their density within those classes. The pseudo code of this objective function is given below (Algorithm 5):

Algorithm 5
figure e

Pseudo-code for feature selection fitness function

Consequently, the optimal feature is the one that minimizes the objective function's output value; it is extracted from audio signals, fed into the training network with the set of audio signals. GWO algorithm is used to find the most important features. The block diagram of training is shown in Fig. 4.

To train the proposed network, 7 features are first picked from the set of extracted features using the GWO algorithm and the approach of lowering audio signal characteristics according to Fig. 7. The effectiveness of the GWO algorithm in selecting features is depicted in Fig. 7. The GWO algorithm has been evaluated with 50 population members for 500 iterations. As can be seen from the figure, the algorithm has reached its final minimum after 44 iterations. With the help of this optimization, the features with the highest quality are selected for the classification of digits 0–9 from the audio signals and the rest are removed. To separate the audio signal samples, the features that exhibit the least compatibility in the classification process are eliminated in this way. With this effort, the training complexity for spiking networks is lowered, and fewer input neurons are needed for the network that is being trained.

Fig. 7
figure 7

Performance display of GWO algorithm for feature selection (15 13 10 9 14 2 7)

The SNN network's layout consists of an input layer with seven neurons based on the quantity of features that were chosen, a variable number of hidden layers with a predetermined number of neurons, and an output layer with a single neuron. In the input layer of FWS-SNN, an Izhikevich neuron model is used to convert each real-valued stimulus into a spike with amplitude and firing rate corresponding to the normalized value in the input features. For intermediate layer neurons with defined threshold voltages and weights, the Leaky Integrate-and-fire (LIF) neuron model is employed. This model learns the threshold voltage and weight values in each neuron using the WHO algorithm and a fuzzy weighting scheme. A LIF neuron model, which is determined by the average structure in various classes, is employed in the output layer.

The suggested network is trained using the wild horse algorithm in this paper. Thus, for the classification of audio signals for digits 0–9, O, the suggested FWS-SNN method using the WHO algorithm approach has been compared with two machine learning methods: the feedforward ANN method and the adaptive neural fuzzy network ANFIS. These methods are checked for the test data set (30 numbers) based on the training data of 120 members trained under different techniques, and the results are shown in Fig. 8. All of the machine learning networks in this study have been constructed according to the defined network of a three-layer network with a total of [15 8 1]. Figure 8 shows the results of several analyses and comparisons made between the training and test data sets. The proposed technique with the help of meta-heuristic algorithms has been able to produce good results with high accuracy and alignment of results. Table 2 shows the comparison results for the two studies. As shown, the WHO-FWS-SNN technique has been able to achieve the highest accuracy compared to other machine learning methods. The tagging range for the SNN network in the results is normalized to the operational range of the network in the interval [0 0.25], which is in agreement with the range 1 to 11 for the characteristic digits 1 to 9 and O and zero, respectively. An inventive WHO method with 30 population groups and 50 or 300 iterations was used to train the algorithm to calculate the weights and threshold values.

Fig. 8
figure 8

Displaying the training results of TIDIGITS test data with the help of machine learning methods. A ANFIS. B ANN. C WHO-SNN. D WHO-FWS-SNN

Table 2 Comparison of accuracy results of different machine learning methods

4.2 Discussion

The current study developed SNN network training method to define fuzzy weighting, and applied it to a set of data for audio signal digit recognition. In this case, we have used GWO, WHO meta-heuristic algorithms to reduce input features and train network learning parameters. We have compared the performance results of this method for classification with other unmodified WHO-SNN and classic ANFIS, ANN methods in Fig. 8. The training and testing datasets for classical approaches exhibit considerable dispersion and proportionally reduced accuracy, as seen in Fig. 8a and b, respectively. Figure 8c shows the training results of the proposed network for the WHO-SNN method. The suggested approach's performance is evaluated using a test data set. Because there are many training parameters, the method requires more time to train the network, which results in lesser accuracy and increased complexity. However, we were able to attain better training speed and accuracy using the WHO-FWS-SNN method, which incorporates the fuzzy weighting methodology and reduces the number of training parameters.

In Table 2, we have compared the WHO-FWS-SNN network with the WHO-SNN network for three training datasets, Digit recognize, IRIS and Trip Data. Based on the accuracy results obtained with the proposed technique in both data samples, we have been able to improve it compared to the WHO-SNN, ANN, and ANFIS methods. As can be observed, the WHO algorithm no longer needs to calculate a huge number of parameters to reach the final minimum network to boost accuracy because the search parameters for defining the weights of the neurons in different layers have been reduced in this technique. For instance, in this proposed network [15 8 1], we need 257 parameters to be calculated by the WHO algorithm to train the WHO-SNN network to calculate the communication weights along with the threshold value, whereas, with our suggested fuzzy weighting technique, we only need 72 parameters to be calculated by the WHO algorithm to train the WHO-FWS-SNN network. This has made it much easier to train the suggested network. On the other hand, this problem becomes extremely difficult for training when the number of middle levels of training and the number of neurons in each layer increases. However, with the help of our fuzzy weighting method, we have significantly reduced the parameters.

According to the results obtained from Table 2, it can be well explained that the fuzzy weighting algorithm for each of the different data sets has been able to create a good relative accuracy for the proposed method compared to other unmodified and traditional methods. For example, in the case of the IRIS data set, the proposed method has achieved the highest accuracy with 98.93% compared to other methods including WHO-SNN, ANN and ANFIS with 95.47, 96.27, and 95.85% accuracy, respectively. Also, the Trip Data dataset of the proposed method has achieved the highest accuracy (97.36%) compared to other methods including WHO-SNN, ANN and ANFIS with the accuracy of 96.12, 90.75, and 88.43%, respectively. Ultimately, the Digit recognize data set of the proposed method has achieved the highest accuracy with the accuracy of 97.18% compared to other methods including WHO-SNN, ANN and ANFIS with the accuracy of 92.3, 84.38, and 81.2%, respectively.

The improvement of SNN network training accuracy is another issue with this research. By adjusting the layers and the number of neurons in each layer, attempts have been made to check the results and obtain the highest accuracy in this context. The outcomes for the suggested WHO-FWS-SNN and WHO-SNN training techniques for the case study are shown in Table 3. With this explanation, it is clear that expanding the neurons and layers alone will not help to improve accuracy. This problem calls for the most efficient searches for various values of layers and their number.

Table 3 Comparison of TIDIGITS database accuracy scores from several machine learning techniques

Table 3 examines the accuracy of the proposed learning methods for the 0 to 9 digit recognition data set in the SNN network with different structures. In this case, two, three, and four layer structures with different numbers of neurons have been used. In all structures, WHO-FWS-SNN learning method is more accurate than WHO-SNN. But as can be seen, increasing the number of neurons and the number of layers has not been able to achieve higher accuracy. The reason for this is due to the increase in the complexity of the problem for the proposed algorithms. According to the comparison of the results in Table 3, the structure [15 8 1] has been able to achieve the best performance for accuracy.

Table 4 compares previous related works with the proposed method from the points of "techniques used, pre-processing techniques, type of data sets used, evaluation measures, advantages and disadvantages of that technique". With the comparison made in this table, we have been able to compare and analyze our proposed technique with new works.

Table 4 Comparing the accuracy outcomes of several machine learning techniques using the TIDIGITS database

5 Conclusion

5.1 Inference

For digit classification problems of audio signals, a new modified learning, known as FWS-SNN, is presented in this research. In the input layer of FWS-SNN, an Izhikevich neuron model is used to convert each real-valued stimulus into a spike with different amplitude and firing rate. Leaky Integrate-and-fire (LIF) neuron model is used for intermediate layers and output layer, which uses WHO algorithm and fuzzy weighting system to learn input weight values in each neuron. Figure 4 presents the flowchart of the proposed learning algorithm for a 0 to 9 digit recognition system. After defining the number of layers and neurons of each layer according to the steps described in the flowchart, the proposed WHO-FWS-SNN learning algorithm determines the weight and threshold of each neuron of the intermediate layers and the output layer. This synaptic weight is adapted using a random fuzzy weighting system with reference to the test data set. This is carried out after determining the number of layers and the number of neurons in each layer. The weight update rules used by the FWS learning algorithm have low processing cost and require only two parameters per neuron. The difficulty of training the investigated neural network is simplified using this strategy. The suggested SNN network's training complexity was decreased by 71.9% thanks to the fuzzy weighting technique, which allowed us to reduce the number of network parameters for weighting neurons. To train the proposed network, we have also used the feature reduction technique. Based on the feature distribution and density provided for each class, the gray wolf optimization (GWO) algorithm determines which features are most closely related to the input speech signal by calculating the digit information from the corresponding language model. This process is repeated for each feature extraction. Our investigation into potential network topologies and learning parameter optimizations was based on previous numerical experiments conducted using IRIs travel data set. Our goal was to establish a network with the best potential accuracy. In comparison to other machine learning techniques (ANN, ANFIS), the suggested methods have demonstrated good accuracy in a number of studies.

5.2 Future Works

In the following work, the proposed learning technique will be investigated in different scenarios for other classification applications. Also, the performance of the WHO-FWS-SNN technique here can be investigated and improved with new algorithms. To further improve the classification performance in various applications, we would like to investigate the training of recurrent networks of spiking neurons by applying one feedback to each neuron in the future. To increase the accuracy of SNN networks, we can try to continue this work by defining neurons with higher order modelling.