In existing works, for converting the isolated active word speech into text, various techniques were utilized and performed. The proposed SDRN uses handcrafted features. The AMS technique is used to extract the features from the input speech signal. Then the extracted input features are taken into DNN for training and testing. The hidden layers of DNN its neurons are optimized using the OWOA optimization technique and this proposed method is referred to as a neural-based opposition whale optimization algorithm (NOWOA) in this paper. The OWOA algorithm is the latest optimization algorithm proposed by (Alamri et al., 2018). It is an efficient algorithm used in research areas like solar cell diode modeling (Abd Elaziz and Oliva 2018).
Whales are considered to be significantly wise animals. WOA is the latest developed optimization technique that mimics the common activities of the humpback whales. This algorithm has a better accuracy level, and for more enhancements, the opposition algorithm is included.
In this work, 375 features of input speech signals are used as input to the DNN. Two types of databases have been used in this work namely standard database and real-time database. For the standard database, the TIMIT (Garofolo et al., 1993) corpus of reading speech has been used. This database was developed to offer voice data for acoustic and for the studies of phonetic data for automated speech recognition systems. It consists of vast soundtracks of 630 narrators of 8 main vernacular divisions of the United States. The TIMIT corpus contains a total of 6300 statements, 10 sentences spoken by each of those narrators. For each statement, a 16 kHz speech waveform file has been used. The core test set contains 192 different texts. The chosen texts were monitored for the existence of at least one phoneme. Figure 1 describes the conceptual methodology of the proposed system.
Among these 6300 speech signals, 70% are utilized for training and the remaining 30% are used for testing. For validation purposes, 60 isolated (real-time) speech signals are recorded in ambient conditions, among which 70% are used for training and the remaining 30% are used for testing. For validation of continuous (real-time) speech signals, 110 speech signals were considered, out of which 70% are used for training and the remaining 30% are used for testing. Then, 375 features are mined from these speech signals. Initially, features are taken out from the input speech corpus using AMS. The input is a combination of clean and noisy signals and it is pre-processed by normalizing, quantizing, and windowing.
The acquired signals have been disintegrated into different time–frequency (TF) units by utilizing bandpass filters which will transform the signals within a specified frequency range. The signals are split into 25 TF units each attached to a channel Ci, where i = 1, 2, 3…25. Among these, 25 bands of channels are considered and the signal frequencies are characterized in the range of the individual channel.
Let the feature vector (FV) be denoted by afr(λ,φ) where φ is the time slot and λ is the sub-band. By considering the small updates in TF domains, we additionally consider the functions Δati to the extracted features, given beneath, where ti is the time, and B is the channel bandwidth.
$$\begin{gathered} \Delta a_{ti}\, (\lambda ,\varphi ) = a_{fr} \,(\lambda ,\varphi ) - a_{fr}\, (\lambda,\varphi - 1 ),\,\, \hfill \\ \,where\,\,\varphi = 2,...ti \hfill \\ \end{gathered}$$
(1)
The frequency delta function Δad is given as:
$$\begin{gathered} \Delta a_{d} (\lambda ,\varphi ) = a_{fr} (\lambda ,\varphi ) - a_{fr} (\lambda - 1,\varphi ),\,\, \hfill \\ \,where\,\,\lambda = 2,...{\text{B}} \hfill \\ \end{gathered}$$
(2)
The cumulative FV a (λ,φ) can be defined as:
$$a(\lambda ,\varphi ) = [a_{fr} (\lambda ,\varphi ),\,\,\Delta a_{ti} (\lambda ,\varphi ),\,\,\Delta a_{d} (\lambda ,\varphi )]$$
(3)
In this way, the features are extracted from the input speech signal using the AMS technique, which will be then used as input for DNN.
Deep neural network
A DNN is a network with a fixed level of intricacy and with diverse layers. DNN uses a complex technical exemplary for managing the data in an erratic mode. DNN with plentiful layers typically combines the characteristic removal and organization procedure into a signal learning body. These kinds of NN have attained achievement in multifaceted areas for the documentation of designs in contemporary ages. The network consists of a layer of inputs, HL, and OL. The input layer is taken as layer 0 for a P + 1 layer DNN framework and the output layer is P for P + 1 layer DNN as given in Eq. 4 and Eq. 5.
$${{\varvec{x}}}^{p}=f\left({{\varvec{y}}}^{p}\right)=f\left({{\varvec{W}}}^{p}{{\varvec{x}}}^{p-1}+{{\varvec{b}}}^{p}\right), 0<p<P$$
(4)
$${{\varvec{y}}}^{p}={{\varvec{W}}}^{p}{{\varvec{x}}}^{p-1}+$$
(5)
The activation vector and the excitation vector are x and y respectively, and W gives the weight matrix and b is the bias vector.
Stochastic feed-forward backpropagation (Bengio, 2012; LeCun et al., 2012) is used for learning the weights of DNN. The difference between each output and its target value is transformed into an error derivative. Then error derivatives from error derivatives in the above layer are measured in each hidden layer. Then error derivatives w.r.t. activities are used to obtain error derivatives w.r.t. incoming weights in Eq. 6–9. Here e is the error, y’ is the target value and y is the output.
$$\begin{gathered} e_{{}} = \frac{1}{2}\sum\limits_{j \in output} {\;(y^{\prime}_{j} } - y_{j} )^{2} \hfill \\ \frac{{\partial e_{{}} }}{{\partial y_{j} }} = - (y^{\prime}_{j} - y_{j} ) \hfill \\ \end{gathered}$$
(6)
$$\frac{\partial e}{{\partial x_{j} }}\;\; = \;\;\frac{{dy_{j} }}{{dx_{j} }}\;\frac{\partial E}{{\partial y_{j} }}\;\; = \;\;y_{j} \;(1 - y_{j} )\;\frac{\partial e}{{\partial y_{j} }}$$
(7)
$$\frac{\partial e}{{\partial y_{i} }}\;\; = \;\;\sum\limits_{j} {\frac{{dx_{j} }}{{dy_{i} }}\frac{\partial E}{{\partial x_{j} }}} \;\; = \;\;\sum\limits_{j} {w_{ij} \frac{\partial E}{{\partial x_{j} }}}$$
(8)
$$\frac{\partial e}{{\partial w_{ij} }}\;\; = \;\;\frac{{\partial x_{j} }}{{\partial w_{ij} }}\;\;\frac{\partial E}{{\partial x_{j} }}\;\; = \;\;y_{i} \frac{\partial e}{{\partial x_{j} }}$$
(9)
Convolutional neural network
CNN is a group of profound learning neural networks.
A CNN has
-
Convolutional layer
-
ReLU layer
-
Pooling layer
-
Fully connected layer
Convolutional layers use an intricacy action to the input. This permits the data on to the subsequent layer. Assembling pools the outputs of groups of neurons into a distinct neuron in the following layer. Completely associated layers join each neuron in a single layer to each neuron in the succeeding layer.
CNN architecture
A definitive CNN (Abdel-Hamid et al., 2014) design would appear somewhat similar to this. Figure 2 exemplifies CNN architecture showing layers containing a couple of convolution layers and a pooling layer in series, where charting from either the input layer or pooling to a convolution layer.
When the charts of the input feature are made, the layers of convolution and pooling use their relating actions to make the training of the elements in those layers. The elements of the convolution and pooling layers can also be prearranged into charts, like that of the input layer. A few convolution and pooling layers in series are typically mentioned in CNN lexis as a unique "layer" of CNN. Thus a deep CNN contains two or more of these couples in series.
The convolution layer unit of one feature map can be calculated as:
$${l}_{s,m=}\sigma \sum_{i=1}^{p}\sum_{n=1}^{F}{I}_{i,n+m-1}{w}_{i,s,n}{w}_{I,s }$$
(10)
$$(s=\mathrm{1,2}\dots ..,S)$$
where Ii,m is the mth unit of the ith input feature map, ls,m is the mth element of the jth feature chart in the convolution layer, wi,s,n is the nth component of the mass trajectory, which joins the ith input feature chart to the feature chart of the convolution layer. F is named the filter dimension, which regulates the amount of the filter bands in every input feature chart that every feature in the convolution layer gets as an input. As a result of the area that ascends from our opinion of mel frequency spectral coefficients (MFSC) aspects, these feature charts are limited to a partial incidence range of the speech signal. Using max pooling function the pooling layer in CNN is given as
$${p}_{i,m}={max}_{n=1}^{G}{q}_{i\left(m-1\right)Xs+n}$$
(11)
where G represents the pooling size, and, s denotes the shift size that determines the overlap of adjacent pooling windows. The output layer in CNN is
$${p}_{i,m}=x\sum_{n=1}^{G}{q}_{i\left(m-1\right)Xs+n}$$
(12)
where x represents the scaling factor that can be learned. In the image, identification uses with the limitation that G = s and if the assembling windows do not overlay and have no places between them, it has been recognized that max-pooling performs better than average-pooling.