Introduction

With the coming of intelligent era, traditional interaction modes such as the switch, button, keyboard, mouse, touch screen and so on have had increasing difficulties for meeting the growing needs of people for intelligent computing and control. The Internet of Things with intelligent interconnections of “thing to thing” and “thing to human” is believed to be the fourth wave of world information industry development, from intelligent vehicles, intelligent transportation, wisdom logistics, intelligent offices, intelligent furniture and so on to almost all wisdom agriculture and industries. Fast, convenient, intelligent, and the integration of people and things are its biggest characteristics. Therefore, it is becoming more and more urgent to explore and implement new ways of interaction, provide a feasible solution or equipment. So we set up the unit middleware for the implementation of human–machine interconnection, namely human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things.

For unit middleware, to realize speech analysis and semantic recognition based on small capacity and low computing power is the key. Recognition model, performance and capacity and computing power, they both relate to and influence each other. Generally, large model and good performance require large capacity and strong computing power, and vice versa. General speech recognition models have many parameters, use a lot of data, and take a long time to train and test, which requires large capacity and strong computing power. In addition, this is also an optimization or optimization problem with constraints. Simultaneously different applications have different requirements. Obviously, based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges. In order to base on small capacity and low computing power to realize speech analysis and semantic recognition, we present a kind of novel deep hybrid intelligent algorithm. We embed it in three main ways: First, the algorithm must be able to extract features efficiently, reduce the redundancy of data, and improve the recognition rate and stability. Second, the algorithm must have a certain degree of elasticity and flexibility, easy to expand and clip, so that the recognition model is small and light, and reduce the requirements of computing power and capacity. Third, speech data has the characteristics of serialization and no-modularization, there is a strong correlation and dependence before and after data. For different speech sequences, the dependency can be set to varying lengths. Greater length may result in greater accuracy, but requires greater computing power and capacity; conversely, it may result in lower accuracy, but requires less computing power and capacity. Therefore, the serialization model is used to better obtain the dependencies between sequential words and make corresponding choices according to the actual needs. Second, it is to establish unit middleware using the embedded chip as the core on the motherboard. Third, it is to develop the important auxiliary tools writer-burner and cross-compiler. Fourth, it is to prune procedures and system, download, burn and write the algorithms and codes into the unit middleware and cross-compile. Fifth, it is to expand the functions of the motherboard, provide more components and interfaces, for example including RFID, ZigBee, Wi-Fi, GPRS, RS-232 serial port, USB interfaces and so on. Sixth, we take advantage of algorithms, software and hardware to make machines "understand" human speech and "think" and "comprehend" human intentions so as to implement human–machine interconnection, which further structure the intelligent ecology of the Internet of Things. At last, the experimental results denote that the unit middleware have very good effect, fast recognition speed, high accuracy and good stability, consequently realizing the intelligent ecology construction of the Internet of Things.

Previous foreign and domestic studies

The research of this paper is multidisciplinary cross research and the content is many, difficult, needs various aspects of professional knowledge, which includes speech recognition and semantic controls, deep hybrid intelligent algorithm, human–machine interactions, artificial intelligence, the Internet of Things, embedded development and so on. It's also a combination of algorithms, hardware and software. Although all of them are the current research hotspot respectively, the application realization and research of their combination have not been found. In particular, based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges for constructing intelligent ecology of Internet of Things. Previous papers need to be explored for each relevant domain.

At present, speech recognition semantic control should be the most suitable way of human–machine interaction. For example, speech recognition semantic control can be applied to indoor equipment controls, voice control telephone exchange, intelligent toys, industrial controls, home services, hotel services, banking services, ticketing systems, information web queries, voice communication systems, voice navigation and so on all kinds of voice control systems and self-help customer service systems. In particular, with the vigorous development of artificial intelligence technology [1,2,3,4], compared to traditional human–machine interaction modes, which mainly include using keyboards, mice and so on to communicate, people naturally expect that machines will have highly intelligent voice communication abilities, named intelligent machines, which can "understand" human speech, "think" and "comprehend" human intentions, and finally respond to the speech or actions. This has always been one of the ultimate goals of artificial intelligence, which is also one of critical components to structure the intelligent interconnections of the Internet of Things [5,6,7,8,9,10,11,12]. Intelligent voice interaction technology has involuntarily become one of the current research hotspots. Until 2006, there were no big breakthroughs in speech recognition. All along the most representative identification methods are respectively the feature parameter-matching method, HMM(Hidden Markov Model, HMM) and other key technologies based on HMM, for example MAP(Maximum A-posteriori Probability, MAP) estimation criterion [13] and MLLR(Maximum Likelihood Linear Regression, MLLR) [14]. After Hinton et al. presented the layer-by-layer greedy unsupervised pre-training deep neural network named deep learning in 2006 [15,16,17,18,19], speech recognition was starting to make some breakthroughs. Microsoft had applied successfully it to its own speech recognition system. It achieved a reduction in the error rate of word recognition by approximately 30% compared to previous optimal methods [20, 21], which was a major breakthrough for speech recognition. At present, many domestic and foreign research institutions for example Xunfei, Microsoft, Google, IBM and so on are all also actively pursuing research targeted for deep learning [22]. So far, hundreds of neural networks have been proposed, such as SOFM(Self-Organizing Feature Mapping, SOFM), LVQ(Learning Vector Quantization, LVQ), LAM(Local Attention Mechanism, LAM), RBF(Radial Basis Function, RBF), ART(Adaptive Resonance Theory, ART), BAM(Bidirectional Associative Memory, BAM), CMAC(Cerebellar Model Articulation Controller, CMAC), CPN(Counter Propagation Network, CPN), quantum neural network, fuzzy neural network and so on [23, 24]. In particular, in 1995, Y. LeCun et al. proposed CNN (Convolution Neural Network, CNN) [25, 26]. In 2006, Hinton et al. proposed DBN (Deep Belief Network, DBN) [24] that used RBM (Restricted Boltzmann Machine, RBM) [27] as the construction module. Rumelhart, D.E. proposed AENN (Automatic Encoding Neural Network, AENN) [28, 29]. At the same time, some other neural networks were proposed based on these models, for example SDBN (Sparse Deep Belief Network, SDBN) [30], SSAE (Sparse Stack Automatic Encoders, SSAE) [31], DCGAN (Deep Convolution Generative Adversarial Network, DCGAN) [32] and so on. All of these have become main constituent models of deep neural networks, namely, deep learning [33, 34].

The concept of the IoT (Internet of Things, IoT) was first proposed by Professor Ashton in 1999 [35]. He presented the "intelligent interconnection of thing to thing", which uses information sensor equipment to collect information in real time and constitutes a huge network combined with the Internet [36,37,38,39,40,41]. As early as 1999, the Chinese Academy of Sciences had launched research on the sensor network and has already made significant progress in terms of wireless intelligent sensor network communication technology, micro-sensors, sensor terminals, mobile base stations and so on [42]. In 2010, the Beijing municipal government launched the first demonstration project of the Internet of Things of the "perception of Beijing".

An embedded system is a kind of dedicated computer system with an application as the center. It is based on computer technology, can tailor software and hardware and can adapt to the application system that has stringent requirements on functions, reliability, costs, volume, power consumption and so on [43, 44]. An embedded processor is the core of an embedded system. It is the hardware unit that controls and assists the system’s operations. The popular system architecture includes EMP(Embedded Microprocessor Unit, EMP), EMCU (Embedded Micro Controller Unit, EMCU), EDSP (Embedded Digital Signal Processors, EDSP), ESoC (Embedded Systems on Chip, ESoC) and so on for a total of four kinds [45].

ESR (Embedded Speech Recognition, ESR) refers to where all speech recognition processing is performed on the target device. The traditional speech recognition system generally adopts the acoustic model, which is based on the GMM-HMM (Gaussian Mixture Model—Hidden Markov Model, GMM-HMM) or the n-gram language model. In recent years, with the rise of deep learning, the acoustic model and language model that are based on deep neural networks have separately achieved significant performance improvements compared with the traditional GMM-HMM and n-gram models [46,47,48,49]. Automatic speech recognition based on an embedded mobile platform is one of the key technologies.

The remainder of this paper is organized as follows. "Principle of speech recognition control and mathematical theory model" section discusses the principle of speech recognition control and the mathematical theory model. "Deep hybrid intelligent algorithm" section introduces the novel deep hybrid intelligent algorithm and training methods. The experimental results are presented and discussed in "Experiments and result analysis" section. "Summary and prospect" section provides the concluding remarks and prospects.

Principle of speech recognition control and mathematical theory model

Although there are different degrees of complexity, the principle of speech recognition in all languages is the same. Therefore, we choose Chinese for speech recognition and semantic control, then to realize human–machine interaction control.

The speech semantic recognition mainly includes the following steps: speech input, data acquisition, feature extraction, encoding and decoding and speech to semantic recognition, as shown in Fig. 1. Through the statistical theory and principle, to use conditional probability, prior probability, posterior probability and so on, it can establish the relationship between words \(W\) and speech signal \(O\), namely which can be considered as solving the problem of MAPP(Maximum A Posteriori Probability, MAPP) [13]. Through processing and transformation, it is to be able to get a sequence for speech feature vectors \(O\), for finding the maximum of a posteriori probability, to establish the following formula:

$$W* = \arg \left\{ {\mathop {\max }\limits_{{W \in \tau }} P(W|O)} \right\}$$
(1)
Fig. 1
figure 1

Schematic diagram of the traditional acoustic model based on the GMM-HMM and the novel speech semantic recognition system based on the deep hybrid intelligent algorithm

And calculate the posteriori probabilities of all possible word sequences and maximize, where \(W*\) is the maximum probability, \(\tau\) is a collection of all words. Because the \(P(O)\) is constant, on the other hand, if \(W\) is determined, the \(O\) is uniquely determined, so the conditional probability \(P(O|W)\) is equal to \(1\). By the formula (1) we can get formula (2):

$$\begin{aligned} W*&= \arg \left\{ \mathop {\max }\limits_{W \in \tau } \frac{P(O|W)P(W)}{{P(O)}}\right\} \hfill \\& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } P(O|W)P(W)\right\} \hfill \\& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } P(W)\right\} \hfill \\ \end{aligned}$$
(2)

The \(W\) is the string sequence, so \(P(W)\) can be decomposed into:

$$\begin{aligned} P(W)& = P(w_{n} ,w_{n - 1} ,w_{n - 2} , \cdots ,w_{1} ) \hfill \\& = P(w_{1} )P(w_{2} |w_{1} )P(w_{3} |w_{2} ,w_{1} ) \cdots \hfill \\ P(w_{n} |w_{n - 1} ,w_{n - 2} , \cdots ,w_{1} ) \hfill \\& = \prod\limits_{i = 1}^{n} {P(w_{i} |w_{i - 1} ,w_{i - 2} , \cdots ,w_{1} )} \hfill \\& = \prod\limits_{i = 1}^{n} {P(w_{i} |\omega^{i - 1} )} \propto \sum\limits_{i = 1}^{n} {\log (P(w_{i} |\omega^{i - 1} ))} \hfill \\ \end{aligned}$$
(3)

By the formula (2) and formula (3) we can get formula (4):

$$\begin{aligned} W*& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } \frac{P(O|W)P(W)}{{P(O)}}\right\} \hfill \\& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } P(O|W)P(W)\right\} \hfill \\& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } P(W)\right\} \hfill \\& = \arg \left\{ \mathop {\max }\limits_{W \in \tau } \sum\limits_{i = 1}^{n} {\log (P(w_{i} |\omega^{i - 1} ))} \right\} \hfill \\ \end{aligned}$$
(4)

where \(w_{i}\) is the \(ith\) word of the string, \(n\) is the total number of words, and \({\varvec{\omega}}^{i - 1}\) represents the word sequence \(w_{i - 1} ,w_{i - 2} , \cdots ,w_{1}\).

Considering the large number of words, it's hard to calculate directly the conditional probability \(P(w_{i} |{\varvec{\omega}}^{i - 1} )\). Therefore, a finite number of words are selected as the calculation range. Namely, the n-gram (n elements grammar, n-gram) model is widely used, for example 2-g, 3-g and so on. It is to assume that the conditional probability \(P(w_{i} |{\varvec{\omega}}^{i - 1} )\) is only related to the preceding \(n - 1\) words. As a result, it can be simplified as:

$$\begin{aligned} P(w_{i} |\omega^{i - 1} )& = P(w_{n} |w_{n - 1} ,w_{n - 2} , \cdots ,w_{1} ) \hfill \\& = P(w_{n} |w_{n - 1} ,w_{n - 2} , \cdots ,w_{n - N + 1} ) \hfill \\ \end{aligned}$$
(5)

Thus, using the binary grammar model, namely, 2-g, the \(P(W)\) can be approximated as follows:

$$\begin{gathered} P(W) \approx \prod\limits_{i = 1}^{n} {P(w_{i} |w_{i - 1} )} \hfill \\ \propto \sum\limits_{i = 1}^{n} {\log (P(w_{i} |w_{i - 1} ))} \hfill \\ \end{gathered}$$
(6)

Deep hybrid intelligent algorithm

Deep training and residual

The mean square error equation can be expressed as:

$$\begin{aligned} J(W,b)& = \left[\frac{1}{m}\sum\limits_{i = 1}^{m} {J(W,b;x^{(i)} ,y^{(i)} )} \right] + \frac{\lambda }{2}\sum\limits_{l = 1}^{{n_{l} - 1}} {\sum\limits_{i = 1}^{{s_{l} }} {\sum\limits_{j = 1}^{{s_{l + 1} }} {(W_{ji}^{(l)} )^{2} } } } \hfill \\& = \left[\frac{1}{m}\sum\limits_{i = 1}^{m} {(\frac{1}{2}||h_{W,b} (x^{(i)} ) - y^{(i)} ||^{2} )}\right] \hfill \\& + \frac{\lambda }{2}\sum\limits_{l = 1}^{{n_{l} - 1}} {\sum\limits_{i = 1}^{{s_{l} }} {\sum\limits_{j = 1}^{{s_{l + 1} }} {(W_{ji}^{(l)} )^{2} } } } \hfill \\ \end{aligned}$$
(7)

By taking the partial derivative of this equation with respect to each variable, the value called the "residual" is being calculated for each unit, and is denoted as \(\delta_{i}^{(l)}\). First of all, it can get the residuals of the units in output layer:

$$\begin{aligned} \delta_{i}^{{(n_{l} )}}& = \frac{\partial }{{\partial z_{i}^{{n_{l} }} }}J(W,b;x,y) = \frac{\partial }{{\partial z_{i}^{{n_{l} }} }}\frac{1}{2}||y - h_{W,b} (x)||^{2} \hfill \\& = \frac{\partial }{{\partial z_{i}^{{n_{l} }} }}\frac{1}{2}\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {(y_{j} - a_{j}^{{(n_{l} )}} )^{2} } \hfill \\& = \frac{\partial }{{\partial z_{i}^{{n_{l} }} }}\frac{1}{2}\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {(y_{j} - f(z_{j}^{{(n_{l} )}} )^{2} } \hfill \\& = - (y_{i} - f(z_{i}^{{(n_{l} )}} )) \cdot f^{\prime}(z_{i}^{{(n_{l} )}} ) \hfill \\& = - (y_{i} - a_{i}^{{(n_{l} )}} ) \cdot f^{\prime}(z_{i}^{{(n_{l} )}} ) \hfill \\ \end{aligned}$$
(8)

Once again, the residual of the individual unit in other layers, for example the layer \(l = n_{l} - 1,n_{l} - 2, \cdots ,2\), can also be obtained, for the residuals of the layer \(l = n_{l} - 1\):

$$\begin{aligned} \delta_{i}^{{(n_{l} - 1)}}& = \hfill \\& \frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}J(W,b;x,y) = \frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}\frac{1}{2}||y - h_{W,b} (x)||^{2} \hfill \\& = \frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}\frac{1}{2}\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {(y_{j} - a_{j}^{{(n_{l} )}} )^{2} = } \frac{1}{2}\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {\frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}(y_{j} - a_{j}^{{(n_{l} )}} )^{2} } \hfill \\ & = \frac{1}{2}\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {\frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}(y_{j} - f(z_{j}^{{(n_{l} )}})^{2} } = \sum\limits_{j = 1}^{{s_{{n_{l} }} }} { - (y_{j} - f(z_{j}^{{(n_{l} )}} ))}\cdot \hfill \\& \frac{\partial }{{\partial z_{i}^{{(n_{l} - 1)}} }}f(z_{j}^{{(n_{l} )}} ) = \sum\limits_{j = 1}^{{s_{{n_{l} }} }} { - (y_{j} - f(z_{j}^{{(n_{l} )}} ))}\cdot f^{\prime}(z_{j}^{{(n_{l} )}} )\cdot \hfill \\& \frac{{\partial z_{j}^{{(n_{l} )}} }}{{\partial z_{i}^{{(n_{l} - 1)}} }} = \sum\limits_{j = 1}^{{s_{{n_{l} }} }} {\delta_{j}^{{(n_{l} )}} }\cdot \frac{{\partial z_{j}^{{(n_{l} )}} }}{{\partial z_{i}^{{n_{l} - 1}} }} \hfill \\& = \sum\limits_{j = 1}^{{s_{{n_{l} }} }} {(\delta_{j}^{{(n_{l} )}} \cdot \frac{\partial }{{\partial z_{i}^{{n_{l} - 1}} }}\sum\limits_{k = 1}^{{s_{{n_{l} - 1}} }} {f(z_{k}^{{n_{l} - 1}} )} \cdot W_{jk}^{{_{{n_{l} - 1}} }} )} \hfill \\& = \sum\limits_{j = 1}^{{s_{{n_{l} }} }} {\delta_{j}^{{(n_{l} )}} } \cdot W_{ji}^{{_{{n_{l} - 1}} }}\cdot f^{\prime}(z_{i}^{{n_{l} - 1}} ) \hfill \\& = (\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {W_{ji}^{{_{{n_{l} - 1}} }} \delta_{j}^{{(n_{l} )}} } )f^{\prime}(z_{i}^{{(n_{l} - 1)}} ) \hfill \\ \end{aligned}$$
(9)

where \(W\) is the weight, \(b\) is the bias, \((x,y)\) is the sample, \(h_{W,b} (x)\) is the final output and \(f(\cdot)\) is the activation function. Further the relationship between residuals of units at two adjacent layers can be obtained:

$$\delta_{i}^{{(n_{l} - 1)}} = \left(\sum\limits_{j = 1}^{{s_{{n_{l} }} }} {W_{ji}^{{(n_{l} - 1)}} \delta_{j}^{{(n_{l} )}} } \right)f^{\prime}(z_{i}^{{(n_{l} - 1)}} )$$
(10)

At last, by all of these formulas it can realize the learning and training of the novel deep hybrid intelligent algorithm, namely:

$$\left\{ \begin{gathered} \frac{\partial }{{\partial W_{ij}^{{(n_{l} - 1)}} }}J(W,b;x,y) = a_{j}^{{(n_{l} - 1)}} \delta_{i}^{{(n_{l} )}} \hfill \\ \frac{\partial }{{\partial b_{i}^{{(n_{l} - 1)}} }}J(W,b;x,y) = \delta_{i}^{{(n_{l} )}} \hfill \\ \end{gathered} \right.$$
(11)

DBNESR (deep belief network embedded with softmax regress, DBNESR)

The DBN uses the RBM [50, 51] of unsupervised learning networks as the basis for the multi-layer learning systems and uses a supervised learning algorithm named BP (Back-Propagation, BP) for fine-tuning after the pre-training. Its architecture is shown in Fig. 2. The deep architecture is a fully interconnected directed belief network with one input layer \(v^{1}\), parameter space \(W = \{ W^{1} ,W^{2} , \cdot \cdot \cdot ,W^{N} \}\), hidden layers \(h^{1}\), \(h^{2}\),\(\cdot \cdot \cdot\), \(h^{N}\), and one labelled layer at the top. The input layer \(v^{1}\) has \(D\) units, which is equal to the number of features of the samples. The label layer has \(C\) units, which is equal to the number of classes of label vector \(Y\). The numbers of units for the hidden layers are currently pre-defined according to the experience or intuition. The goal of the mapping function here is transformed to the problem of finding the parameter space \(W = \{ W^{1} ,W^{2} , \cdot \cdot \cdot ,W^{N} \}\) for the deep architecture [52].

Fig. 2
figure 2

Architecture of the DBNESR

The semi-supervised learning method based on the DBN architecture can be divided into two stages. First, the DBN architecture is constructed by greedy layer-wise unsupervised learning using the RBM as the basis. All samples are utilized to find the parameter space \(W\) with \(N\) layers. Second, the DBN architecture is trained according to the log-likelihood using the gradient descent method. Since it is difficult to optimize a deep architecture by using supervised learning directly, the unsupervised learning stage can abstract the feature effectively, and prevent over-fitting of the supervised training. The BP algorithm is used to pass the error from the top-down for fine-tuning after the pre-training.

For unsupervised learning, it defines the energy of the joint configuration \((h^{k - 1} ,h^{k} )\) as [53]:

$$\begin{gathered} E(h^{k - 1} ,h^{k} ;\theta ) \hfill \\ = - \sum\limits_{i = 1}^{{D_{k - 1} }} {\sum\limits_{j = 1}^{{D_{k} }} {w_{{_{ij} }}^{k} } } h_{i}^{k - 1} h_{j}^{k} - \sum\limits_{i = 1}^{{D_{k - 1} }} {b_{{_{i} }}^{k - 1} h_{i}^{k - 1} - \sum\limits_{j = 1}^{{D_{k} }} {c_{j}^{k} } } h_{j}^{k} \hfill \\ \end{gathered}$$
(12)

where \(\theta = (W,b,c)\) are the model parameters.\(w_{{_{ij} }}^{k}\) is the symmetric interaction term between unit \(i\) in layer \(h^{k - 1}\) and unit \(j\) in layer \(h^{k}\), \(k = 1, \cdot \cdot \cdot ,N - 1\). \(b_{{_{i} }}^{k - 1}\) is the \(ith\) bias of layer \(h^{k - 1}\) and \(c_{j}^{k}\) is the \(jth\) bias of layer \(h^{k}\). \(D^{k}\) is the number of units in the \(kth\) layer. The network assigns a probability to every possible data point via this energy function. The probability measures the likelihood that a training data point can be raised by adjusting the weights and biases to lower the energy of that data and to raise the energy of similar, confabulated data that \(h^{k}\) would prefer to the real data. When it inputs the value of \(h^{k}\), the network can learn the content of \(h^{k - 1}\) by minimizing this energy function.

The probability that the model assigns it to an \(h^{k - 1}\) is:

$$P(h^{k - 1} ;\theta ) = \frac{1}{Z(\theta )}\sum\limits_{{h^{k} }} {\exp ( - E(h^{k - 1} ,h^{k} ;\theta ))}$$
(13)
$$Z(\theta ) = \sum\limits_{{h^{k - 1} }} {\sum\limits_{{h^{k} }} {\exp ( - E(h^{k - 1} ,h^{k} ;\theta ))} }$$
(14)

where \(Z(\theta )\) denotes the normalizing constant. The conditional distributions over \(h^{k}\) and \(h^{k - 1}\) are given as:

$$p(h^{k} |h^{k - 1} ) = \prod\limits_{j} {p(h_{j}^{k} |h^{k - 1} } )$$
(15)
$$p(h^{k - 1} |h^{k} ) = \prod\limits_{i} {p(h_{i}^{k - 1} |h^{k} } )$$
(16)

The probability that a turning unit \(j\) is a logistic function of the states \(h^{k - 1}\) and \(w_{{_{ij} }}^{k}\) is:

$$p(h_{j}^{k} = 1|h^{k - 1} ) = sigm(c_{j}^{k} + \sum\limits_{i} {w_{ij}^{k} } h_{i}^{k - 1} )$$
(17)

The probability that a turning unit \(i\) is a logistic function of the states of \(h^{k}\) and \(w_{{_{ij} }}^{k}\) is:

$$p(h_{i}^{k - 1} = 1|h^{k} ) = sigm(b_{i}^{k - 1} + \sum\limits_{j} {w_{ij}^{k} } h_{j}^{k} )$$
(18)

In this, the logistic function that has been chosen is the sigmoid function:

$$sigm(x) = 1/(1 + e^{ - x} )$$
(19)

The derivative of the log-likelihood with respect to the model parameter \(w^{k}\) can be obtained from Eq. (13):

$$\frac{{\partial \log p(h^{k - 1} )}}{{\partial w_{ij}^{k} }} = \langle h_{i}^{k - 1} h_{j}^{k} \rangle_{{p_{0} }} - \langle h_{i}^{k - 1} h_{j}^{k} \rangle_{{p_{Model} }}$$
(20)

where \(\langle \cdot \rangle_{{p_{0} }}\) denotes an expectation with respect to the data distribution and \(\langle \cdot \rangle_{{p_{Model} }}\) denotes an expectation with respect to the distribution defined by the model [54]. The expectation \(\langle \cdot \rangle_{{p_{Model} }}\) cannot be computed analytically. In practice, \(\langle \cdot \rangle_{{p_{Model} }}\) is replaced by \(\langle \cdot \rangle_{{p_{1} }}\), which denotes a distribution of samples when the feature detectors are being driven by reconstructed \(h^{k - 1}\). This is an approximation of the gradient of a different objective function called CD (Contrastive Divergence, CD) [55]. The use of the Kullback–Leibler distance to measure two probability distribution "diversity", which is represented by \(KL(P||P^{^{\prime}} )\), is shown in Eq. (21):

$$CD_{n} = KL(p_{0} ||p_{\infty } ) - KL(p_{n} ||p_{\infty } )$$
(21)

where \(p_{0}\) denotes joint probability distribution of the initial state of the RBM network, \(p_{n}\) denotes the joint probability distribution of the RBM network after \(n\) transformations of the MCMC (Markov Chain Monte Carlo, MCMC), and \(p_{\infty }\) denotes the joint probability distribution of the RBM network at the ends of the MCMC. Therefore, the \(CD_{n}\) can be regarded as a measure location for \(p_{n}\) between \(p_{0}\) and \(p_{\infty }\). It constantly assigns \(p_{n}\) to \(p_{0}\) and gets a new \(p_{0}\) and \(p_{n}\). The experiments show that \(CD_{n}\) will tend to zero and that the accuracy is approximated by the MCMC after setting the slope for \(r\) times for the correction parameter \(\theta\). The training process of the RBM is shown in Fig. 3.

Fig. 3
figure 3

Training process of the RBM using CD

We can get Eq. (22) through the training process of the RBM using CD:

$$\vartriangle w_{ij}^{k} = \eta \left(\left\langle {h_{i}^{k - 1} h_{j}^{k} } \right\rangle_{{P_{0} }} - \left\langle {h_{i}^{k - 1} h_{j}^{k} } \right\rangle_{{P_{1} }}\right)$$
(22)

where \(\eta\) is the learning rate. Then, the parameter can be adjusted through:

$$w_{ij}^{k} = \mu w_{ij}^{k} + \vartriangle w_{ij}^{k}$$
(23)

where \(\mu\) is the momentum.

The above discussion is based on the training of the parameters between the hidden layers with one sample \(x\). For unsupervised learning, it constructs the deep architecture using all samples by inputting them one by one from layer \(h^{0}\) and training the parameters between \(h^{0}\) and \(h^{1}\). Then, \(h^{1}\) is constructed, the value of \(h^{1}\) is calculated by \(h^{0}\), and the trained parameters are between \(h^{0}\) and \(h^{1}\). It can also use it to construct the next layer \(h^{2}\) and so on. The deep architecture is constructed layer by layer from the bottom to the top. In each iteration, the parameter space \(W^{K}\) is trained by the calculated data in the \((k - 1)th\) layer. According to the \(W^{K}\) calculated above, the layer \(h^{k}\) is obtained as below for a sample \(x\) fed from the layer \(h^{0}\):

$$\begin{gathered} h_{j}^{k} (x) = sigm\left( {c_{j}^{k} + \sum\limits_{i = 1}^{{D_{k - 1} }} {w_{ij}^{k} } h_{i}^{k - 1} (x)} \right) \hfill \\ {\text{ j = 1,}} \cdot \cdot \cdot {\text{,D}}_{{\text{k}}} {\text{;k = 1,}} \cdot \cdot \cdot {\text{,N - 1}} \hfill \\ \end{gathered}$$
(24)

For supervised learning, the DBM architecture is trained by \(C\) labelled data. The optimization problem is formulized as:

$$\arg \min err = - \sum\limits_{k} {p_{k} } \log \hat{p}_{k} - \sum\limits_{k} {(1 - p_{k} )} \log (1 - \hat{p}_{k} )$$
(25)

This is done to minimize cross-entropy. In the equation, \(p_{k}\) denotes the real label probability and \(\hat{p}_{k}\) denotes the model label probability.

The greedy layer-wise unsupervised learning is used solely to initialize the parameter of the deep architecture, and the parameters of the deep architecture are updated based on Eq. (23). After the initialization, real values are used in all the nodes of the deep architecture. It uses gradient-descent through the whole deep architecture to retrain the weights for an optimal classification.

DLSTM(deep long short term memory network, DLSTM)

Speech signals are serialized data with the characteristics of consistency and causality, so the serialization model is used to better obtain the dependencies between sequential words. To do this, we present a DLSTM [50] integrated with the DBNESR to constitute a kind of novel deep hybrid intelligent algorithm, which has the advantages of the dependency of data sequence before and after, the dimension reduction and overcoming the disadvantage of gradient disappearance or gradient explosion. It can also realize the function of memory even for super-long sequences, so as better to model and perform speech recognition and semantic control. The schematic diagram of DLSTM is shown in Fig. 4.

Fig. 4
figure 4

The schematic diagram of DLSTM

In the recurrent neural network, the final gradient of the weight array \(W\) is the sum of the gradients at each moment, namely:

$$\begin{aligned} \nabla_{W} E& = \sum\limits_{k = 1}^{t} {\nabla_{{W_{k} }} E} \hfill \\& = \nabla_{{W_{t} }} E + \nabla_{{W_{t - 1} }} E + \nabla_{{W_{t - 2} }} E + \cdots + \nabla_{{W_{1} }} E \hfill \\ \end{aligned}$$
(26)

In this formula, there will appear that the gradient is almost zero at a certain moment, thus making no contribution to the final gradient value, and the previous state is suddenly gone. That is, the long-distance dependence cannot be processed. For this reason, a unit state \(c\) is added to preserve the long-term state, at the same time to use the gate mechanism to control the contents of the \(c\), respectively called the forgetting-gate, which is expressed as follows:

$$f_{t} = \sigma (W_{f} \cdot [h_{t - 1} ,x_{t} ] + b_{f} )$$
(27)

where, \(W_{f}\) denotes the weight matrix, \([h_{t - 1} ,x_{t} ]\) denotes joining two vectors \(h_{t - 1}\) and \(x_{t}\) together, \(b_{f}\) is the bias, and \(\sigma\) is the activation function. And the inputting-gate, expressed as:

$$i_{t} = \sigma (W_{i} \cdot [h_{t - 1} ,x_{t} ] + b_{i} )$$
(28)

Based on the previous output and the current input, the cell state used to describe the current input can be derived:

$$\mathop {c_{t} }\limits^{ \sim } = \tanh (W_{c} \cdot [h_{t - 1} ,x_{t} ] + b_{c} )$$
(29)

And then the unit state at the current moment can be calculated:

$$c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ \mathop {c_{t} }\limits^{ \sim }$$
(30)

The notation \(\circ\) means multiply by the elements. And the outputting-gate can be expressed as:

$$o_{t} = \sigma (W_{o} \cdot [h_{t - 1} ,x_{t} ] + b_{o} )$$
(31)

The final output of DLSTM is determined by the outputting-gate and the cell state:

$$h_{t} = o_{t} \circ \tanh (c_{t} )$$
(32)

There are eight groups of parameters to be learned for DLSTM training, namely, the forgetting-gate, inputting-gate, outputting-gate and the weight and bias of computing unit state: \(W_{f}\) (\(W_{fh}\) and \(W_{fx}\)) and \(b_{f}\), \(W_{i}\) (\(W_{ih}\) and \(W_{ix}\)) and \(b_{i}\), \(W_{o}\) (\(W_{oh}\) and \(W_{ox}\)) and \(b_{o}\), \(W_{c}\) (\(W_{ch}\) and \(W_{cx}\)) and \(b_{c}\). Since DLSTM has four weighted inputs, it is assumed that the error term is the derivative of the loss function with respect to the output value, as shown below:

$$\begin{gathered} \delta_{t} \mathop = \limits^{def} \frac{\partial E}{{\partial h_{t} }}(\delta_{f,t} \mathop = \limits^{def} \frac{\partial E}{{\partial net_{f,t} }},\delta_{i,t} \mathop = \limits^{def} \frac{\partial E}{{\partial net_{i,t} }}, \hfill \\ \delta_{{\mathop c\limits^{ \sim } ,t}} \mathop = \limits^{def} \frac{\partial E}{{\partial net_{{\mathop c\limits^{ \sim } ,t}} }},\delta_{o,t} \mathop = \limits^{def} \frac{\partial E}{{\partial net_{o,t} }}) \hfill \\ \end{gathered}$$
(33)

During training, the error term is transmitted in reverse direction along the time, and the error term at time \(t - 1\) is set as:

$$\begin{aligned} \delta_{{_{t - 1} }}^{T}& = \frac{\partial E}{{\partial h_{t - 1} }} = \frac{\partial E}{{\partial h_{t} }}\frac{{\partial h_{t} }}{{\partial h_{t - 1} }} \hfill \\& = \delta_{{_{t} }}^{T} \frac{{\partial h_{t} }}{{\partial h_{t - 1} }} \hfill \\ \end{aligned}$$
(34)

Using Eqs. (30), (32) and the full derivative formula, it can be obtained:

$$\begin{gathered} \delta_{{_{t} }}^{T} \frac{{\partial h_{t} }}{{\partial h_{t - 1} }} = \delta_{{_{o,t} }}^{T} \frac{{\partial net_{o,t} }}{{\partial h_{t - 1} }} + \delta_{{_{f,t} }}^{T} \frac{{\partial net_{f,t} }}{{\partial h_{t - 1} }} + \hfill \\ \delta_{{_{i,t} }}^{T} \frac{{\partial net_{i,t} }}{{\partial h_{t - 1} }} + \delta_{{^{{_{{\mathop c\limits^{ \sim } ,t}} }} }}^{T} \frac{{\partial net_{{\mathop c\limits^{ \sim } ,t}} }}{{\partial h_{t - 1} }} \hfill \\ \end{gathered}$$
(35)

Solve each partial derivative in Eq. (33), and obtain:

$$\begin{gathered} \delta_{t - 1} = \delta_{{_{o,t} }}^{T} W_{oh} + \delta_{{_{f,t} }}^{T} W_{fh} + \hfill \\ \delta_{{_{i,t} }}^{T} W_{ih} + \delta_{{^{{_{{\mathop c\limits^{ \sim } ,t}} }} }}^{T} W_{ch} \hfill \\ \end{gathered}$$
(36)

According to the definitions of \(\delta_{o,t}\),\(\delta_{f,t}\), \(\delta_{i,t}\) and \(\delta_{{\mathop c\limits^{ \sim } ,t}}\), we can get:

$$\left\{ \begin{gathered} \delta _{{_{{o,t}} }}^{T} = \delta _{{_{t} }}^{T} \circ \tanh (c_{t} ) \circ o_{t} \circ (1 - o_{t} ) \hfill \\ \delta _{{_{{f,t}} }}^{T} = \delta _{{_{t} }}^{T} \circ o_{t} \circ (1 - \tanh (c_{t} )^{2} ) \circ c_{{t - 1}} \circ f_{t} \circ (1 - f_{t} ) \hfill \\ \delta _{{_{{i,t}} }}^{T} = \delta _{{_{t} }}^{T} \circ o_{t} \circ (1 - \tanh (c_{t} )^{2} ) \circ \widetilde{{c_{t} }} \circ i_{t} \circ (1 - i_{t} ) \hfill \\ \delta _{{^{{_{{\widetilde{c},t}} }} }}^{T} = \delta _{{_{t} }}^{T} \circ o_{t} \circ (1 - \tanh (c_{t} )^{2} ) \circ i_{t} \circ (1 - \widetilde{{c_{t} }}^{2} ) \hfill \\ \end{gathered} \right.$$
(37)

Equations (36) and (37) are the formula to make the error back-propagated for one moment along time, so the formula for the error term to be transmitted forward to any \(k\) moment can be obtained:

$$\delta_{{_{k} }}^{T} = \prod\limits_{j = k}^{t - 1} {\delta_{{_{o,j} }}^{T} W_{oh} + \delta_{{_{f,j} }}^{T} W_{fh} + \delta_{{_{i,j} }}^{T} W_{ih} + \delta_{{^{{_{{\mathop c\limits^{ \sim } ,j}} }} }}^{T} W_{ch} }$$
(38)

At the same time the formula for transmitting the error to the upper layer can also be obtained:

$$\begin{aligned} &\delta_{t}^{l - 1} \mathop = \limits^{def} \frac{\partial E}{{\partial net_{t}^{l - 1} }} \hfill \\&\quad\quad = \left(\delta_{{_{f,t} }}^{T} W_{fx} + \delta_{{_{i,t} }}^{T} W_{ix} + \delta_{{^{{_{{\mathop c\limits^{ \sim } ,t}} }} }}^{T} W_{cx} + \delta_{{_{o,t} }}^{T} W_{ox} \right) \circ f^{^{\prime}} (net_{t}^{l - 1} ) \hfill \\ \end{aligned}$$
(39)

For the gradients of \(W_{oh}\), \(W_{fh}\), \(W_{ih}\) and \(W_{ch}\) and the gradients of \(b_{o}\),, \(b_{i}\), \(b_{f}\) and \(b_{c}\), which are all the sum of the gradients of theirs at each moment, and the final gradients are finally obtained:

$$\left\{ \begin{gathered} \frac{\partial E}{{\partial W_{oh} }} = \sum\limits_{j = 1}^{t} {\delta_{o,j} h_{j - 1}^{T} } \hfill \\ \frac{\partial E}{{\partial W_{fh} }} = \sum\limits_{j = 1}^{t} {\delta_{f,j} h_{j - 1}^{T} } \hfill \\ \frac{\partial E}{{\partial W_{ih} }} = \sum\limits_{j = 1}^{t} {\delta_{i,j} h_{j - 1}^{T} } \hfill \\ \frac{\partial E}{{\partial W_{ch} }} = \sum\limits_{j = 1}^{t} {\delta_{{^{{_{{\mathop c\limits^{ \sim } }} }} ,j}} h_{j - 1}^{T} } \hfill \\ \end{gathered} \right.$$
(40)
$$\left\{ \begin{gathered} \frac{\partial E}{{\partial b_{o} }} = \sum\limits_{j = 1}^{t} {\delta_{o,j} } \hfill \\ \frac{\partial E}{{\partial b_{i} }} = \sum\limits_{j = 1}^{t} {\delta_{i,j} } \hfill \\ \frac{\partial E}{{\partial b_{f} }} = \sum\limits_{j = 1}^{t} {\delta_{f,j} } \hfill \\ \frac{\partial E}{{\partial b_{c} }} = \sum\limits_{j = 1}^{t} {\delta_{{\mathop c\limits^{ \sim } ,j}} } \hfill \\ \end{gathered} \right.$$
(41)

For the gradients of \(W_{ox}\), \(W_{fx}\), \(W_{ix}\) and \(W_{cx}\), which only need to be directly calculated according to the corresponding error term:

$$\left\{ \begin{gathered} \frac{\partial E}{{\partial W_{ox} }} = \frac{\partial E}{{\partial net_{o,t} }}\frac{{\partial net_{o,t} }}{{\partial W_{ox} }} = \delta_{o,t} x_{t}^{T} \hfill \\ \frac{\partial E}{{\partial W_{fx} }} = \frac{\partial E}{{\partial net_{f,t} }}\frac{{\partial net_{f,t} }}{{\partial W_{fx} }} = \delta_{f,t} x_{t}^{T} \hfill \\ \frac{\partial E}{{\partial W_{ix} }} = \frac{\partial E}{{\partial net_{i,t} }}\frac{{\partial net_{i,t} }}{{\partial W_{ix} }} = \delta_{i,t} x_{t}^{T} \hfill \\ \frac{\partial E}{{\partial W_{cx} }} = \frac{\partial E}{{\partial net_{{\mathop c\limits^{ \sim } ,t}} }}\frac{{\partial net_{{\mathop c\limits^{ \sim } ,t}} }}{{\partial W_{cx} }} = \delta_{{\mathop c\limits^{ \sim } ,t}} x_{t}^{T} \hfill \\ \end{gathered} \right.$$
(42)

Through the above gradients, the values of weight and bias can be changed so as to realize the training of DLSTM.

Experiments and result analysis

Experimental environment

  • Hardware: The motherboard, which has integrated development environment, including the core processing unit, memory, various interfaces, onboard speech processing module that can amplify, filter, sample, convert with A/D(Analog to Digital Converter, A/D) or D/A(Digital to Analog Converter, D/A) and digitize the speech signal, MIC(Messages Integrity Check, MIC), ZigBee, RFID, GPRS, Wi-Fi, RS232, USB and so on.

  • Software: Linux system for the embedded development, combining with the important auxiliary tools SecureCRT and ESP8266, which are all developed by us respectively for downloading, cross-compiling, burning and writing the algorithms and codes and other data.

Experimental process and results

The implementation process of speech recognition semantics control is shown below. First, it is to get voice signals from audio files or input devices, make A/D conversions, encode and decode, learn and train by the novel deep hybrid intelligent algorithm. Second, get corresponding semantic vocabularies, realize language semantic conversion. Third, basing on the semantic information, the system achieves corresponding I/O output controls by the system call functions and performs related system operations. For example, it can realize the operations of turning on and off LED (Light-Emitting Diode, LED) lights for corresponding equipment. To do this, the system should implement at least the “open”, “read”, “write” and “close” and so on system operations [56,57,58]. In the experiment, we also refer to the developmental boards of YueQian and the phonetic components of Hkust XunFei.

The intelligent control system being implemented in this paper has more functionality. It can realize a wider range of recognition and interaction, for example the recognition vocabularies for one, binary, three, four, five and multiple, and the voice data respectively from audio files, the microphone input devices or mobile phone terminals by Wi-Fi and so on. The experimental results are shown below.

  1. 1.

    First of all, we have done experiments for the recognitions of a variety of vocabularies, for example one, binary, three, four, five and multiple, which are respectively from the audio file or the microphone input device, for generality and validity, for 30 times, the recognition results are shown in Tables 1 and 2 respectively. As you can see from the results in both tables: Except the first time for the recognition of the multiple vocabularies from the microphone it was wrong because of initialization, the system has achieved very good and stable recognition results, the recognition rate almost reached 100%.

  2. 2.

    For example, the recognition of voice data “Turning on light” and “Turning off light” for implementing intelligent interactive control, in the experiment, we have used six lights with ID(Identity Document, ID) numbers corresponding from No.1 to 6 to realize the operations of turning on or off and the switch of any light, such as No.3 and No.6. The correct operation is denoted as 1, and the wrong operation is denoted as 0. Each experiment is repeated for 30 times respectively for each light. For being more general and authentic, we have again used two types of circuit boards with these lights for the experiments, respectively named categoryIanII. All recognition and interaction results are respectively shown in Tables 3 and 4. From the results in these tables, we can see: The speech recognition semantics control system for the audio file on categoryIandIIcircuit boards has all achieved very good and stable recognition and interaction results, the recognition and interaction rate reached 100%.

  3. 3.

    And for the speech recognition semantics control system for the microphone on categoryIandIIcircuit boards, all recognition and interaction results are respectively shown in Tables 5 and 6. From the results in these tables, we can see: The system has also achieved very good and stable recognition and interaction results, the recognition and interaction rate also reached 100%.

  4. 4.

    The speech recognition semantics control system for voice data from mobile phone terminals by Wi-Fi on categoryIandIIcircuit boards has respectively occurred an error recognition, namely No.1 light on the first time and No.6 light on the first time. The recognition and interaction results are slightly worse, but the recognition and interaction rate is also close to 100%, namely 99.4444%. The main reason is that the signal is not stable when Wi-Fi is first connected. All recognition and interaction results are respectively shown in Tables 7 and 8.

  5. 5.

    In addition to achieving very good and stable recognition and interaction results, we have also measured the time it took to identify. In order to have more ways for human–machine interaction, the paper has realized many kinds of recognition, namely based on audio files, based on microphones and based on mobile phone terminals so on. Considering the process of information processing, it's intuitive that the time of recognition based on mobile phone terminals is a little longer, the second is based on microphones and the minimum is based on audio files. So for that, let's take the middle one, namely based on microphones for experiments. Each experiment is repeated for 20 times respectively for each light. Results are shown in Tables 9 and 10. The unit of time is the second and millisecond, among them 1 s = 1000 ms. It can be seen that all the recognition time are less than one second, which should be very good, namely being completely able to meet the actual needs.

  6. 6.

    In order to show the change of recognition time, we obtained the curve diagrams for Figs. 5 and 6. This can be seen from Fig. 5: All the recognition time are less than one second, in particular, even though the maximum recognition time is also only for 0.982 s. And the shortest recognition time is even shorter, for 0.447 s. The average time of all recognitions is 0.7493 s. So they are very good, namely being completely able to meet the actual needs. The same it can be seen from Fig. 6: All the recognition time are also less than one second, the maximum recognition time is also only for 0.968 s. And the shortest recognition time is 0.624 s. The average time of all recognitions is 0.7767 s. The same is true of which are be very good, namely being completely able to meet the actual needs.

  7. 7.

    For each light, we again get their average recognition time, which are: 0.84785, 0.78010, 0.69420, 0.67705, 0.71850, 0.77810 and 0.78040, 0.84755, 0.78865, 0.74245, 0.77340, 0.72800, the bar charts are shown in Figs. 7 and 8. This can be seen from these values and figures: All mean recognition time is also less than one second, and there's very little variation between them, which shows that the recognition and interaction performance of the system is good, and very stable, namely being completely able to meet the actual needs.

Table 1 Recognition results for a variety of vocabularies from the audio file for 30 times
Table 2 Recognition results for a variety of vocabularies from the microphone for 30 times
Table 3 Recognition and interaction results for the audio file on categoryIcircuit boards for 30 times
Table 4 Recognition and interaction results for the audio file on categoryIIcircuit boards for 30 times
Table 5 Recognition and interaction results for the microphone on categoryIcircuit boards for 30 times
Table 6 Recognition and interaction results for the microphone on categoryIIcircuit boards for 30 times
Table 7 Recognition and interaction results for mobile phone terminals on categoryIcircuit boards for 30 times
Table 8 Recognition and interaction results for mobile phone terminals on categoryIIcircuit boards for 30 times
Table 9 Recognition time based on microphones on categoryIcircuit boards for 20 times
Table 10 Recognition time based on microphones on categoryIIcircuit boards for 20 times
Fig. 5
figure 5

Curve diagram of recognition time on categoryIcircuit boards for 20 times

Fig. 6
figure 6

Curve diagram of recognition time on categoryIIcircuit boards for 20 times

Fig. 7
figure 7

Bar chart of average recognition time for each light on categoryIcircuit boards

Fig. 8
figure 8

Bar chart of average recognition time for each light on categoryIIcircuit boards

Except for voice data recognition above, it can implement recognition for almost any vocabulary of two kinds of meaning, for example “Up and Down”, “Left and Right”, “Before and After”, “Go and Stop”, “Black and White” and so on. So the vocabulary size is very big, which are enough to meet the needs of almost all applications of Internet of Things, namely to implement human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things.

Summary and prospect

The implementation of general speech recognition models only pursues performance, without considering capacity and computing power, which usually requires large capacity and strong computing power. Based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges for constructing intelligent ecology of the Internet of Things. For this purpose, we set up the unit middleware for the implementation of human–machine interconnection based on small capacity and low computing power, namely human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things. First, through calculation, theoretical derivation and verification we present a kind of novel deep hybrid intelligent algorithm, which has realized speech analysis and semantic recognition. Second, it is to establish unit middleware using the embedded chip as the core on the motherboard. Third, it is to develop the important auxiliary tools writer-burner and cross-compiler. Fourth, it is to prune procedures and system, download, burn and write the algorithms and codes into the unit middleware and cross-compile. Fifth, it is to expand the functions of the motherboard, provide more components and interfaces, for example including RFID, ZigBee, Wi-Fi, GPRS, RS-232 serial port, USB interfaces and so on. Sixth, we take advantage of algorithms, software and hardware to make machines "understand" human speech and "think" and "comprehend" human intentions so as to implement human–machine interconnection, which further structure the intelligent ecology of the Internet of Things. At last, the experimental results denote that the unit middleware have very good effect, fast recognition speed, high accuracy and good stability, consequently realizing the intelligent ecology construction of the Internet of Things.

Recognition model, performance and capacity and computing power, they both relate to and influence each other. Generally, large model and good performance require large capacity and strong computing power, and vice versa. Obviously, speech recognition based on small capacity and small computing power is much harder and a big challenge. Different applications have also different requirements. In addition, this is also an optimization or optimization problem with constraints. Further reveal their relationship and law, such as how to do this better and quantitative relationship, which are the directions of our future efforts.