1 Introduction

Voice activity detection plays a significant role in enabling natural and spontaneous Human-Computer and Human-Robot Interaction applications. However, voice activity detection based solely on audio modality faces various challenges in real world environments. The fact that visual information is complimentary to acoustic speech signal has been well founded in the literature. The influence of visual features on the perception of speech has been demonstrated in [1]. It is well-known in the literature as the McGurk effect. Also, if there is noise corruption in the environment, the availability of lip movement data grants a person an extra 4–6 dB of noise tolerance compared to audio data alone [2].

1.1 Related Works

Many works in the area of visual voice activity detection utilize mouth region image intensity based features. Siatras et al. [3] used variations in the amount and intensity of mouth region pixels as cues for voice activity detection. A case-specific threshold was needed to identify relevant pixels. Ahmad et al. [4] presented a method where changes in the mean intensity values of mouth area during speech and silence periods were modeled by Gaussian Mixture Models. These methods rely solely on image intensities which make them unsuitable for adverse lighting conditions. In [5], Song et al. described a method based on chaos inspired similarity measure to mitigate changes in lighting conditions. However, their classification model depends on a number of predefined thresholds which are not easily generalizable. From the previously described approaches, they do not consider natural lip movements during non-speaking intervals. Aubrey et al. [6] attempted to model both speech and non-speech movements by Hidden Markov Models (HMM), using optical flow vectors computed from consecutive mouth area frames as observations. Additionally, similar approach based on optical flow based mouth image energy feature and bi-level HMM was proposed by Tiawongsombat et al. [7]. Most techniques of the optical flow require appropriate illumination condition and that the displacement of pixels between frames are not abrupt [8]. Moreover, Hidden Markov Models are not suitable for modeling long data sequences because of the independence assumption where each hidden state can depend only on the immediate preceding one.

2 Proposed Method

The technique proposed in this paper only utilizes simple lip shape information extracted from the video sequences containing the speaker’s face. First, the speaker’s face is detected by histogram of oriented gradients (HOG) based detector [9]. Secondly, the detected face is used as a template to initialize the tracking algorithm [11] and then, in each subsequent frame, the facial landmark coordinates of the tracked face are located using the pretrained shape predictor [9]. Next, the lip shape vectors are geometrically normalized and the centroid distance function is applied to the latter, obtaining the lip shape features. Temporal variations of lip shape during speech and silence periods including non-trivial head motions are modeled using Long Short-Term Memory (LSTM) recurrent neural network. Figure 1 illustrates the overall approach. Details of each step are described as follows.

Fig. 1.
figure 1

Proposed approach

2.1 Face Detection

Face detection is performed on the initial frames until a face is detected by using the pre-trained sliding window detector provided by [9] which utilizes the structural SVM based classifier trained on histogram of oriented gradient (HOG) features. It has a substantially lower false detections than Viola Jones face detector [10].

2.2 Face Tracking

Kernelized Correlation Filter tracker [11] is then applied. It is an online discriminative tracking algorithm. The object of interest can be continuously tracked while adapting the tracking model to incorporate new information about the former. The KCF tracker exploits the performance and computational efficiency afforded by utilizing cyclically translated samples and by performing corresponding kernel correlations in the Fourier domain. The bounding rectangle returned from the face detector is used to extract the face template to initialize the KCF tracker. Given a base template x of size (\(m\,\times \,n\)) and the Gaussian-shaped regression targets y of size (\(m\,\times \,n\)), the dual coefficients \(\hat{\alpha }\) to solve the kernelized ridge regression in Fourier Domain is given by

$$\begin{aligned} \hat{\alpha } = \frac{\hat{y}}{\hat{k}^{xx} + \lambda } \end{aligned}$$
(1)

where

  • \(\hat{\alpha }\) = DFT (Discrete Fourier Transform) of dual coefficients

  • \(\hat{y}\) = DFT of Gaussian-shaped regression targets

  • \(\hat{k}^{xx} \) = DFT of kernel autocorrelation vector

  • \(\lambda \) = Regularization term

In the subsequent frames, the kernel crosscorrelation (\(\hat{k}^{xz}\)) between cyclically shifted versions of base sample x and new candidate patch z can be computed and an (m x n) map of responses given by the following equation is obtained. \(\odot \) represents element-wise multiplication.

$$\begin{aligned} \hat{f}(z) = \hat{k}^{xz} \odot \hat{\alpha } \end{aligned}$$
(2)

By calculating the location of the maximum response in the real part of the inverse Discrete Fourier transform of \(\hat{f}(z)\) (Re(\(\mathcal {F}^{-1}\)(\(\hat{f}(z)\))), the location of the tracked face in the current frame can be found. The tracking model is then adapted to the newly found face patch \(x_{n}\) in frame n to reflect the changes.

$$\begin{aligned} \hat{\alpha }_{n} = (1 - \eta ) *{\hat{\alpha }_{n-1}} + \eta *(\frac{\hat{y}_{n-1}}{\hat{k}^{x_{n}x_{n}} + \lambda }) \end{aligned}$$
(3)
$$\begin{aligned} x_{n} =(1 - \eta ) *x_{n-1} + \eta *x_{n} \end{aligned}$$
(4)

Here \(\eta \) is the adaptation parameter, \(\hat{\alpha }_{n}\) and \(x_{n}\) are the Fourier transform of dual coefficients and new interpolated base template respectively.

2.3 Landmark Localization and Geometric Normalization

To localize the facial landmarks in the tracked face, an implementation of the method described in [12] is utilized. It has been trained on an IBUG 300-W face landmark dataset to predict the location of 68 facial landmarks in realtime. Here it is defined that the \(c_{i} \in R^{2}\) be the \(i^{th}\) x and y coordinate of lip contour in a face image F. The shape vector C = \((c_{1}^T, c_{2}^T, c_{3}^T, ...,c_{20}^T)\) represents the 20 coordinates of inner and outer lip contours in F. Next, the scale, orientation and translation components of the detected lip landmarks have to be normalized. In this step, it is necessary to find the \(2 \times 3\) affine transformation matrix \(\mathbf {A}\) which maps the vector C onto the coordinate frame with the size of \(128 \times 96\) as shown in Fig. 2. \(\mathbf {A}\) is defined as

$$\begin{aligned} \mathbf {A} = \left[ \begin{array}{ccc} \alpha &{} \beta &{} [(1-\alpha )*{c_{x}}_{center} - \beta *{c_{y}}_{center}] \\ -\beta &{} \alpha &{} [\beta *{c_{x}}_{center} + (1-\alpha )*{c_{y}}_{center}] \\ \end{array} \right] \end{aligned}$$
(5)

where

  • \(\alpha \) = scale \(*\) \(\cos \) \(\theta \)

  • \(\beta \) = scale \(*\) \(\sin \) \(\theta \)

  • \({c_{x}}_{center}\) = x coordinate of mouth center

  • \({c_{y}}_{center}\) = y coordinate of mouth center

  • \(\theta \) = angle of rotation around \(c_{x}\) and \(c_{y}\)

To get \(\theta \) and scale, the following equations are computed.

$$\begin{aligned} \theta = \arctan (\frac{d_{y}}{d_{x}}) *\frac{180}{\theta } \end{aligned}$$
(6)
$$\begin{aligned} scale = \frac{\gamma * 128}{||c_{lmc} - c_{rmc}||^{2}} \end{aligned}$$
(7)

where \(d_{y}\) and \(d_{x}\) are the differences between y and x coordinates of left \(c_{lmc}\) and right \(c_{rmc}\) mouth corners respectively. \(\gamma \) is the scaling factor.

Finally, the geometrically normalized lip shape vector \(C_{norm}\) can be obtained by

$$\begin{aligned} C_{norm} = \mathbf {A} \left[ \begin{array}{c} C\\ 1\\ \end{array} \right] ^{T} \end{aligned}$$
(8)
Fig. 2.
figure 2

Lip shape normalization

2.4 Centroid Distance Features

In this step, the centroid distance function [13] (CDF) is applied to the normalized outer and inner lip contour points in \(C_{norm}\). CDF measures the distances between outer and inner lip boundary points and their respective centroids.

$$\begin{aligned} {d^{O}}_{n} = \sqrt{({x^{O}}_{n} - x^{O}_{c})^{2} + ({y^{O}}_{n} - y^{O}_{c})^2} \end{aligned}$$
(9)
$$\begin{aligned} {d^{I}}_{m} = \sqrt{({x^{I}}_{m} - x^{I}_{c})^{2} + ({y^{I}}_{m} - y^{I}_{c})^2} \end{aligned}$$
(10)

The resulting 20 dimensional feature vector \(d_{i}\) \(=\) \([d^{O}_{n}\) \(d^{I}_{m}]\) conveys the lip shape information during speech and silence frames.

2.5 LSTM Recurrent Neural Network

Recurrent neural networks (RNNs) belong to a family of supervised learning techniques and they have major advantages over feed forward neural networks and support vector machines in which recurrent neural networks can efficiently capture time dynamics and handle long-range time-dependencies. Moreover, unlike Hidden Markov models (HMM), they do not inherit the flaws of independence assumption where the current state can only depend on a limited number of previous ones. RNNs can model a probability distribution over an arbitrarily long sequence, \(P(x_{1},x_{2},...,x_{T})\), without simplifications necessary to make HMMs mathematically and computationally tractable. At time step t, a recurrent neural network retains a state \({{\varvec{s}}}_{t}\) which encodes the information regarding the entire sequence of previous inputs \((x_{1},x_{2},...,x_{t})\). Therefore, an RNN can be trained to learn a function f such that

$$\begin{aligned} {{\varvec{s}}}_{t} = f({{\varvec{s}}}_{t-1},{{\varvec{x}}}_{t}) \end{aligned}$$
(11)

For a simple recurrent neural network, the Eq. 11 becomes

$$\begin{aligned} {{\varvec{s}}}_{t} = \phi (\textit{W}^{ss}{{\varvec{s}}}_{t-1} + \textit{W}^{sx}{{\varvec{x}}}_{t}) \end{aligned}$$
(12)

where \(\phi \) is either a logistic sigmoid or hyperbolic tangent nonlinearity. \({W}_{ss}\) represents the state-to-state connections and \({W}_{sx}\) represents input-to-hidden state connections. However, simple RNNs cannot effectively learn long-range temporal and non-temporal dependencies since the backpropagated errors may either decay (vanishing gradients) or grow (exploding gradients) across several time steps.

Long Short-Term Memory Recurrent Neural Networks [14, 15] are an extension of the original RNNs that address the major short comings of the latter. It achieves this by replacing the traditional hidden recurrent nodes with “LSTM cells” that ensure constant error propagation across time steps. Hence, they are readily suitable to capture the dynamics of lip movements over long temporal scales. An LSTM cell contains a special node (c) with a self-recurrent connection and an input node (I) while the flow of information to and from the cell is controlled by input (i) and output (o) gates. The forget gate (f) determines the persistence of the state of the special memory node. The equations governing the forward propagation mechanisms through an LSTM layer are as follows:

The input node \({{\varvec{I}}}\) receives the previous state information of the network \({{\varvec{s}}}_{t-1}\) and the current input \({{\varvec{x}}}_{t}\). Then, a squashing function \(\phi \) is applied on the affine transformation of its inputs.

$$\begin{aligned} {{\varvec{I}}}_{t} = \phi (\textit{W}^{Ix}{{\varvec{x}}}_{t}+\textit{W}^{Is}{{\varvec{s}}}_{t-1}+{{\varvec{b}}}^{I}) \end{aligned}$$
(13)

The input and forget gates has the same inputs but the sigmoid nonlinearity \(\sigma \) is used as a gating function to produce a value between 0 and 1. Once learned, the input and forget gates determine how much of new information is allowed into the memory node and how much of previously memorized content should be discarded.

$$\begin{aligned} {{\varvec{i}}}_{t} = \sigma (\textit{W}^{ix}{{\varvec{x}}}_{t}+\textit{W}^{is}{{\varvec{s}}}_{t-1}+{{\varvec{b}}}^{i}) \end{aligned}$$
(14)
$$\begin{aligned} {{\varvec{f}}}_{t} = \sigma (\textit{W}^{fx}{{\varvec{x}}}_{t}+\textit{W}^{fs}{{\varvec{s}}}_{t-1}+{{\varvec{b}}}^{f}) \end{aligned}$$
(15)

Based on the activation values of the input and forget gates, the new internal state \({{\varvec{c}}}_{t}\) of the cell is computed as a weighted sum of the new input information \({{\varvec{I}}}_{t}\) and past internal state \({{\varvec{c}}}_{t-1}\). \(\odot \) represents element-wise multiplication.

$$\begin{aligned} {{\varvec{c}}}_{t} = {{\varvec{I}}}_{t}\odot {{\varvec{i}}}_{t}+{{\varvec{c}}}_{t-1}\odot {{\varvec{f}}}_{t} \end{aligned}$$
(16)

The output gate will modulate the extent to which the new state of the LSTM cell will be exposed to the rest of the network. \({{\varvec{b}}}^{I}\), \({{\varvec{b}}}^{i}\), \({{\varvec{b}}}^{o}\), \({{\varvec{b}}}^{f}\) are bias vectors.

$$\begin{aligned} {{\varvec{o}}}_{t} = \phi (\textit{W}^{ox}{{\varvec{x}}}_{t}+\textit{W}^{os}{{\varvec{s}}}_{t-1}+{{\varvec{b}}}^{o}) \end{aligned}$$
(17)

Lastly, the hidden state of the LSTM network is updated according to the following equation.

$$\begin{aligned} {{\varvec{s}}}_{t} = {{\varvec{c}}}_{t}\odot {{\varvec{o}}}_{t} \end{aligned}$$
(18)

Single LSTM unit and the overall network architecture can be seen in Figs. 3 and 4 respectively. The HMM-based classification scheme is presented in the next section.

2.6 Hidden Markov Model

In this section, a brief description of the Hidden Markov Model (HMM) which is used to model the time varying lip shape vectors is provided. Hidden Markov Models are probabilistic generative models widely used in a variety of sequence generation and classification tasks. An HMM is parameterized by the initial state distribution \(\pi \), the state transition matrix A, and the observation model O. Given that there are M hidden states and N dimensional feature vectors, the state variable \(X_{t} \in \{i,j|i,j \in 1,...,M\}\) and the observation variable \(Y_{t} \in R^{N}\) at time t can be defined. Then, it follows that

$$\begin{aligned} \pi (i) = P(X_{1} = i) \end{aligned}$$
(19)
$$\begin{aligned} A(i, j) = P(X_{t}=j|X_{t-1}=i) \end{aligned}$$
(20)
$$\begin{aligned} O(i) = P(Y_{t}|X_{t}=i) \end{aligned}$$
(21)

From a series of T observations generated by a process c, \({Y^{c}}_{t=1:T}\) , \({\pi }^{c}\), \({A}^c\), and \({O}^c\) for the model \({M}^c\) can be estimated using the well-known Baum-Welch [16] algorithm. The classification task using trained HMMs, \(M^{c=speech}\) and \(M^{c=silence}\), can therefore be formulated using log likelihood of the models.

$$\begin{aligned} {M^{c}}^{*} = \arg \!\max _c P(Y|M^{c})P(M^{c}) \end{aligned}$$
(22)

\(P(Y^{c}|M^{c})\) is the likelihood defined as the probability of the observed data given the model \(M^{c}\) and \(P(M^{c})\) is the prior for the model which can be omitted as it is assumed to be uniform.

Fig. 3.
figure 3

An LSTM memory cell

Fig. 4.
figure 4

LSTM network architecture

Fig. 5.
figure 5

Confusion matrix for LSTM classification

Fig. 6.
figure 6

Confusion matrix for HMM classification

3 Experiment Settings

3.1 Dataset

For evaluating the performance of the proposed approach, visual data collection system was set up as follows. A standard webcam with a resolution of 640 x 480 capturing at 30 frames per second was connected to a laptop running the feature extraction algorithm including face detection and tracking. A total of six subjects were asked to sit in front of the webcam and instructed to perform speech and non-speech lip movements. During the non-speech movement period, in addition to stationary lips, the subjects were instructed to perform typical behaviors such as smiling, laughing, shaking and nodding head, etc. for approximately 5 min. For data collection during speech, each subject was asked to read out loud a collection of Thai words and articles for about 5 min. Therefore, 20 dimensional centroid distance features were collected in realtime from approximately 230000 frames. This process was carried out under uncontrolled common lighting conditions.

3.2 Network Architecture

In this experiment, the temporal evolutions of lip shapes during speaking and non-speaking states are modeled with LSTM recurrent neural network containing one hidden layer of LSTM units. Hence, the input layer of the network receives a 20 dimensional feature vector at each time step. The number of LSTM units in the hidden layer is 100 which is experimentally found to be optimal for the current problem. A dropout layer [17] which randomly sets a certain portion of its inputs to zero with probability P (\(P=0.5\)) is also added to reduce overfitting of the network. The final layer is a fully connected layer with a sigmoid activation function which outputs a value between 0 an 1. Then, the binary cross-entropy between the target and network output values is evaluated. The network is trained using RMSProp [18] algorithm with the initial learning rate set to 0.001.

3.3 HMM-based Classifier

HMMs with Gaussian mixture observation model are trained using the same lip shape sequences as described in the previous section. 20 dimensional feature vectors, spanning a time window of 60 frames are used as training sequences to estimate the parameters of the speech and silence HMMs. The classification scheme described in Eq. 22 is employed to classify the video frame sequences into either speech or silence.

Fig. 7.
figure 7

An example of the program running in realtime

Table 1. Classification rate
Table 2. Classification accuracies for different subjects using LSTM
Table 3. Classification accuracies for different subjects using HMM

4 Experimental Results

In the first part of the experiment, the performance of the LSTM network and HMM-based classifiers is evaluated on the dataset which contains lip shape features extracted from all subjects. Two HMMs are trained on 75 % of the data while the remaining 25 % are used for testing. For LSTM-based classifier, the dataset is divided into training, validation and test sets containing 50 To investigate the effectiveness and generalizability of the proposed approach, the two models are trained solely on the centroid distance features extracted from each person and their classification performances are assessed on the datasets of every other subjects. The results of the second experiment are shown in Tables 2 and 3. Even though the LSTM network is trained only on a fraction of the whole dataset, it is still able to classify with high accuracy in most cases whereas the HMM-based classifier shows consistently lower performance. Also, the proposed approach using LSTM network is tested in realtime on a laptop with Intel Core i5 processor and 30 fps webcam with a resolution of 640x480, while performing forward pass on the network every 60 frames. Given that the user’s face is detected and tracked correctly, the algorithm can efficiently and accurately classify speech and complex non-speech lip shape sequences (Fig. 7).

5 Conclusion

A novel method for visual voice activity detection with integrated face tracking framework has been proposed. Since the features utilized by this method are simple and purely geometric, the efficiency and robustness of the algorithm is greatly increased. Another contribution of this paper is the use of Long Short-Term Memory neural network to model long range temporal evolutions of lip shapes during periods of speech and non-speech. To the best of the authors’ knowledge, this is the first time that a temporal connectionist model is applied to visual voice activity detection. Moreover, the performances of and LSTM recurrent neural network and the classical HMM on visual voice activity detection task, using the proposed features are compared. Experimental results show that the trained network can achieve classification rate of above 98 % and also it is demonstrated that the generalization performance of the proposed approach using LSTM network is better than using HMM.