1 Introduction

By Visual Language IDentification (VLID), a recognition of language a subject speaks in a video (with no audio track) is meant.

Successful VLID is useful in scenarios in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available) [1]. It has been shown that visual speech cues may improve human’s ability to understand a speech under noisy conditions [2]. There are two tasks related to the VLID, lip-reading and speaker authentication using visual lip information. Lip-reading means understanding the content of the uttered speech. There is a significant body of research, see [3] for a review. Speaker authentication is a biometric problem, where the identity of a person is recognized based on the way of speaking [4]. Unlike these tasks, we are only concerned about the language being spoken.

Fig. 1.
figure 1

An example of a VLID. The output score of our method over 100 frames. From left to right expressions producing: strong French score, neutral score and strong English score.

Automatic Language Identification (LID), i.e. a recognition of language from an audio channel, is a mature technology achieving a high identification accuracy from a few seconds of a speech [5]. In contrast, VLID is still an unexplored area of research that is considered both an interesting research topic by means of practical usability [1].

Related Work. Newman and Cox [5] present a preliminary study on an automatic VLID. The authors describe a system that adopts a model successfully used in audio-based speech recognition systems. Bigram language models are built for each language using a highly speaker-dependent representation extracted from the videos. The database used to perform the experiments consists of 21 multilingual subjects reading a UN Declaration of Human Rights in all languages in which they are proficient. The dataset is thus also highly content-dependent. Despite the fact that the database consists of 21 speakers, they present only results for two bilingual speakers (English - Arabic, English - German) and for one trilingual speaker (English, French and German).

Newman and Cox continue their research in [6]. They fully articulate some of the details of the method presented in [5] and they try to extend it so it is capable of a speaker-independent VLID. The authors manually transcribe the audio track at word level and automatically expand this transcription to units of sound (phonemes) and then, via a custom designed mapping table, to units of visual communication (visemes). A tied-state mixture triphone HMMs are used to model the probability of a given viseme sequence being a sequence of a particular language. They classify with an RBF kernel SVM built on the outcomes of the individual HMMs. The results for five English/French bilingual speakers are presented.

In the latest work [1], Newman and Cox repeat both the described speaker-dependent and speaker-independent experiments on a larger dataset. Again, the videos are highly content-dependent and recorded in a controlled environment. The authors perform a new speaker-independent experiment with features based only on the shape of the lips. After 30 s of speech, \(\approx \)22% mean error-rate is achieved in speaker-independent experiments using only lip-shape features and \(\approx \)10% mean error-rate using a combination of the lip-shape and lip-appearance features. In the final discussion, the authors point out that the results of their speaker-independent experiments, in which the combined features are used, seems to be dependent on the difference between the colour of the speakers’ skin. This comes from the fact that the experiments were performed with nineteen Arabic and English speakers in total. Also, a speaking-rate is suspected to affect both the speaker-dependent and speaker-independent experiments.

In [7] an automatic LID in music videos is explored. A “bag-of-words” approach based on audio-visual features and linear SVM classifiers is presented. Using a combination of audio and video features, 48% accuracy (compared to 4% for chance) is obtained in experiments with 25000 music videos and 25 languages. Using only video features, 14.3% accuracy is achieved. Visual features consist of \(8\times 8\) hue-saturation histogram, several statistics derived from the face detector and textons computed for each video frame.

All the presented VLID experiments of Newman and Cox are done with the videos recorded in a studio-like environment with all speakers reading the same text. In [5], highly speaker-dependent features are used. In [6], the experiments are performed using only five bilingual speakers. Despite low mean error rates presented in [1], there is a high standard deviation present in all the results.

To the best of our knowledge, there are no other works on VLID.

In this paper, an approach to the VLID using facial landmarks is described (see Fig. 1 for an example output of the method). A hypothesis is examined that a language being spoken may be deduced by observing the lip-shapes of the speaker. A convex optimisation problem is formulated in which simultaneously a linear classifier is learnt and the most discriminative representation of the lip-shapes is found.

Our contribution is the introduction of a speaker and content-independent VLID method. Unlike the existing approaches, our method was validated on realistic videos that were not recorded specifically for the purpose of the research. Additionally, we collected a dataset CMP-VLID consisting of videos downloaded from youtube.com of native English and French speakers, mostly bloggers. The dataset is publicly available at the following website: http://cmp.felk.cvut.cz/~spetlrad/cmp-vlid/.

The rest of the paper is structured as follows: Sect. 2 describes the proposed method, implementation details are given in Sects. 3 and 4 presents the datasets, Sect. 5 covers experiments settings and their results. Sect. 6 discusses the results and concludes the paper.

2 The Proposed Method

The basic idea of the proposed method is to perform a VLID by observing the shape of the speaker’s lips. The recognition of a language is done by measuring the frequency of the most discriminative lip-shapes in a given video. We model a single lip-shape as a subset of standard facial landmarks automatically detected in every frame. The most discriminative lip-shapes are found jointly together with the classifier.

Let the training set be

$$\begin{aligned} T = \{(V_i,y_i)\}_{i=1}^{N} \end{aligned}$$
(1)

where \(V_i\) is a video with a single speaker, \(y_i\) is the label of one of two possible languages of the speaker in the video, \(y_i \in \{-1,1\}\), and N is the number of the videos.

Facial landmarks in all frames in all videos are detected. Every video has a set of associated facial landmarks, i.e.

$$\begin{aligned} V_i = \{\mathbf {s}_i^t\}_{t=1}^{L_i} \end{aligned}$$
(2)

where \(\mathbf {s}_i^t\) are the facial landmarks detected in the t-th frame of the i-th video, \(\mathbf {s}_i^t \in \mathbb {R}^D\), D is twice the number of the landmark points, because we get \(\frac{D}{2}\) coordinates for x and \(\frac{D}{2}\) coordinates for y axis, and \(L_i\) is the number of frames in video i.

The i-th video is represented by a soft-assignment variant of bag-of-words. A histogram with K bins

$$\begin{aligned} \mathbf {x}_i = \begin{bmatrix}x_i^1&\cdots&x_i^K&1\end{bmatrix}^T \end{aligned}$$
(3)

is used where for \(k \in \{1, \dots , K\}\) the bin values are given as

$$\begin{aligned} x^k_i = \frac{1}{L_i} \sum _{t=1}^{L_i} \exp \left( -\sigma \Vert {\mathbf {s}_i^t-\mathbf {\mu }_k}\Vert _2^2\right) \!, \end{aligned}$$
(4)

\(L_i\) is the number of frames with detected facial landmarks in a video i, \(\sigma \) is a hyperparameter. Centroids \(\mu _k\) are constructed by clustering the set \( \{s\} = \bigcup \limits _{i=1}^N \{\mathbf {s}_i^t\}_{t=1}^{L_i}, \) which contains all landmarks from all videos, by the k-means algorithm using the Euclidean norm. The result

$$\begin{aligned} \{\mathbf {\mu }_k\}_{k=1}^K \end{aligned}$$
(5)

is a set of K centroids, \(\mathbf {\mu }_k \in \mathbb {R}^D\).

Finally, to discriminate between the languages, a linear binary classification over the representation

$$\begin{aligned} \hat{y}_i = {{\mathrm{sign}}}(\mathbf {w}^T \mathbf {x}_i) \end{aligned}$$
(6)

is performed, where the \(\mathbf {w}\) is an unknown weight vector \(\mathbf {w} = \begin{bmatrix}w_1,&\cdots ,&w_K,&b\end{bmatrix}^T\).

To train the classifier, the optimisation problem

$$\begin{aligned} \min _{\mathbf {w}} \big \{\lambda _1 \Vert {\mathbf {w}}\Vert _1 + \lambda _2 \Vert {\mathbf {w}}\Vert _2^2 + \frac{1}{N}\sum _{i=1}^{N} \max \{0, 1-y_i\cdot \mathbf {w}^\intercal \mathbf {x}_i \} \big \} \end{aligned}$$
(7)

is formulated. The last term is the hinge loss. The other two terms are the regularizers. Hyperparameters \(\lambda _1\) and \(\lambda _2\) control the strength of the regularization.

Problem (7) is convex and can be converted into a standard quadratic programming problem for which fast and efficient solvers are available.

Let us now shortly discuss the \(\ell _1\) norm in Eq. (7). By omitting the norm, a standard linear hinge loss SVM is obtained. In that case, a solution \(\mathbf {w}^*\) assigning an optimal weight to each soft histogram bin \(x^k_i\) is produced. By using the \(\ell _1\) norm, a sparsity of the solution \(\mathbf {w}^*\) is enforced, thus a subset of the soft histogram bins is used. In other words, a simultaneous search for the optimal bin weights and a selection of the most informative bins, i.e. the learning of the optimal representation, is undertaken. In Sect. 5, we show that \(\ell _1\) norm improves the results of the experiments and thus a subset of centroids (having non-zero weight) is selected.

In theory, landmarks from all videos may be used as the centroids \(\mu _k\) in the representation (4). The most discriminative centroids would still be found by solving (7). However, the problem would be too large and computationally intractable. Therefore the data are pre-clustered with the k-means algorithm.

Also note that the method has the following properties. First, although a formulation of a binary decision problem is presented, an extension to multiple class scenarios is straightforward. Second, the fixed-length representation (3) enables classification of an arbitrarily long sequence of images. We expect the representation to be more stable for longer image sequences. Third, a simple linear classifier is used, thus applications in real-time environments are at hand. Fourth, the representation (3) is inherently invariant to any permutation of the input sequence frames.

3 Implementation Details

In this section the details concerning the implementation of the method are specified. The details are described in the order following the structure of Sect. 2.

Before the landmarks detection a face bounding box is detected. A commercial implementation of WaldBoost [8] based detectorFootnote 1 was used to perform this task. The detector provides a head yaw estimate. All detected faces having \(|yaw| > 15^{\circ }\) were discarded. Also all faces for which the width of the face bounding box was <150 pixels were discarded. These two steps were taken to ensure that the landmarks detector has good enough input to provide precise results.

When the face bounding boxes were detected, a facial landmarks detection was performed with the IntraFace detector [9]. Each detected set of landmarks was translated to the origin and it was normalized by the interocular distance.

After the facial landmarks were detected, the non-speaking parts of the video were removed. That was done by computing a location variance for each landmark using a sliding temporal window of length 75 frames (3 s). That was done separately for each video. Then the mean temporal window variance was computed for each frame. Frames where the mean temporal window variance was less than \(10^{-4}\) were discarded.

The quadratic problem defined in Eq. (7) was solved by the IBM ILOG CPLEX Optimization Studio quadratic problem solver using the default settings.

Table 1. The CMP-VLID database.

4 Data

In this section, the experimental datasets are described. First, it is explained, how the CMP-VLID dataset was collected. Then the CMP-VLID Test dataset, used for the experiment with human participants and for the testing, is presented.

Fig. 2.
figure 2

CMP-VLID Test dataset. The first and the second row contains randomly selected frames from videos with five English (EN1, ..., EN5) and five French (FR1, ..., FR5) speakers respectively. The third row contains the expressions yielding strong French score and the fourth row strong English score for every subject in the set.

CMP-VLID Dataset. Both the English and French parts of the CMP-VLID dataset were collected from youtube.com. The majority of the dataset was collected semi-automatically. There were two main techniques used in the collection of the data. First, a script was created that automatically downloaded all results of a single full-text search at the site. The search terms were similar to the terms “English blogger” or “English YouTuber”, and their adequates in the French language, i.e. “Une Blogueuse Française”, “Un Blogueur Français”, “Une YouTubeuse Française” and “Un YouTubeur Français”. The downloaded videos were manually checked for to be sure that the persons in the videos speak the expected language. Second, a Wikipedia site containing a list of English and French film and theatre actors was crawled, and the youtube full-text search engine was used to find the interviews with the individuals. One interview per each actor was then manually selected from each result of the search. Table 1 contains statistics describing the collected database. Two non-overlapping datasets, the CMP-VLID Test dataset and the Cross-validation dataset, were created from the CMP-VLID.

Cross-Validation Dataset. Consists of 322 videos for each language. Each video was trimmed to the first 500 frames. The Cross-validation dataset was used for training and for introspection experiments.

CMP-VLID Test Dataset. Consists of 5 videos for each language. It is used for an experiment with human participants. The selection was performed in the following manner. A video was randomly picked and then manually checked. The single speaker videos with neutral surroundings were selected. The videos were demultiplexed, the audio track was removed, and they were cut to the length of 60 s. The proposed method, trained on the CMP-VLID dataset, was evaluated on the CMP-VLID Test dataset to provide a comparison between human and the algorithm.

5 Experiments

This section contains the description of the performed experiments. First, an experiment with human participants is presented, then a series of experiments demonstrating the accuracy and properties of the proposed method follows.

5.1 Experiment with Human Participants

This experiment was intended to uncover the extent to which a human can distinguish between the English and French speakers if only the visual information is available. Also, techniques used by people to make the guess were investigated. An on-line form was created using the Test dataset described in Sect. 4. The participant’s task was to guess which language a person in a video (with no audio track) speaks. One hundred people participated. Every participant was also asked if he or she is familiar with the French language. The experiment was not carried out in controlled conditions, but since the participation was voluntary, we believe that the participants were interested in their own performance rather than in “getting it all right”.

Fig. 3.
figure 3

VLID accuracy of 100 human participants on CMP-VLID Test dataset videos with 5 English (EN \(\cdot \) columns) and 5 French (FR \(\cdot \) columns) speakers. Green dashed line in (a) is the average accuracy. Light and dark green dashed lines in (b) is the average accuracy of the people familiar and unfamiliar respectively. (Color figure online)

Figure 3(a) shows the mean guess accuracy for each of the ten videos. There were two videos with the mean guess accuracy lower than fifty percent and the accuracy in the remaining cases fluctuates around the same average. The mean guess accuracy is 72.6% and it is depicted by the green dashed plot. The first video that was hard to guess contained a young fast speaking English man. The second hard to guess video included a French middle-aged man with a slow speaking pace. These results indicate that there were some expectations amongst people who participated about how fast a speaker of a given language should speak. If we take a glimpse at Fig. 3(b), we see that the results pictured there, i.e. the results where the familiarity with the French language is taken into account, support our hypothesis. People familiar with French were more accurate in guessing the language of the speaker in both cases than the people not familiar with French.

Figure 3(b) shows the average guess accuracy for each of the ten videos with the results displayed for both groups of participants, i.e. the participants familiar and participants not familiar with the French language. In the case of the video with id EN2 and FR5, there is a ten percent and twenty percent points difference between the two groups of participants. Also, there are two other videos with the difference ten percent points or higher. The video FR1 contained clearly articulating French teenage man and a video FR2 included young French man with fast speaking pace. The results show that people who participated in our experiment and who are familiar with French are able to uncover the language of the French speakers better than the people not familiar with French.

We asked some of the participants to give us a more detailed feedback on the way they tried to guess the language of the speaker. The participants mentioned the suspected qualities, i.e. the pace of the speech and the shapes the speakers lips took most frequently. But they also mentioned something other – their guesses were also based on something which we call “an overall appearance”. We heard things like: “This must be a French girl, look at the way she dresses...”, or “This is totally a guy from Algeria...”. In other words, also the things like the colour of the skin, the shape of the head or the way one dresses support the decision whether the subject speaks one language or another. In the light of this finding, it is also interesting to inspect the video that was the easiest to guess, the video with FR3. It included a young French woman that articulated very clearly and quite often formed her lips into a shape of a circle. If we take a look at Fig. 3(b), we see that the guess accuracy in this case was the same amongst both groups of participants. We believe that an overall appearance was one of the factors that helped the participants, who were unfamiliar with the French language, to correctly guess the language of the woman in the video id FR3 (see Fig. 1).

The majority of participants correctly guessed seven videos, there was one participant who guessed only one of the videos correctly and there were only six participants that guessed correctly all the videos in the set.

Fig. 4.
figure 4

(a) Accuracy of the method on CMP-VLID Test dataset as a function of a video segment length. (A) the accuracy when the representation (4) is build using the whole input sequence, (B) the accuracy when an ensemble of classifiers is used. (b) Average accuracy as a function of different choices of \(\mathbf {s}_i^t\) and \(\lambda _2\) for fixed \(\sigma \) and \(\lambda _1\). The same colours as in Fig. 5 were used to distinguish between different choices of \(\mathbf {s}_i^t\). Cross-validation results. (Color figure online)

5.2 Evaluation of the Proposed Method

There were three parameters in formulation (7) that needed to be assessed. The weight of the \(\ell _1\) norm \(\lambda _1\), the weight of the \(\ell _2\) norm \(\lambda _2\) and the width of the soft-histogram bin \(\sigma \). An exhaustive grid search with \(\lambda _1, \lambda _2 \in \{10^{-10},10^{-9},\ldots ,10^6\}\) and \(\sigma \in \{10^{-3},10^{-2},\ldots ,10^2\}\) was performed. A 10-fold cross-validation was used on the Cross-validation dataset defined in Sect. 4. The parameters \(\lambda _1 = 10^{-3}\), \(\lambda _2 = 10^{-6}\) and \(\sigma = 1\) were found resulting in \(73\%\) accuracy with 6 % points standard deviation on the validation data. The parameter K in (5) of the k-means algorithm was set to 500. This choice was empirically proven to give the best results when the training time was taken into account. Solving problem (7) takes approximately 3 s.

Accuracy Evaluation. The proposed method was evaluated on the CMP-VLID Test dataset described in Sect. 4. The sequences of landmarks detected in each video were divided sequentially and exhaustively to give test durations of 60, 45, 30, 20, 7, 3, 1 and \(\frac{1}{25}\) s (assuming frame rate 25 frames per second). Figure 4(a) presents the results of the evaluation. The blue plot depicts the accuracy of the representation (4) build from all available frames. The best accuracy is achieved at 20 s, it drops for longer durations. Let us now discuss this result. The described data partitioning leads to a situation, in which the number of test sequences for shorter test durations greatly exceeds the number of sequences for longer test durations. This means, that the results for longer durations may not be statistically significant. Thus, we performed an additional experiment in which a sliding temporal window of length 500 frames was applied on each sequence of length \(L > 500\) leading to another \(L - 500 + 1\) sequences. The label of the whole sequence was obtained by a vote of \(L - 500 + 1\) classifiers, each having the same weight. The results represented in Fig. 4(a) suggest that the descending trend witnessed for durations longer than 20 s in case of a single classifier may be indeed caused by a small size of the testing dataset. The classifier ensemble represented by the orange plot yields the best accuracy 90% at the longest duration.

Results of both presented experiments lead to a conclusion that the proposed method is better than an average human in discriminating between the French and English language.

Fig. 5.
figure 5

Different choices of \(\mathbf {s}_i^t\). The top row (from left to right): (a) full set of facial landmarks, (b) subset of (a) – inner and outer lips, (c) subset of (a) – left half of the lips, (d) subset of (a) – lip contours. The bottom row (from left to right): (e) features proposed in [10] computed for left and right eye, left and right eyebrow and lips landmarks, (f) subset of (e) – features computed from inner and outer lips, (g) subset of (e) – features computed from the left half of the lips, a full set of facial landmarks highlighted by green lines. (Color figure online)

Introspection. So far, \(\mathbf {s}_i^t\) in Eq. (4) was only considered to be a full set of facial landmarks. In fact, seven different subsets of landmarks and features derived from them were tried as \(\mathbf {s}_i^t\) (see Fig. 5). These included: (a) the full set of concatenated x and y facial landmarks positions, (b) a subset of (a) containing only the lips landmarks, (c) a subset of (a) containing only the left half of the lips landmarks, (d) a subset of (a) containing only the lips contour landmarks, i.e. the lips landmarks with the inner lips landmarks excluded. Saitoh and Konishi [10] compute the radius feature r, which is the distance \(r_0\), \(r_1\), ..., \(r_{L}\) from the centre of gravity of the lip to the contour, where L is the number of facial lips landmarks. This method was used to compute: (e) a set of the features r computed for left eye, right eye, left eyebrow, right eyebrow and lips landmarks separately, (f) a subset of (e) containing only the lips features r, (g) a subset of (e) containing only the left half of the lips features r. The overview of how the different choices of \(\mathbf {s}_i^t\) affected the average accuracy is given in Fig. 4(b). Particularly, the graph shows accuracies obtained in a 10-fold cross-validation when the parameters \(\sigma \) and \(\lambda _1\) were fixed and the \(\lambda _2\) parameter spanned. A grid search discussed in Sect. 5.2 was performed that included the search over different choices of \(\mathbf {s}_i^t\). The best results were obtained for the concatenated x and y positions of the lips landmarks, i.e. for type (b).

Fig. 6.
figure 6

(a) Dependency of average accuracy (orange plot) and the non-zero bins count (blue bars) on \(\lambda _1\) for fixed \(\sigma \) and \(\lambda _2\). The vertical bars represent the standard deviation. Cross-validation results. (b) Dependency of average accuracy on \(\lambda _2\) for fixed \(\sigma \) and \(\lambda _1\). The vertical bars represent the standard deviation. Cross-validation results. (Color figure online)

In Figs. 6(a), (b) and 7(a), the dependency of the mean accuracy on the parameters \(\lambda _1\), \(\lambda _2\) and \(\sigma \) is shown. Besides the average accuracy, Fig. 6(a) shows the number of the non-zero bins for different \(\lambda _1\) settings. We see, that the highest average accuracy is obtained when \(\lambda _1 = 10^{-3}\). For this particular choice, the \(\ell _1\) norm reduces the number of the non-zero bins to 99, effectively selecting the most informative centroids \(\mu _k\) defined in (5). The representation is indeed learnt simultaneously with the classifier. Setting the \(\lambda _1\) to some smaller values makes the \(\ell _1\) norm ineffective, i.e. we leave the regularization purely on the \(\ell _2\) norm, setting it to some high values leads to a too sparse weight vector producing a low classification accuracy on the validation set.

Figure 7(a) presents dependency of average accuracy on \(\sigma \) parameter. As \(\sigma \) represents the width of the soft-histogram bin defined in (4), setting its value too high leads to the bins too wide, spreading the representation of a single lip-shape across many bins. Setting it too low leads to the bins too thin, therefore only a small number of bins is activated for a given lip-shape. Both situations negatively affect the generalization capabilities of the representation, so a proper \(\sigma \) must be selected. In our case the highest average accuracy occurs for \(\sigma = 1\).

Fig. 7.
figure 7

(a) Dependency of average accuracy on \(\sigma \) for fixed \(\lambda _1\) and \(\lambda _2\). Vertical bars represent the standard deviation. Cross-validation results. (b) English (orange) and French (blue) centroids. The two shapes on the left are the centroids with the highest corresponding weight, the next two have the second highest corresponding weight. (Color figure online)

Two centroids for each language with the highest corresponding weights are visualised in Fig. 7(b). These centroids have the highest impact on the classification. A tendency of English and French speakers to deform their lips in a particular way, visible also in Fig. 2, is captured. Notice, that both the top two English and French centroids represent opened and closed mouth, only in a reversed order. The most discriminative English lip-shape corresponds to a fully opened mouth slightly rotated to the side, while the French one reminds of the lips formed into a shape of a circle. This suggests that the typical English and French lip-shapes were found.

6 Conclusion

A novel speaker and content-independent VLID method using facial landmarks was presented. It was shown that the method performs better than average human in discriminating between English and French in a 10 videos study and achieves 73% average accuracy in a 10-fold cross-validation on realistic videos. A convex problem of learning the classifier was formulated as a simultaneous search for the best representation and its best parametrization.

CMP-VLID dataset consisting of videos collected from youtube.com of native English and French speakers was presented.