1 Introduction

Automating quality checking is currently based on finding major video and audio artefacts. The Monitoring Of Audio Visual quality by key Indicators (MOAVI) subgroup of the Video Quality Experts Group (VQEG) is an open collaborative project for developing No-Reference (NR) models for monitoring audiovisual service quality. MOAVI is a complementary, industry-driven alternative to Quality of Experience (QoE), used as a subjective measure of a viewer’s experiences.

Existing NR QoE models, such as those reported in related research work [7, 26], follow the less useful Full-Reference (FR) models (e.g. [8]), which measure the quality of networked multimedia using objective parametric models. These models have slight problems in predicting the overall audiovisual QoE. MOAVI can be used to automatically measure audiovisual quality by using simple indicators of perceived degradation.

The goal of the project is to develop a set of key indicators (including blockiness, blur, freeze/jerkiness effects, block missing errors, slice video stripe errors, aspect ratio problems, field order problems, interlace, lip synchronization (also known as lip sync), muting (signal losses), and clipping [2]; the list is not complete although it does include the major artefacts) describing service quality in general, and to select subsets for each potential application. Therefore, the MOAVI project concentrates on models based on key indicators, unlike models predicting overall quality.

The MOAVI project focuses on indicators which are yet to be addressed by other VQEG projects. Audio quality of low bit-rate signal may be poor due to artefacts such as compression artefacts in signal coding/transmission/encoding, limited sampling rate, limited dynamics, etc.; however, these aspects have already been studied and evaluated in numerous previous VQEG works. Artefacts which are yet to be addressed are muting, clipping and lip sync. While clipping and muting detection algorithms are rather rudimentary, the main contribution of this paper is measuring the lip sync artefact.

The classic quality metric approach cannot provide pertinent predictive scores with a quantitative description of specific (new) audiovisual artefacts, such as stripe error or exposure distortions. MOAVI is an interesting approach because it can detect artefacts present in videos, as well as predicting the quality as described by consumers. In realistic situations, when video quality decreases in audiovisual services, customers can call a helpline to describe the problem and visibility of the defects or degradations in order to describe the outage. In general, they are not required to provide a Mean Opinion Score (MOS). As such, the concept used in MOAVI is completely in phase with user experience. There are many reasons for video disturbance, and they can arise at any point along the video chain transmission (filming stage to end-user stage) [13].

The video signal needs some signal processing to be performed on. Quality checking can be conducted before, during, and/or after the encoding process. However, in MOAVI, no MOS is provided. A binary indicator for each artefact is provided instead showing its presence or absence.

Figure 1 shows the concept of MOAVI. The audio or video stream (video only for video artefacts, audio only for audio artefacts, and both together for audiovisual artefacts) is the input to the system. The metric of each artefact is used to determine the level of impairment of the media to be analysed. These results are converted into binary indicators using a threshold which determines whether the artefact is noticeable in the video. This way MOAVI obtains a key indicator for each artefact.

Fig. 1
figure 1

Concept of monitoring of audiovisual quality

This paper is organized as follows. Section 2 describes the measurements of key audio indicators – presence of muting and clipping. Section 3 describes the measurements of key audiovisual indicator – presence of lip sync, including the video database for the assessment of the metrics, the algorithms and the results obtained. Section 4 concludes the paper and summarizes the results.

2 Measuring mute and clipping artefacts

In recent years, interest has been growing in real-time audio services over packet networks. For quality evaluation, it is essential to quantify user perception of the audio sequence. Signal loss is one of the most common degradations in audio streaming at low bit rates. The end-user perceives a silence followed by an abrupt clipping. Cell loss in packet networks or a restitution strategy could be the origin of this perceived temporal audio discontinuity.

It is important to detect and prevent or correct the clipping problem caused by digital capture, conversion and downscaling processing. The audio signal is always stored digitally in order to improve the quality of audio. In certain situatgs, the original audio signal may be clipped during the recording due to the impact of environmental noise or recording equipment. This means clipping can originate at the capture stage. The maximum amplitude of the clipped signal is frequently limited to a constant. This clipping distortion leads to a harsh noise. It significantly affects the subjective listening quality if the clipping intensity is strong or the clipping density is high.

Muting and clipping are the most frequent impairments present in audio streaming and audio files in general. Therefore, a key indicator for each of these artefacts needs to be added to the most suitable subset of metrics when audio is present in the file/stream being evaluated. These indicators are based on metrics developed for this project and a threshold optimized with preliminary tests carried out during the implementation and improvement phases of the development.

2.1 Mute

The advent of protocols for quasi real-time communications and rapidly increasing computing power are driving an increasing interest in real-time audio services over packet networks. Audio streaming is used in real-time applications since the data needs to be transmitted as soon as it is generated in order to deliver continuous media play out. These applications can only tolerate a short delay in signal restitution. However, packets of data are transmitted over unreliable, lossy networks.

Packet loss produces significant temporal impairments in the received audio. When considering quality, it is essential to quantify user perception of the played-out audio sequence. Muting caused by signal losses is one of the most common degradations in audio streaming at low bit rates. The end-user perceives a silence followed by an abrupt clipping. Cell loss in packet networks or a restitution strategy could be the origin of this perceived temporal audio discontinuity. Packet loss or jitter could cause a sporadic or non-uniform signal loss at the decoding level because of the play-out buffer time limit.

The muting artefact presents as an absence of any kind of sound during a period of time detectable by the human ear. A typical waveform of a muted sound file is shown in Fig. 2 It is usually generated during the transmission stage where the majority of losses occur. This is why this detector should be applied near the far-end to check the correct transmission of the audio file.

Fig. 2
figure 2

Example of sound waveform with mute artefact

Some approaches to muting detection have been already proposed, usually in the context of automatic audio classification and segmentation. A notable example of such investigation is presented in the paper by Lu and Hankinson [14], where the concept of silence ratio has been introduced, being variation of zero-crossing rate.

2.1.1 Algorithm

The algorithm for the detection of the muting artefact involves the establishment of a certain threshold or set of thresholds to determine whether the audio samples analysed are suffering from sporadic audio signal loss. This way, the related research work [16] describes how different lengths, contents and local activity levels affect the quality perceived. It should be noted that the goal of the MOAVI project is to develop a set of metrics that will work without analysing the content.

Two thresholds are needed to determine whether the muting artefact is present in an audio stream: one for the duration of the silence and the other for the amplitude of local activity, which describes the greatest amplitude of the audio wave for it to be considered silenced.

As the metrics for the MOAVI project are NR, we cannot compare the file with the original. An NR audio metric explores the audio file at the sample level in order to detect and measure the distortions which may have been generated.

When the characteristics of the artefact are known, the detection algorithm is simple. Figure 3 shows a schematic view of the process which determines whether the mute artefact is present in an audio file. Each sample is compared to the amplitude threshold. If its value is lower, we check whether the number of successive low amplitude samples is sufficient to become noticeable. If the silence is sufficiently long, the key indicator for the muting artefact is positive, indicating the presence of the artefact in the analysed sound.

Fig. 3
figure 3

Algorithm for the detection of mute artefact

The paper [16] provides experiments for setting the duration threshold. It has been shown that, for most types of content, a signal loss of 10ms is detectable (with the exception of news or speech-based content). An unequivocal detection, close to the probability of 1, is attained for a discontinuity of 30ms. This result is also valid for preliminary tests carried out in this research. Thus, the duration threshold is 30 milliseconds or its equivalent in samples.

Regarding the amplitude, the threshold for the minimum amplitude in the digital signal detectable by a listener depends on the player configuration characteristics such as volume settings or distance between the listener and the speaker. However, if muting is considered to be an artefact which occurs when the signal is completely lost, the amplitude threshold has to be the minimum amplitude in absolute value different from zero that the codification can admit. Therefore, the assumption made here is that muting is only present when the audio signal is a sequence of zeros or the complete absence of audio signal.

A sound file can carry information for two channels. In fact since the majority of streaming, broadcasts and music are produced, transmitted and displayed using stereo digital equipment, it is common for the mute detection algorithm to analyse and synchronize both channels. Therefore, the solution is simple: since the human ear can only declare as mute a file with both channels silenced, the logical operation to be introduced between the two channels is the ’AND’ operation. This means that the key indicator is active only if both of the channels are detected as mute.

As the metric takes into account every sample it is extremely accurate while indicating the start and end of the muted subsequence, which can be helpful in the detection of the data packet which has been lost. This data packet could even be requested to be sent again from the production/distribution centre, which solves the mute artefact problem in this scenario.

2.1.2 Results

Regarding the results obtained, the detection of the mute artefact in a simulated sequence impaired by a signal loss is shown in Figs. 4 and 5. In the first figure, the silence was artificially introduced between samples number 5 ⋅ 105 and 7 ⋅ 105 approximately.

Fig. 4
figure 4

Example of detection of the mute artefact in an audio sequence 1

Fig. 5
figure 5

Example of detection of the mute artefact in an audio sequence 2

In the second figure, the silence was introduced between samples number 3⋅105 and 3.2⋅105 approximately. In both the example sequences it can be observed that the algorithm works accurately and it detects the artefact at the time positions when it was introduced. Additionally, the metric detects muting discriminates the silent moments during speech (pauses when only background noise is heard) from artificial silence, or loss of the audio signal which is actually the mute impairment that the metric was developed to detect.

Experiments were conducted to evaluate the accuracy of this detector. The set of ten audio files used as an input for the experiments was similar to that shown in Fig. 4 in the sense that an artificial mute artefact was introduced to them. In this regard, the mute artefact was present in the input audio files as silent samples of different lengths.

An accuracy rate of 95 % was found for this metric under these conditions. Most of the samples that were erroneously marked as ”non-muted” (false negatives) were the first muted samples which the detector encountered from the muted section.

One of the limitations of this algorithm are the potential false negatives when a signal bias (i.e. DC offset) is introduced in the audio wave. Under these circumstances, a muted signal does not imply small values of samples and thus it would not be detected.

Although psychoacoustic experiments are not the object of this research, we use the available publications to determine the optimal thresholds for the minimum duration of the silence and the minimum noticeable amplitude of the waveform [16].

2.2 Clipping

As noted in [4, 29] on a restoration method of clipped audio signals based on MDCT, the audio signal is always stored digitally in order to improve audio quality. In certain situations, the original audio signal may be clipped during the recording due to environmental noise or recording equipment. The maximum amplitude of the clipped signal is often limited to a constant. This clipping distortion leads to a harsh noise. It significantly affects subjective listening quality if the clipping intensity or density is high.

Clipping can be divided into two classes: digital clipping and analogue clipping. For digital clipping, when the signal amplitude exceeds the upper limit of the recording equipment during the transcription, the signal amplitude will be a constant in the peak region. In analogue recording systems, the signal can be clipped by impedance mismatch or the overflow of the input electrical level. Analogue clipping shows a small deviation in amplitude, and the sample values in the clipped region are not exactly equal to each other. In both digital and analogue clipping, the front-end of the clipped signal is always in the peak regions.

While analogue and digital input clipping can occur in the observed streams, they need to be distinguished. Although input analogue signals can be over-amplified, in fact artificial amplification is not common in real equipment. On the other hand, digital over-amplifi- cation is introduced when certain parts of the digital processing chain are not connected correctly – digital signal is equalized without signal compression/limitation – by the digital compressor/limiter algorithm.

A typical waveform of a clipped signal tends to be similar to the one showed in Fig. 6. The waveform in the clipped areas is a constant or semi constant value, which is usually the highest value that the amplitude of the audio signal can have.

Fig. 6
figure 6

Example of waveform of an audio signal suffering clipping

There is also another type of clipping in which the artefact is produced during the stage before the audio signal level is reduced or converted. In this case, the constant or semi constant amplitude can be any value. In this type of clipping, none of the signal samples are higher than the constant. Thus, the waveform appears to be cut off at the mid value.

Whereas clip detection has been already investigated for a quite long time, most of the proposed solutions (like the one by Person and Muccioli [17]) was related to analogue signals. Nevertheless, recently, solutions for digital signals (like the one by Skoglund and Linden [19]) started to emerge as well.

The following section explains the algorithm we used to detect of clipping (both types).

2.2.1 Algorithm

The algorithm for the detection of the clipping artefact involves setting a certain threshold or set of thresholds to determine whether each of the analysed audio samples is limited to a constant amplitude. This method has been used to study how different lengths and contents affect the perceived quality. As the goal of the MOAVI project is to develop a set of metrics that work without analysing the content, this is not taken into account in the clipping metric.

This means that two thresholds are needed to determine whether the clipping artefact is present in an audio stream: one for the number of samples following each other restricted to a constant, and one for the maximum variation of the amplitude value in two consecutive samples to be considered constant; this represents the amplitude gap between two consecutive audio samples which are candidates to be clipped.

As the metrics for the MOAVI project are NR, we cannot compare the file with the original. An NR audio metric explores the audio file at the sample level in order to detect and measure the distortions which may have been generated, so the NR clipping metric cannot compare the analysed signal with the original.

Figure 7 shows a schematic view of the process used to determine whether the clipping artefact is present in an audio file. Each sample is compared to the previous sample to determine whether the gap between their amplitudes is greater than the differential threshold. If the gap is lower, and thus two or more samples have a very similar amplitude, we check whether the number of consecutive low-amplitude samples is sufficient to be noticeable by a human listener as clipping (harsh noise).

Fig. 7
figure 7

Block diagram describing the algorithm for the detection of the clipping artefact

If the length of the constant or semi-constant values is sufficient, the sample becomes a candidate to be clipped. Every 125 milliseconds, the number of candidate samples is compared with the total number of samples analysed in those 125 milliseconds. Therefore, the key indicator for the clipping artefact is positive if this ratio is higher than 30 percent. If the key indicator is positive, it indicates the presence of the clipping artefact in the analysed sound.

The percentage of candidate samples to be clipped (30 percent) and the length of the audio sub-sequence (125 milliseconds) over which the clipped/not clipped decision is made is based on preliminary tests, which show that the best behaviour occurs when applying the pertinent threshold to this length of sequence.

2.2.2 Results

Regarding the results, the algorithm detecting this artefact is simulated over a sequence impaired by generated clipping. This process involves two steps. In the first step, the audio signal is amplified until some of its samples reach the top amplitude (over-amplification). In the second step, the amplitudes are cut above the maximum value which can be reached by a sound file with a given bit depth. This generates a waveform similar to an audio signal affected by the impairment naturally, during the capture or processing stage (see Fig. 6).

Two examples of clipping being detected are presented in Figs. 8 and 9. In both figures, clipping was artificially introduced over the entire file, since in most cases the clipping artefact affects the entire file.

Fig. 8
figure 8

Example of detection of the clipping artefact in an audio sequence

Fig. 9
figure 9

Example of detection of the clipping artefact in an audio sequence

In Fig. 8 the amplification is 24 dB. This makes the clipping more noticeable and the signal cuts are greater. This produces a harsh noise when the sound is played, becoming more notice able as the cuts become greater. In Fig. 9, the amplification is 15 dB. This means that the number of sub-sequences detected as clipped is lower; however, the indicator remains positive since the artefact is detected.

This shows that the detection occurs in the instants when the waveform is cut or limited by a constant, which corresponds to the instants when the sound is impaired when the file is played. Thus, the MOAVI indicator for clipping increases when clipping appears in the entire file, although the metric is able to determine accurately which samples are clipped in case this information is needed.

We conducted experiments to evaluate the accuracy of this detector using a set of ten audio files. The files were similar to the file shown in Fig. 8 in that they included an artificial clipping artefact. The clipping artefact was present in the input audio files as a set of samples with the maximum possible amplitude. Different values and lengths were used for this evaluation.

An accuracy rate of 90 % was found for this metric under these conditions. Most of the samples erroneously marked as “non-clipped” (false negatives) were the first clipped samples found by the detector.

Although psychoacoustic experiments are not the object of this research, we use the available publications [29] to determine the optimal thresholds for the minimum duration of the silence and the minimum noticeable amplitude of the waveform.

2.3 Limitations and further research

There are three main limitations to further research:

  • The results could be enhanced by applying adaptive thresholds depending on the content.

  • Being a NR metric, it is impossible to discriminate a silence introduced by the loss of a sound file packet and a normal silence which would not be an artefact. Therefore, the false alarm ratio can be high and content-dependent.

  • Being a NR metric, it is impossible to discriminate a clipping introduced while the file undergoing capturing, processing, transmitting and displaying from deliberately-introduced clipping which would not be an artefact. However, deliberately-introduced clipping is less frequent than in the case of silence, and it is not significant.

3 Measuring the lip sync artefact

This paper examines the process of detection of audiovisual artefacts. We describe the algorithm, implementation and results of three different metrics developed to indicate the presence or absence of the lip sync artefact, which is the most common problem affecting audiovisual signals.

Lip syncing is a key parameter in interactive communication. In video conferencing, streaming and television broadcasting, the uneven delay between audio and video should remain below certain thresholds, recommended by several standardization bodies. However, research shows that the thresholds can be relaxed, depending on the targeted application and use case [21].

In multimedia systems, synchronization is needed to ensure a temporal ordering of events. For single data streams, a stream consists of consecutive Logical Data Units (LDU). For audio streams, LDUs are individual samples or blocks of samples transferred together from a source to one or more sinks. Similarly with video, one LDU typically corresponds to a single video frame, and consecutive LDUs to a series of frames. They have to be presented at the sink with the same temporal relationship as they are captured, giving an intra-stream. The temporal ordering must also be applied to related data streams, where one of the more common relationships is the simultaneous playback of audio and video with lip sync. Both media must be in sync, otherwise the result will not be satisfactory.

In general, inter-stream synchronization involves relationships between many types of media including pointers, graphics, images, animation, text, audio and video. In the following discussion, synchronization always refers to inter-stream synchronization between video and audio.

Until recently, lip syncing was impossible to detect automatically by state-of-the-art solutions. This is due to the difficulty in obtaining the correct algorithm (technique) to detect this artefact and the high cost of equipment required for processing video and audio. Additionally, analysis of literature and patents covering the lip sync detection problem shows that several solutions use this formulation [3, 9, 11, 22, 24, 28]; however, none of them are innovative scalable solutions and offer potential commercial applications, unlike the results of the research presented in this paper. The majority of existing solutions (including that patented by LG Electronics [9, 11]) attempt to circumvent the difficulties in detecting this artefact by introducing external timestamps to audio and video signals. Another approach represents a solution known as QuMax2000 (patented by the KWILL Corporation) [24]; this requires no external marks, but instead it requires simultaneous access to audiovisual streams with and without the lip sync artefact, which makes the solution unsuitable for non-laboratory conditions. Similarly, LipTracker (patented by the Pixel Instruments Corporation) [3] is not a suitable solution. While the general concept of detecting the lip sync artefact carries certain similarities with the the solution proposed in this paper, an analysis of the patent indicates the existence of significant algorithmic differences. In addition, it should be noted that LipTracker, originally developed in 2005, is simply a closed-mounted rack 19” laboratory solution for analysing analogue signals and the detection of the lip sync artefact in limited cases, such as television news programmes or talk shows [22].

Recently, some more related approaches to developing methods for bi-modal (audio-video) lip speech detection have been proposed, for example in the paper by CzyŻewski et al. [5]. These methods can be potentially combined with the method proposed in this paper, in order to achieve higher accuracy.

Some more facts about the lip sync problem:

  • The most common origin for the lip sync artefact is jitter produced in the transmission stage.

  • Different languages make no significant difference in synchronizing media.

  • Different languages make no significant difference in the detection of the lip sync artefact, both for human perception and for automatic detection.

  • In [23] it is also stated that professional video editors and TV-related technical personnel show a lower level of skew tolerance. When they detect an error, they are able to correctly state whether audio is ahead of or behind video.

  • Watermarks or fingerprints embedded in an audio signal are used in broadcasts to avoid this problem. However, this method is not suitable for on-line multimedia streaming.

Regarding detection thresholds, [21] describes the high number of thresholds determined by the authors. Some authors and research groups have concluded that audio may be played up to 305 ms ahead of video and conversely video can be displayed up to 190 ms ahead of the audio. Both temporal skews are noticed, but they can be accepted by the user without any significant loss of effect. However, some authors report a tolerance of only 4-16 ms.

Figure 10 is a graphical representation of different audio/video delay and lip sync thresholds of detectability as identified by several standard bodies and independent studies. The thresholds used for the lip sync artefact in MOAVI are set to 100 ms when the audio is delayed with respect to video and 140 ms when video is delayed versus the audio. These thresholds are based on research work by Steinmetz on human perception of jitter and media synchronization, referred to here [23].

Fig. 10
figure 10

Different audio/video delay and lip sync thresholds of detectability

3.1 Video database for the assessment of metrics

The development of experiments analysing the behaviour and measuring the accuracy of different metrics in this section requires a small database of videos and key information about them. It is a set of 15 video sequences between 13 and 37 seconds longs, originating from various types of media. The videos are all taken from a forward-facing camera, although some include several frames with a profile view. Usually only the face and the shoulders are visible. Only one person is seen and heard in each video.

Some of the videos originate from TV news shows or interviews; a few are videos uploaded directly to the internet.

The most important characteristics of each video are shown in Table 1. The audio files extracted from the videos have been stored and analysed, so they can be used for tests of Voice Activity Detection (VAD).

Table 1 Characteristics of the video database for the assessment of metrics

The MOAVI indicator for lip sync is based on the lip sync metric explained below. The audio part of the metric is described first, followed by the signal processing used to implement a VAD algorithm. The video part of the metric described in the second section, explaining the combination of techniques used to detect the lip movement. In the third and final section, the algorithm comparing the audio and visual information is described. Each section includes a results subsection and a further research subsection describing the method developed to detect the delay between the visual and audio and audio media.

3.2 Voice activity detector

VAD, also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected [18]. The main applications of VAD are in speech coding, speech recognition and speech searching [25].

Developing an indicator analysing whether audio and video are synchronized is a challenging goal. The process is simplified if the task is divided into smaller parts, therefore the first algorithm to develop is a voice activity detector.

3.2.1 Algorithm

In lip syncing, it is necessary to process the signal in utterances including speech, silence and background noise. The detection of speech embedded in various types of non-speech events and background noise is known as endpoint detection, voice detection, or VAD.

The VAD algorithm includes two steps. The algorithm for the detection of voice is represented in Fig. 11. The two detectors are used together to obtain better results.

Fig. 11
figure 11

Algorithm for the detection of the speech instants artefact

The first step is signal processing leading to the detection of the endpoints of voice in the audio. An algorithm based on [20] was developed in MATLAB.

The second step is the analysis of the Minimum Energy Density (MED) feature which is a key distinction between music and similar waveforms and speech waveforms. The algorithm is described in [10]; the MATLAB code was completed based on this algorithm.

In [20], a VAD for variable rate speech coding is decomposed into two parts - a decision rule and a background noise statistic estimator - which are analysed separately by applying a statistical model. A robust decision rule is derived from the generalized likelihood ratio test by assuming that the noise statistics are known a priori. To estimate the time-varying noise statistics, allowing for the occasional presence of a speech signal, a noise spectrum adaptation algorithm using soft decision information of the proposed decision rule was developed. The algorithm is robust, especially for time-varying noise.

In [10], MED is used to discriminate between speech and music audio signals. This method is based on the analysis of local energy for local sub-sequences of audio signals. The sub-sequences in this method will be those in which voice activity has been detected in the first detector. An elementary analysis of the probability density for the power distribution in these sub-sequences is an effective tool supporting the decision-making. Distinguishing between speech and music is intuitive, based on shape of the signal’s energy envelope. As Fig. 12 shows, speech signals have distinctive high and low amplitude parts, which represent voiced and unvoiced speech, respectively. In turn, the music signal envelope is more steady. Moreover, it is known that speech has a distinctive 4 Hz energy modulation, which matches the syllabic rate.

Fig. 12
figure 12

Comparison between a music waveform (up) and a speech waveform (down)

Considering these characteristics, a decision is made to discriminate between speech and music sub-sequences using the probability density function of short timeframe energy inside a time window known as the normalization window. The window has to be long enough to capture the nature of the signal. This value is 200 ms, when the sub-sequence of speech after the first discriminator is longer than this value.

As explained above, these two algorithms work together to make the resulting combination more robust and to improve the accuracy of the metric in order to provide better information to be compared with information coming from video; this provides a lip sync artefact indicator.

3.2.2 Results

Regarding the results of the VAD developed for MOAVI, the output of the metric resembles the one presented in Fig. 13. The metric provides an accurate classification of samples. Every subsequence of 50 ms is classified into two different values: voiced (1) or unvoiced (0). Thus, a binary vector is constructed to be compared with information originating from the video concerning endpoints of speech. The final goal is calculating the delay between the signals. The binary vector originating from the VAD metric described above is stored.

Fig. 13
figure 13

Example of voice detection

These results were compared with the ground truth prepared by listening to the 15 audio files and developing a small database for each sound in which every instant is classified between voiced or unvoiced with a precision of 50 ms. The selected audios were selected based on two characteristics: they mainly comprised human voice, and they featured different environments/sources, such as old radio, recent interviews or noisy conferences.

Table 2 shows the Hamming distance, precision, accuracy and the F1 metric for each of the video files stored.

Table 2 Accuracy results of the VAD algorithm in each video from the database

Table 3 shows the same parameters describing the performance of the metric as Table 2, although this time the data shows the results for all the videos together. In this regard, the total Hamming distance column shows the sum of all the Hamming distances calculated for each audio file, and the precision, accuracy and F1 metric are the mean of the corresponding statistical indicator for each audio file.

Table 3 Accuracy results of the VAD algorithm in the whole video database

It should be noted that the VAD algorithm has an accuracy of 92.17 % and an F1 metric of 95.47 % regarding the measurements made based on the database.

3.3 Lip activity detector

This section describes the lip sync sub-metric based on video analysis. The combination of techniques detecting frames with lip motion is explained.

3.3.1 Algorithm

In this paper the video metrics are developed in OpenCV, a cross-platform library of programming functions mainly aimed at real-time computer vision.

OpenCV is fast and easy to use; it provides fast execution of high level metrics based on the optimization of multi-core systems and advance research by providing open and optimized code for basic vision infrastructure.

The algorithm tracking and detecting lip activity in this environment is explained in Fig. 14. The algorithm classifies each frame into two different groups, e.g. frames in which the lips are moving and frames in which they are not. The block diagram represents the following algorithm:

  • The next frame is read in the video being analysed. If it is the first frame, two frames have to be read.

  • In this frame, a Haar cascade is used for the detection of the mouth region based on an OpenCV implementation of the Viola and Jones algorithm for face detection. The Viola and Jones object detection framework is the first such framework to provide competitive rates in real-time. It was proposed in 2001 by Viola and Jones [27]. Although it can be trained to detect a variety of object classes [1, 12], for example the mouth region as in this algorithm, its development was motivated by the problem of face detection. The mouth region will be our Region Of Interest (ROI).

  • In the ROI of the frame, we measure the motion that appears between the previous and current frame. The algorithm for estimating the amount of motion is explained in detail in the next figure.

  • A motion threshold is compared with the calculated motion to determine if the output of the metric is lip-active. This threshold was optimized for the final output of the metric, which is the audiovisual delay.

  • The first of the two frames is released and the last frame read is used to compare with the next one, until we reach the end of the video file.

Fig. 14
figure 14

Algorithm for the detection of lip movement

Figure 14 describes the algorithm in general. The key block for the detection of lip movement is known as motion measure. Figure 15 explains in more detail the process carried out to determine the amount of movement between two frames in the mouth ROI. The algorithm is described here:

  • The inputs of the block are two consecutive frames in which the mouth region has been located.

  • The optical flow between them is calculated. The implementation is based on the algorithm described in research carried out by Farneback [6]. Optical flow estimates the quantity and direction of the motion in every corresponding point of the two consecutive frames the algorithm receives.

  • Once the direction and intensity of motion is estimated, the next step is to discriminate between the movement of the entire face and the movement of the lip region independently. This was achieved by calculating the edges of the optical flow output. This involves knowing the Laplacian of the motion field, and analysing the borders. If the border is in the mouth ROI, we consider it as an indicator of independent movement of the lips.

  • The final step is to count how much of the edge region of the optical flow was discovered in the mouth region. The number of these edges is strongly correlated with the amount of lip motion in the frame.

Fig. 15
figure 15

Detailed block diagram for motion measure

The total information from the OpenCV metric is loaded into MATLAB to be processed and to continue with the comparison with information coming from the audio part. This means that only the video part of the lip sync algorithm is implemented in OpenCV. Future plans include the full implementation of the metrics included in this study into C++ and OpenCV.

3.3.2 Results

The output of the algorithm for Lip Activity Detection (LAD) is a binary vector showing the instants in which the video information analysis provides evidence of lip movement. This binary vector is compared with the binary vector obtained with the VAD algorithm. The comparison is carried out using the delay calculation algorithm which is explained in next section.

Being a video metric has the advantage of showing its behaviour in an image, which is not possible for audio metrics. Figure 16 shows the graphical output for a frame of the LAD metric for MOAVI. The frame originates from one of the audiovisual sequences, named STOSSEL, which is included in the MOAVI database. All elements presented by OpenCV can be seen in this capture. The green rectangle shows the position of the mouth and defines the ROI of the frame. The optical flow is calculated and the edges of its output are drawn in the black and white square on the right. The graphical representation of the output of the metric is shown in the middle of the figure.

Fig. 16
figure 16

Graphical output of the LAD algorithm

The results subsection of the LAD shows graphs of the outputs of the metrics described above. A typical output of the motion measure block is represented in the upper graph of Fig. 17. The binary vector determined from this information is shown in the graph below. This binary vector, based on the threshold of the amount of motion, indicates which of the frames are considered active in terms of lip movement.

Fig. 17
figure 17

Example of detection of lip activity

3.4 Delay calculation

The goal of the previous algorithms, VAD and LAD, was to provide a binary vector originating from the audio information and another from the video information. In the second step, they are compared with each other to obtain the delay between them. This section explains the algorithm used in this comparison and shows the results.

3.4.1 Algorithm

Some delay estimation algorithms were implemented in the time-domain. For example, the basic but well-known delay estimation based on cross-correlation was used in this application, without good results. Most advanced time delay estimation algorithms are implemented in the frequency-domain, such as the generalized cross-correlation method. The problem with using the frequency-domain is the lack of accuracy in the spectral estimation for short signal segments. The delay algorithm needed in this synchronization stage aims to estimate the time shift of the audio with respect to video, and it needs to be used in short audiovisual sequences such as those stored in the database described above.

For this reason, the estimation algorithm found in [15] is a time-domain implementation that satisfies the needs of this application. The proposed information delay criterion is used. The basis of the algorithm is a time-domain implementation of the maximum likelihood method. Although numerically motivated convergence criteria are commonly used, our method uses statistically motivated convergence criteria.

The delay algorithm is outlined in the block diagram (Fig. 18). The implementation was done in MATLAB. The first input of the delay estimator is the binary vector from the VAD, while the second input is the binary vector from the LAD. Both vectors have the same length. The delay algorithm introduces different delays between the two signals, and calculates the likelihood of the pair of signals for each delay introduced. The delay that maximizes the likelihood value is the estimated delay of the two signals, and thus the output of the delay algorithm. The algorithm process is as follows:

  • First, a covariance matrix is constructed based on the possible delays. In this metric, the possible delays were set to ±2 s.

  • The criterion is built up next. The goal is to establish a statistically motivated convergence criterion to make the decision.

  • Finally the maximum of the criterion is calculated. The estimated delay will be the shift that corresponds to that maximum.

Fig. 18
figure 18

Block diagram for delay estimation

One of the problems with this method is that it is assumed that the audio and video activity are perfectly synchronized, meaning that when a person is talking and the lips are visible, the viewer can see the lips moving only when a sound can be heard.

This is clearly not accurate. One example of an absence of audiovisual speech correlation is noisy, unvoiced motion of the lips, such as smiling or licking of the lips. They are impossible to discriminate using this algorithm, although some differences are accepted and the estimated delay remains accurate. An example of a problem which can be corrected easily is the absence of complete synchronization between lip activity and voice activity even when the lip sync artefact has not occurred. It can be observed that lip activity starts around 300 ms before the voice can be perceived. This is a stationary delay which can be corrected simply by taking into account the 300 ms in the estimated delay. The results shown below include this artificially added gap.

3.4.2 Results

Section 3.2.2 shows that the accuracy of the Voice Activity Detector is 92.7 %. It has been noted that in certain situations the VAD method is not able to perfectly discriminate between human speech and other sounds. In addition, the Lip Activity Detector experiences difficulties in certain situation, such as discriminating lip motion while speaking and other types of lip movement.

In these circumstances, the two binary vectors used as inputs for the Delay Estimation Algorithm are not going to be active (v a l u e=1) at the same instants, even if no delay is introduced. This is why detecting the Lip Sync artefact is challenging. It is also the reason why an advanced delay estimation algorithm is used. The results of estimating the delay using this algorithm are presented in this subsection.

Since the Delay Estimation Block is the final stage of the Lip Sync Artefact Key Indicator Determination, the output of this block is a key indicator. Therefore, if the estimated delay is above the thresholds determined in previous sections (140 ms), the determined Lip Sync Artefact Key Indicator is active.

Delays of 0, 300, 500 and 800 ms are artificially introduced to analyse the delays determined by the metric. The absolute error is also calculated. An average gap of 34.28 ms (standard deviation gap: 32.92 ms) is calculated for the 60 estimations carried out during the experiment. Moreover, only 12 % possible cases failed the test by detecting a delay when none was present. This is a satisfactory result, since in 88 % of the test audiovisual sequences the binary key indicator is correct. Thus, in 88 % of cases, the key indicator determines correctly whether the lip sync artefact is present and the threshold is exceeded and whether the audio is delayed with respect to the video or vice versa.

3.5 Limitations and future research

As limitations, we list a few main aspects which should be improved during further research.

With respect to VAD, certain sounds that should not be detected as speech because they appear without any correlation with video information are actually detected as voice activity. Examples could be speakers which are not visible in the scene (common in films) or other background music. Further research should include audio signal processing in terms of speaker recognition to discriminate between different speakers.

With respect to LAD, certain noisy lip movements which should not be detected as speech because they appear without any correlation with audio information are actually detected as lip activity. Examples could be people smiling or licking their lips, which are impossible to discriminate using this algorithm. Further research should include video signal processing in terms of speaker recognition to discriminate between different people in the scene.

With respect to the Delay Estimator, further research should be capable of detecting both types of delays rather than just audio delayed with respect to video.

4 Conclusions

The purpose of this paper was to report the development of the audiovisual part of the MOAVI project, which includes the detection of mute, clipping and lip synchronization (also known as lip sync) artefacts.

Regarding the results obtained for the mute artefact, the algorithm works accurately and detects the artefact at the time positions when it was introduced. We suggest that two further phenomena are evaluated in future research which, if detected, should improve the mute detection accuracy. First of all, muting may be detected if there is no audio and lip movement is recognized, which is done with respect to lip sync detection. Muting may be detected if the first sample of a sequence with a value of 0 is preceded by a high value (this often produces an annoying effect).

Regarding the clipping results obtained, the detection occurs at the instants in which the waveform is cut or limited by a constant, which is exactly the instants that sound annoying when the file is played.

Regarding the results of the lip sync indicator, in 88 % of the test audiovisual sequences, the binary key indicator is correct.