1 Introduction

The human face is arguably among the most well-studied deformable objects in the field of Computer Vision. This is due to the many roles it has in numerous applications. For example, accurate detection of faces is an essential step for tasks such as controller-free gaming, surveillance, digital photo album organization, image tagging, etc. Additionally, detection of facial features plays a crucial role for facial behaviour analysis, facial attributes analysis (e.g., gender and age recognition, etc.), facial image editing (e.g., digital make-up, etc.), surveillance, sign language recognition, lip reading, human-computer and human-robot interaction. In this work, we study the deformable face tracking task and we develop the first, to the best of our knowledge, comprehensive evaluation of multiple deformable face tracking pipelines.

Current research has been monopolised by the tasks of face detection, facial landmark localisation and face recognition or verification. Firstly, face detection, despite having permeated many forms of modern technology such as digital cameras and social networking, is still a challenging problem and a popular line of research, as shown by the recent surveys of Jain and Learned-Miller (2010), Zhang and Zhang (2010), Zafeiriou et al. (2015). Although face detection on well-lit frontal facial images can be performed reliably on an embedded device, face detection on arbitrary images of people is still extremely challenging (Jain and Learned-Miller 2010). Images of faces under these unconstrained conditions are commonly referred to as “in-the-wild” and may include scenarios such as extreme facial pose, defocus, faces occupying a very small number of pixels or occlusions. Given the fact that face detection is still regarded as a challenging task, many generic object detection architectures such as Yan et al. (2014), King (2015) are either directly assessed on in-the-wild facial data, or are appropriately modified in order to explicitly perform face detection as done by Zhu and Ramanan (2012), Felzenszwalb and Huttenlocher (2005). The interested reader may refer to the most recent survey by Zafeiriou et al. (2015) for more information on in-the-wild face detection. The problem of localising facial landmarks that correspond to fiducial facial parts (e.g., eyes, mouth, etc.) is still extremely challenging and has only been possible to perform reliably relatively recently. Although the history of facial landmark localisation spans back many decades (Cootes et al. 1995, 2001), the ability to accurately recover facial landmarks on in-the-wild images has only become possible in recent years (Matthews and Baker 2004; Papandreou and Maragos 2008; Saragih et al. 2011; Cao et al. 2014). Much of this progress can be attributed to the release of large annotated datasets of facial landmarks (Sagonas et al. 2013b, a; Zhu and Ramanan 2012; Le et al. 2012; Belhumeur et al. 2013; Köstinger et al. 2011) and very recently the area of facial landmark localisation has become extremely competitive with recent works including Xiong and De la Torre (2013), Ren et al. (2014), Kazemi and Sullivan (2014), Zhu et al. (2015), Tzimiropoulos (2015). For a recent evaluation of facial landmark localisation methods the interested reader may refer to the survey by Wang et al. (2014) and to the results of the 300 W competition by Sagonas et al. (2015). Finally, face recognition and verification are extremely popular lines of research. For the past two decades, the majority of statistical machine learning algorithms spanning from linear/non-linear subspace learning techniques (De la Torre 2012; Kokiopoulou et al. 2011) to deep convolutional neural networks (DCNNs) (Taigman et al. 2014; Schroff et al. 2015; Parkhi et al. 2015) have been applied to the problem of face recognition and verification. Recently, due to the revival of DCNNs, as well as the development of graphics processing units (GPUs), remarkable face verification performance has been reported (Taigman et al. 2014). The interested reader may refer to the recent survey by Learned-Miller et al. (2016) as well as the most popular benchmark for face verification in-the-wild in Huang et al. (2007).

In all of the aforementioned fields, significant progress has been reported in recent years. The primary reasons behind these advances are:

  • The collection and annotation of large databases Given the abundance of facial images available primarily through the Internet via services such as Flickr, Google Images and Facebook, the collection of facial images is extremely simple. Some examples of large databases for face detection are FDDB (Jain and Learned-Miller 2010), AFW (Zhu and Ramanan 2012) and LFW (Huang et al. 2007). Similar large-scale databases for facial landmark localisation include 300 W (Sagonas et al. 2013b) LFPW (Belhumeur et al. 2013), AFLW (Köstinger et al. 2011) and HELEN (Le et al. 2012). Similarly, for face recognition there exists LFW (Huang et al. 2007), FRVT (Phillips et al. 2000) and the recently introduced Janus database (IJB-A) (Klare et al. 2015).

  • The establishment of in-the-wild benchmarks and challenges that provide a fair comparison between state of the art techniques. FDDB (Jain and Learned-Miller 2010), 300 W (Sagonas et al. 2013a, 2015) and Janus (Klare et al. 2015) are the most characteristic examples for face detection, facial landmark localisation and face recognition, respectively.

Contrary to face detection, facial landmark localisation and face recognition, the problem of deformable face tracking across long-term sequences has yet to attract much attention, despite its crucial role in numerous applications. Given the fact that cameras are embedded in many common electronic devices, it is surprising that current research has not yet focused towards providing robust and accurate solutions for long-term deformable tracking. Almost all face-based applications, including facial behaviour analysis, lip reading, surveillance, human-computer and human-robot interaction etc., require accurate continuous tracking of the facial landmarks. The facial landmarks are commonly used as input signals of higher-level methodologies to compute motion dynamics and deformations. The performance of currently available technologies for facial deformable tracking has not been properly assessed (Yacoob and Davis 1996; Essa et al. 1996, 1997; Decarlo and Metaxas 2000; Koelstra et al. 2010; Snape et al. 2015). This is attributed to the fact that, until recently, there was no established benchmark for the task. At ICCV 2015, the first benchmark for facial landmark tracking (so-called 300 VW) was presented by Shen et al. (2015), providing a large number of annotated videos captured in-the-wild.Footnote 1 In particular, the benchmark provides 114 videos with average duration around 1 minute, split into three categories of increasing difficulty. The frames of all videos (218595 in total) were annotated by applying semi-automatic procedures, as shown in Chrysos et al. (2015). Five different facial tracking methodologies were evaluated in the benchmark (Rajamanoharan and Cootes 2015; Yang et al. 2015a; Wu and Ji 2015; Uricar and Franc 2015; Xiao et al. 2015) and the results are indicative of the current state-of-the-art performance.

In this paper, we make a significant step further and develop the first, to the best of our knowledge, comprehensive evaluation of multiple deformable face tracking pipelines. In particular, we assess:

  • A pipeline which combines a generic face detection algorithm with a facial landmark localisation method. This pipeline is typically assumed in the related tracking papers, e.g. Wolf et al. (2011), Best-Rowden et al. (2013), Chrysos et al. (2015), as well as in various implementations that are (publicly) available, e.g. King (2009), Asthana et al. (2014), Chrysos et al. (2015), and the demos given in various conferences. The pipeline is fairly robust since the probability of drifting is reduced due to the application of the face detector at each frame. Nevertheless, it does not exploit the dynamic characteristics of the tracked face. Several state-of-the-art face detectors as well as facial landmark localisation methodologies are evaluated in this pipeline.

  • A pipeline which combines a model free tracking system with a facial landmark localisation method. This approach takes into account the dynamic nature of the tracked face, but is susceptible to drifting and thus losing the tracked object. We evaluate the combinations of multiple state-of-the-art model free trackers, as well as landmark localisation techniques.

  • Hybrid pipelines that include mechanisms for detecting tracking failures and performing re-initialisation, as well as using models for ensuring robust tracking.

Some of the above pipelines were used extensively by practitioners, especially the first one. Nevertheless, to the best of our knowledge, this is the first paper that explicitly refers to the various alternatives and provides a thorough examination of the different components of the pipelines (i.e., detectors, trackers, smoothing, landmark localisation etc.).

Summarising, the findings of our evaluation show that current face detection and model free tracking technologies are advanced enough so that even a naive combination with landmark localisation techniques is adequate to achieve state-of-the-art performance on deformable face tracking. Specifically, we experimentally show that model free tracking based pipelines are very accurate when applied on videos with moderate lighting and pose circumstances. Furthermore, the combination of state-of-the-art face detectors with landmark localisation systems demonstrates excellent performance with surprisingly high true positive rate on videos captured under arbitrary conditions (extreme lighting, pose, occlusions, etc.). Moreover, we show that hybrid approaches provide only a marginal improvement, which is not worth their complexity and computational cost. Finally, we compare these approaches with the systems that participated in the 300 VW competition of Shen et al. (2015).

The rest of the paper is organised as follows. Sect. 2 presents a survey of the current literature on both rigid and deformable face tracking. In Sect. 3, we present the current state-of-the-art methodologies for deformable face tracking. Since, modern face tracking consists of various modules, including face detection, model free tracking and facial landmark localisation, Sects. 3.13.3 briefly outline the state-of-the-art in each of these domains. Experimental results are presented in Sect. 4. Finally, in Sect. 5 we discuss the challenges that still remain to be addressed, provide future research directions and draw conclusions.

2 Related Work

Rigid and deformable tracking of faces and facial features have been a very popular topic of research over the past twenty years (Black and Yacoob 1995; Lanitis et al. 1995; Sobottka and Pitas 1996; Essa et al. 1996, 1997; Oliver et al. 1997; Decarlo and Metaxas 2000; Jepson et al. 2003; Matthews and Baker 2004; Matthews et al. 2004; Xiao et al. 2004; Patras and Pantic 2004; Kim et al. 2008; Ross et al. 2008; Papandreou and Maragos 2008; Amberg et al. 2009; Kalal et al. 2010a; Koelstra et al. 2010; Tresadern et al. 2012; Tzimiropoulos and Pantic 2013; Xiong and De la Torre 2013; Liwicki et al. 2013; Smeulders et al. 2014; Asthana et al. 2014; Tzimiropoulos and Pantic 2014; Li et al. 2016a; Xiong and De la Torre 2015; Snape et al. 2015; Wu et al. 2015; Tzimiropoulos 2015). In this section we provide an overview of face tracking spanning over the past twenty years up to the present day. In particular, we will outline the methodologies regarding rigid 2D/3D face tracking, as well as deformable 2D/3D face tracking using a monocular camera.Footnote 2 Finally, we outline the benchmarks for both rigid and deformable face tracking.

2.1 Prior Art

The first methods for rigid 2D tracking generally revolved around the use of various features or transformations and mainly explored various color-spaces for robust tracking (Crowley and Berard 1997; Bradski 1998b; Qian et al. 1998; Toyama 1998; Jurie 1999; Schwerdt and Crowley 2000; Stern and Efros 2002; Vadakkepat et al. 2008). The general methods of choice for tracking were Mean Shift and variations such as the Continuously Adaptive Mean Shift (Camshift) algorithm (Bradski 1998a; Allen et al. 2004). The Mean Shift algorithm is a non-parametric technique that climbs the gradient of a probability distribution to find the nearest dominant mode (peak) (Comaniciu and Meer 1999; Comaniciu et al. 2000). Camshift is an adaptation of the Mean Shift algorithm for object tracking. The primary difference between CamShift and Mean Shift is that the former uses continuously adaptive probability distributions (i.e., distributions that may be recomputed for each frame) while the latter is based on static distributions, which are not updated unless the target experiences significant changes in shape, size or color. Other popular methods of choice for tracking are linear and non-linear filtering techniques including Kalman filters, as well as methodologies that fall in the general category of particle filters (Del Moral 1996; Gordon et al. 1993), such as the popular Condensation algorithm by Isard and Blake (1998). Condensation is the application of Sampling Importance Resampling (SIR) estimation by Gordon et al. (1993) to contour tracking. A recent successful 2D rigid tracker that updates the appearance model of the tracked face was proposed in Ross et al. (2008). The algorithm uses incremental Principal Component Analysis (PCA) (Levey and Lindenbaum 2000) to learn a statistical model of the appearance in an on-line manner and contrary to other eigentrackers, such as Black and Jepson (1998), it does not contain any training phase. The method in Ross et al. (2008) uses a variant of the Condensation algorithm to model the distribution over the objects location as it evolves over time. The method has initiated a line of research on robust incremental object tracking including the works of Liwicki et al. (2012b, 2013, 2012a, 2015). Rigid 3D tracking has also been studied by using generic 3D models of the face (Malciu and Prěteux 2000; La Cascia et al. 2000). For example, La Cascia et al. (2000) formulate the tracking task as an image registration problem in the cylindrically unwrapped texture space and Sung et al. (2008) combine active appearance models (AAMs) with a cylindrical head model for robust recovery of the global rigid motion. Currently, rigid face tracking is generally treated along the same lines as general model free object tracking (Jepson et al. 2003; Smeulders et al. 2014; Liwicki et al. 2013, 2012b; Ross et al. 2008; Wu et al. 2015; Li et al. 2016a). An overview of model free object tracking is given in Sect. 3.2.

Non-rigid (deformable) tracking of faces is important in many applications, spanning from facial expression analysis to motion capture for graphics and game design. Deformable tracking of faces can be further subdivided into i) tracking of certain facial landmarks (Lanitis et al. 1995; Black and Yacoob 1995; Sobottka and Pitas 1996; Xiao et al. 2004; Matthews and Baker 2004; Matthews et al. 2004; Patras and Pantic 2004; Papandreou and Maragos 2008; Amberg et al. 2009; Tresadern et al. 2012; Xiong and De la Torre 2013; Asthana et al. 2014; Xiong and De la Torre 2015) or ii) tracking/estimation of dense facial motion (Essa et al. 1996; Yacoob and Davis 1996; Essa et al. 1997; Decarlo and Metaxas 2000; Koelstra et al. 2010; Snape et al. 2015). The latter category of estimating a dense facial motion through a model-based system was proposed by MIT Media lab in mid 1990’s (Essa et al. 1997, 1996, 1994; Basu et al. 1996). In particular, the method by Essa and Pentland (1994) tracks facial motion using optical flow computation coupled with a geometric and a physical (muscle) model describing the facial structure. This modeling results in a time-varying spatial patterning of facial shape and a parametric representation of the independent muscle action groups which is responsible for the observed facial motions. In Essa et al. (1994) the physically-based face model of Essa and Pentland (1994) is driven by a set of responses from a set of templates that characterise facial regions. Model generated flow has been used by the same group in Basu et al. (1996) for motion regularisation. 3D motion estimation using sparse 3D models and optical flow estimation has also been proposed by Li et al. (1993), Bozdaği et al. (1994). Dense facial motion tracking is performed in Decarlo and Metaxas (2000) by solving a model-based (using a facial deformable model) least-squares optical flow problem. The constraints are relaxed by the use of a Kalman filter, which permits controlled constraint violations based on the noise present in the optical flow information, and enables optical flow and edge information to be combined more robustly and efficiently. Free-form deformations (Rueckert et al. 1999) are used in Koelstra et al. (2010) for extraction of dense facial motion for facial action unit recognition. Recently, Snape et al. (2015) proposed a statistical model of the facial flow for fast and robust dense facial motion extraction.

Arguably, the category of deformable tracking that has received the majority of attention is that of tracking a set of sparse facial landmarks. The landmarks are either associated to a particular sparse facial model, i.e. the popular Candide facial model by Li et al. (1993), or correspond to fiducial facial regions/parts (e.g., mouth, eyes, nose etc.) (Cootes et al. 2001). Even earlier attempts such as Essa and Pentland (1994) understood the usefulness of tracking facial regions/landmarks in order to perform robust fitting of complex facial models (currently the vast majority of dense 3D facial model tracking techniques, such as Wei et al. (2004), Zhang et al. (2008), Amberg (2011), rely on the robust tracking of a set of facial landmarks). Early approaches for tracking facial landmarks/regions included: (i) the use of templates built around certain facial regions (Essa and Pentland 1994), (ii) the use of facial classifiers to detect landmarks (Colmenarez et al. 1999) where tracking is performed using modal analysis (Tao and Huang 1998) or (iii) the use of face and facial region segmentation to detect the features where tracking is performed using block matching (Sobottka and Pitas 1996). Currently, deformable face tracking has converged with the problem of facial landmark localisation on static images. That is, the methods generally rely on fitting generative or discriminative statistical models of appearance and 2D/3D sparse facial shape at each frame. Arguably, the most popular methods are generative and discriminative variations of Active Appearance Models (AAMs) and Active Shape Models (ASMs) (Pighin et al. 1999; Cootes et al. 2001; Dornaika and Ahlberg 2004; Xiao et al. 2004; Matthews and Baker 2004; Dedeoğlu et al. 2007; Papandreou and Maragos 2008; Amberg et al. 2009; Saragih et al. 2011; Xiong and De la Torre 2013, 2015). The statistical models of appearance and shape can either be generic as in Cootes et al. (2001), Matthews and Baker (2004), Xiong and De la Torre (2013) or incrementally updated in order to better capture the face at hand, as in Sung and Kim (2009), Asthana et al. (2014). The vast majority of the facial landmark localisation methodologies require an initialisation provided by a face detector. More details regarding current state-of-the-art in facial landmark localisation can be found in Sect. 3.3.

Arguably, the current practise regarding deformable face tracking includes the combination of a generic face detection and generic facial landmark localisation technique (Saragih et al. 2011; Xiong and De la Torre 2013, 2015; Alabort-i-Medina and Zafeiriou 2015; Asthana et al. 2015). For example, popular approaches include successive application of the face detection and facial landmark localisation procedure at each frame. Another approach performs face detection in the first frame and then applies facial landmark localisation at each consecutive frame using the fitting result of the previous frame as initialisation. Face detection can be re-applied in case of failure. This is the approach that is used by popular packages such as Asthana et al. (2014). In this paper, we thoroughly evaluate variations of the above approaches. Furthermore, we consider the use of modern model free state-of-the-art trackers for rigid 2D tracking in order to be used as initialisation for the facial landmark localisation procedure. This is pictorially described in Fig. 1.

2.2 Face Tracking Benchmarking

For assessing the performance of rigid 2D face tracking several short face sequences have been annotated with regards to the facial region (using a bounding box style annotation). One of the first sequences that has been annotated for this task is the so-called Dudek sequence by Ross et al. (2015).Footnote 3 Nowadays, several such sequences have been annotated and are publicly available, such as the ones by Liwicki et al. (2016), Li et al. (2016b), Wu et al. (2015).

Fig. 1
figure 1

Overview of the standard approaches for deformable face tracking. (Top) face detection is applied independently at each frame of the video followed by facial landmark localisation. (Bottom) model free tracking is employed, initialised with the bounding box of the face at the first frame, followed by facial landmark localisation

The performance of deformable dense facial tracking methodologies was usually assessed by using markers (Decarlo and Metaxas 2000), simulated data (Snape et al. 2015), visual inspection (Decarlo and Metaxas 2000; Essa et al. 1997, 1996; Yacoob and Davis 1996; Snape et al. 2015; Koelstra et al. 2010) or indirectly by the use of the dense facial motion for certain tasks, such as expression analysis (Essa et al. 1996; Yacoob and Davis 1996; Koelstra et al. 2010). Regarding tracking of facial landmarks, up until recently, the preferred method for assessing the performance was visual inspection in a number of selected facial videos (Xiong and De la Torre 2013; Tresadern et al. 2012). Other methods were assessed on a small number of short (a few seconds in length) annotated facial videos (Sagonas et al. 2014; Asthana et al. 2014). Until recently the longest annotated facial video sequence was the so-called talking face of Cootes (2016) which was used to evaluate many tracking methods including Orozco et al. (2013), Amberg et al. (2009). The talking face video comprises of 5000 frames (around 200 seconds) taken from a video of a person engaged in a conversation. The talking face video was initially tracked using an Active Appearance Model (AAM) that had a shape model and a total of 68 landmarks are provided. The tracked landmarks were visually checked and manually corrected where necessary.

Recently, Xiong and De la Torre (2015) introduced a benchmark for facial landmark tracking using videos from the Distracted Driver Face (DDF) and Naturalistic Driving Study (NDS) in Campbell (2016).Footnote 4 The DDF dataset contains 15 sequences with a total of 10,882 frames. Each sequence displays a single subject posing as the distracted driver in a stationary vehicle or indoor environment. 12 out of 15 videos were recorded with subjects sitting inside of a vehicle. Five of them were recorded during the night under infrared (IR) light and the rest were recorded during the daytime under natural lighting. The remaining three were recorded indoors. The NDS database contains 20 sub-sequences of driver faces recorded during a drive conducted between the Blacksburg, VA and Washington, DC areas (NDS is more challenging than DDF since its videos are of lower spatial and temporal resolution). Each video of the NDS database has one minute duration recorded at 15 frames per second (fps) with a \(360 \times 240\) resolution. For both datasets one in every ten frames was annotated using either 49 landmarks for near-frontal faces or 31 landmarks for profile faces. The database contains many extreme facial poses (90\(^\circ \) yaw, 50\(\circ \) pitch) as well as many faces under extreme lighting condition (e.g., IR). In total the dataset presented in Xiong and De la Torre (2015) contains between 2000 to 3000 annotated faces (please refer to Xiong and De la Torre (2015) for exemplar annotations).

The only existing large in-the-wild benchmark for facial landmark tracking was recently introduced by Shen et al. (2015). The benchmark consists of 114 videos with varying difficulty and provides annotations generated in a semi-automatic manner (Chrysos et al. 2015; Shen et al. 2015; Tzimiropoulos 2015). This challenge, called 300 VW, is the only existing large-scale comprehensive benchmark for deformable model tracking. More details regarding the dataset of the 300 VW benchmark can be found in Sect. 4.1. The performance of the pipelines considered in this paper are compared with the participating methods of the 300 VW challenge in Sect. 4.8.

3 Deformable Face Tracking

In this paper, we focus on the problem of performing deformable face tracking across long-term sequences within unconstrained videos. The problem of tracking across long-term sequences is particularly challenging as the appearance of the face may change significantly during the sequence due to occlusions, illumination variation, motion artifacts and head pose. For the problem of deformable tracking, however, the problem is further complicated by the expectation of recovering a set of accurate fiducial points in conjunction with successfully tracking the object. As described in Sect. 2, current deformable facial tracking methods mainly concentrate on performing face detection per frame and then performing facial landmark localisation. However, we consider the most important metric for measuring the success of deformable face tracking as the facial landmark localisation accuracy. Given this, there are a number of strategies that could feasibly be employed in order to attempt to minimise the total facial landmark localisation error across the entire sequence. Therefore, we take advantage of current advances in face detection, model free tracking and facial landmark localisation techniques in order to perform deformable face tracking. Specifically, we investigate three strategies for deformable tracking:

  1. 1.

    Detection + landmark localisation Face Detection per frame, followed by facial landmark localisation initialised within the facial bounding boxes. This scenario is visualised in Fig. 1 (top).

  2. 2.

    Model free tracking + landmark localisation Model free tracking, initialised around the interior of the face within the first frame, followed by facial landmark localisation within the tracked box. This scenario is visualised in Fig. 1 (bottom).

  3. 3.

    Hybrid systems Hybrid methods that attempt to improve the robustness of the placement of the bounding box for landmark localisation. Namely, we investigate methods for failure detection, trajectory smoothness and reinitialisation. Examples of such methods are pictorially demonstrated in Figs. 4 and 8.

Note that we focus on combinations of methods that provide bounding boxes of the facial region followed by landmark localisation. This is due to the fact that the current set of state-of-the-art landmark localisation methods are all local methods and require initialisation within the facial region. Although joint face detection and landmark localisation methods have been proposed (Zhu and Ramanan 2012; Chen et al. 2014), they are not competitive with the most recent set of landmark localisation methods. For this reason, in this paper we focus on the combination of bounding box estimators with state-of-the-art local landmark localisation techniques.

The remainder of this Section will give a brief overview of the literature concerning face detection, model free tracking and facial landmark localisation.

3.1 Face Detection

Face detection is among the most important and popular tasks in Computer Vision and an essential step for applications such as face recognition and face analysis. Although it is one of the oldest tasks undertaken by researchers (the early works appeared about 45 years ago (Sakai et al. 1972; Fischler and Elschlager 1973)), it is still an open and challenging problem. Recent advances can achieve reliable performance under moderate illumination and pose conditions, which led to the installation of simple face detection technologies in everyday devices such as digital cameras and mobile phones. However, recent benchmarks (Jain and Learned-Miller 2010) show that the detection of faces on arbitrary images is still a very challenging problem.

Since face detection has been a research topic for so many decades, the existing literature is, naturally, extremely extensive. The fact that all recent face detection surveys (Hjelmås and Low 2001; Yang et al. 2002; Zhang and Zhang 2010; Zafeiriou et al. 2015) provide different categorisations of the relative literature is indicative of the huge range of existing techniques. Consequently, herein, we only present a basic outline of the face detection literature. For an extended review, the interested reader may refer to the most recent face detection survey in Zafeiriou et al. (2015).

According to the most recent literature review Zafeiriou et al. (2015), existing methods can be separated in two major categories. The first one includes methodologies that learn a set of rigid templates, which can be further split in the following groups: (i) boosting-based methods, (ii) approaches that utilise SVM classifiers, (ii) exemplar-based techniques, and (iv) frameworks based on Neural Networks. The second major category includes deformable part models, i.e. methodologies that learn a set of templates per part as well as the deformations between them.

Boosting Methods Boosting combines multiple “weak” hypotheses of moderate accuracy in order to determine a highly accurate hypothesis. The most characteristic example is Adaptive Boosting (AdaBoost) which is utilised by the most popular face detection methodology, i.e. the Viola–Jones (VJ) detector of Viola and Jones (2001, 2004). Characteristic examples of other methods that employ variations of AdaBoost include Li et al. (2002), Wu et al. (2004), Mita et al. (2005). The original VJ algorithm used Haar features, however boosting (or cascade of classifiers methodologies in general) have been shown to greatly benefit from robust features (Köstinger et al. 2012; Jun et al. 2013; Li et al. 2011; Li and Zhang 2013; Mathias et al. 2014; Yang et al. 2014a), such as HOG (Dalal and Triggs 2005), SIFT (Lowe 1999), SURF (Bay et al. 2008) and LBP (Ojala et al. 2002). For example, SURF features have been successfully combined with a cascade of weak classifiers in Li et al. (2011), Li and Zhang (2013), achieving faster convergence. Additionally, Jun et al. (2013) propose robust face specific features that combine both LBP and HOG. Mathias et al. (2014) recently proposed an approach (so called HeadHunter) with state-of-the-art performance that employs various robust features with boosting. Specifically, they propose the adaptation of Integral Channel Features (ICF) (Dollár et al. 2009) with HOG and LUV colour channels, combined with global feature normalisation. A similar approach is followed by Yang et al. (2014a), in which they combine gray-scale, RGB, HSV, LUV, gradient magnitude and histograms within a cascade of weak classifiers.

SVM Classifiers Maximum margin classifiers, such as Support Vector Machines (SVMs), have become popular for face detection (Romdhani et al. 2001; Heisele et al. 2003; Rätsch et al. 2004; King 2015). Even though their detection speed was initially slow, various schemes have been proposed to speed up the process. Romdhani et al. (2001) propose a method that computes a reduced set of vectors from the original support vectors that are used sequentially in order to make early rejections. A similar approach is adopted by Rätsch et al. (2004). A hierarchy of SVM classifiers trained on different resolutions is applied in Heisele et al. (2003). King (2015) proposes an algorithm for efficient learning of a max-margin classifier using all the sub-windows of the training images, without applying any sub-sampling, and formulates a convex optimisation that finds the global optimum. Moreover, SVM classifiers have also been used for multi-view face detection (Li et al. 2000; Wang and Ji 2004). For example, Li et al. (2000) first apply a face pose estimator based on support vector regression (SVR), followed by an SVM face detector for each pose.

Exemplar-Based Techniques These methods aim to match a test image against a large set of facial images. This approach is inspired by principles used in image retrieval and requires that the exemplar set covers the large appearance variation of human face. Shen et al. (2013) employ bag-of-word image retrieval methods to extract features from each exemplar, which creates a voting map for each exemplar that functions as a weak classifier. Thus, the final detection is performed by combining the voting maps. A similar methodology is applied in Li et al. (2014), with the difference that specific exemplars are used as weak classifiers based on a boosting strategy. Recently, Kumar et al. (2015) proposed an approach that enhances the voting procedure by using semantically related visual words as well as weighted occurrence of visual words based on their spatial distributions.

Table 1 The set of detectors used in this paper

Convolutional Neural Networks Another category, similar to the previous rigid template-based ones, includes the employment of Convolutional Neural Networks (CNNs) and Deep CNNs (DCNNs) (Osadchy et al. 2007; Zhang and Zhang 2014; Ranjan et al. 2015; Li et al. 2015a; Yang et al. 2015b). Osadchy et al. (2007) use a network with four convolution layers and one fully connected layer that rejects the non-face hypotheses and estimates the pose of the correct face hypothesis. Zhang and Zhang (2014) propose a multi-view face detection framework by employing a multi-task DCNN for face pose estimation and landmark localization in order to obtain better features for face detection. Ranjan et al. (2015) combine deep pyramidal features with Deformable Part Models. Recently, Yang et al. (2015b) proposed a DCNN architecture that is able to discover facial parts responses from arbitrary uncropped facial images without any part supervision and report state-of-the-art performance on current face detection benchmarks.

Deformable Part Models DPMs (Schneiderman and Kanade 2004; Felzenszwalb and Huttenlocher 2005; Felzenszwalb et al. 2010; Zhu and Ramanan 2012; Yan et al. 2013; Li et al. 2013a; Yan et al. 2014; Mathias et al. 2014; Ghiasi and Fowlkes 2014; Barbu et al. 2014) learn a patch expert for each part of an object and model the deformations between parts using spring-like connections based on a tree structure. Consequently, they perform joint facial landmark localisation and face detection. Even though they are not the best performing methods for landmark localisation, they are highly accurate for face detection in-the-wild. However, their main disadvantage is their high computational cost. Pictorial Structures (PS) (Fischler and Elschlager 1973; Felzenszwalb and Huttenlocher 2005) are the first family of DPMs that appeared. They are generative DPMs that assume Gaussian distributions to model the appearance of each part, as well as the deformations. They became a very popular line of research after the influential work in Felzenszwalb and Huttenlocher (2005) that proposed a very efficient dynamic programming algorithm for finding the global optimum based on Generalized Distance Transform. Many discriminatively trained DPMs (Felzenszwalb et al. 2010; Zhu and Ramanan 2012; Yan et al. 2013, 2014) appeared afterwards, which learn the patch experts and deformation parameters using discriminative classifiers, such as latent SVM.

DPMs can be further separated with respect to their training scenario into: (i) weakly supervised and (ii) strongly supervised. Weakly-supervised DPMs (Felzenszwalb et al. 2010; Yan et al. 2014) are trained using only the bounding boxes of the positive examples and a set of negative examples. The most representative example is the work by Felzenszwalb et al. (2010), which has proved to be very efficient for generic object detection. Under a strongly supervised scenario, it is assumed that a training database with images annotated with figucial landmarks is available. Several strongly supervised methods exist in the literature (Felzenszwalb and Huttenlocher 2005; Zhu and Ramanan 2012; Yan et al. 2013; Ghiasi and Fowlkes 2014). Ghiasi and Fowlkes (2014) propose an hierarchical DPM that explicitly models parts’ occlusions. In Zhu and Ramanan (2012) it is shown that a strongly supervised DPM outperforms, by a large margin, a weakly supervised one. In contrast, HeadHunter by Mathias et al. (2014) shows that a weakly supervised DPM can outperform all current state-of-the-art face detection methodologies including the strongly supervised DPM of Zhu and Ramanan (2012).

According to FDDB (Jain and Learned-Miller 2010), which is the most well established face detection benchmark, the currently top-performing methodology is the one by Ranjan et al. (2015), which combines DCNNs with a DPM. Some of the top-performing systems consist of commercial software, thus we did use the deep methods of Hu and Ramanan (2016), Zhang et al. (2016) that are available as open source with the method of Hu and Ramanan (2016) reporting the latest best performance in FDDB. Additionally, we employ the top performing SVM-based method for learning rigid templates (King 2015), the best weakly and strongly supervised DPM implementations of Mathias et al. (2014) and Zhu and Ramanan (2012), along with the best performing exemplar-based technique of Kumar et al. (2015) . Finally, we also use the popular VJ algorithm (Viola and Jones 2001, 2004) as a baseline face detection method. The employed face detection implementations are summarised in Table 1.

3.2 Model Free Tracking

Model free tracking is an extremely active area of research. Given the initial state (e.g., position and size of the containing box) of a target object in the first image, model free tracking attempts to estimate the states of the target in subsequent frames. Therefore, model free tracking provides an excellent method of initialising landmark localisation methods.

The literature on model free tracking is vast. For the rest of this section, we will provide an extremely brief overview of model free tracking that focuses primarily on areas that are relevant to the tracking methods we investigated in this paper. We refer the interested reader to the wealth of tracking surveys (Li et al. 2013b; Smeulders et al. 2014; Salti et al. 2012; Yang et al. 2011) and benchmarks (Wu et al. 2013, 2015; Kristan et al. 2013, 2014, 2015, 2016; Smeulders et al. 2014) for more information on model free tracking methods.

Generative Trackers These trackers attempt to model the objects appearance directly. This includes template based methods, such as those by Matthews et al. (2004), Baker and Matthews (2004), Sevilla-Lara and Learned-Miller (2012), as well as parametric generative models such as Balan and Black (2006), Ross et al. (2008), Black and Jepson (1998) , Xiao et al. (2014). The work of Ross et al. (2008) introduces online subspace learning for tracking with a sample mean update, which allows the tracker to account for changes in illumination, viewing angle and pose of the object. The idea is to incrementally learn a low-dimensional subspace and adapt the appearance model on object changes. The update is based on an incremental principal component analysis (PCA) algorithm, however it seems to be ineffective at handling large occlusions or non-rigid movements due to its holistic model. To alleviate the partial occlusion, Xiao et al. (2014) suggest the use of square templates along with PCA. Another popular area of generative tracking is the use of sparse representations for appearance. In Mei and Ling (2011), a target candidate is represented by a sparse linear combination of target and trivial templates. The coefficients are extracted by solving an \(\ell _1\) minimisation problem with non-negativity constraints, while the target templates are updated online. However, solving the \(\ell _1\) minimisation for each particle is computationally expensive. A generalisation of this tracker is the work of Zhang et al. (2012), which learns the representation for all particles jointly. It additionally improves the robustness by exploiting the correlation among particles. An even further abstraction is achieved in Zhang et al. (2014d) where a low-rank sparse representation of the particles is encouraged. In Zhang et al. (2014c), the authors generalise the low-rank constraint of Zhang et al. (2014d) and add a sparse error term in order to handle outliers. Another low-rank formulation was used by Wu et al. (2012) which is an online version of the RASL (Peng et al. 2012) algorithm and attempts to jointly align the input sequence using convex optimisation.

Keypoint Trackers These trackers (Pernici and Del Bimbo 2014; Poling et al. 2014; Hare et al. 2012; Nebehay and Pflugfelder 2015) attempt to use the robustness of keypoint detection methodologies like SIFT (Lowe 1999) or SURF (Bay et al. 2008) in order to perform tracking. Pernici and Del Bimbo (2014) collected multiple descriptors of weakly aligned keypoints over time and combined these matched keypoints in a RANSAC voting scheme. Nebehay and Pflugfelder (2015) utilises keypoints to vote for the object center in each frame. A consensus-based scheme is applied for outlier detection and the votes are transformed based on the current key point arrangement to consider scale and rotation. However, keypoint methods may suffer from difficulty in capturing the global information of the tracked target by only considering the local points.

Discriminative Trackers These trackers attempt to explicitly model the difference between the object appearance and the background. Most commonly, these methods are named “tracking-by-detection” techniques as they involve classifying image regions as either part of the object or the background. In their work, Grabner et al. (2006) propose an online boosting method to select and update discriminative features which allows the system to account for minor changes in the object appearance. However, the tracker fails to model severe changes in appearance. Babenko et al. (2011) advocate the use of a multiple instance learning boosting algorithm to mitigate the drifting problem. More recently, discriminative correlation filters (DCF) have become highly successful at tracking. The DCF is trained by performing a circular sliding window operation on the training samples. This periodic assumption enables efficient training and detection by utilizing the Fast Fourier Transform (FFT). Danelljan et al. (2014) learn separate correlation filters for the translation and the scale estimation. In Danelljan et al. (2015), the authors introduce a sparse spatial regularisation term to mitigate the artifacts at the boundaries of the circular correlation. In contrast to the linear regression commonly used to learn DCFs, Henriques et al. (2015) apply a kernel regression and propose its multi-channel extension to enable to the use of features such as HOG Dalal and Triggs (2005). Li et al. (2015b) propose a new use for particle filters in order to choose reliables patches to consider part of the object. These patches are modelled using a variant of the method proposed by Henriques et al. (2015). Hare et al. (2011) propose the use of structured output prediction. By explicitly allowing the outputs to parametrize the needs of the tracker, an intermediate classification step is avoided.

Table 2 The set of trackers that are used in this paper

Part-based Trackers These trackers attempt to implicitly model the parts of an object in order to improve tracking performance. Adam et al. (2006) represent the object with multiple arbitrary patches. Each patch votes on potential positions and scales of the object and a robust statistic is employed to minimise the voting error. Kalal et al. (2010b) sample the object and the points are tracked independently in each frame by estimating optical flow. Using a forward–backward measure, the erroneous points are identified and the remaining reliable points are utilised to compute the optimal object trajectory. Yao et al. (2013) adapt the latent SVM of Felzenszwalb et al. (2010) for online tracking, by restricting the search in the vicinity of the location of the target object in the previous frame. In comparison to the weakly supervised part-based model of Yao et al. (2013), in Zhang and van der Maaten (2013) the authors recommend an online strongly supervised part-based deformable model that learns the representation of the object and the representation of the background by training a classifier. Wang et al. (2015) employ a part-based tracker by estimating a direct displacement prediction of the object. A cascade of regressors is utilised to localise the parts, while the model is updated online and the regressors are initialised by multiple motion models at each frame.

Given the wealth of available trackers, selecting appropriate trackers for deformable tracking purposes poses a difficult proposition. In order to attempt to give as broad an overview as possible, we selected trackers from each of the aforementioned categories. Therefore, in this paper we compare against 27 trackers which are outlined in Table 2. SRDCF (Danelljan et al. 2015), KCF (Henriques et al. 2015), LCT (Ma et al. 2015), STAPLE (Bertinetto et al. 2016a) and DSST (Danelljan et al. 2014) are all discriminative trackers based on DCFs. They all performed well in the VOT 2015 (Kristan et al. 2015) challenge and DSST was the winner of VOT 2014 (Kristan et al. 2014). The trackers of Danelljan et al. (2016), Qi et al. (2016); Nam and Han (2016), Bertinetto et al. (2016b) are indicative trackers that employ neural networks and achieve top results. STRUCK (Hare et al. 2011) is a discriminative tracker that performed very well in the Online Object Tracking benchmark (Wu et al. 2013), while the more recent method of Ning et al. (2016) improves the computational burden of the structural SVM of STRUCK and reports superior results. SPOT (Zhang and van der Maaten 2014) is a strong performing part based tracker, CMT (Nebehay and Pflugfelder 2015) is a strong performing keypoint based tracker, LRST (Zhang et al. 2014d) and ORIA (Wu et al. 2012) are recent generative trackers. RPT (Li et al. 2015b) is a recently proposed technique that reported state-of-the-art results on the Online Object Tracking benchmark (Wu et al. 2013). TLD (Kalal et al. 2012), MIL (Babenko et al. 2011), FCT (Zhang et al. 2014c), DF (Sevilla-Lara and Learned-Miller 2012) and IVT (Ross et al. 2008) were included as baseline tracking methods with publicly available implementations. Finally, the CAMSHIFT and PF methods (Bradski 1998a; Isard and Blake 1996) are included as very influential trackers used in the previous decades for tracking.

3.3 Facial Landmark Localisation

Statistical deformable models have emerged as an important research field over the last few decades, existing at the intersection of computer vision, statistical pattern recognition and machine learning. Statistical deformable models aim to solve generic object alignment in terms of localisation of fiducial points. Although deformable models can be built for a variety of object classes, the majority of ongoing research has focused on the task of facial alignment. Recent large-scale challenges on facial alignment (Sagonas et al. 2013b, a, 2015) are characteristic examples of the rapid progress being made in the field.

Table 3 The landmark localisation methods employed in this paper

Currently, the most commonly-used and well-studied face alignment methods can be separated into two major families: (i) discriminative models that employ regression in a cascaded manner, and (ii) generative models that are iteratively optimised.

Regression-Based Models The methodologies of this category aim to learn a regression function that regresses from the object’s appearance (e.g. commonly handcrafted features) to the target output variables (either the landmark coordinates or the parameters of a statistical shape model). Although the history behind using linear regression in order to tackle the problem of face alignment spans back many years (Cootes et al. 2001), the research community turned towards alternative approaches due to the lack of sufficient data for training accurate regression functions. Nevertheless, recently regression-based techniques have prevailed in the field thanks to the wealth of annotated data and effective handcrafted features (Lowe 1999; Dalal and Triggs 2005). Recent works have shown that excellent performance can be achieved by employing a cascade of regression functions (Burgos-Artizzu et al. 2013; Xiong and De la Torre 2013, 2015; Dollár et al. 2010; Cao et al. 2014; Kazemi and Sullivan 2014; Ren et al. 2014; Asthana et al. 2014; Tzimiropoulos 2015; Zhu et al. 2015). Regression based methods can be approximately seperated into two categories depending on the nature of the regression function employed. Methods that employ a linear regression such as the supervised descent method (SDM) of Xiong and De la Torre (2013) tend to employ robust hand-crafted features (Xiong and De la Torre 2013; Asthana et al. 2014; Xiong and De la Torre 2015; Tzimiropoulos 2015; Zhu et al. 2015). On the other hand, methods that employ tree-based regressors such as the explicit shape regression (ESR) method of Cao et al. (2014), tend to rely on data driven features that are optimised directly by the regressor (Burgos-Artizzu et al. 2013; Cao et al. 2014; Dollár et al. 2010; Kazemi and Sullivan 2014).

Table 4 The set of experiments conducted in this paper

Generative Models The most dominant representative algorithm of this category is, by far, the active appearance model (AAM). AAMs consist of parametric linear models of both shape and appearance of an object, typically modelled by Principal Component Analysis (PCA). The AAM objective function involves the minimisation of the appearance reconstruction error with respect to the shape parameters. AAMs were initially proposed by Cootes et al. (1995, 2001), where the optimisation was performed by a single regression step between the current image reconstruction residual and an increment to the shape parameters. However, Matthews and Baker (2004), Baker and Matthews (2004) linearised the AAM objective function and optimised it using the Gauss-Newton algorithm. Following this, Gauss-Newton optimisation has been the modern method for optimising AAMs. Numerous extensions have been published, either related to the optimisation procedure (Papandreou and Maragos 2008; Tzimiropoulos and Pantic 2013; Alabort-i-Medina and Zafeiriou 2014, 2015; Tzimiropoulos and Pantic 2014) or the model structure (Tzimiropoulos et al. 2012; Antonakos et al. 2014; Tzimiropoulos et al. 2014; Antonakos et al. 2015b, a).

In recent challenges by Sagonas et al. (2013a, 2015), discriminative methods have been shown to represent the current state-of-the-art. However, in order to enable a fair comparison between types of methods we selected a representative set of landmark localisation methods to compare with in this paper. The set of landmark localisation methods used in the paper is given in Table 3. We chose to use ERT (Kazemi and Sullivan 2014) as it is extremely fast and the implementation provided by King (2009) is the best known implementation of a tree-based regressor. We chose CFSS (Zhu et al. 2015) as it is the current state-of-the-art on the data provided by the 300W competition of Sagonas et al. (2013a). We used the Gauss-Newton Part-based AAM of Tzimiropoulos and Pantic (2014) as the top performing generative localisation method, as provided by the Menpo Project (Alabort-i-Medina et al. 2014). Finally, we also demonstrated an SDM (Xiong and De la Torre 2013) as implemented by Alabort-i-Medina et al. (2014) as a baseline.

4 Experiments

In this section, details of the experimental evaluation are established. Firstly, the datasets employed for the evaluation, training and validation are introduced in Sect. 4.1. Next, Sect. 4.2 provides details of the training procedures and of the implementations that are relevant to all experiments. Following this, in Sects. 4.34.7, we describe the set of experiments that were conducted in this paper, which are summarised in Table 4. Finally, experimental Sect. 4.8 compares the best results from the previous experiments to the winners of the 300 VW competition in Shen et al. (2015).

In the following sections, due to the very large amount of methodologies taken into account, we provide a summary of all the results as tables and only the top five methods as graphs for clarity. Please refer to the supplementary material for an extensive report of the experimental results. Additionally, we provide videos with the tracking results for the experiments of Sects. 4.3, and 4.5 for qualitative comparison.Footnote 5 , Footnote 6

4.1 Dataset

All the comparisons are conducted in the testset of the 300 VW dataset collected by Shen et al. (2015). This recently introduced dataset contains 114 videos (50 for training and 64 for testing). The videos are separated into the following 3 categories:

  • Category 1 This category is composed of videos captured in well-lit environments without any occlusions.

  • Category 2 The second category includes videos captured in unconstrained illumination conditions.

  • Category 3 The final category consists of video sequences captured in totally arbitrary conditions (including severe occlusions and extreme illuminations).

Each video includes only one person and is annotated using the 68 point mark-up employed by Gross et al. (2010) and Sagonas et al. (2015) for Multi-PIE and 300W databases, respectively. All videos include between 1500 frames and 3000 frames with a large variety of expressions, poses and capturing conditions, which makes the dataset very challenging for deformable facial tracking. A number of exemplar images, which are indicative of the challenges of each category, are provided in Fig. 2. We note that, in contrast to the results of Shen et al. (2015) in the original 300 VW competition, we used the most recently provided annotations (See footnote 1) which have been corrected and do not contain missing frames. Therefore, we also provide updated results following the participants of the 300 VW competition.

The public datasets of IBUG (Sagonas et al. 2013a), HELEN (Le et al. 2012), AFW (Zhu and Ramanan 2012) and LFPW (Belhumeur et al. 2013) are employed for training all the landmark localisation methods. This is further explained in Sect. 4.2.1 below.

Fig. 2
figure 2

Example frames from the 300 VW dataset by Shen et al. (2015). Each row contains 10 exemplar images from each category, that are indicative of the challenges that characterise the videos of the category. a Category 1. b Category 2. c Category 3

4.2 Implementation Details

The authors’ implementations are utilised for the trackers, as outlined in Table 2. Similarly, the face detectors’ implementations are outlined in Table 1. HOG + SVM was provided by the Dlib project of King (2015, 2009), the Weakly Supervised DPM (DPM) (Felzenszwalb et al. 2010) was the model provided by Mathias et al. (2014) and the code of Dubout and Fleuret (2012, 2013) was used to perform the detection. Moreover, the Strongly Supervised DPM (SS-DPM) of Zhu and Ramanan (2012) was provided by the authors and, finally, the OpenCV implementation by Bradski (2000) was used for the VJ detector (Viola and Jones 2004). The default parameters were used in all cases. The pre-trained detectors’ models were utilised; only the most confident detection was exported per frame, there was no effort to maximise the overlap with the ground-truth bounding box; in all videos there is only one person per frame.

For face alignment, as outlined in Table 3, the implementation of CFSS provided by Zhu et al. (2015) is adopted, while the implementations provided by Alabort-i-Medina et al. (2014) in the Menpo Project are employed for the patch-based AAM of Tzimiropoulos and Pantic (2014) and the SDM of Xiong and De la Torre (2013). Lastly, the implementation of ERT (Kazemi and Sullivan 2014) is provided by King (2009) in the Dlib library. For the three latter methods, following the original papers and the code’s documentation, several parameters were validated and chosen based on the results in a validation set that consisted of a few videos from the 300 VW training set.

The details of the parameters utilised for the patch-based AAM, SDM and ERT are the following: For AAM, we used the algorithm of Tzimiropoulos and Pantic (2014) and applied a 2-level Gaussian pyramid with 4 and 10 shape components, and 60 and 150 appearance components in each scale, respectively. For the SDM, a 4-level Gaussian pyramid was employed. SIFT (Lowe 1999) feature vectors of length 128 were extracted at the first 3 scales, using RootSIFT by Arandjelović and Zisserman (2012). Raw pixel intensities were used at the highest scale.

Table 5 Exemplar deformable tracking results that are indicative of the fitting quality that corresponds to each error value for all video categories
Table 6 Results for experiment 1 of Sect. 4.3 (detection + landmark localisation) (Color table online)

Part of the experiments was conducted on the cloud software of Koukis et al. (2013) and the web application of Pérez and Granger (2007), while the rest of the functionality was provided by the Python libraries of Alabort-i-Medina et al. (2014), Pedregosa et al. (2011). The source code as well as the list of errors for the top methods will be released for the research community in the link https://github.com/grigorisg9gr/deformable_tracking_review_ijcv2016.

4.2.1 Landmark Localisation Training

All the landmark localisation methods were trained with respect to the 68 facial points mark-up employed by Sagonas et al. (2013a, 2015) in 300W, while the rest of the parameters were determined via cross-validation. Again, this validation set consisted of frames from the 300 VW trainset, as well as 60 privately collected images with challenging poses. All of the discriminative landmark localisation methods (SDM, ERT, CFSS) were trained from images in the public datasets of IBUG (Sagonas et al. 2013a), HELEN (Le et al. 2012), AFW (Zhu and Ramanan 2012) and LFPW (Belhumeur et al. 2013). The generative AAM was trained on less data, since generative methods do not benefit as strongly from large training datasets. The training data used for the AAM was the recently released 300 images from the 600W dataset (Sagonas et al. 2015), 500 challenging images from LFPW (Belhumeur et al. 2013) and the 135 images of the IBUG dataset (Sagonas et al. 2013a).

Discriminative landmark localisation methods are tightly coupled with the initialisation statistics, as they learn to model a given variance of initialisations. Therefore, it is necessary to re-train each discriminative method for each face detection method employed. This allows the landmark localisation methods to correctly model the large amount of variance present between detectors. On aggregate 5 different detector and landmark localisation models are trained. One for each detector and landmark localisation pair (totalling 4) and a single model trained using a validation set that estimates the variance of the ground truth bounding box throughout the sequences. This model is used for all trackers.

4.2.2 Quantitative Metrics

The errors reported for all the following experiments are with respect to the landmark localisation error. The error metric employed is the mean Euclidean distance of the 68 points, normalised by the diagonal of the ground truth bounding box (\(\sqrt{width^2 + height^2}\)). This metric was chosen as it is robust to changes in head pose which are frequent within the 300 VW sequences. The graphs that are shown are cumulative error distribution (CED) plots that provide the proportion of images less than or equal to a particular error. We also provide summary tables with respect to the Area Under the Curve (AUC) of the CED plots, considered up to a maximum error. Errors above this maximum threshold, which is fixed to 0.08, are considered failures to accurately localise the facial landmarks. Therefore, we also report the failure rate, as a percentage, which marks the proportion of images that are not considered within the CED plots. Table 5 shows some indicative examples of the deformable fitting quality that corresponds to each error value for all video categories. When ranking methods, we consider the AUC as the primary statistic and only resort to considering the failure rate in cases where there is little distinction between methods’ AUC values.

The indicative speed metric (times) reported in the outcomes is measured on 100 frames of a single video with \(640 \times 360\) resolution. Note that the utlised detectors’ performance is highly affected by the resolution. The times were measured in a single machine with a i7 processor, 3.6 GHz, all in CPU mode, with 8GB RAM and report the time in seconds. The implementations were not optimised to minimise the computational complexity, i.e. the public implementations in C/C++ have a considerable advantage.

Fig. 3
figure 3

Results for experiment 1 of Sect. 4.3 (detection + landmark localisation). The top 5 performing curves are highlighted in each legend. Please see Table 6 for a full summary

Fig. 4
figure 4

This figure gives a diagram of the reinitialisation scheme proposed in Sect. 4.4. Specifically, in case the face detector does not return a bounding box for a frame, the bounding box of the previous frame is used as a successful detection for the missing frame

Table 7 Results for experiment 2 of Sect. 4.4 (detection + landmark localisation + initialisation from previous frame) (Color table online)

4.3 Experiment 1: Detection and Landmark Localisation

In this experiment, we validate the most frequently used facial deformable tracking strategy, i.e. performing face detection followed by landmark localisation on each frame independently. If a detector fails to return a frame, that frame is considered as having infinite error and thus will appear as part of the failures in Table 6. Note that the AUC is robust to the use of infinite errors. In frames where multiple bounding boxes are returned, the box with the highest confidence is kept, limiting the results of the detectors to a single bounding box per image. A high level diagram explaining the detection procedure for this experiment is given by Fig. 1.

Specifically, in this experiment we consider the 8 face detectors of Table 1 (DPM, HR-TF, MTCNN, NPD, SS-DPM, HOG \(+\) SVM, VJ, VPHR) with the 4 landmark localisation techniques of Table 3 (AAM, CFSS, ERT, SDM), for a total of 32 results. The results of the experiment are given in Table 6 and Fig. 3. The results indicate that the AAM performs poorly as it achieves the lowest performance across all face detectors. The discriminative CFSS and ERT landmark localisation methods consistently outperform SDM. From the detectors point of view, it seems that the strongly supervised DPM (SS-DPM) is the worst and provides the highest failure rates. On the other hand, the weakly supervised DPM (DPM) outperforms the rest of the detectors in the first two categories in terms of both accuracy (i.e. AUC) and robustness (i.e. Failure Rate), while in the third one, the deep detector of Zhang et al. (2016) outperforms marginally DPM. In all three categories the state-of-the-art deep networks fetch top results, however they do not seem to be consistently better than DPM or VPHR of Kumar et al. (2015). The detailed graphs per method (32 methods in total), as well as a video with the results of the top five methods (see footnote 5) are deferred to the supplementary material.

4.4 Experiment 2: Detection and Landmark Localisation with Reinitialisation

Complementing the experiments of Sect. 4.3, the same set-up was utilised to study the effect of missed frames by assuming a first order Markov dependency. If the detector does not return a bounding box in a frame, the bounding box of the previous frame is used as a successful detection for the missing frame. This procedure is depicted in Fig. 4. Given that the frame rate of the input videos is adequately high (over 20 fps), this assumption is a reasonable one. The results of this experiment are summarised in Table 7 and in Fig. 5. As expected, the ranking of the methods is almost identical as the previous experiment of Sect. 4.3, with the minor differences emerging from the threshold of the different detectors. For instance, the SVM \(+\) HOG that has a high threshold, i.e. in the previous experiment it ‘missed’ several challenging frames, can benefit further from the Markov dependency, while the VPHR one has exactly the same statistics as it returned a detection in every single frame in the previous experiment.

In order to better investigate the effect of this reinitialisation scheme, we also provide Fig. 6 that directly shows the improvement. Specifically, we plot the CED curves with and without the reinitialisation strategy for 3 top performing methods, as well as the 3 techniques for which the highest improvement is achieved. It becomes evident that the top performing methods from Sect. 4.3 do not benefit from reinitialisation, since the improvement is marginal. This is explained by the fact that these methods already achieve a very high true positive rate. The largest difference is observed for methods that utilise AAM. As shown by Antonakos et al. (2015b), AAMs are very sensitive to initialisation, due to the nature of Gauss-Newton optimisation. Additionally, note that we have not attempted to apply any kind of greedy approach for improving the detectors’ bounding boxes in order to provide a better AAM initialisation. Since the initialisation of a frame with failed detection is achieved by the bounding box of the previous frame’s landmarks, it is highly likely that its area will be well constrained to include only the facial parts and not the forehead or background. This kind of initialisation is very beneficial for AAMs, which justifies the large improvements that are shown in Fig. 6. For the graphs that correspond to all 32 methods, please refer to the supplementary material.

4.5 Experiment 3: Model-free Tracking and Landmark Localisation

In this section, we provide, to the best of our knowledge, the first detailed analysis of the performance of model free trackers for tracking “in-the-wild” facial sequences. For this reason, we have considered a large number of trackers in order to attempt to give a balanced overview of the performance of modern model trackers for deformable face alignment. The 27 trackers considered in this section are summarised in Table 2. To initialise all trackers, the tightest possible bounding box of the ground truth facial landmarks is provided as the initial tracker state. We also include a baseline method, which appears in results Table 8, referred to as PREV, which is defined as applying the landmark localisation methods initialised from the bounding box of the result in the previous frame. Obviously this scheme is highly sensitive to drifting and therefore we have included it as a basic baseline that does not include any model free tracking. A high level diagram explaining the detection procedure for this experiment is given by Fig. 1.

Specifically, in this experiment we consider the 27 model free trackers of Table 2, plus the PREV baseline, with the 4 landmark localisation techniques of Table 3 (AAM, CFSS, ERT, SDM), for a total of 112 results. The results of the experiment are given in Table 8 and Fig. 7. Please see the supplementary material for full statistics.

By inspecting the results, we can firstly notice that most generative trackers perform poorly (i.e. ORIA, DF, FCT, IVT), except LRST which achieves decent performance for the most challenging video category.The discriminative approaches of SRDCF and SPOT are consistently performing very well, however the trackers employing deep neural networks fetch the most accurate outcomes, consistent with the latest VOT competition outcomes. Additionally, similar to the face detection experiments, the combination of all trackers with CFSS returns the best result, whereas AAM constantly demonstrates the poorest performance. Finally, it becomes evident that a straightforward application of the simplistic baseline approach (PREV) is not suitable for deformable tracking, even though it is surprisingly outperforming some model free trackers, such as DF, ORIA and FCT. For the curves that correspond to all 112 methods as well as a video with the tracking result of the top five methods (see footnote 6), please refer to the supplementary material.

Fig. 5
figure 5

Results for experiment 2 of Sect. 4.4 (detection + landmark localisation + initialisation from previous frame). The top five performing curves are highlighted in each legend. Please see Table 7 for a full summary

Fig. 6
figure 6

Results for experiment 2 of Sect. 4.4 (detection + landmark localisation + initialisation from previous frame). These results show the effect of initialisation from the previous frame, in comparison to missing detections. The top three performing results are given in red, green and blue, respectively, and the top three most improved are given in cyan, yellow and brown, respectively. The dashed lines represent the results before the reinitialisation strategy is applied, solid lines are after (Color figure online)

4.6 Experiment 4: Failure Checking and Tracking Reinitialisation

Complementing the experiments of Sect. 4.5, we investigate the improvement in performance of performing failure checking during tracking. Here we define failure checking as the process of determining whether or not the currently tracked object is a face. Given that we have prior knowledge of the class of object we are tracking, namely faces, this enables us to train an offline classifier that attempts to determine whether a given input is a face or not. Furthermore, since we are also applying landmark localisation, we can perform a strong classification by using the facial landmarks as position priors when extracting features for the failure checking. To train the failure checking classifier, we perform the following methodology:

Table 8 Results for experiment 3 of Sect. 4.5 (model free tracking + landmark localisation) (Color table online)
Fig. 7
figure 7

Results for experiment 3 of Sect. 4.5 (model free tracking + landmark localisation). The top five performing curves are highlighted in each legend. Please see Table 8 for a full summary

  1. 1.

    For all images in the Landmark Localisation training set, extract a fixed sized patch around each of the 68 landmarks and compute HOG (Dalal and Triggs 2005) features for each patch. These patches are the positive training samples.

  2. 2.

    Generate negative training samples by perturbing the ground truth bounding box, extracting fixed size patches and computing HOG.

  3. 3.

    Train an SVM classifier using the positive and negative samples.

For the experiments in this section, we use a fixed patch size of \(18 \times 18\) pixels, with 100 negative patches sampled for each positive patch. The failure checking classification threshold is chosen via cross-validation on two sequences from the 300 VW training videos. Any hyper-parameters of the SVM are also trained using these two validation videos.

Given the failure detector, our restart procedure, is as follows:

  • Classify the current frame to determine if the tracking has failed. If a failure is verified, perform a restart, otherwise continue.

  • Following the convention of the VOT challenges by Kristan et al. (2013, 2014, 2015), we attempt to reduce the probability that poor trackers will overly rely on the output of the failure detection system. In the worst case, a very poor tracker would fail on most frames and thus the accuracy of the detector would be validated rather than the tracker itself. Therefore, when a failure is identified, the tracker is allowed to continue for 10 more frames. The results from the drifting tracker are used in these 10 frames in order reduce the affect of the detector. The tracker is then reinitialised at the frame it was first detected as failing at. The next 10 frames, as previously described, already have results computed and therefore no landmark localisation or failure checking is performed in these frames. At the 11th frame, the tracker continues as normal, with landmark localisation and failure checking.

  • In the unlikely event that the detector fails to detect the face, the previous frame is used as described in Sect. 4.4.

The diagram given in Fig. 8 gives a pictorial representation of this scheme.

Fig. 8
figure 8

This figure gives a diagram of the reinitialisation scheme proposed in Sect. 4.6 for tracking with failure detection. For all frames after the first, the result of the current landmark localisation is used to decide whether or not a face is still being tracked. If the classification fails, a re-detection is performed and the tracker is reinitialised with the bounding box returned by the detector

Table 9 Results for experiment 4 of Sect. 4.6 (model free tracking + landmark localisation + failure checking) (Color table online)
Fig. 9
figure 9

Results for experiment 4 of Sect. 4.6 (model free tracking + landmark localisation + failure checking). The top five performing curves are highlighted in each legend. Please see Table 9 for a full summary

Fig. 10
figure 10

Results for experiment 4 of Sect. 4.6 (model free tracking + landmark localisation + failure checking). These results show the effect of the failure checking, in comparison to only tracking. The results are coloured by their performance red, green, blue and orange, respectively. The dashed lines represent the results before the reinitialisation strategy is applied, solid lines are after (Color figure online)

The results of this experiment are given in Table 9 and Fig. 9. In contrast to Sect. 4.5, we only perform the experiments on a subset of the total trackers using CFSS. We use 3 among the top performing trackers (SRDCF, RPT, SPOT) as well as FCT which had mediocre performance in Sect. 4.5. The results indicate that SRDCF is the best model free tracking methodology for the task.

In order to better investigate the effect of this failure checking scheme, we also provide Fig. 6 which shows the differences between the initial tracking results of Sect. 4.5 and the results after applying failure detection. The performance of top trackers (i.e. SRDCF, SPOT, RPT) does not improve much, which is expected since they are already able to return a robust tracking result. However, FCT benefits from the failure checking process, which apparently minimises its drifting issues.

Table 10 Results for experiment 5 of Sect. 4.7 (Kalman Smoothing) (Color table online)

4.7 Experiment 5: Kalman Smoothing

In this section, we report the effect of performing Kalman Smoothing (Kalman 1960) on the results of the detectors of Sect. 4.3 and the trackers of Sect. 4.5. This experiment is designed to highlight the stability of the current landmark localisation methods with respect to noisy movement between frames (or jittering as it often known). However, when attempting to smooth the trajectories of the tracked bounding boxes themselves, we found an extremely negative effect on the results. Therefore, to remove jitter from the results we perform Kalman smoothing on the landmarks themselves. To robustly smooth the landmark trajectories, a generic facial shape model is constructed in a similar manner as described in the AAM literature by Cootes et al. (2001). Specifically, given the sparse shape of the face consisting of n landmark points, we denote the coordinates of the i-th landmark point within the Cartesian space of the image \(\mathbf {I}\) as \(\mathbf {x}_i=[x_i,y_i]^T\). Then a shape instance of the face is given by the \(2n\times 1\) vector \(\mathbf {s} = \left[ \mathbf {x}_1^T,\ldots ,\mathbf {x}_n^T\right] ^T = \left[ x_1,y_1,\ldots ,x_n,y_n\right] ^T\). Given a set of N such shape samples \(\{\mathbf {s}^1,\ldots ,\mathbf {s}^N\}\), a parametric statistical subspace of the object’s shape variance can be retrieved by first applying Generalised Procrustes Analysis on the shapes to normalise them with respect to the global similarity transform (i.e., scale, in-plane rotation and translation) and then using Principal Component Analysis (PCA). The resulting shape model, denoted as \(\{\mathbf {U}_s,\bar{\mathbf {s}}\}\), consists of the orthonormal basis \(\mathbf {U}_s\in \mathbb {R}^{2n\times n_s}\) with \(n_s\) eigenvectors and the mean shape vector \(\bar{\mathbf {s}}\in \mathbb {R}^{2n}\). This parametric model can be used to generate new shape instances as \(\mathbf {s}(\mathbf {p})=\bar{\mathbf {s}}\,+\,\mathbf {U}_s\mathbf {p}\) where \(\mathbf {p}=[p_1,\ldots ,p_{n_s}]^T\) is the \(n_s\times 1\) vector of shape parameters that control the linear combination of the eigenvectors. The Kalman smoothing is thus learnt via Expectation-Maximisation (EM) for the parameters \(\mathbf {p}\) of each shape within a sequence (Fig. 10).

The results of this experiment are given in Table 10 and Fig. 11. These experiments also provide a direct comparison between the best detection and model free tracking based techniques. In categories 1 and 2 the Kalman smoothing applied to the model free trackers followed by the discriminative landmark localisation methods of ERT or CFSS score better, with the trackers MDNET and SRDCF being the top performers. In category 3 the DPM and the deep tracker MTCNN achieve the top performance, because they are less prone to drifting (in comparison to trackers) in the most challenging clips of the dataset.

In order to better investigate the effect of the smoothing, we also provide Fig. 12 which shows the differences between the initial tracking results and the results after applying Kalman smoothing. This comparison is shown for the best methods of Table 10. It becomes obvious that the improvement introduced by Kalman smoothing is consistent, but marginal.

Fig. 11
figure 11

Results for experiment 5 of Sect. 4.7 (Kalman Smoothing). The top five performing curves are highlighted in each legend. Please see Table 10 for a full summary

Fig. 12
figure 12

Results for experiment 5 of Sect. 4.7 (Kalman Smoothing). These results show the effect of Kalman smoothing on the final landmark localisation results. The top three performing results are given in red, green and blue, respectively, and the top three most improved are given in cyan, yellow and brown, respectively. The dashed lines represent the results before the smoothing is applied, solid lines are after (Color figure online)

Table 11 Comparison between the best methods of Sects. 4.34.7 and the participants of the 300 VW challenge by Shen et al. (2015) (Color table online)
Fig. 13
figure 13

Comparison between the best methods of Sects. 4.34.7 and the participants of the 300 VW challenge by Shen et al. (2015). The top five methods are shown and are coloured red, blue, green, orange and purple, respectively. Please see Table 11 for a full summary (Color figure online)

4.8 300 VW Comparison

In this section we provide results that compare the best performing methods of the previous Sects. (4.34.7) to the participants of the 300 VW challenge by Shen et al. (2015). The challenge had 5 competitors. Rajamanoharan and Cootes (2015) employ a multi-view Constrained Local Model (CLM) with a global shape model and different response maps per pose and explore shape-space clustering strategies to determine the optimal pose-specific CLM. Uricar and Franc (2015) apply a DPM at each frame as well as Kalman smoothing on the face positions. Wu and Ji (2015) utilise a shape augmented regression model, where the regression function is automatically selected based on the facial shape. Xiao et al. (2015) propose a multi-stage regression-based approach that progressively provides initialisations for ambiguous landmarks such as boundary and eyebrows, based on landmarks with semantically strong meaning such as eyes and mouth corners. Finally, Yang et al. (2015a) employ a multi-view spatio-temporal cascade shape regression model along with a novel reinitialisation mechanism.

The results are summarised in Table 11 and Fig. 13. Note that the error metric considered in this paper (as described in Sect. 4.2.2) differs from that of the original competition. This was intended to improve the robustness of the results with respect to variation in pose. Also, as noted in Sect. 4.2, the 300 VW annotations have been corrected and thus this experiment represents updated results for the 300 VW competitors. The results indicate that Yang et al. (2015a) outperforms the rest of the methods for the videos of categories 1 and 2, whereas the deep network of Zhang et al. (2016) combined with CFSS and Kalman smoothing or initialisation from previous are the top performing for the challenging videos of category 3. Moreover, it becomes evident that methodologies which employ face detection dominate category 3, whereas in categories 1 and 2 the model free trackers dominate.

5 Discussion and Conclusions

In Sect. 4 we presented a number of experiments on deformable tracking of sequences containing a single face. We investigated the performance of state-of-the-art face detectors and model free trackers on the recently released 300 VW dataset (see footnote 1). We also devised a number of hybrid systems that attempt to improve the performance of both detectors and trackers with respect to tracking failures. A summary of the proposed experiments are given in Table 4.

Overall, it appears that modern detectors are capable of handling videos of the complexity provided by the 300 VW dataset. This supports the most commonly proposed deformable face tracking methodology that couples a detector with a landmark localisation algorithm. More interestingly, it appears that modern model free trackers are also highly capable of tracking videos that contain variations in pose, expression and illumination. This is particularly evident in the videos of category 2 where the model free trackers perform the best. The performance on the videos of category 2 is likely due to the decreased amount of pose variation in comparison to the other two categories. Category 2 contains many illumination variations which model free trackers appear invariant to. Our work also supports the most recent model free tracking benchmarks (Kristan et al. 2015 and Wu et al. 2015) which have demonstrated that DCF-based trackers are currently the most competitive along with the deep neural network approaches. However, the performance of the trackers does deteriorate significantly in category 3 which supports the categorisation of these videos in the 300 VW as the most difficult category. The difficulty in the videos of category 3 largely stems from the amount of pose variation present, which both detectors and model free trackers struggle with.

The DPM detector provided by Mathias et al. (2014) is very robust across a variety of poses and illumination conditions. The more recent face detector of Zhang et al. (2016) outperforms the rest employed methods in the challenging category 3, however it seems less robust than the DPM detector in the easier categories. The recent advances in the model free trackers, dictate the MDNET tracker of Nam and Han (2016) as a top performing method, which outperforms the pre-trained detectors in the first two categories. MDNET belongs to the discriminatively learned Convolutional Neural Networks trackers with their architecture having several shared CNN layers along with a branched last layer during the training. During the inference, the last layer is discarded and a new layer that is updated online is added. This online update capability of the last layer makes the tracker very robust to abrupt changes and a top performing method in all tracking benchmarks. The SRDCF tracker of Danelljan et al. (2015) from the category of trackers with discriminatively learned correlation filters (DCF) consists an alternative top performing method. DCF trackers are currently a very popular method of choice for bounding box based tracking. They capitalise on a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. Nevertheless, the periodic assumption may introduce unwanted boundary effects, which severely degrade the quality of the tracking model. SRDCF incorporates a spatial regularization component in the learning to penalize correlation filter coefficients depending on their spatial location. The CFSS landmark localisation method of Zhu et al. (2015) outperforms all other considered landmark localisation methods, although the random forest based ERT method of Kazemi and Sullivan (2014) also performed very well. In contrast to the conventional Cascade Regression approaches that iteratively refine an initial shape in a cascaded manner, CFSS explores a diverse shape space and employs a probabilistic heuristic to constrain the finer search in the subsequent cascade levels. The authors argue that this procedure prevents the final solution from being trapped in a local optimum like similar regression techniques. The experimental results support the claim of the authors of Zhu et al. (2015) as the videos contain very challenging pose variations.

The stable performance of both the best model free trackers and detectors on these videos is further demonstrated by the minimal improvement gained from the proposed hybrid systems. Neither reinitialisation from the previous frame (Sect. 4.4), nor the failure detection methodology proposed (Sect. 4.6) improved the best performing methods with any significance. Such hybrid systems could be very useful, though, in case of person re-appearance, multiple person cross-overs. Furthermore, smoothing the facial shapes across the sequences (Kalman) also had a very minimal positive improvement, which can be attributed to the human factor, nonetheless the usage of this smoothing could be more useful for reducing the amount of jiterring in consecutive frames.

In comparison to the recent results of the 300 VW competition (Shen et al. 2015), our review of combinations of modern state-of-the-art detectors and trackers found that very strong performance can be obtained through fairly simple deformable tracking schemes. In fact, only the work of Yang et al. (2015a) outperforms our best performing methods in the easier categories of 1 and 2, while the difference shown by Fig. 13 appears to be marginal. However, the overall results show that, particularly for videos that contain significant pose, there are still improvements to be made.

To summarise, there are a number of important issues that must be tackled in order to improve deformable face tracking:

  1. 1.

    Pose is still a challenging issue for landmark localisation methods. In fact, the videos of 300 VW do not even exhibit the full range of possible facial pose as they do not contain profile faces. The challenges of considering profile faces have yet to be adequately addressed and have not be verified with respect to current state-of-the-art benchmarks.

  2. 2.

    In this work, we only consider videos that contain a single visible face. However, there are many scenarios in which multiple faces may be present and this represents further challenges to deformable tracking. Detectors for example, are particularly vulnerable to multi-object tracking scenarios as they require extending with the ability to determine whether the object being localised is the same as in the previous frame.

  3. 3.

    It is very common for objects to leave the frame of the camera during a sequence, and then reappear. Few model free trackers are robust to reinitialisation after an object has disappeared and then reappeared. When combined with multiple objects, this scenario becomes particularly challenging as it requires a re-identification step in order to verify whether the object to be tracked is one that was seen before.

We believe that deformable face tracking is a very exciting line of research and future advances on the field can have an important impact on several areas of Computer Vision.