A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”

Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking “in-the-wild”. Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300 VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic. Electronic supplementary material The online version of this article (doi:10.1007/s11263-017-0999-5) contains supplementary material, which is available to authorized users.

The human face is arguably among the most well-studied deformable objects in the field of Computer Vision.This is due to the many roles it has in numerous applications.For example, accurate detection of faces is an essential step for tasks such as controller-free gaming, surveillance, digital photo album organization, image tagging, etc.Additionally, detection of facial features plays a crucial role for facial behaviour analysis, facial attributes analysis (e.g., gender and age recognition, etc.), facial image editing (e.g., digital make-up, etc.), surveillance, sign language recognition, lip reading, human-computer and human-robot interaction.
Due to the above applications, current research has been monopolised by the tasks of face detection, facial landmark localisation and face recognition or verification.Firstly, face detection, despite having permeated many forms of modern technology such as digital cameras and social networking, is still a challenging problem and a popular line of research, as shown by the recent surveys of Jain and Learned-Miller (2010); Zhang and Zhang (2010); Zafeiriou et al (2015).Although face detection on well-lit frontal facial images can be performed reliably on an embedded device, face detection on arbitrary images of people is still extremely challenging (Jain and Learned-Miller (2010)).Images of faces under these unconstrained conditions are commonly referred to as "in-the-wild" and may include scenarios such as extreme facial pose, defocus, faces occupying a very small number of pixels or occlusions.Given the fact that face detection is still regarded as a challenging task, many generic object detection architectures such as Yan et al (2014); King (2015) are either directly assessed on in-the-wild facial data, or are appropriately modified in order to explicitly perform face detection as done by Zhu and Ramanan (2012); Felzenszwalb and Huttenlocher (2005).The interested reader may refer to the most recent survey by Zafeiriou et al (2015) for more information on in-the-wild face detection.The problem of localising facial landmarks that correspond to fiducial facial parts (e.g., eyes, mouth, etc.) is still extremely challenging and has only been possible to perform reliably relatively recently.Although the history of facial landmark localisation spans back many decades (Cootes et al (1995(Cootes et al ( , 2001))), the ability to accurately recover facial landmarks on in-the-wild images has only become possible in recent years (Matthews and Baker (2004); Papandreou and Maragos (2008); Saragih et al (2011); Cao et al (2014)).Much of this progress can be attributed to the release of large annotated datasets of facial landmarks (Sagonas et al (2013b,a); Zhu and Ramanan (2012); Le et al (2012); Belhumeur et al (2013); Köstinger et al (2011)) and very recently the area of facial landmark localisation has become extremely competitive with recent works including Xiong and De la Torre (2013); Ren et al (2014); Kazemi and Sullivan (2014); Zhu et al (2015); Tzimiropoulos (2015).For a recent evaluation of facial landmark localisation methods the interested reader may refer to the survey by Wang et al (2014) and to the results of the 300W competition by Sagonas et al (2015).Finally, face recognition and verification are extremely popular lines of research.For the past two decades, the majority of statistical machine learning algorithms spanning from linear/non-linear subspace learning techniques (De la Torre (2012); Kokiopoulou et al (2011)) to Deep Convolutional Neural Networks (DCNNs) (Taigman et al (2014); Schroff et al (2015); Parkhi et al (2015)) have been applied to the problem of face recognition and verification.Recently, due to the revival of DCNNs, as well as the development of Graphics Processing Units (GPUs), remarkable face verification performance has been reported (Taigman et al (2014)).The interested reader may refer to the recent survey by Learned-Miller et al (2016) as well as the most popular benchmark for face verification in-the-wild in Huang et al (2007).
In all of the aforementioned fields, significant progress has been reported in recent years.The primary reasons behind these advances are: -The collection and annotation of large databases.
-The establishment of in-the-wild benchmarks and challenges that provide a fair comparison between state of the art techniques.FDDB (Jain and Learned-Miller (2010)), 300W (Sagonas et al (2013a(Sagonas et al ( , 2015))) and Janus (Klare et al (2015)) are the most characteristic examples for face detection, facial landmark localisation and face recognition, respectively.
Contrary to face detection, facial landmark localisation and face recognition, the problem of deformable face tracking across long-term sequences has yet to attract much attention, despite its crucial role in numerous applications.Given the fact that cameras are embedded in many common electronic devices, it is surprising that current research has not yet focused towards providing robust and accurate solutions for longterm deformable tracking.Almost all face-based applications, including facial behaviour analysis, lip reading, surveillance, human-computer and human-robot interaction etc., require accurate continuous tracking of the facial landmarks.The facial landmarks are commonly used as input signals of higher-level methodologies to compute motion dynamics and deformations.The performance of currently available technologies for facial deformable tracking has not been properly assessed (Yacoob and Davis (1996); Essa et al (1996Essa et al ( , 1997)); Decarlo and Metaxas (2000); Koelstra et al (2010); Snape et al (2015)).This is attributed to the fact that, until recently, there was no established benchmark for the task.At ICCV 2015, the first benchmark for facial landmark tracking (so-called 300VW) was presented by Shen et al (2015), providing a large number of anno-tated videos captured in-the-wild1 .In particular, the benchmark provides 114 videos with average duration around 1 minute, split into three categories of increasing difficulty.The frames of all videos (218595 in total) were annotated by applying semi-automatic procedures, as shown in Chrysos et al (2015).Five different facial tracking methodologies were evaluated in the benchmark (Rajamanoharan and Cootes (2015); Yang et al (2015a); Wu and Ji (2015); Uricar and Franc (2015); Xiao et al (2015)) and the results are indicative of the current state-of-the-art performance.
In this paper, we make a significant step further and present the first, to the best of our knowledge, comprehensive evaluation of multiple deformable face tracking pipelines.In particular, we assess: -A pipeline which combines a generic face detection algorithm with a facial landmark localisation method.This is the most common method for facial landmark tracking.It is fairly robust since the probability of drifting is reduced due to the application of the face detector at each frame.Nevertheless, it does not exploit the dynamic characteristics of the tracked face.Many state-of-the-art face detectors as well as facial landmark localisation methodologies are evaluated in this pipeline.-A pipeline which combines a model free tracking system with a facial landmark localisation method.This approach takes into account the dynamic nature of the tracked face, but is susceptible to drifting and thus losing the tracked object.We evaluate the combinations of multiple state-of-the-art model free trackers, as well as landmark localisation techniques.
-Hybrid pipelines that include mechanisms for detecting tracking failures and performing re-initialisation, as well as using models for ensuring robust tracking.
Summarising, the findings of our evaluation show that current face detection and model free tracking technologies are advanced enough so that even a naive combination with landmark localisation techniques is adequate to achieve state-of-the-art performance on deformable face tracking.Specifically, we experimentally show that model free tracking based pipelines are very accurate when applied on videos with moderate lighting and pose circumstances.Furthermore, the combination of state-of-the-art face detectors with landmark localisation systems demonstrates excellent performance with surprisingly high true positive rate on videos captured under arbitrary conditions (extreme lighting, pose, occlusions, etc.).Moreover, we show that hybrid approaches provide only a marginal improvement, which is not worth their complexity and computational cost.Finally, we compare these approaches with the systems that participated in the 300VW competition of Shen et al (2015).
The rest of the paper is organised as follows.Section 2 presents a survey of the current literature on both rigid and deformable face tracking.In Section 3, we present the current state-of-the-art methodologies for deformable face tracking.Since, modern face tracking consists of various modules, including face detection, model free tracking and facial landmark localisation, Sections 3.1, 3.2 and 3.3 briefly outline the state-of-theart in each of these domains.Experimental results are presented in Section 4. Finally, in Section 5 we discuss the challenges that still remain to be addressed, provide future research directions and draw conclusions.

Prior Art
The first methods for rigid 2D tracking generally revolved around the use of various features or transfor-mations and mainly explored various color-spaces for robust tracking (Crowley and Berard (1997); Bradski (1998b); Qian et al (1998); Toyama (1998); Jurie (1999); Schwerdt and Crowley (2000); Stern and Efros (2002); Vadakkepat et al (2008)).The general methods of choice for tracking were Mean Shift and variations such as the Continuously Adaptive Mean Shift (Camshift) algorithm (Bradski (1998a); Allen et al (2004)).The Mean Shift algorithm is a non-parametric technique that climbs the gradient of a probability distribution to find the nearest dominant mode (peak) (Comaniciu and Meer (1999); Comaniciu et al (2000)).Camshift is an adaptation of the Mean Shift algorithm for object tracking.The primary difference between CamShift and Mean Shift is that the former uses continuously adaptive probability distributions (i.e., distributions that may be recomputed for each frame) while the latter is based on static distributions, which are not updated unless the target experiences significant changes in shape, size or color.Other popular methods of choice for tracking are linear and non-linear filtering techniques including Kalman filters, as well as methodologies that fall in the general category of particle filters (Del Moral (1996); Gordon et al (1993)), such as the popular Condensation algorithm by Isard and Blake (1998).Condensation is the application of Sampling Importance Resampling (SIR) estimation by Gordon et al (1993) to contour tracking.A recent successful 2D rigid tracker that updates the appearance model of the tracked face was proposed in Ross et al (2008).The algorithm uses incremental Principal Component Analysis (PCA) (Levey and Lindenbaum (2000)) to learn a statistical model of the appearance in an on-line manner and contrary to other eigentrackers, such as Black and Jepson (1998), it does not contain any training phase.The method in Ross et al (2008) uses a variant of the Condensation algorithm to model the distribution over the objects location as it evolves over time.The method has initiated a line of research on robust incremental object tracking including the works of Liwicki et al (2012bLiwicki et al ( , 2013Liwicki et al ( , 2012aLiwicki et al ( , 2015b)).Rigid 3D tracking has also been studied by using generic 3D models of the face (Malciu and Prěteux (2000); La Cascia et al (2000)).For example, La Cascia et al (2000) formulate the tracking task as an image registration problem in the cylindrically unwrapped texture space and Sung et al (2008) combine Active Appearance Models (AAMs) with a cylindrical head model for robust recovery of the global rigid motion.Currently, rigid face tracking is generally treated along the same lines as general model free object tracking (Jepson et al (2003); Smeulders et al (2014); Liwicki et al (2013Liwicki et al ( , 2012b)); Ross et al (2008); Wu et al (2015); Li et al (2015a)).An overview of model free object tracking is given in Section 3.2.
Non-rigid tracking of faces is important in many applications, spanning from facial expression analysis to motion capture for graphics and game design.Non-rigid tracking of faces can be further subdivided into tracking of certain facial landmarks (Lanitis et al (1995); Black and Yacoob (1995); Sobottka and Pitas (1996); Xiao et al (2004); Matthews and Baker (2004); Matthews et al (2004); Patras and Pantic (2004) 2015)) or tracking/estimation of dense facial motion (Essa et al (1996); Yacoob and Davis (1996); Essa et al (1997); Decarlo and Metaxas (2000); Koelstra et al (2010); Snape et al (2015)).The first series of model-based methods for dense facial motion tracking were proposed by MIT Media lab in mid 1990's (Essa et al (1997(Essa et al ( , 1996(Essa et al ( , 1994)); Basu et al (1996)).In particular, the method by Essa and Pentland (1994) tracks facial motion using optical flow computation coupled with a geometric and a physical (muscle) model describing the facial structure.This modeling results in a time-varying spatial patterning of facial shape and a parametric representation of the independent muscle action groups which is responsible for the observed facial motions.In Essa et al (1994) the physically-based face model of Essa and Pentland (1994) is driven by a set of responses from a set of templates that characterise facial regions.Model generated flow has been used by the same group in Basu et al (1996) for motion regularisation.3D motion estimation using sparse 3D models and optical flow estimation has also been proposed by Li et al (1993); Bozdagi et al (1994).Dense facial motion tracking is performed in Decarlo and Metaxas (2000) by solving a model-based (using a facial deformable model) least-squares optical flow problem.The constraints are relaxed by the use of a Kalman filter, which permits controlled constraint violations based on the noise present in the optical flow information, and enables optical flow and edge information to be combined more robustly and efficiently.Freeform deformations (Rueckert et al (1999)) are used in Koelstra et al (2010) for extraction of dense facial motion for facial action unit recognition.Recently, Snape et al (2015) proposed a statistical model of the facial flow for fast and robust dense facial motion extraction.
Arguably, the problem that has received the majority of attention is tracking of a set of sparse facial landmarks.The landmarks are either associated to a particular sparse facial model, i.e. the popular Candide facial model by Li et al (1993), or correspond to fiducial facial regions/parts (e.g., mouth, eyes, nose etc.) (Cootes et al (2001)).Even earlier attempts such as Essa and Pentland (1994) understood the usefulness of tracking facial regions/landmarks in order to perform robust fitting of complex facial models (currently the vast majority of dense 3D facial model tracking techniques, such as Wei et al (2004); Zhang et al (2008); Amberg (2011), rely on the robust tracking of a set of facial landmarks).Early approaches for tracking facial landmarks/regions included: (i) the use of templates built around certain facial regions (Essa and Pentland (1994)), (ii) the use of facial classifiers to detect landmarks (Colmenarez et al (1999)) where tracking is performed using modal analysis (Tao and Huang (1998)) or (iii) the use of face and facial region segmentation to detect the features where tracking is performed using block matching (Sobottka and Pitas (1996)).Currently, deformable face tracking has converged with the problem of facial landmark localisation on static images.That is, the methods generally rely on fitting generative or discriminative statistical models of appearance and 2D/3D sparse facial shape at each frame.Arguably, the most popular methods are generative and discriminative variations of Active Appearance Models (AAMs) and Active Shape Models (ASMs) (Pighin et al (1999); Cootes et al (2001); Dornaika and Ahlberg (2004); Xiao et al (2004); Matthews and Baker (2004); Dedeoglu et al (2007); Papandreou and Maragos (2008); Amberg et al (2009); Saragih et al (2011); Xiong andDe la Torre (2013, 2015)).The statistical models of appearance and shape can either be generic as in Cootes et al (2001); Matthews and Baker (2004); Xiong and De la Torre (2013) or incrementally updated in order to better capture the face at hand, as in Sung and Kim (2009); Asthana et al (2014).The vast majority of the facial landmark localisation methodologies require an initialisation provided by a face detector.More details regarding current state-of-the-art in facial landmark localisation can be found in Section 3.3.
Arguably, the current practise regarding deformable face tracking includes the combination of a generic face detection and generic facial landmark localisation technique (Saragih et al (2011); Xiong andDe la Torre (2013, 2015); Alabort-i-Medina and Zafeiriou (2015); Asthana et al (2015)).For example, popular approaches include successive application of the face detection and facial landmark localisation procedure at each frame.Another approach performs face detection in the first frame and then applies facial landmark localisation at each consecutive frame using the fitting result of the previous frame as initialisation.Face detection can be re-applied in case of failure.This is the approach that is used by popular packages such as Asthana et al (2014).In this paper, we thoroughly evaluate variations of the above approaches.Furthermore, we consider the use of modern model free state-of-the-art trackers for rigid 2D tracking in order to be used as initialisation for the facial landmark localisation procedure.This is pictorially described in Figure 1.

Face Tracking Benchmarking
For assessing the performance of rigid 2D face tracking several short face sequences have been annotated with regards to the facial region (using a bounding box style annotation).One of the first sequences that has been annotated for this task is the so-called Dudek sequence by Ross et al (2015) 3 .Nowadays, several such sequences have been annotated and are publicly available, such as the ones by Liwicki et al (2015a); Li et al (2015b); Wu et al (2015).
The performance of non-rigid dense facial tracking methodologies was usually assessed by using markers (Decarlo and Metaxas (2000)), simulated data (Snape et al (2015)), visual inspection (Decarlo and Metaxas (2000); Essa et al (1997Essa et al ( , 1996)); Yacoob and Davis (1996); Snape et al (2015); Koelstra et al (2010)) or indirectly by the use of the dense facial motion for certain tasks, such as expression analysis (Essa et al (1996); Yacoob and Davis (1996); Koelstra et al (2010)).Regarding tracking of facial landmarks, up until recently, the preferred method for assessing the performance was visual inspection in a number of selected facial videos (Xiong and De la Torre (2013); Tresadern et al (2012)).Other methods were assessed on a small number of short (a few seconds in length) annotated facial videos (Sagonas et al (2014); Asthana et al (2014)).Until recently the longest annotated facial video sequence was the socalled talking face of Cootes (2015) which was used to evaluate many tracking methods including Orozco et al (2013); Amberg et al (2009).The talking face video comprises of 5000 frames (around 200 seconds) taken from a video of a person engaged in a conversation.The talking face video was initially tracked using an Active Appearance Model (AAM) that had a shape model and a total of 68 landmarks are provided.The tracked landmarks were visually checked and manually corrected where necessary.
Recently, Xiong and De la Torre (2015) introduced a benchmark for facial landmark tracking using videos from the Distracted Driver Face (DDF) and Naturalistic Driving Study (NDS) in Campbell (2015) 4 .The DDF dataset contains 15 sequences with a total of 10,882 frames.Each sequence displays a single subject posing as the distracted driver in a stationary vehicle or indoor environment.12 out of 15 videos were recorded with subjects sitting inside of a vehicle.Five of them were recorded during the night under infrared (IR) light and the rest were recorded during the daytime under natural lighting.The remaining three were recorded indoors.The NDS database contains 20 sub-sequences of driver faces recorded during a drive conducted between the Blacksburg, VA and Washington, DC areas (NDS is more challenging than DDF since its videos are of lower spatial and temporal resolution).Each video of the NDS database has one minute duration recorded at 15 frames per second (fps) with a 360 × 240 resolution.For both datasets one in every ten frames was annotated using either 49 landmarks for near-frontal faces or 31 landmarks for profile faces.The database contains many extreme facial poses (90 o yaw, 50 o pitch) as well as many faces under extreme lighting condition (e.g., IR).In total the dataset presented in Xiong and De la Torre (2015) contains between 2,000 to 3,000 annotated faces (please refer to Xiong and De la Torre (2015) for exemplar annotations).
The only existing large in-the-wild benchmark for facial landmark tracking was recently introduced by Shen et al (2015).The benchmark consists of 114 with varying difficulty and provides annotations generated in a semi-automatic manner (Chrysos et al (2015); Shen et al (2015); Tzimiropoulos (2015)).This challenge, called 300VW, is the only existing large-scale comprehensive benchmark for deformable model tracking.More details regarding the dataset of the 300VW benchmark can be found in Section 4.1.The performance of the pipelines considered in this paper are compared with the participating methods of the 300VW challenge in Section 4.8.

Deformable Face Tracking
In this paper, we focus on the problem of performing deformable face tracking across long-term sequences within unconstrained videos.The problem of tracking across long-term sequences is particularly challenging as the appearance of the face may change significantly during the sequence due to occlusions, illumination variation, motion artifacts and head pose.For the problem of deformable tracking, however, the problem is further complicated by the expectation of recovering a set of accurate fiducial points in conjunction with successfully tracking the object.As described in Section 2, current deformable facial tracking methods mainly concentrate described in the paper, will not be made publicly available (at least not in the near future).
on performing face detection per frame and then performing facial landmark localisation.However, we consider the most important metric for measuring the success of deformable face tracking as the facial landmark localisation accuracy.Given this, there are a number of strategies that could feasibly be employed in order to attempt to minimise the total facial landmark localisation error across the entire sequence.Therefore, we take advantage of current advances in face detection, model free tracking and facial landmark localisation techniques in order to perform deformable face tracking.Specifically, we investigate three strategies for deformable tracking: 1. Detection + Landmark Localisation.Face Detection per frame, followed by facial landmark localisation initialised within the facial bounding boxes.This scenario is visualised in Figure 1 and 8.
Note that we focus on combinations of methods that provide bounding boxes of the facial region followed by landmark localisation.This is due to the fact that the current set of state-of-the-art landmark localisation methods are all local methods and require initialisation within the facial region.Although joint face detection and landmark localisation methods have been proposed (Zhu and Ramanan (2012); Chen et al ( 2014)), they are not competitive with the most recent set of landmark localisation methods.For this reason, in this paper we focus on the combination of bounding box estimators with state-of-the-art local landmark localisation techniques.
The remainder of this Section will give a brief overview of the literature concerning face detection, model free tracking and facial landmark localisation.

Face Detection
Face detection is among the most important and popular tasks in Computer Vision and an essential step for applications such as face recognition and face analysis.Since face detection has been a research topic for so many decades, the existing literature is, naturally, extremely extensive.The fact that all recent face detection surveys (Hjelmås and Low (2001); Yang et al (2002); Zhang and Zhang (2010); Zafeiriou et al (2015)) provide different categorisations of the relative literature is indicative of the huge range of existing techniques.Consequently, herein, we only present a basic outline of the face detection literature.For an extended review, the interested reader may refer to the most recent face detection survey in Zafeiriou et al (2015).
According to the most recent literature review Zafeiriou et al (2015), existing methods can be separated in two major categories.The first one includes methodologies that learn a set of rigid templates, which can be further split in the following groups: (i) boosting-based methods, (ii) approaches that utilise SVM classifiers, (ii) exemplar-based techniques, and (iv) frameworks based on Neural Networks.The second major category includes deformable part models, i.e. methodologies that learn a set of templates per part as well as the deformations between them.
Boosting Methods.Boosting combines multiple "weak" hypotheses of moderate accuracy in order to determine a highly accurate hypothesis.The most characteristic example is Adaptive Boosting (AdaBoost) which is utilised by the most popular face detection methodology, i.e. the Viola-Jones (VJ) detector of Viola andJones (2001, 2004).Characteristic examples of other methods that employ variations of AdaBoost include Li et al (2002); Wu et al (2004); Mita et al (2005).The original VJ algorithm used Haar features, however boosting (or cascade of classifiers methodologies in general) have been shown to greatly benefit from robust features (Köstinger et al (2012); Jun et al (2013); Li et al (2011); Li and Zhang (2013); Mathias et al (2014); Yang et al (2014)), such as HOG (Dalal and Triggs (2005)), SIFT (Lowe (1999)), SURF (Bay et al (2008)) and LBP (Ojala et al (2002)).For example, SURF features have been successfully com-

Method Citation(s) Rigid Template DPM Implementation
Table 1: The set of detectors used in this paper.The table reports the short name of the method, the relevant citation(s) as well as the link to the implementation used.
bined with a cascade of weak classifiers in Li et al (2011); Li and Zhang (2013), achieving faster convergence.Additionally, Jun et al (2013) propose robust face specific features that combine both LBP and HOG.Mathias et al (2014) recently proposed an approach (so called HeadHunter) with state-of-the-art performance that employs various robust features with boosting.Specifically, they propose the adaptation of Integral Channel Features (ICF) (Dollár et al (2009)) with HOG and LUV colour channels, combined with global feature normalisation.A similar approach is followed by Yang et al (2014), in which they combine gray-scale, RGB, HSV, LUV, gradient magnitude and histograms within a cascade of weak classifiers.
SVM Classifiers.Maximum margin classifiers, such as Support Vector Machines (SVMs), have become popular for face detection (Romdhani et al (2001); Heisele et al (2003); Rätsch et al (2004); King (2015)).Even though their detection speed was initially slow, various schemes have been proposed to speed up the process.Romdhani et al (2001) propose a method that computes a reduced set of vectors from the original support vectors that are used sequentially in order to make early rejections.A similar approach is adopted by Rätsch et al (2004).A hierarchy of SVM classifiers trained on different resolutions is applied in Heisele et al (2003).King (2015) proposes an algorithm for efficient learning of a max-margin classifier using all the sub-windows of the training images, without applying any sub-sampling, and formulates a convex optimisation that finds the global optimum.Moreover, SVM classifiers have also been used for multi-view face detection (Li et al (2000); Wang and Ji (2004)).For example, Li et al (2000) first apply a face pose estimator based on Support Vector Regression (SVR), followed by an SVM face detector for each pose.
Exemplar-based Techniques.These methods aim to match a test image against a large set of facial images.This approach is inspired by principles used in image retrieval and requires that the exemplar set covers the large appearance variation of human face.Shen et al (2013) employ bag-of-word image retrieval methods to extract features from each exemplar, which creates a voting map for each exemplar that functions as a weak classifier.Thus, the final detection is performed by combining the voting maps.A similar methodology is applied in Li et al (2014), with the difference that specific exemplars are used as weak classifiers based on a boosting strategy.Recently, Kumar et al (2015) proposed an approach that enhances the voting procedure by using semantically related visual words as well as weighted occurrence of visual words based on their spatial distributions.2014)) learn a patch expert for each part of an object and model the deformations between parts using spring-like connections based on a tree structure.Consequently, they perform joint facial landmark localisation and face detection.Even though they are not the best performing methods for landmark localisation, they are highly accurate for face detection inthe-wild.However, their main disadvantage is their high computational cost.Pictorial Structures (PS) (Fischler and Elschlager (1973); Felzenszwalb and Huttenlocher (2005)) are the first family of DPMs that appeared.They are generative DPMs that assume Gaussian distributions to model the appearance of each part, as well as the deformations.They became a very popular line of research after the influential work in Felzenszwalb and Huttenlocher (2005) that proposed a very efficient dynamic programming algorithm for finding the global optimum based on Generalized Distance Transform.Many discriminatively trained DPMs (Felzenszwalb et al (2010); Zhu and Ramanan (2012); Yan et al (2013Yan et al ( , 2014))) appeared afterwards, which learn the patch experts and deformation parameters using discriminative classifiers, such as latent SVM.

Convolutional
DPMs can be further separated with respect to their training scenario into: (i) weakly supervised and (ii) strongly supervised.Weakly-supervised DPMs (Felzenszwalb et al (2010); Yan et al ( 2014)) are trained using only the bounding boxes of the positive examples and a set of negative examples.The most representative example is the work by Felzenszwalb et al (2010), which has proved to be very efficient for generic object detection.Under a strongly supervised scenario, it is assumed that a training database with images annotated with figucial landmarks is available.Several strongly supervised methods exist in the literature (Felzenszwalb and Huttenlocher (2005); Zhu and Ramanan (2012); Yan et al (2013); Ghiasi and Fowlkes (2014)).Ghiasi and Fowlkes (2014) propose an hierarchical DPM that explicitly models parts' occlusions.In Zhu and Ramanan (2012) it is shown that a strongly supervised DPM outperforms, by a large margin, a weakly supervised one.In contrast, HeadHunter by Mathias et al (2014) shows that a weakly supervised DPM can outperform all current state-of-the-art face detection methodologies including the strongly supervised DPM of Zhu and Ramanan (2012).
According to FDDB (Jain and Learned-Miller (2010)), which is the most well established face detection benchmark, the currently top-performing methodology is the one by Ranjan et al (2015), which combines DCNNs with a DPM.However, it is impossible to use most DCNN-based techniques, because their authors do not provide publicly available implementations and it is very complicated and time-consuming to train and fine-tune such networks.Thus, even though many DCNN-based techniques are proved to achieve state-of-the-art performance, it was not feasible to use them for deformable face tracking pipelines.Nevertheless, we employ the top performing SVM-based method for learning rigid templates (King (2015)), as well as the best weakly and strongly supervised DPM implementations of Mathias et al (2014) and Zhu and Ramanan (2012).Finally, we also use the popular VJ algorithm (Viola andJones (2001, 2004)) as a baseline face detection method.The employed face detection implementations are summarised in Table 1.

Model Free Tracking
Model free tracking is an extremely active area of research.Given the initial state (e.g., position and size of the containing box) of a target object in the first image, model free tracking attempts to estimate the states of the target in subsequent frames.Therefore, model free tracking provides an excellent method of initialising landmark localisation methods.
The literature on model free tracking is vast.For the rest of this section, we will provide an extremely brief overview of model free tracking that focuses primarily on areas that are relevant to the tracking methods we investigated in this paper.We refer the interested reader to the wealth of tracking surveys (Li et al (2013b); Smeulders et al (2014); Salti et al (2012); Yang et al (2011)) and benchmarks (Wu et al (2013(Wu et al ( , 2015)); Kristan et al (2013Kristan et al ( , 2014Kristan et al ( , 2015Kristan et al ( , 2016)); Smeulders et al (2014)) for more information on model free tracking methods.
Generative Trackers.These trackers attempt to model the objects appearance directly.This includes template based methods, such as those by Matthews et al (2004); Baker and Matthews (2004); Sevilla-Lara and Learned-Miller (2012), as well as parametric generative models such as Balan and Black (2006); Ross et al (2008); Black and Jepson (1998); Xiao et al (2014).The work of Ross et al (2008) introduces online subspace learning for tracking with a sample mean update, which allows the tracker to account for changes in illumination, viewing angle and pose of the object.The idea is to incrementally learn a low-dimensional subspace and adapt the appearance model on object changes.The update is based on an incremental principal component analysis (PCA) algorithm, however it seems to be ineffective at handling large occlusions or non-rigid movements due to its holistic model.To alleviate the partial occlusion, Xiao et al (2014) suggest the use of square templates along with PCA.Another popular area of generative  Nebehay and Pflugfelder (2015)) attempt to use the robustness of keypoint detection methodologies like SIFT (Lowe (1999)) or SURF (Bay et al (2008)) in order to perform tracking.Pernici and Del Bimbo (2014) collected multiple descriptors of weakly aligned keypoints over time and combined these matched keypoints in a RANSAC voting scheme.Nebehay and Pflugfelder (2015) utilises keypoints to vote for the object center in each frame.A consensus-based scheme is applied for outlier detection and the votes are transformed based on the current key point arrangement to consider scale and rotation.However, keypoint methods may suffer from difficulty in capturing the global information of the tracked target by only considering the local points.
Discriminative Trackers.These trackers attempt to explicitly model the difference between the object appearance and the background.Most commonly, these methods are named "tracking-by-detection" techniques as they involve classifying image regions as either part of the object or the background.In their work, Grabner et al (2006) propose an online boosting method to select and update discriminative features which allows the system to account for minor changes in the object appearance.However, the tracker fails to model severe changes in appearance.Babenko et al (2011) advocate the use of a multiple instance learning boosting algorithm to mitigate the drifting problem.More recently, discriminative correlation filters (DCF) have become highly successful at tracking.The DCF is trained by performing a circular sliding window operation on the training samples.This periodic assumption enables efficient training and detection by utilizing the Fast Fourier Transform (FFT).Danelljan et al (2014) learn separate correlation filters for the translation and the scale estimation.In Danelljan et al (2015), the authors introduce a sparse spatial regularisation term to mitigate the artifacts at the boundaries of the circular correlation.In contrast to the linear regression commonly used to learn DCFs, Henriques et al (2015) apply a kernel regression and propose its multi-channel extension to enable to the use of features such as HOG Dalal and Triggs (2005).Li et al (2015d) propose a new use for particle filters in order to choose reliables patches to consider part of the object.These patches are modelled using a variant of the method proposed by Henriques et al (2015).Hare et al (2011) propose the use of structured output prediction.By explicitly allowing the outputs to parametrize the needs of the tracker, an intermediate classification step is avoided.
Part-based Trackers.These trackers attempt to implicitly model the parts of an object in order to improve tracking performance.Adam et al (2006) represent the object with multiple arbitrary patches.Each patch votes on potential positions and scales of the object and a robust statistic is employed to minimise the voting error.Kalal et al (2010b) sample the object and the points are tracked independently in each frame by estimating optical flow.Using a forward-backward measure, the erroneous points are identified and the remaining reliable points are utilised to compute the optimal object trajectory.Yao et al (2013) adapt the latent SVM of Felzenszwalb et al (2010) for online tracking, by restricting the search in the vicinity of the location of the target object in the previous frame.In comparison to the weakly supervised part-based model of Yao et al (2013), in Zhang and van der Maaten (2013) the authors recommend an online strongly supervised partbased deformable model that learns the representation of the object and the representation of the background by training a classifier.Wang et al (2015) employ a partbased tracker by estimating a direct displacement prediction of the object.A cascade of regressors is utilised to localise the parts, while the model is updated online and the regressors are initialised by multiple motion models at each frame.
Given the wealth of available trackers, selecting appropriate trackers for deformable tracking purposes poses a difficult proposition.In order to attempt to give as broad an overview as possible, we selected a representative tracker from each of the categories described previously.Therefore, in this paper we compare against 14 trackers which are outlined in Table 2. SRDCF (Danelljan et al ( 2015)), KCF (Henriques et al (2015)) and DSST (Danelljan et al (2014)) are all discriminative trackers based on DCFs.They all performed well in the VOT 2015 (Kristan et al (2015)) challenge and DSST was the winner of VOT 2014 (Kristan et al (2014)).STRUCK (Hare et al (2011)) is a discriminative tracker that performed very well in the Online Object Tracking benchmark (Wu et al (2013)).SPOT (Zhang and van der Maaten (2014)) is a strong performing part based tracker, CMT (Nebehay and Pflugfelder (2015)) is a strong performing keypoint based tracker and LRST (Zhang et al (2014b)) and ORIA (Wu et al (2012)) are recent generative trackers.RPT (Li et al (2015d)) is a recently proposed technique that reported state-of-the-art results on the Online Object Tracking benchmark (Wu et al (2013)).Finally, TLD (Kalal et al (2012)), MIL (Babenko et al (2011)), FCT (Zhang et al (2014a)), DF (Sevilla-Lara and Learned-Miller ( 2012)) and IVT (Ross et al (2008)) were included as baseline tracking methods with publicly available implementations.

Facial Landmark Localisation
Statistical deformable models have emerged as an important research field over the last few decades, existing at the intersection of computer vision, statistical pattern recognition and machine learning.Statistical deformable models aim to solve generic object alignment in terms of localisation of fiducial points.Although deformable models can be built for a variety of object classes, the majority of ongoing research has focused on the task of facial alignment.Recent large-scale challenges on facial alignment (Sagonas et al (2013b(Sagonas et al ( ,a, 2015))) are characteristic examples of the rapid progress being made in the field.
Currently, the most commonly-used and well-studied face alignment methods can be separated into two major families: (i) discriminative models that employ regression in a cascaded manner, and (ii) generative models that are iteratively optimised.
Regression-based models.The methodologies of this category aim to learn a regression function that regresses from the object's appearance (e.g.commonly handcrafted features) to the target output variables (either the landmark coordinates or the parameters of a statistical shape model).Although the history behind using linear regression in order to tackle the problem of face alignment spans back many years (Cootes et al (2001)), the research community turned towards alternative approaches due to the lack of sufficient data for training accurate regression functions.Nevertheless, recently regression-based techniques have prevailed in the field thanks to the wealth of annotated data and effective handcrafted features (Lowe (1999); Dalal and Triggs (2005)).Recent works have shown that excellent performance can be achieved by employing a cascade of regression functions (Burgos-Artizzu et al (2013); Table 3: The landmark localisation methods employed in this paper.The table reports the short name of the method, the relevant citation(s) as well as the link to the implementation used.Generative models.The most dominant representative algorithm of this category is, by far, the Active Appearance Model (AAM).AAMs consist of parametric linear models of both shape and appearance of an object, typically modelled by Principal Component Analysis (PCA).The AAM objective function involves the minimisation of the appearance reconstruction error with respect to the shape parameters.AAMs were initially proposed by Cootes et al (1995Cootes et al ( , 2001)), where the optimisation was performed by a single regression step between the current image reconstruction residual and an increment to the shape parameters.However, Matthews and Baker (2004); Baker and Matthews (2004) linearised the AAM objective function and optimised it using the Gauss-Newton algorithm.Following this, Gauss-Newton optimisation has been the modern method for optimising AAMs.Numerous extensions have been published, either related to the optimisation procedure (Papandreou and Maragos (2008); Tzimiropoulos and Pantic (2013); Alabort-i-Medina and Zafeiriou (2014Zafeiriou ( , 2015)) In recent challenges by Sagonas et al (2013aSagonas et al ( , 2015)), discriminative methods have been shown to represent the current state-of-the-art.However, in order to enable a fair comparison between types of methods we selected a representative set of landmark localisation methods to compare with in this paper.The set of landmark localisation methods used in the paper is given in Table 3.We chose to use ERT (Kazemi and Sullivan (2014)) as it is extremely fast and the implementation provided by King (2009) is the best known implementation of a tree-based regressor.We chose CFSS (Zhu et al (2015)) as it is the current state-of-the-art on the data provided by the 300W competition of Sagonas et al (2013a).We used the Gauss-Newton Part-based AAM of Tzimiropoulos and Pantic (2014) as the top performing generative localisation method, as provided by the Menpo Project (Alabort-i-Medina et al ( 2014)).Finally, we also demonstrated an SDM (Xiong and De la Torre ( 2013)) as implemented by Alabort-i-Medina et al (2014) as a baseline.

Experiments
In this section, details of the experimental evaluation are established.Firstly, the datasets employed for the evaluation, training and validation are introduced in Section 4.1.Next, Section 4.2 provides details of the training procedures and of the implementations that are relevant to all experiments.Following this, in Sections 4.3−4.7,we describe the set of experiments that were conducted in this paper, which are summarised in Table 4. Finally, experimental Section 4.8 compares the best results from the previous experiments to the winners of the 300VW competition in Shen et al (2015).
In the following sections, due to the very large amount of methodologies taken into account, we provide a summary of all the results as tables and only the top 5 Comparison against state-of-the-art of 300VW competition (Shen et al (2015)).
Table 4: The set of experiments conducted in this paper.This table is intended as an overview of the battery of experiments that were conducted, as well as providing a reference to the relevant section.
methods as graphs for clarity.Please refer to the supplementary material for an extensive report of the experimental results.Additionally, we provide videos with the tracking results for the experiments of Sections 4.3, 4.4 and 4.5 for qualitative comparison5,6,7 .

Dataset
All the comparisons are conducted in the testset of the 300VW dataset collected by Shen et al (2015).This recently introduced dataset contains 114 videos (50 for training and 64 for testing).The videos are separated into the following 3 categories: -Category 1 : This category is composed of videos captured in well-lit environments without any occlusions.
-Category 2 : The second category includes videos captured in unconstrained illumination conditions.-Category 3 : The final category consists of video sequences captured in totally arbitrary conditions (including severe occlusions and extreme illuminations).
Each video includes only one person and is annotated using the 68 point mark-up employed by Gross et al (2010) and Sagonas et al (2015) for Multi-PIE and 300W databases, respectively.All videos are between 1500 frames and 3000 frames with a large variety of expressions, poses and capturing conditions, which makes the dataset very challenging for deformable facial tracking.A number of exemplar images, which are indicative of the challenges of each category, are provided in Figure 2. We note that, in contrast to the results of Shen et al (2015) in the original 300VW competition, we used the most recently provided annotations 1 which have been corrected and do not contain missing frames.Therefore, we also provide updated results following the participants of the 300VW competition.

Implementation Details
The authors' implementations are utilised for the trackers, as outlined in Table 2. Similarly, the face detectors' implementations are outlined in Table 1.HOG+SVM was provided by the Dlib project of King (2015King ( , 2009)), the Weakly Supervised DPM (DPM) (Felzenszwalb et al (2010)) was the model provided by Mathias et al (2014) and the code of Dubout andFleuret (2012, 2013) was used to perform the detection.Moreover, the Strongly Supervised DPM (SS-DPM) of Zhu and Ramanan (2012) was provided by the authors and, finally, the OpenCV implementation by Bradski (2000) was used for the VJ detector (Viola and Jones (2004)).The default parameters were used in all cases.
For face alignment, as outlined in Table 3, the implementation of CFSS provided by Zhu et al ( 2015) is adopted, while the implementations provided by Alaborti-Medina et al (2014) in the Menpo Project are employed for the patch-based AAM of Tzimiropoulos and Pantic (2014) and the SDM of Xiong and De la Torre (2013).Lastly, the implementation of ERT (Kazemi and Sullivan (2014)) is provided by King (2009) in the Dlib library.For the three latter methods, following the orig- The details of the parameters utilised for the patchbased AAM, SDM and ERT are mentioned below.For AAM, we used the algorithm of Tzimiropoulos and Pantic (2014) and applied a 2-level Gaussian pyramid with 4 and 10 shape components, and 60 and 150 appearance components in each scale, respectively.For the SDM, a 4-level Gaussian pyramid was employed.SIFT (Lowe (1999)) feature vectors of length 128 were extracted at the first 3 scales, using RootSIFT by Arandjelović and Zisserman (2012).Raw pixel intensities were used at the highest scale.Finally, part of the experiments were conducted on the cloud software of Koukis et al (2013).

Landmark Localisation Training
All the landmark localisation methods were trained with respect to the 68 facial points mark-up employed by Sagonas et al (2013aSagonas et al ( , 2015) ) in 300W, while the rest of the parameters were determined via cross-validation.Again, this validation set consisted of frames from the 300VW trainset, as well as 60 privately collected images with challenging poses.All of the discriminative landmark localisation methods (SDM, ERT, CFSS) were trained from images in the public datasets of IBUG (Sagonas et al (2013a)), HELEN (Le et al (2012)), AFW (Zhu and Ramanan (2012)) and LFPW (Belhumeur et al (2013)).The generative AAM was trained on less data, since generative methods do not benefit as strongly from large training datasets.The training data used for the AAM was the recently released 300 images from the 600W dataset (Sagonas et al (2015)), 500 challenging images from LFPW (Belhumeur et al (2013)) and the 135 images of the IBUG dataset (Sagonas et al (2013a)).
Discriminative landmark localisation methods are tightly coupled with the initialisation statistics, as they learn to model a given variance of initialisations.Therefore, it is necessary to re-train each discriminative method for each face detection method employed.This allows the landmark localisation methods to correctly model the large amount of variance present between detectors.On aggregate 5 different detector and landmark localisation models are trained.One for each detector and landmark localisation pair (totalling 4) and a single model trained using a validation set that estimates the variance of the ground truth bounding box throughout the sequences.This model is used for all trackers.

Quantitative Metrics
The errors reported for all the following experiments are with respect to the landmark localisation error.The error metric employed is the mean Euclidean distance of the 68 points, normalised by the diagonal of the ground truth bounding box ( width 2 + height 2 ).This metric was chosen as it is robust to changes in head pose which are frequent within the 300VW sequences.The graphs that are shown are cumulative error distribution (CED) plots that provide the proportion of images less than or equal to a particular error.We also provide summary tables with respect to the Area Under the Curve (AUC) of the CED plots, considered up to a maximum error.Errors above this maximum threshold, which is fixed to  0.08, are considered failures to accurately localise the facial landmarks.Therefore, we also report the failure rate, as a percentage, which marks the proportion of images that are not considered within the CED plots.Table 5 shows some indicative examples of the deformable fitting quality that corresponds to each error value for all video categories.When ranking methods, we consider the AUC as the primary statistic and only resort to considering the failure rate in cases where there is little distinction between methods' AUC values.

Experiment 1: Detection and Landmark Localisation
In this experiment, we validate the most frequently used facial deformable tracking strategy, i.e. performing face detection followed by landmark localisation on each frame independently.If a detector fails to return a frame, that frame is considered as having infinite error and thus will appear as part of the failures in Table 6.Note that the AUC is robust to the use of infinite errors.
In frames where multiple bounding boxes are returned,  the box with the highest confidence is kept, limiting the results of the detectors to a single bounding box per image.A high level diagram explaining the detection procedure for this experiment is given by Figure 1.
Specifically, in this experiment we consider the 4 face detectors of Table 1 (DPM, SS-DPM, HOG+SVM, VJ) with the 4 landmark localisation techniques of Table 3 (AAM, CFSS, ERT, SDM), for a total of 16 results.The results of the experiment are given in Table 6 and Figure 3.The results indicate that the AAM performs poorly as it achieves the lowest performance across all face detectors.The discriminative CFSS and ERT landmark localisation methods consistently outperform SDM.From the detectors point of view, it seems that the strongly supervised DPM (SS-DPM) is the worst and provides the highest failure rates.On the other hand, the weakly supervised DPM (DPM) outperforms the rest of the detectors for all video categories in terms of both accuracy (i.e.AUC) and robustness (i.e.Failure Rate).For the graphs that correspond to all 16 methods, as well as a video with the results of the top 5 methods 5 , please refer to the supplementary material.

Detection Landmark Localisation
Initialise From Previous Frame     7 for a full summary.

Experiment 2: Detection and Landmark Localisation with Reinitialisation
Complementing the experiments of Section 4.3, the same set-up was utilised to study the effect of missed frames by assuming a first order Markov dependency.If the detector does not return a bounding box in a frame, the bounding box of the previous frame is used as a successful detection for the missing frame.This procedure is depicted in Figure 4. Given that the frame rate of the input videos is adequately high (over 20fps), this assumption is a reasonable one.The results of this experiment are summarised in Table 7 and in Figure 5.As expected, the ranking of the methods remains the same as the previous experiment of Section 4.3.
In order to better investigate the effect of this reinitialisation scheme, we also provide Figure 6 that directly shows the improvement.Specifically, we plot the CED curves with and without the reinitialisation strategy for the 3 best performing methods, as well as the 3 techniques for which the highest improvement is achieved.It becomes evident that the top performing methods Fig. 6: Results for Experiment 2 of Section 4.4 (Detection + Landmark Localisation + Initialisation From Previous Frame).These results show the effect of initialisation from the previous frame, in comparison to missing detections.The top 3 performing results are given in red, green and blue, respectively, and the top 3 most improved are given in cyan, yellow and brown, respectively.The dashed lines represent the results before the reinitialisation strategy is applied, solid lines are after.
from Section 4.3 do not benefit from reinitialisation, since the improvement is marginal.This is explained by the fact that these methods already achieve a very high true positive rate.The largest difference is observed for methods that utilise AAM.As shown by Antonakos et al (2015b), AAMs are very sensitive to initialisation, due to the nature of Gauss-Newton optimisation.Additionally, note that we have not attempted to apply any kind of greedy approach for improving the detectors' bounding boxes in order to provide a better AAM initialisation.Since the initialisation of a frame with failed detection is achieved by the bounding box of the previous frame's landmarks, it is highly likely that its area will be well constrained to include only the facial parts and not the forehead or background.This kind of initialisation is very beneficial for AAMs, which justifies the large improvements that are shown in Figure 6.For the graphs that correspond to all 16 methods as well as a video with the results of the top 5 methods 6 , please refer to the supplementary material.

Experiment 3: Model-free Tracking and Landmark Localisation
In this section, we provide, to the best of our knowledge, the first detailed analysis of the performance of model free trackers for tracking "in-the-wild" facial sequences.For this reason, we have considered a large number of trackers in order to attempt to give a balanced overview of the performance of modern model trackers for deformable face alignment.The 14 trackers considered in this section are summarised in Table 2. To initialise all trackers, the tightest possible bounding box of the ground truth facial landmarks is provided as the initial tracker state.We also include a baseline method, which appears in results Table 8, referred to as PREV, which is defined as applying the landmark localisation methods initialised from the bounding box of the result in the previous frame.Obviously this scheme is highly sensitive to drifting and therefore we have included it as a basic baseline that does not include any model free tracking.A high level diagram explaining the detection procedure for this experiment is given by Figure 1.
Specifically, in this experiment we consider the 14 model free trackers of Table 2, plus the PREV baseline, with the 4 landmark localisation techniques of Table 3 (AAM, CFSS, ERT, SDM), for a total of 60 results.The results of the experiment are given in Table 8 and Figure 7.Note that the results for ORIA (Wu et al (2012)) and DF (Sevilla-Lara and Learned-Miller (2012)) do not appear in Table 8 due to lack of space and the fact that they did not perform well in comparison to PREV.Please see the supplementary material for full statistics.
By inspecting the results, we can firstly notice that most generative trackers perform poorly (i.e.ORIA, DF, FCT, IVT), except LRST which achieves the second best performance for the most challenging video category.On the other hand, the discriminative approaches of SRDCF and SPOT are consistently performing very well.Additionally, similar to the face de-  tection experiments, the combination of all trackers with CFSS returns the best result, whereas AAM constantly demonstrates the poorest performance.Finally, it becomes evident that a straightforward application of the simplistic baseline approach (PREV) is not suitable for deformable tracking, even though it is surprisingly outperforming some model free trackers, such as DF, ORIA and FCT.For the curves that correspond to all 60 methods as well as a video with the tracking result of the top 5 methods 7 , please refer to the supplementary material.

Experiment 4: Failure Checking and Tracking Reinitialisation
Complementing the experiments of Section 4.5, we investigate the improvement in performance of performing failure checking during tracking.Here we define failure checking as the process of determining whether or not the currently tracked object is a face.Given that we have prior knowledge of the class of object we are tracking, namely faces, this enables us to train an offline classifier that attempts to determine whether a given input is a face or not.Furthermore, since we are also applying landmark localisation, we can perform a strong classification by using the facial landmarks as position priors when extracting features for the failure checking.To train the failure checking classifier, we perform the following methodology: 1.For all images in the Landmark Localisation training set, extract a fixed sized patch around each of the 68 landmarks and compute HOG (Dalal and Triggs (2005)) features for each patch.These patches are the positive training samples.
2. Generate negative training samples by perturbing the ground truth bounding box, extracting fixed size patches and computing HOG. 3. Train an SVM classifier using the positive and negative samples.
For the experiments in this section, we use a fixed patch size of 18 × 18 pixels, with 100 negative patches sampled for each positive patch.The failure checking classification threshold is chosen via cross-validation on two sequences from the 300VW training videos.Any hyperparameters of the SVM are also trained using these two validation videos.
Given the failure detector, our restart procedure, is as follows: -Classify the current frame to determine if the tracking has failed.If a failure is verified, perform a restart, otherwise continue.-Following the convention of the VOT challenges by Kristan et al (2013Kristan et al ( , 2014Kristan et al ( , 2015)), we attempt to reduce the probability that poor trackers will overly rely on the output of the failure detection system.In the worst case, a very poor tracker would fail on most frames and thus the accuracy of the detector would be validated rather than the tracker itself.Therefore, when a failure is identified, the tracker is allowed to continue for 10 more frames.The results from the drifting tracker are used in these 10 frames in order reduce the affect of the detector.The tracker is then reinitialised at the frame it was first detected as failing at.The next 10 frames, as previously described, already have results computed and therefore no landmark localisation or failure checking is performed in these frames.At the 11th frame,  The diagram given in Figure 8 gives a pictorial representation of this scheme.The results of this experiment are given in Table 9 and Figure 9.In contrast to Section 4.5, we only perform the experiments on a subset of the total track-ers using CFSS.We use the top 3 performing trackers (SRDCF, RPT, SPOT) as well as FCT which had mediocre performance in Section 4.5.The results indicate that SRDCF is the best model free tracking methodology for the task.
In order to better investigate the effect of this failure checking scheme, we also provide Figure 6 which shows the differences between the initial tracking results of Section 4.5 and the results after applying failure detection.The performance of the top trackers (i.e.SRDCF, SPOT, RPT) does not improve much, which is expected   ing).These results show the effect of the failure checking, in comparison to only tracking.The results are coloured by their performance red, green, blue and orange, respectively.The dashed lines represent the results before the reinitialisation strategy is applied, solid lines are after.
since they are already able to return a robust tracking result.However, FCT benefits from the failure checking process, which apparently minimises its drifting issues.

Experiment 5: Kalman Smoothing
In this section, we report the effect of performing Kalman Smoothing (Kalman (1960)) on the results of the detectors of Section 4.3 and the trackers of Section 4.5.This experiment is designed to highlight the stability of the current landmark localisation methods with respect to noisy movement between frames (or jittering as it often known).However, when attempting to smooth the trajectories of the tracked bounding boxes themselves, we found an extremely negative effect on the results.Therefore, to remove jitter from the results we perform Kalman smoothing on the landmarks themselves.To robustly smooth the landmark trajectories, a generic facial shape model is constructed in a similar manner as described in the AAM literature by Cootes et al (2001).Specifically, given the sparse shape of the face consisting of n landmark points, we denote the coordinates of the i-th landmark point within the Cartesian space of the image I as x i = [x i , y i ] T .Then a shape instance of the face is given by the 2n × 1 vector s = x T 1 , .Fig. 12: Results for Experiment 5 of Section 4.7 (Kalman Smoothing).These results show the effect of Kalman smoothing on the final landmark localisation results.The top 3 performing results are given in red, green and blue, respectively, and the top 3 most improved are given in cyan, yellow and brown, respectively.The dashed lines represent the results before the smoothing is applied, solid lines are after.
Expectation-Maximisation (EM) for the parameters p of each shape within a sequence.The results of this experiment are given in Table 10 and Figure 11.These experiments also provide a direct comparison between the best detection and model free tracking based techniques.For the videos of categories 1 and 3, the Kalman smoothing applied on DPM followed by a discriminative landmark localisation method (CFSS, ERT) outperforms all the combinations that involve model free rigid tracking.The combination of SRDCF with CFSS with Kalman smoothing achieves the best performance for Category 2.
In order to better investigate the effect of the smoothing, we also provide Figure 12 which shows the differences between the initial tracking results and the results after applying Kalman smoothing.This comparison is shown for the best methods of Table 10.It becomes obvious that the improvement introduced by Kalman smoothing is marginal.

300VW Comparison
In this section we provide results that compare the best performing methods of the previous sections (4.3-4.7) to the participants of the 300VW challenge by Shen et al (2015).The challenge had 5 competitors.Rajamanoharan and Cootes (2015) employ a multi-view Constrained Local Model (CLM) with a global shape model and different response maps per pose and explore shapespace clustering strategies to determine the optimal pose-specific CLM.Uricar and Franc (2015) apply a DPM at each frame as well as Kalman smoothing on the face positions.Wu and Ji (2015) utilise a shape augmented regression model, where the regression function is automatically selected based on the facial shape.Xiao et al (2015) propose a multi-stage regression-based approach that progressively provides initialisations for ambiguous landmarks such as boundary and eyebrows, based on landmarks with semantically strong meaning such as eyes and mouth corners.Finally, Yang et al (2015a) employ a multi-view spatio-temporal cascade shape regression model along with a novel reinitialisation mechanism.
The results are summarised in Table 11 and Fig ure 13.Note that the error metric considered in this paper (as described in Section 4.2.2) differs from that of the original competition.This was intended to improve the robustness of the results with respect to variation in pose.Also, as noted in Section 4.2, the 300VW annotations have been corrected and thus this experiment represents updated results for the 300VW competitors.The results indicate that Yang et al (2015a) outperform the rest of the methods for the videos of Categories 1 and 2, whereas a weakly supervised DPM combined with CFSS and Kalman smoothing is the top performing for the challenging videos of Category 3.Moreover, it becomes evident that methodologies which employ face detection dominate Categories 1 and 3. Category 2 is dominated by approaches that utilise a model free tracker.Fig. 13: Comparison between the best methods of Sections 4.3-4.7 and the participants of the 300VW challenge by Shen et al (2015).The top 5 methods are shown and are coloured red, blue, green, orange and purple, respectively.Please see Table 11 for a full summary.

Discussion and Conclusions
In Section 4 we presented a number of experiments on deformable tracking of sequences containing a single face.We investigated the performance of state-ofthe-art face detectors and model free trackers on the recently released 300VW dataset 1 .We also devised a number of hybrid systems that attempt to improve the performance of both detectors and trackers with respect to tracking failures.A summary of the proposed experiments are given in Table 4.
Overall, it appears that modern detectors are capable of handling videos of the complexity provided by the 300VW dataset.This supports the most commonly proposed deformable face tracking methodology that couples a detector with a landmark localisation algorithm.More interestingly, it appears that modern model free trackers are also highly capable of tracking videos that contain variations in pose, expression and illumination.This is particularly evident in the videos of Category 2 where the model free trackers perform the best.The performance on the videos of Category 2 is likely due to the decreased amount of pose variation in comparison to the other two categories.Category 2 contains many illumination variations which model free trackers appear invariant to.Our work also supports the most recent model free tracking benchmarks (Kristan et al (2015) and Wu et al (2015)) which have demonstrated that DCF-based trackers are currently the most competitive.However, the performance of the trackers does deteriorate significantly in Category 3 which supports the categorisation of these videos in the 300VW as the most difficult category.The difficulty in the videos of Category 3 largely stems from the amount of pose variation present, which both detectors and model free trackers struggle with.
The DPM detector provided by Mathias et al ( 2014) is very robust across a variety of poses and illumination conditions.Overall, it outperformed the other methods by a fairly significant margin, particularly when failure rate is considered.Even in the most challenging videos of Category 3, the failure rate of DPM is only approximately 5%, which is over 50% less than the next best performing method, SRDCF, at 8%.The CFSS landmark localisation method of Zhu et al (2015) outperforms all other considered landmark localisation methods, although the random forest based ERT method of Kazemi and Sullivan (2014) also performed very well.The difference between CFSS and SDM supports the findings of Zhu et al (2015) as the videos contain very challenging pose variations.
The stable performance of both the best model free trackers and detectors on these videos is further demonstrated by the minimal improvement gained from the proposed hybrid systems.Neither reinitialisation from the previous frame (Section 4.4), nor the failure detection methodology proposed (Section 4.6) improved the best performing methods with any significance.Furthermore, Kalman smoothing the facial shapes across the sequences also had a very minimal positive improvement.
In comparison to the recent results of the 300VW competition (Shen et al (2015)), our review of combinations of modern state-of-the-art detectors and trackers found that very strong performance can be obtained through fairly simple deformable tracking schemes.In fact, only the work of Yang et al (2015a) outperforms our best performing method and the difference shown by Figure 13 appears to be marginal, particular in Category 3.However, the overall results show that, particularly for videos that contain significant pose, there are still improvements to be made.
To summarise, there are a number of important issues that must be tackled in order to improve deformable face tracking: 1. Pose is still a challenging issue for landmark localisation methods.In fact, the videos of 300VW do not even exhibit the full range of possible facial pose as they do not contain profile faces.The challenges of considering profile faces have yet to be adequately addressed and have not be verified with respect to current state-of-the-art benchmarks.2. In this work, we only consider videos that contain a single visible face.However, there are many scenarios in which multiple faces may be present and this represents further challenges to deformable tracking.Detectors for example, are particularly vulnerable to multi-object tracking scenarios as they require extending with the ability to determine whether the object being localised is the same as in the previous frame.3. It is very common for objects to leave the frame of the camera during a sequence, and then reappear.Few model free trackers are robust to reinitialisation after an object has disappeared and then reappeared.When combined with multiple objects, this scenario becomes particularly challenging as it requires a re-identification step in order to verify whether the object to be tracked is one that was seen before.
We believe that deformable face tracking is a very exciting line of research and future advances on the field can have an important impact on several areas of Computer Vision.

Fig. 1 :
Fig. 1: Overview of the standard approaches for deformable face tracking.(Top): Face detection is applied independently at each frame of the video followed by facial landmark localisation.(Bottom): Model free tracking is employed, initialised with the bounding box of the face at the first frame, followed by facial landmark localisation.
Neural Networks.Another category, similar to the previous rigid template-based ones, includes the employment of Convolutional Neural Networks (CNNs) and Deep CNNs (DCNNs)(Osadchy et al (2007);Zhang and Zhang (2014);Ranjan et al (2015);Li et al (2015c);Yang et al (2015b)).Osadchy et al (2007) use a network with four convolution layers and one fully connected layer that rejects the non-face hypotheses and estimates the pose of the correct face hypothesis.Zhang and Zhang (2014)  propose a multi-view face detection framework by employing a multi-task DCNN for face pose estimation and landmark localization in order to obtain better features for face detection.Ranjan et al (2015) combine deep pyramidal features with Deformable Part Models.Recently, Yang et al (2015b) proposed a DCNN architecture that is able to discover facial parts responses from arbitrary uncropped facial images without any part supervision and report stateof-the-art performance on current face detection benchmarks.Deformable Part Models.DPMs (Schneiderman and Kanade (2004); Felzenszwalb and Huttenlocher (2005); Felzenszwalb et al (2010); Zhu and Ramanan (2012); Yan et al (2013); Li et al (2013a); Yan et al (2014); Mathias et al (2014); Ghiasi and Fowlkes (2014); Barbu et al ( Xiong and De la Torre (2013, 2015);Dollár et al (2010);Xiong and De la Torre (2013);Cao et al (2014);Kazemi and Sullivan (2014);Ren et al (2014);Asthana et al (2014);Tzimiropoulos (2015);Zhu et al (2015)).Regression based methods can be approximately seperated into two categories depending on the nature of the regression function employed.Methods that employ a linear regression such as the Supervised Descent Method (SDM) of Xiong and De la Torre (2013) tend to employ robust hand-crafted features (Xiong and De la Torre (2013);Asthana et al (2014);Xiong and De la Torre (2015);Tzimiropoulos (2015);Zhu et al (2015)).On the other hand, methods that employ tree-based regressors such as the Explicit Shape Regression (ESR) method ofCao et al (2014), tend to rely on data driven features that are optimised directly by the regressor (Burgos-Artizzu et al (2013);Cao et al (2014);Dollár et al (2010);Kazemi and Sullivan (2014)).

Fig. 2 :
Fig.2: Example frames from the 300VW dataset byShen et al (2015).Each row contains 10 exemplar images from each category, that are indicative of the challenges that characterise the videos of the category.

Fig. 4 :
Fig. 4: This figure gives a diagram of the reinitialisation scheme proposed in Section 4.4.Specifically, in case the face detector does not return a bounding box for a frame, the bounding box of the previous frame is used as a successful detection for the missing frame.

Fig. 5 :
Fig. 5: Results for Experiment 2 of Section 4.4 (Detection + Landmark Localisation + Initialisation From Previous Frame).The top 5 performing curves are highlighted in each legend.Please see Table7for a full summary.

Fig. 8 :
Fig. 8: This figure gives a diagram of the reinitialisation scheme proposed in Section 4.6 for tracking with failure detection.For all frames after the first, the result of the current landmark localisation is used to decide whether or not a face is still being tracked.If the classification fails, a re-detection is performed and the tracker is reinitialised with the bounding box returned by the detector.

Table 2 :
The set of trackers that are used in this paper.The table reports the short name of the method, the relevant Zhang et al (2012)) as the link to the implementation used.The initials stand for: (D)iscriminative, (G)enerative, (P )art-based and (K )eypoint trackers.tracking is the use of sparse representations for appearance.InMei and Ling (2011), a target candidate is represented by a sparse linear combination of target and trivial templates.The coefficients are extracted by solving an 1 minimisation problem with non-negativity constraints, while the target templates are updated online.However, solving the 1 minimisation for each particle is computationally expensive.A generalisation of this tracker is the work ofZhang et al (2012), which learns the representation for all particles jointly.It

Table 5 :
Exemplar deformable tracking results that are indicative of the fitting quality that corresponds to each error value for all video categories.The Area Under the Curve (AUC) and Failure Rate for all the experiments are computed based on the Cumulative Error Distributions (CED) limited at maximum error of 0.08.

Table 6 :
Fig. 3: Results for Experiment 1 of Section 4.3 (Detection + Landmark Localisation).The top 5 performing curves are highlighted in each legend.Please see Table6for a full summary.
Results for Experiment 1 of Section 4.3 (Detection + Landmark Localisation).The Area Under the Curve (AUC) and Failure Rate are reported.The top 4 performing curves are highlighted for each video category.

Table 7 :
Results for Experiment 2 of Section 4.4 (Detection + Landmark Localisation + Initialisation From Previous Frame).The Area Under the Curve (AUC) and Failure Rate are reported.The top 4 performing curves are highlighted for each video category.

Table 8 :
Results for Experiment 3 of Section 4.5 (Model Free Tracking + Landmark Localisation).Fig.7:Results for Experiment 3 of Section 4.5 (Model Free Tracking + Landmark Localisation).The top 5 performing curves are highlighted in each legend.Please see Table8for a full summary.

Table 9 :
Results for Experiment 4 of Section 4.6 (Model Free Tracking + Landmark Localisation + Failure Checking).The Area Under the Curve (AUC) and Failure Rate are reported.The top 3 performing curves are highlighted for each video category.Fig.9:Results for Experiment 4 of Section 4.6 (Model Free Tracking + Landmark Localisation + Failure Checking).The top 5 performing curves are highlighted in each legend.Please see Table9for a full summary.

Table 10 :
Results for Experiment 5 of Section 4.7(Kalman Smoothing).The Area Under the Curve (AUC) and Failure Rate are reported.The top 4 performing curves are highlighted for each video category.
s , s}, consists of the orthonormal basis U s ∈ R 2n×ns with n s eigenvectors and the mean shape vector s ∈ R 2n .This parametric model can be used to generate new shape instances as s(p) = s+U s p where p = [p 1 , ..., p ns ] T is the n s × 1 vector of shape parameters that control the linear combination of the eigenvectors.The Kalman smoothing is thus learnt via Fig.11: Results for Experiment 5 of Section 4.7(Kalman Smoothing).The top 5 performing curves are highlighted in each legend.Please see Table10for a full summary.

Table 11 :
Shen et al (2015)n the best methods of Sections 4.3-4.7 and the participants of the 300VW challenge byShen et al (2015).The Area Under the Curve (AUC) and Failure Rate are reported.The top 5 performing curves are highlighted for each video category.