Keywords

1 Introduction

The eyes and their movements convey our attention, indicate our interests, and play a key role in communicating social and emotional information [1]. Estimating eye gaze is therefore an important problem for computer vision, with applications ranging from facial analysis [2] to gaze-based interfaces [3, 4]. However, estimating gaze remotely under unconstrained lighting conditions and significant head-pose is a yet-outstanding challenge. Appearance-based methods that directly estimate gaze from an eye image have recently improved upon person- and device-independent gaze estimation by learning invariances from large amounts of labelled training data. In particular, Zhang et al. trained a multi-modal convolutional neural network with 200,000 images collected during everyday laptop use [5], and Wood et al. rendered over one million synthetic training images with artificial illumination variation [6]. It has been shown that the performance of such methods heavily depends on the head pose and gaze range that the training data covers – results are best when the training data closely matches the desired test condition [7]. This means a gaze estimator trained in one scenario does not perform well in another. Instead, we would prefer a generic gaze estimator that performs well in all conditions.

Fig. 1.
figure 1

Our generic gaze estimator is enabled by two contributions. First, a novel 3DMM of the eye built from high quality head scans. Second, a new method for gaze estimation – we fit our 3DMM to an image using analysis-by-synthesis, and estimate gaze from fitted parameters.

3D morphable models (3DMM) are a powerful tool as they combine a model of face variation with a model of image formation, allowing pose and illumination invariance. Since their introduction [8], they have become an established method for many tasks including inverse rendering [9, 10], face recognition [11, 12], and expression re-targeting [13]. Given a face image, such systems use model fitting to discover the most likely shape, texture, expression, pose, and illumination parameters that generated it. However, previous work has failed to accurately model the eyes, portraying them as a static geometry [8, 11], or removing them from the face entirely [13, 14]. This is a result of two complexities that are not handled by current methods: (1) The eyeball’s materials make it difficult to reconstruct in 3D, leading to poor correspondence and loss of detail in the 3DMM, (2) Previous work uses blendshapes to model facial expression – a technique not compatible with independent eyeball movement.We make two specific contributions:

An Eye Region 3DMM. Our first contribution is a novel multi-part 3DMM that includes the eyeball, allowing us to accurately model variation in eye appearance and eyeball pose (see Fig. 1 left). Recent work presented a morphable shape model of the eye region, but did not capture texture variation [6]. We constructed a 3DMM of the facial eye region by carefully registering a set of high-quality 3D head scans, and extracting modes of shape and texture variation using PCA. We combined this with an anatomy-based eyeball model that can be posed separately to simulate changes in eye gaze.

Analysis-by-Synthesis for Gaze Estimation. Our second contribution is a novel method for gaze estimation: fitting our 3DMM to an input image using analysis-by-synthesis (see Fig. 1 right). We solve for shape, texture, pose, and illumination simultaneously, so our fitted model parameters provide us with a robust estimate of where someone is looking in a 3D scene. Previous approaches for remote RGB gaze estimation can be categorized as either appearance-based, feature-based, or model-based [3]. Our method is first to combine the benefits of all three: (1) We minimize the appearance difference between synthesized and observed images using a dense image-error term. (2) We use sparse facial features localized with a face tracker [15] for initialization and regularization. (3) We use our morphable model to capture variation between people and eye motion itself. We iteratively fit our model using gradient descent with numerical derivatives efficiently calculated with a tailored GPU rasterizer.

Fig. 2.
figure 2

A comparison between the Basel Face Model (BFM, left) [11], and our own (right). Note the BFM’s lack of caruncle and unrealistic eyeball proxy geometry. Our model has well-defined correspondences for these difficult regions.

2 Related Work

2.1 3D Morphable Models

A 3D morphable model is a statistically-derived generative model, parameterized by shape and texture coefficients. They are closely related to their 2D analogue, active appearance models [16]. 3DMMs have been successfully applied to various face-related computer vision problems ranging from reconstruction [8, 10] to recognition [11, 12], and have also been extended to other body parts, such as the hand [17] as well as the entire body itself [18, 19].

Blanz and Vetter built the first 3DMM from a set of 200 laser scans of faces with neutral expression [8]. They first computed a dense correspondences between the scans, then used PCA to extract modes of variation. Subsequent work with 3DMMs has followed the same approach, building similar models with higher quality scans [11], or more training samples [12, 20]. However, despite advances in scanning technology, the eye remains problematic for 3D reconstruction, leading to poor correspondences and loss of quality in the 3DMM (see Fig. 2).

3DMMs represent a face with neutral expression, so they are often combined with a model of facial motion. Vlasic et al. used a multi-linear model to separately encode identity and expression, and demonstrated its use in facial transfer [21]. More recent works have instead used blend shapes – an animation technique that stores a different version of a mesh for each expression, and interpolates between them [14]. However, while blend shapes work well for skin, they cannot represent the independent motion of the eyeball. For these reasons, previous work either replaced the scanned eyeball with a proxy mesh [11] or completely removed the eye from the 3DMM mesh [13, 22]. Bérard et al. recently presented a 3D morphable eyeball model [23] built from a database of eyeball scans [24], showing impressive results for high-quality semi-automatic eyeball reconstruction. Our work uses a simpler model that is sufficient for low-quality input data, and our fitting procedure is fully automatic.

2.2 Remote Gaze Estimation

Gaze estimation is a well established topic in computer vision (see [3, 25] for reviews). Methods can be categorized as (1) appearance-based – map directly from image pixels to a gaze direction [5, 26, 27], (2) feature-based – localize facial feature points (e.g. pupil centre, eye corner) and map these to gaze [28, 29], or (3) model-based – estimate gaze using a geometric model of the eye [3032]. Some systems combine these techniques, e.g. using facial features for image alignment [26, 33], mapping appearance to a 2D generative model [34], or combining head pose with image pixels in a multi-modal neural network [5]. To the best of our knowledge, no work so far has combined appearance, facial features, and a generative model into a single method, solving for shape, texture, eyeball pose, and illumination simultaneously.

The current outstanding challenge for remote RGB gaze estimation is achieving person- and device- independence under unconstrained conditions [5]. The state-of-the-art methods for this are appearance-based, attempting to learn invariances from large amounts of training data. However, such systems are still limited by their training data with respect to appearance, gaze, and head pose variation [5, 27]. To address this, recent work used graphics to synthesize large amounts of training images. These learning-by-synthesis methods cover a larger range of head pose, gaze, appearance, and illumination variation without additional costs for data collection or ground truth annotation. Specifically, Wood et al. rendered 10 K images and used them to pre-train a multi-modal CNN, significantly improving upon state-of-the-art gaze estimation accuracy [7]. They later rendered 1M images with improved appearance variation for training a k-Nearest-Neighbour classifier, again improving over state-of-the-art CNN results [6].

While previous work used 3D models to synthesise training data [6], ours is first to use analysis-by-synthesis – a technique where synthesis is used for gaze estimation itself. This approach is not constrained by a limited variation in training images but instead can, in theory, generalise to arbitrary settings. Additionally, while previous work strove for realism [7], our forward synthesis method focuses on speed in order to make analysis-by-synthesis tractable.

3 Overview

At the heart of our generic gaze estimator are two core contributions. In Sect. 4 we present our first contribution: a novel multi-part eye region 3DMM. We constructed this from 22 high-resolution face scans acquired from an online storeFootnote 1, combined with an anatomy-based eyeball model. Our model is described by a set of parameters \(\varPhi \) that cover both geometric (shape, texture, and pose) and photometric (illumination and camera projection) variation.

Fig. 3.
figure 3

An overview our fitting process: We localize landmarks L in an image, and use them to initialze our 3DMM. We then use analysis-by-synthesis to render an \(I_{syn}\) that best matches \(I_{obs}\). We finally extract gaze \(\varvec{g}\) from fitted paramters \(\varPhi ^*\).

In Sect. 5 we present our second contribution: analysis-by-synthesis for gaze estimation (see Fig. 3). The core idea is to fit our 3DMM to an image using analysis-by-synthesis – given an observed image \(I_{obs}\), we wish to produce a synthesized image \(I_{syn}\) that matches it. We then estimate gaze from the fitted eyeball pose parameters. Key in this process is our objective function \(E(\varPhi )\), which considers both a local dense measure of appearance similarity, as well as a holistic sparse measure of facial feature-point similarity (see Eq. 10).

4 3D Eye Region Model

Our goal is to use a 3D eye region model to synthesize an image which matches an input RGB eye image. To render synthetic views, we used a multi-part model consisting of the facial eye region and the eyeball. These were posed in a scene, illuminated, and then rendered using a model of camera projection. Our total set of model and scene parameters \(\varPhi \) are:

$$\begin{aligned} \varPhi = \left\{ \beta , \tau , \theta , \iota , \kappa \right\} , \end{aligned}$$
(1)

where \(\beta \) are the shape parameters, \(\tau \) the texture parameters, \(\theta \) the pose parameters, \(\iota \) the illumination parameters, and \(\kappa \) the camera parameters. In this section we describe each part of our model, and the parameters that affect it.

Morphable facial eye region model – \(\beta , \tau \) The first part of our model is a 3DMM of the eye region, and serves as a prior for facial appearance. While previous work used a generative shape model of the eye region [6], ours captures both shape and texture variation, allowing.

Fig. 4.
figure 4

We re-parameterize high-resolution 3D head scan data (left) into a more efficient lower resolution form (right). We use a carefully designed generic eye region topology [6] for consistent correspondences and realistic animation.

We started by acquiring 22 high-quality head scans as source data. The first stage of constructing a morphable model is bringing scan data into correspondence, so a point in one face mesh is semantically equivalent to a point in another. While previous work computed a dense point-to-point correspondence from original scan data [8, 11], we compute sparse correspondences that describe 3D shape more efficiently. We manually re-parameterised each original high-resolution scamn into a low resolution topology containing the eye region only (see Fig. 4). This topology does not include the eyeball, as we wish to pose that separately to simulate its independent movement. Additionally, we maintain correspondences for detailed parts, e.g. the interior eyelid margins, which are poorly defined for previous models [11]. We uv-unwrap the mesh and represent color as a texture map, coupling our low-resolution mesh with a high-resolution texture.

Fig. 5.
figure 5

The mean shape \(\varvec{\mu }_s\) and texture \(\varvec{\mu }_t\) along with the first four modes of variation. The first shape mode \(\varvec{U}_1\) varies between hooded and protruding eyes, and the first texture mode \(\varvec{V}_1\) varies between dark and light skin.

Following this registration, the facial eye regions are represented as a combination of 3D shape \(\varvec{s}\) (n vertices) and 2D texture \(\varvec{t}\) (m texels), encoded as 3n and 3m dimensional vectors respectively,

$$\begin{aligned} \varvec{s}&= \left[ x_1, y_1, z_1, x_2, ... y_n, z_n\right] ^T \in \mathbb {R}^{3n} \end{aligned}$$
(2)
$$\begin{aligned} \varvec{t}&= \left[ r_1, g_1, b_1, r_2, ... g_m, b_m\right] ^T \in \mathbb {R}^{3m} \end{aligned}$$
(3)

where \(x_i, y_i, z_i\) is the 3D position of the ith vertex, and \(r_j, b_j, g_j\) is the color of the jth texel. We then performed Principal Component Analysis (PCA) on our set of c ordered scans to extract orthogonal shape and texture basis functions: \(\varvec{U} \in \mathbb {R}^{3n \times c}\) and \(\varvec{V} \in \mathbb {R}^{3m \times c}\). For each of the 2m shape and texture basis functions, we fit a Gaussian distribution to the original data. Using this we can construct linear models that describe variation in both shape \(\mathcal {M}_s\) and texture \(\mathcal {M}_t\),

$$\begin{aligned} \mathcal {M}_s = \left( \varvec{\mu }_s, { }\varvec{\sigma }_s, \varvec{U}\right) \qquad \mathcal {M}_t = \left( \varvec{\mu }_t, { }\varvec{\sigma }_t, \varvec{V}\right) \end{aligned}$$
(4)

where \(\varvec{\mu }_s\in \mathbb {R}^{3n}\) and \(\varvec{\mu }_t \in \mathbb {R}^{3m}\) are the average 3D shape and 2D texture, and \(\varvec{\sigma }_s = [\sigma _{s1} ... \sigma _{sc}]\) and \(\varvec{\sigma }_t = [\sigma _{t1} ... \sigma _{tc}]\) describe the Gaussian distributions of each shape and texture basis function. Figure 5 shows the mean shape and texture, along with the four most important modes of variation. Facial eye region shapes \(\varvec{s}\) and textures \(\varvec{t}\) can then be generated from shape (\(\beta _{ face } \subset \beta \)) and texture coefficients (\(\tau _{ face }\subset \tau \)) as follows:

$$\begin{aligned} \varvec{s}(\beta _{ face })&= \varvec{\mu }_s + \varvec{U} \, \text {diag} (\varvec{\sigma }_s) \, \beta _{ face } \end{aligned}$$
(5)
$$\begin{aligned} \varvec{t}(\tau _{ face })&= \varvec{\mu }_t + \varvec{V} \, \text {diag} (\varvec{\sigma }_t) \, \tau _{ face } \end{aligned}$$
(6)

From our set of \(c =22\) scans, \(90\,\%\) of shape and texture variation can be encoded in 8 shape and 7 texture coefficients. This reduction in dimensionality is important for fitting our model efficiently. Additionally, as eyelashes can provide a visual cue to gaze direction, we model them model them using a semi-transparent mesh controlled by a simple hair simulation [6].

Fig. 6.
figure 6

Our eyeball mesh, mean iris texture \(\varvec{\mu }_{ iris }\), and some examples of iris texture variation captured by our linear model \(\mathcal {M}_{ iris }\).

Parametric eyeball model – \(\mathbf {\beta , \tau }\) The second part of our multi-part model is the eyeball. Accurately recovering eyeball shape is difficult due to its complex structure [24], so instead we created a mesh using standard anatomical measurements [6] (see Fig. 6). Eyeballs vary in shape and texture between different people. We model changes in iris size geometrically, by scaling vertices on the iris boundary about the 3D iris centre as specified by iris diameter \(\beta _{ iris }\). We used a collection of aligned high-resolution iris photos to build a generative model \(\mathcal {M}_{ iris }\) of iris texture using PCA,

$$\begin{aligned} \mathcal {M}_{ iris }&= \left( \varvec{\mu }_{ iris }, { }\varvec{\sigma }_{ iris }, \varvec{W}\right) \end{aligned}$$
(7)

This can be used to generate new iris textures \(\varvec{t}_{ iris }\). As the “white” of the eye is not purely white, we model variations in sclera color by multiplying the eyeball texture with a tint color \(\tau _{tint}\in \mathbb {R}^{3}\). In reality, the eyeball has a complex layered structure with a transparent cornea covering the iris. We avoid explicitly modelling this by computing refraction effects in texture-space [6, 35].

Posing our multi-part model – \(\theta \) Global and local pose information is encoded by \(\theta \). Our model’s parts are defined in a local coordinate system with origin at the eyeball centre, so we use model-to-world transforms \(\varvec{M}_{ face }\) and \(\varvec{M}_{eye}\) to position them in a scene. The facial eye region part has degrees of freedom in translation and rotation. These are encoded as \(4 \times 4\) homogenous transformation matrices \(\varvec{T}\) and \(\varvec{R}\), so model-to-world transform \(\varvec{M}_{ face } = \varvec{T}\varvec{R}\). The eyball’s position is anchored to the face model, but it can rotate separately through local pitch and yaw transforms \(\varvec{R}_x(\theta _{p})\) and \(\varvec{R}_y(\theta _{y})\), giving \(\varvec{M}_{eye} = \varvec{T}\varvec{R}_x\varvec{R}_y\).

When the eye looks up or down, the eyelid follows it. Eyelid motion is modelled using procedural animation [6] – each eyelid vertex is rotated about the inter-eye-corner axis, with rotational amounts chosen to match measurements from an anatomical study [36]. As our multi-part model contains disjoint parts, we also “shrinkrwap” the eyelid skin to the eyeball, projecting eyelid vertices onto the eyeball mesh to avoid gaps and clipping issues.

Scene illumination – \(\iota \) As we focus on a small region of the face, we assume a simple illumination model where lighting is distant and surface materials are purely Lambertian. Our illumination model consists of an ambient light with color \(\varvec{l}_{ amb }\in \mathbb {R}^3\), and a directional light with color \(\varvec{l}_{ dir }\in \mathbb {R}^3\) and 3D direction vector \(\varvec{L}\). We do not consider specular effects, global illumination, or self-shadowing, so illumination depends only on surface normal and albedo. Radiant illumination \(\mathcal {L}\) at a point on the surface with normal \(\varvec{N}\) and albedo \(\varvec{c}\) is calculated as:

$$\begin{aligned} \mathcal {L}(\varvec{n}, \varvec{c}) = \varvec{c} \, \varvec{l}_{ amb } + \varvec{c} \, \varvec{l}_{ dir } \, (\varvec{N} \cdot \varvec{L}) \end{aligned}$$
(8)

While this model is simple, we found it to be sufficient. If we considered a larger facial region, or fit models to both eyes at once, we would explore more advanced material or illumination models, as seen in previous work [13].

Camera projection – \(\kappa \) For a complete model of image formation, we also consider camera projection. We fix our axis-aligned camera at world origin, allowing us to set our world-to-view transform as the identity \(\varvec{I}_4\). We assume knowledge of intrinsic camera calibration parameters \(\kappa \), and use these to build a full projection transform \(\varvec{P}\). A local point in our model can then be transformed into image space using the model-view-projection transform \(\varvec{P}\varvec{M}_{\{{ face }|{ eye }\}}\).

Fig. 7.
figure 7

We measure dense image-similarity as the mean absolute error between \(I_{ obs }\) and \(I_{ syn }\), over a mask of rendered foreground pixels P (white). We ignore error for background pixels (black).

5 Analysis-by-synthesis for Gaze Estimation

Given an observed image \(I_{ obs }\), we wish to produce a synthesized image \(I_{ syn }\left( \varPhi ^*\right) \) that best matches it. 3D gaze direction \(\varvec{g}\) can then be extracted from eyeball pose parameters. We search for optimal model parameters \(\varPhi ^*\) using analysis-by-synthesis. To do this, we iteratively render a synthetic image \(I_{ syn }\left( \varPhi \right) \), compare it to \(I_{ obs }\) using our energy function, and update \(\varPhi \) accordingly. We cast this as an unconstrained energy minimization problem for unknown \(\varPhi \).

$$\begin{aligned} \varPhi ^* = \mathop {{{\mathrm{arg\!min}}}}\limits _\varPhi \, E(\varPhi ) \end{aligned}$$
(9)

5.1 Objective Function

Our energy is formulated as a combination of a dense image similarity metric \(E_{ image }\) that minimizes difference in image appearance, and a sparse landmark similarity metric \(E_{ ldmks }\) that regularizes our model against reliable facial feature points, and weight \(\lambda \) controlling their relative importance.

$$\begin{aligned} E(\varPhi ) = E_{ image }(\varPhi ) + \lambda \cdot E_{ ldmks }(\varPhi ,\,L) \end{aligned}$$
(10)

Image similarity metric. Our primary goal is to minimise the difference between \(I_{ syn }\) and \(I_{ obs }\). This can be seen as an ideal energy function: if \(I_{ syn } = I_{ obs }\), our model must have perfectly fit the data, so virtual and real eyeballs should be aligned. We approach this by including a dense photo-consistency term \(E_{ image }\) in our energy function. However, as the 3DMM in \(I_{ syn }\) does not cover the entire of \(I_{ obs }\), we split our image into two regions: a set of rendered foreground pixels P that we compute error over, and a set of background pixels that we ignore (see Fig. 7). Image similarity is then computed as the mean absolute difference between \(I_{ syn }\) and \(I_{ obs }\) for foreground pixels \(p \in P\).

$$\begin{aligned} E_{ image }(\varPhi ) = \frac{1}{\left| P\right| } \sum _{p \in P} \, \left| I_{ syn }(\varPhi , p) - I_{ obs }(p) \right| \end{aligned}$$
(11)
Fig. 8.
figure 8

\(I_{ obs }\) with landmarks L (white dots), and model fits with our landmark similarity term (top), and without (bottom). Note how it prevents erroneous drift in global pose, eye region shape, and local eyelid pose.

Landmark similarity metric. The face contains important landmark feature points that can be localized reliably [13]. These can be used to efficiently consider the appearance of the whole face, as well as the local appearance of the eye region. We use a state-of-the-art face tracker [15] to localize 14 landmarks L around the eye region in image-space (see Fig. 8). For each landmark \(l\in L\) we compute a corresponding synthesized landmark \(l^\prime \) using our 3DMM. The sparse landmark-similarity term is calculated as the distance between both sets of landmarks, normalized by the foreground area to avoid bias from image or eye region size. This acts as a regularizer to prevent our pose \(\theta \) from drifting too far from a reliable estimate.

$$\begin{aligned} E_{ ldmks }(\varPhi ,\,L) = \frac{1}{\left| L\right| } \sum _{i = 0}^{|L|} \, ||l_i - l^\prime _i ||\end{aligned}$$
(12)
Fig. 9.
figure 9

Example model fits on gaze datasets Eyediap [39] (HD and VGA) and Columbia [40], showing estimated gaze (yellow) and labelled gaze (blue).

5.2 Optimization Procedure

We fit our model to the subject’s left eye. This is a challenging non-convex, high-dimensional optimization problem. To approach it we use gradient descent (GD) with an annealing step size. Calculating analytic derivatives for a scene as complex as our eye region is challenging due to occlusions. We therefore use numeric central derivatives \(\nabla E\) to guide our optimization procedure:

$$\begin{aligned} \varPhi _{i+1} = \varPhi _i - \varvec{t} \cdot r^i \, \nabla E(\varPhi _{i}) \qquad \text {where} \end{aligned}$$
(13)
$$\begin{aligned} \nabla E(\varPhi _{i}) = \left( \frac{\partial E}{\phi _1} \ldots \frac{\partial E}{\phi _{|\varPhi |}}\right) \quad \text {and} \quad \frac{\partial E}{\phi _j} = \frac{E(\varPhi _{i} + h_j) - E(\varPhi _{i} - h_j)}{2 h_j} \end{aligned}$$
(14)

where \(\varvec{t} = [t_1...t_{|\varPhi |}]\) are per-parameter step-sizes, \(\varvec{h} = [h_1...h_{|\varPhi |}]\) are per-parameter numerical values, and r the annealing rate. \(\varvec{t}\) and \(\varvec{h}\) were calibrated through experimentation. We explored alternate optimization techniques including LBFGS [37], and rprop [38] and momentum variants of GD, but we found these to be less stable, perhaps due to our use of numerical rather than analytical derivatives. Computing our gradients is expensive, requiring rendering and differencing two images per parameter. Their efficient computation is possible with our tailored GPU DirectX rasterizer that can render \(I_{ syn }\) at over 5000fps.

Initialization. As we perform local optimization, we require an initial model configuration to start from. We use 3D eye corner landmarks and head rotation from the face tracker [15] to initialize \(\varvec{T}\) and \(\varvec{R}\). We then use 2D iris landmarks and a single sphere eyeball model to initialize gaze [2]. \(\beta \) and \(\tau \) are initialized to \(\varvec{0}\), and illumination \(\varvec{l}_{amb}\) and \(\varvec{l}_{dir}\) are set to [0.8, 0.8, 0.8].

Runtime. Figure 7 shows convergence for a typical input image, with \(I_{ obs }\) size \(800 \times 533\)px, and \(I_{ syn }\) size \(125 \times 87\)px. We converge after 60 iterations for 39 parameters, taking 3.69 s on a typical PC (3.3Ghz CPU, GTX 660 GPU).

5.3 Extracting Gaze Direction

Our task is estimating 3D gaze direction \(\varvec{g}\) in camera-space. Once our fitting procedure has converged, \(\varvec{g}\) can be extracted by applying the eyeball model transform to a vector pointing along the optical axis in model-space: \(\varvec{g} = \varvec{M}_{eye} \left[ 0, 0, -1\right] ^T\).

6 Experiments

We evaluated our approach on two publicly available eye gaze datasets: Columbia [40] and Eyediap [39]. We chose these datasets as they show the full face, as required for our facial-landmark based initialization.

Columbia contains of images of 56 people looking at a target grid on the wall. The participants were constrained by a head-clamp device, and images were taken from five different head orientations (from \(-30^\circ \) to \(30^\circ \)). Example fits can be seen in Fig. 9 right. In our experiments we used a subset of 34 people (excluding those with eyeglasses) with 20 images per person, resulting in 680 images. As the images were taken by a high quality camera (\(5184\times 3456\)px), we downsampled them to \(800\times 533\)px for faster processing.

Eyediap contains videos of 16 participants looking at two types of targets: screen targets on a monitor; and floating physical targets. Recordings were made with two cameras: a VGA camera (\(640\times 480\)px) below the screen, and a HD camera (\(1920\times 1080\)px) placed to the side. Example fits can be seen in Fig. 9 left. Participants displayed both static and free head motion. We extracted images from the VGA videos for our experiment – 622 images with screen targets and 500 images with floating targets. In both cases we used a gradient descent step size of 0.0025 with an annealing rate of 0.95 that started after \(10^\text {th}\) iteration.

6.1 Gaze Estimation

In the first experiment we evaluated how well our method predicts gaze direction for Columbia. The results are shown in Fig. 10, giving average gaze error of \(\text {M} = 8.87^\circ , \text {Mdn}=7.54^\circ \) after convergence. As we do not impose a prior on predicted gaze distribution, our system can produce outliers with extreme error, so we believe its performance is best represented by a median (Mdn) average. Note how the decrease in fitting error corresponds to a monotonic decrease in mean and median gaze errors. Furthermore, our approach outperformes the geometric approach used to initialize it [2], a recently proposed k-Nearest-Neighbour approach [6] (\(\text {M}=19.9^\circ , \text {Mdn}=19.5^\circ \)) and a naïve model that always predicts forwards gaze (\(\text {M}=12.00^\circ , \text {Mdn}=11.17^\circ \)).

Fig. 10.
figure 10

Fitting error (left) and gaze estimation error (right). Note how gaze error improves from the initial estimate. Filled regions show inter-quartile range.

Fig. 11.
figure 11

Fitting (blue) and gaze estimation (red) error on Eyediap (VGA). We outperform a state-of-the-art CNN [5]. Additionally, the CNN was not able to generalize to the floating target condition, while ours can. (Color figure online)

Table 1. We outperform state-of-the-art cross-dataset methods trained on UT [27] and synthetic data [6]: CNN [5], Random Forests (RF) [27], kNN [5], Adaptive Linear Regression (ALR) [33], and Support Vector Regression (SVR) [26].

The results for Eyediap VGA images can be seen in Fig. 11. As before the decrease in pixel error corresponds in the decrease in gaze errors. Furthermore, our final gaze estimation error on the Eyediap screen condition (\(\text {M}=9.44^\circ , \text {Mdn}=8.63^\circ \)) outperfoms that reported in literature previously (\(p < .0001\), independent t-test) – \(10.5^\circ \) using a Convolutional Neural Network [5]. See Table 1 for other comparisons. We also outperform the initialization model, a kNN model (\(\text {M}=21.49^\circ , \text {Mdn}=20.93^\circ \)), and a naïve model (\(\text {M}=12.62^\circ , \text {Mdn}=12.79^\circ \)). The results for floating targets are less accurate but still improve upon our initialisation baseline. Zhang et al. [5] did not evaluate on floating targets due to head pose variations not present in their training set. Despite a drop in accuracy, our method can still generalize to this difficult scenario and outperforms a kNN model (\(\text {M}=30.85^\circ , \text {Mdn}=28.92^\circ \)), and a naïve model (\(\text {M}=31.4^\circ , \text {Mdn}=31.37^\circ \)).

We performed a similar experiment for Eyediap HD images that exhibit head pose, achieving a gaze error of \(\text {M}=11.0^\circ , \text {Mdn}=10.4^\circ \) for screen targets and \(\text {M}=22.2^\circ , \text {Mdn}=19.0^\circ \) for floating targets. Despite extreme head pose and gaze range, we still perform comparably with the state-of-the-art and outperform a kNN model (\(\text {M}=29.39^\circ , \text {Mdn}=28.62^\circ \) for screen, and \(\text {M}=34.6^\circ , \text {Mdn}=33.19^\circ \) for floating target), and a naïve model (\(\text {M}=22.67^\circ , \text {Mdn}=22.06^\circ \) for screen, and \(\text {M}=35.08^\circ , \text {Mdn}=34.35^\circ \) for floating target).

6.2 Morphable Model Evaluation

In addition to evaluating our system’s gaze estimation capabilities, we performed experiments to measure the expressive power of our morphable model and the effect of including \(E_{ldmks}\) in our objective function.

Fig. 12.
figure 12

As we include more shape and texture and shape principal components (PCs) in the facial morphable model, both fitting and gaze error decrease. Also note the effect of our landmark regularization term \(\lambda \) which decreases the error (and its standard deviation) by not allowing the fit to drift.

First, we assessed the importance of our facial point similarity weight (\(\lambda \)) to gaze estimation accuracy on the Columbia dataset. We used the same fitting strategy, but varied \(\lambda \). Results can be seen in Fig. 12 (right). It is clear that \(\lambda \) has a positive impact on gaze estimation accuracy, by not allowing fits to drift too far from the reliable estimates and by reducing the variance of the error.

Second, we wanted to see if modelling more degrees of shape and appearance variation led to better image fitting and gaze estimation. We therefore varied the number of shape (\(\beta \)) and texture (\(\tau \)) principal components (PCs) that our model was allowed to use during fitting on Columbia. We varied both the texture and shape PCs together, using the same number for both. As seen in Fig. 12 (left), more PCs lead to better image fitting error, as \(I_{syn}\) matches \(I_{obs}\) better when allowed more variation. A similar downward trend can be seen for gaze error, suggesting better modelling of nearby facial shape and texture is important for correctly aligning the eyeball model, and thus determining gaze direction.

7 Conclusion

We presented the first multi-part 3D morphable model of the eye region. It includes a separate eyeball model, allowing us to capture gaze – a facial expression not captured by previous systems [13, 14]. We then presented a novel approach for gaze estimation: fitting our model to an image with analysis-by-synthesis, and extracting the gaze direction from fitted parameters. Our method is the first to jointly optimize a dense image metric, a sparse feature metric, and a generative 3D model together for gaze estimation. It generalizes to different quality images and wide gaze ranges, and out-performs a state-of-the-art CNN method [5].

Limitations still remain. While other gaze estimation systems can operate in real time [2, 5], ours takes several seconds per image. However, previous analysis-by-synthesis systems have been made real time through careful engineering [41]; we believe this is possible for our method too. Our method can also become trapped in local minima (see Fig. 8). To avoid this and improve robustness, we plan to fit both eyes simultaneously in future work.