A 3D Morphable Eye Region Model for Gaze Estimation

Wood, Erroll; Baltrušaitis, Tadas; Morency, Louis-Philippe; Robinson, Peter; Bulling, Andreas

doi:10.1007/978-3-319-46448-0_18

Erroll Wood¹⁷,
Tadas Baltrušaitis¹⁸,
Louis-Philippe Morency¹⁸,
Peter Robinson¹⁷ &
…
Andreas Bulling¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9905))

Included in the following conference series:

European Conference on Computer Vision

29k Accesses
33 Citations
3 Altmetric

Abstract

Morphable face models are a powerful tool, but have previously failed to model the eye accurately due to complexities in its material and motion. We present a new multi-part model of the eye that includes a morphable model of the facial eye region, as well as an anatomy-based eyeball model. It is the first morphable model that accurately captures eye region shape, since it was built from high-quality head scans. It is also the first to allow independent eyeball movement, since we treat it as a separate part. To showcase our model we present a new method for illumination- and head-pose–invariant gaze estimation from a single RGB image. We fit our model to an image through analysis-by-synthesis, solving for eye region shape, texture, eyeball pose, and illumination simultaneously. The fitted eyeball pose parameters are then used to estimate gaze direction. Through evaluation on two standard datasets we show that our method generalizes to both webcam and high-quality camera images, and outperforms a state-of-the-art CNN method achieving a gaze estimation accuracy of $9.44^\circ $ in a challenging user-independent scenario.

You have full access to this open access chapter, Download conference paper PDF

Synthesis-Based Low-Cost Gaze Analysis

Pose Normalization for Eye Gaze Estimation and Facial Attribute Description from Still Images

RGB-D-based gaze point estimation via multi-column CNNs and facial landmarks global optimization

Article 30 October 2020

Keywords

1 Introduction

The eyes and their movements convey our attention, indicate our interests, and play a key role in communicating social and emotional information [1]. Estimating eye gaze is therefore an important problem for computer vision, with applications ranging from facial analysis [2] to gaze-based interfaces [3, 4]. However, estimating gaze remotely under unconstrained lighting conditions and significant head-pose is a yet-outstanding challenge. Appearance-based methods that directly estimate gaze from an eye image have recently improved upon person- and device-independent gaze estimation by learning invariances from large amounts of labelled training data. In particular, Zhang et al. trained a multi-modal convolutional neural network with 200,000 images collected during everyday laptop use [5], and Wood et al. rendered over one million synthetic training images with artificial illumination variation [6]. It has been shown that the performance of such methods heavily depends on the head pose and gaze range that the training data covers – results are best when the training data closely matches the desired test condition [7]. This means a gaze estimator trained in one scenario does not perform well in another. Instead, we would prefer a generic gaze estimator that performs well in all conditions.

3D morphable models (3DMM) are a powerful tool as they combine a model of face variation with a model of image formation, allowing pose and illumination invariance. Since their introduction [8], they have become an established method for many tasks including inverse rendering [9, 10], face recognition [11, 12], and expression re-targeting [13]. Given a face image, such systems use model fitting to discover the most likely shape, texture, expression, pose, and illumination parameters that generated it. However, previous work has failed to accurately model the eyes, portraying them as a static geometry [8, 11], or removing them from the face entirely [13, 14]. This is a result of two complexities that are not handled by current methods: (1) The eyeball’s materials make it difficult to reconstruct in 3D, leading to poor correspondence and loss of detail in the 3DMM, (2) Previous work uses blendshapes to model facial expression – a technique not compatible with independent eyeball movement.We make two specific contributions:

An Eye Region 3DMM. Our first contribution is a novel multi-part 3DMM that includes the eyeball, allowing us to accurately model variation in eye appearance and eyeball pose (see Fig. 1 left). Recent work presented a morphable shape model of the eye region, but did not capture texture variation [6]. We constructed a 3DMM of the facial eye region by carefully registering a set of high-quality 3D head scans, and extracting modes of shape and texture variation using PCA. We combined this with an anatomy-based eyeball model that can be posed separately to simulate changes in eye gaze.

Analysis-by-Synthesis for Gaze Estimation. Our second contribution is a novel method for gaze estimation: fitting our 3DMM to an input image using analysis-by-synthesis (see Fig. 1 right). We solve for shape, texture, pose, and illumination simultaneously, so our fitted model parameters provide us with a robust estimate of where someone is looking in a 3D scene. Previous approaches for remote RGB gaze estimation can be categorized as either appearance-based, feature-based, or model-based [3]. Our method is first to combine the benefits of all three: (1) We minimize the appearance difference between synthesized and observed images using a dense image-error term. (2) We use sparse facial features localized with a face tracker [15] for initialization and regularization. (3) We use our morphable model to capture variation between people and eye motion itself. We iteratively fit our model using gradient descent with numerical derivatives efficiently calculated with a tailored GPU rasterizer.

2 Related Work

2.1 3D Morphable Models

A 3D morphable model is a statistically-derived generative model, parameterized by shape and texture coefficients. They are closely related to their 2D analogue, active appearance models [16]. 3DMMs have been successfully applied to various face-related computer vision problems ranging from reconstruction [8, 10] to recognition [11, 12], and have also been extended to other body parts, such as the hand [17] as well as the entire body itself [18, 19].

Blanz and Vetter built the first 3DMM from a set of 200 laser scans of faces with neutral expression [8]. They first computed a dense correspondences between the scans, then used PCA to extract modes of variation. Subsequent work with 3DMMs has followed the same approach, building similar models with higher quality scans [11], or more training samples [12, 20]. However, despite advances in scanning technology, the eye remains problematic for 3D reconstruction, leading to poor correspondences and loss of quality in the 3DMM (see Fig. 2).

3DMMs represent a face with neutral expression, so they are often combined with a model of facial motion. Vlasic et al. used a multi-linear model to separately encode identity and expression, and demonstrated its use in facial transfer [21]. More recent works have instead used blend shapes – an animation technique that stores a different version of a mesh for each expression, and interpolates between them [14]. However, while blend shapes work well for skin, they cannot represent the independent motion of the eyeball. For these reasons, previous work either replaced the scanned eyeball with a proxy mesh [11] or completely removed the eye from the 3DMM mesh [13, 22]. Bérard et al. recently presented a 3D morphable eyeball model [23] built from a database of eyeball scans [24], showing impressive results for high-quality semi-automatic eyeball reconstruction. Our work uses a simpler model that is sufficient for low-quality input data, and our fitting procedure is fully automatic.

2.2 Remote Gaze Estimation

Gaze estimation is a well established topic in computer vision (see [3, 25] for reviews). Methods can be categorized as (1) appearance-based – map directly from image pixels to a gaze direction [5, 26, 27], (2) feature-based – localize facial feature points (e.g. pupil centre, eye corner) and map these to gaze [28, 29], or (3) model-based – estimate gaze using a geometric model of the eye [30–32]. Some systems combine these techniques, e.g. using facial features for image alignment [26, 33], mapping appearance to a 2D generative model [34], or combining head pose with image pixels in a multi-modal neural network [5]. To the best of our knowledge, no work so far has combined appearance, facial features, and a generative model into a single method, solving for shape, texture, eyeball pose, and illumination simultaneously.

The current outstanding challenge for remote RGB gaze estimation is achieving person- and device- independence under unconstrained conditions [5]. The state-of-the-art methods for this are appearance-based, attempting to learn invariances from large amounts of training data. However, such systems are still limited by their training data with respect to appearance, gaze, and head pose variation [5, 27]. To address this, recent work used graphics to synthesize large amounts of training images. These learning-by-synthesis methods cover a larger range of head pose, gaze, appearance, and illumination variation without additional costs for data collection or ground truth annotation. Specifically, Wood et al. rendered 10 K images and used them to pre-train a multi-modal CNN, significantly improving upon state-of-the-art gaze estimation accuracy [7]. They later rendered 1M images with improved appearance variation for training a k-Nearest-Neighbour classifier, again improving over state-of-the-art CNN results [6].

While previous work used 3D models to synthesise training data [6], ours is first to use analysis-by-synthesis – a technique where synthesis is used for gaze estimation itself. This approach is not constrained by a limited variation in training images but instead can, in theory, generalise to arbitrary settings. Additionally, while previous work strove for realism [7], our forward synthesis method focuses on speed in order to make analysis-by-synthesis tractable.

3 Overview

At the heart of our generic gaze estimator are two core contributions. In Sect. 4 we present our first contribution: a novel multi-part eye region 3DMM. We constructed this from 22 high-resolution face scans acquired from an online store^{Footnote 1}, combined with an anatomy-based eyeball model. Our model is described by a set of parameters $\varPhi $ that cover both geometric (shape, texture, and pose) and photometric (illumination and camera projection) variation.

In Sect. 5 we present our second contribution: analysis-by-synthesis for gaze estimation (see Fig. 3). The core idea is to fit our 3DMM to an image using analysis-by-synthesis – given an observed image $I_{obs}$, we wish to produce a synthesized image $I_{syn}$ that matches it. We then estimate gaze from the fitted eyeball pose parameters. Key in this process is our objective function $E(\varPhi )$, which considers both a local dense measure of appearance similarity, as well as a holistic sparse measure of facial feature-point similarity (see Eq. 10).

4 3D Eye Region Model

Our goal is to use a 3D eye region model to synthesize an image which matches an input RGB eye image. To render synthetic views, we used a multi-part model consisting of the facial eye region and the eyeball. These were posed in a scene, illuminated, and then rendered using a model of camera projection. Our total set of model and scene parameters $\varPhi $ are:

$$\begin{aligned} \varPhi = \left\{ \beta , \tau , \theta , \iota , \kappa \right\} , \end{aligned}$$

(1)

where $\beta $ are the shape parameters, $\tau $ the texture parameters, $\theta $ the pose parameters, $\iota $ the illumination parameters, and $\kappa $ the camera parameters. In this section we describe each part of our model, and the parameters that affect it.

Morphable facial eye region model – $\beta , \tau $ The first part of our model is a 3DMM of the eye region, and serves as a prior for facial appearance. While previous work used a generative shape model of the eye region [6], ours captures both shape and texture variation, allowing.

We started by acquiring 22 high-quality head scans as source data. The first stage of constructing a morphable model is bringing scan data into correspondence, so a point in one face mesh is semantically equivalent to a point in another. While previous work computed a dense point-to-point correspondence from original scan data [8, 11], we compute sparse correspondences that describe 3D shape more efficiently. We manually re-parameterised each original high-resolution scamn into a low resolution topology containing the eye region only (see Fig. 4). This topology does not include the eyeball, as we wish to pose that separately to simulate its independent movement. Additionally, we maintain correspondences for detailed parts, e.g. the interior eyelid margins, which are poorly defined for previous models [11]. We uv-unwrap the mesh and represent color as a texture map, coupling our low-resolution mesh with a high-resolution texture.

Following this registration, the facial eye regions are represented as a combination of 3D shape $\varvec{s}$ (n vertices) and 2D texture $\varvec{t}$ (m texels), encoded as 3n and 3m dimensional vectors respectively,

$$\begin{aligned} \varvec{s}&= \left[ x_1, y_1, z_1, x_2, ... y_n, z_n\right] ^T \in \mathbb {R}^{3n} \end{aligned}$$

(2)

$$\begin{aligned} \varvec{t}&= \left[ r_1, g_1, b_1, r_2, ... g_m, b_m\right] ^T \in \mathbb {R}^{3m} \end{aligned}$$

(3)

where $x_i, y_i, z_i$ is the 3D position of the ith vertex, and $r_j, b_j, g_j$ is the color of the jth texel. We then performed Principal Component Analysis (PCA) on our set of c ordered scans to extract orthogonal shape and texture basis functions: $\varvec{U} \in \mathbb {R}^{3n \times c}$ and $\varvec{V} \in \mathbb {R}^{3m \times c}$. For each of the 2m shape and texture basis functions, we fit a Gaussian distribution to the original data. Using this we can construct linear models that describe variation in both shape $\mathcal {M}_s$ and texture $\mathcal {M}_t$,

$$\begin{aligned} \mathcal {M}_s = \left( \varvec{\mu }_s, { }\varvec{\sigma }_s, \varvec{U}\right) \qquad \mathcal {M}_t = \left( \varvec{\mu }_t, { }\varvec{\sigma }_t, \varvec{V}\right) \end{aligned}$$

(4)

where $\varvec{\mu }_s\in \mathbb {R}^{3n}$ and $\varvec{\mu }_t \in \mathbb {R}^{3m}$ are the average 3D shape and 2D texture, and $\varvec{\sigma }_s = [\sigma _{s1} ... \sigma _{sc}]$ and $\varvec{\sigma }_t = [\sigma _{t1} ... \sigma _{tc}]$ describe the Gaussian distributions of each shape and texture basis function. Figure 5 shows the mean shape and texture, along with the four most important modes of variation. Facial eye region shapes $\varvec{s}$ and textures $\varvec{t}$ can then be generated from shape ($\beta _{ face } \subset \beta $) and texture coefficients ($\tau _{ face }\subset \tau $) as follows:

$$\begin{aligned} \varvec{s}(\beta _{ face })&= \varvec{\mu }_s + \varvec{U} \, \text {diag} (\varvec{\sigma }_s) \, \beta _{ face } \end{aligned}$$

(5)

$$\begin{aligned} \varvec{t}(\tau _{ face })&= \varvec{\mu }_t + \varvec{V} \, \text {diag} (\varvec{\sigma }_t) \, \tau _{ face } \end{aligned}$$

(6)

From our set of $c =22$ scans, $90\,\%$ of shape and texture variation can be encoded in 8 shape and 7 texture coefficients. This reduction in dimensionality is important for fitting our model efficiently. Additionally, as eyelashes can provide a visual cue to gaze direction, we model them model them using a semi-transparent mesh controlled by a simple hair simulation [6].

Parametric eyeball model – $\mathbf {\beta , \tau }$ The second part of our multi-part model is the eyeball. Accurately recovering eyeball shape is difficult due to its complex structure [24], so instead we created a mesh using standard anatomical measurements [6] (see Fig. 6). Eyeballs vary in shape and texture between different people. We model changes in iris size geometrically, by scaling vertices on the iris boundary about the 3D iris centre as specified by iris diameter $\beta _{ iris }$. We used a collection of aligned high-resolution iris photos to build a generative model $\mathcal {M}_{ iris }$ of iris texture using PCA,

$$\begin{aligned} \mathcal {M}_{ iris }&= \left( \varvec{\mu }_{ iris }, { }\varvec{\sigma }_{ iris }, \varvec{W}\right) \end{aligned}$$

(7)

This can be used to generate new iris textures $\varvec{t}_{ iris }$. As the “white” of the eye is not purely white, we model variations in sclera color by multiplying the eyeball texture with a tint color $\tau _{tint}\in \mathbb {R}^{3}$. In reality, the eyeball has a complex layered structure with a transparent cornea covering the iris. We avoid explicitly modelling this by computing refraction effects in texture-space [6, 35].

Posing our multi-part model – $\theta $ Global and local pose information is encoded by $\theta $. Our model’s parts are defined in a local coordinate system with origin at the eyeball centre, so we use model-to-world transforms $\varvec{M}_{ face }$ and $\varvec{M}_{eye}$ to position them in a scene. The facial eye region part has degrees of freedom in translation and rotation. These are encoded as $4 \times 4$ homogenous transformation matrices $\varvec{T}$ and $\varvec{R}$, so model-to-world transform $\varvec{M}_{ face } = \varvec{T}\varvec{R}$. The eyball’s position is anchored to the face model, but it can rotate separately through local pitch and yaw transforms $\varvec{R}_x(\theta _{p})$ and $\varvec{R}_y(\theta _{y})$, giving $\varvec{M}_{eye} = \varvec{T}\varvec{R}_x\varvec{R}_y$.

When the eye looks up or down, the eyelid follows it. Eyelid motion is modelled using procedural animation [6] – each eyelid vertex is rotated about the inter-eye-corner axis, with rotational amounts chosen to match measurements from an anatomical study [36]. As our multi-part model contains disjoint parts, we also “shrinkrwap” the eyelid skin to the eyeball, projecting eyelid vertices onto the eyeball mesh to avoid gaps and clipping issues.

Scene illumination – $\iota $ As we focus on a small region of the face, we assume a simple illumination model where lighting is distant and surface materials are purely Lambertian. Our illumination model consists of an ambient light with color $\varvec{l}_{ amb }\in \mathbb {R}^3$, and a directional light with color $\varvec{l}_{ dir }\in \mathbb {R}^3$ and 3D direction vector $\varvec{L}$. We do not consider specular effects, global illumination, or self-shadowing, so illumination depends only on surface normal and albedo. Radiant illumination $\mathcal {L}$ at a point on the surface with normal $\varvec{N}$ and albedo $\varvec{c}$ is calculated as:

$$\begin{aligned} \mathcal {L}(\varvec{n}, \varvec{c}) = \varvec{c} \, \varvec{l}_{ amb } + \varvec{c} \, \varvec{l}_{ dir } \, (\varvec{N} \cdot \varvec{L}) \end{aligned}$$

(8)

While this model is simple, we found it to be sufficient. If we considered a larger facial region, or fit models to both eyes at once, we would explore more advanced material or illumination models, as seen in previous work [13].

Camera projection – $\kappa $ For a complete model of image formation, we also consider camera projection. We fix our axis-aligned camera at world origin, allowing us to set our world-to-view transform as the identity $\varvec{I}_4$. We assume knowledge of intrinsic camera calibration parameters $\kappa $, and use these to build a full projection transform $\varvec{P}$. A local point in our model can then be transformed into image space using the model-view-projection transform $\varvec{P}\varvec{M}_{\{{ face }|{ eye }\}}$.

5 Analysis-by-synthesis for Gaze Estimation

Given an observed image $I_{ obs }$, we wish to produce a synthesized image $I_{ syn }\left( \varPhi ^*\right) $ that best matches it. 3D gaze direction $\varvec{g}$ can then be extracted from eyeball pose parameters. We search for optimal model parameters $\varPhi ^*$ using analysis-by-synthesis. To do this, we iteratively render a synthetic image $I_{ syn }\left( \varPhi \right) $, compare it to $I_{ obs }$ using our energy function, and update $\varPhi $ accordingly. We cast this as an unconstrained energy minimization problem for unknown $\varPhi $.

$$\begin{aligned} \varPhi ^* = \mathop {{{\mathrm{arg\!min}}}}\limits _\varPhi \, E(\varPhi ) \end{aligned}$$

(9)

5.1 Objective Function

Our energy is formulated as a combination of a dense image similarity metric $E_{ image }$ that minimizes difference in image appearance, and a sparse landmark similarity metric $E_{ ldmks }$ that regularizes our model against reliable facial feature points, and weight $\lambda $ controlling their relative importance.

$$\begin{aligned} E(\varPhi ) = E_{ image }(\varPhi ) + \lambda \cdot E_{ ldmks }(\varPhi ,\,L) \end{aligned}$$

(10)

Image similarity metric. Our primary goal is to minimise the difference between $I_{ syn }$ and $I_{ obs }$. This can be seen as an ideal energy function: if $I_{ syn } = I_{ obs }$, our model must have perfectly fit the data, so virtual and real eyeballs should be aligned. We approach this by including a dense photo-consistency term $E_{ image }$ in our energy function. However, as the 3DMM in $I_{ syn }$ does not cover the entire of $I_{ obs }$, we split our image into two regions: a set of rendered foreground pixels P that we compute error over, and a set of background pixels that we ignore (see Fig. 7). Image similarity is then computed as the mean absolute difference between $I_{ syn }$ and $I_{ obs }$ for foreground pixels $p \in P$.

$$\begin{aligned} E_{ image }(\varPhi ) = \frac{1}{\left| P\right| } \sum _{p \in P} \, \left| I_{ syn }(\varPhi , p) - I_{ obs }(p) \right| \end{aligned}$$

(11)

Landmark similarity metric. The face contains important landmark feature points that can be localized reliably [13]. These can be used to efficiently consider the appearance of the whole face, as well as the local appearance of the eye region. We use a state-of-the-art face tracker [15] to localize 14 landmarks L around the eye region in image-space (see Fig. 8). For each landmark $l\in L$ we compute a corresponding synthesized landmark $l^\prime $ using our 3DMM. The sparse landmark-similarity term is calculated as the distance between both sets of landmarks, normalized by the foreground area to avoid bias from image or eye region size. This acts as a regularizer to prevent our pose $\theta $ from drifting too far from a reliable estimate.

$$\begin{aligned} E_{ ldmks }(\varPhi ,\,L) = \frac{1}{\left| L\right| } \sum _{i = 0}^{|L|} \, ||l_i - l^\prime _i ||\end{aligned}$$

(12)

5.2 Optimization Procedure

We fit our model to the subject’s left eye. This is a challenging non-convex, high-dimensional optimization problem. To approach it we use gradient descent (GD) with an annealing step size. Calculating analytic derivatives for a scene as complex as our eye region is challenging due to occlusions. We therefore use numeric central derivatives $\nabla E$ to guide our optimization procedure:

$$\begin{aligned} \varPhi _{i+1} = \varPhi _i - \varvec{t} \cdot r^i \, \nabla E(\varPhi _{i}) \qquad \text {where} \end{aligned}$$

(13)

$$\begin{aligned} \nabla E(\varPhi _{i}) = \left( \frac{\partial E}{\phi _1} \ldots \frac{\partial E}{\phi _{|\varPhi |}}\right) \quad \text {and} \quad \frac{\partial E}{\phi _j} = \frac{E(\varPhi _{i} + h_j) - E(\varPhi _{i} - h_j)}{2 h_j} \end{aligned}$$

(14)

where $\varvec{t} = [t_1...t_{|\varPhi |}]$ are per-parameter step-sizes, $\varvec{h} = [h_1...h_{|\varPhi |}]$ are per-parameter numerical values, and r the annealing rate. $\varvec{t}$ and $\varvec{h}$ were calibrated through experimentation. We explored alternate optimization techniques including LBFGS [37], and rprop [38] and momentum variants of GD, but we found these to be less stable, perhaps due to our use of numerical rather than analytical derivatives. Computing our gradients is expensive, requiring rendering and differencing two images per parameter. Their efficient computation is possible with our tailored GPU DirectX rasterizer that can render $I_{ syn }$ at over 5000fps.

Initialization. As we perform local optimization, we require an initial model configuration to start from. We use 3D eye corner landmarks and head rotation from the face tracker [15] to initialize $\varvec{T}$ and $\varvec{R}$. We then use 2D iris landmarks and a single sphere eyeball model to initialize gaze [2]. $\beta $ and $\tau $ are initialized to $\varvec{0}$, and illumination $\varvec{l}_{amb}$ and $\varvec{l}_{dir}$ are set to [0.8, 0.8, 0.8].

Runtime. Figure 7 shows convergence for a typical input image, with $I_{ obs }$ size $800 \times 533$px, and $I_{ syn }$ size $125 \times 87$px. We converge after 60 iterations for 39 parameters, taking 3.69 s on a typical PC (3.3Ghz CPU, GTX 660 GPU).

5.3 Extracting Gaze Direction

Our task is estimating 3D gaze direction $\varvec{g}$ in camera-space. Once our fitting procedure has converged, $\varvec{g}$ can be extracted by applying the eyeball model transform to a vector pointing along the optical axis in model-space: $\varvec{g} = \varvec{M}_{eye} \left[ 0, 0, -1\right] ^T$.

6 Experiments

We evaluated our approach on two publicly available eye gaze datasets: Columbia [40] and Eyediap [39]. We chose these datasets as they show the full face, as required for our facial-landmark based initialization.

Columbia contains of images of 56 people looking at a target grid on the wall. The participants were constrained by a head-clamp device, and images were taken from five different head orientations (from $-30^\circ $ to $30^\circ $). Example fits can be seen in Fig. 9 right. In our experiments we used a subset of 34 people (excluding those with eyeglasses) with 20 images per person, resulting in 680 images. As the images were taken by a high quality camera ($5184\times 3456$px), we downsampled them to $800\times 533$px for faster processing.

Eyediap contains videos of 16 participants looking at two types of targets: screen targets on a monitor; and floating physical targets. Recordings were made with two cameras: a VGA camera ($640\times 480$px) below the screen, and a HD camera ($1920\times 1080$px) placed to the side. Example fits can be seen in Fig. 9 left. Participants displayed both static and free head motion. We extracted images from the VGA videos for our experiment – 622 images with screen targets and 500 images with floating targets. In both cases we used a gradient descent step size of 0.0025 with an annealing rate of 0.95 that started after $10^\text {th}$ iteration.

6.1 Gaze Estimation

In the first experiment we evaluated how well our method predicts gaze direction for Columbia. The results are shown in Fig. 10, giving average gaze error of $\text {M} = 8.87^\circ , \text {Mdn}=7.54^\circ $ after convergence. As we do not impose a prior on predicted gaze distribution, our system can produce outliers with extreme error, so we believe its performance is best represented by a median (Mdn) average. Note how the decrease in fitting error corresponds to a monotonic decrease in mean and median gaze errors. Furthermore, our approach outperformes the geometric approach used to initialize it [2], a recently proposed k-Nearest-Neighbour approach [6] ($\text {M}=19.9^\circ , \text {Mdn}=19.5^\circ $) and a naïve model that always predicts forwards gaze ($\text {M}=12.00^\circ , \text {Mdn}=11.17^\circ $).

Table 1. We outperform state-of-the-art cross-dataset methods trained on UT [27] and synthetic data [6]: CNN [5], Random Forests (RF) [27], kNN [5], Adaptive Linear Regression (ALR) [33], and Support Vector Regression (SVR) [26].

Full size table

The results for Eyediap VGA images can be seen in Fig. 11. As before the decrease in pixel error corresponds in the decrease in gaze errors. Furthermore, our final gaze estimation error on the Eyediap screen condition ($\text {M}=9.44^\circ , \text {Mdn}=8.63^\circ $) outperfoms that reported in literature previously ($p < .0001$, independent t-test) – $10.5^\circ $ using a Convolutional Neural Network [5]. See Table 1 for other comparisons. We also outperform the initialization model, a kNN model ($\text {M}=21.49^\circ , \text {Mdn}=20.93^\circ $), and a naïve model ($\text {M}=12.62^\circ , \text {Mdn}=12.79^\circ $). The results for floating targets are less accurate but still improve upon our initialisation baseline. Zhang et al. [5] did not evaluate on floating targets due to head pose variations not present in their training set. Despite a drop in accuracy, our method can still generalize to this difficult scenario and outperforms a kNN model ($\text {M}=30.85^\circ , \text {Mdn}=28.92^\circ $), and a naïve model ($\text {M}=31.4^\circ , \text {Mdn}=31.37^\circ $).

We performed a similar experiment for Eyediap HD images that exhibit head pose, achieving a gaze error of $\text {M}=11.0^\circ , \text {Mdn}=10.4^\circ $ for screen targets and $\text {M}=22.2^\circ , \text {Mdn}=19.0^\circ $ for floating targets. Despite extreme head pose and gaze range, we still perform comparably with the state-of-the-art and outperform a kNN model ($\text {M}=29.39^\circ , \text {Mdn}=28.62^\circ $ for screen, and $\text {M}=34.6^\circ , \text {Mdn}=33.19^\circ $ for floating target), and a naïve model ($\text {M}=22.67^\circ , \text {Mdn}=22.06^\circ $ for screen, and $\text {M}=35.08^\circ , \text {Mdn}=34.35^\circ $ for floating target).

6.2 Morphable Model Evaluation

In addition to evaluating our system’s gaze estimation capabilities, we performed experiments to measure the expressive power of our morphable model and the effect of including $E_{ldmks}$ in our objective function.

First, we assessed the importance of our facial point similarity weight ($\lambda $) to gaze estimation accuracy on the Columbia dataset. We used the same fitting strategy, but varied $\lambda $. Results can be seen in Fig. 12 (right). It is clear that $\lambda $ has a positive impact on gaze estimation accuracy, by not allowing fits to drift too far from the reliable estimates and by reducing the variance of the error.

Second, we wanted to see if modelling more degrees of shape and appearance variation led to better image fitting and gaze estimation. We therefore varied the number of shape ($\beta $) and texture ($\tau $) principal components (PCs) that our model was allowed to use during fitting on Columbia. We varied both the texture and shape PCs together, using the same number for both. As seen in Fig. 12 (left), more PCs lead to better image fitting error, as $I_{syn}$ matches $I_{obs}$ better when allowed more variation. A similar downward trend can be seen for gaze error, suggesting better modelling of nearby facial shape and texture is important for correctly aligning the eyeball model, and thus determining gaze direction.

7 Conclusion

We presented the first multi-part 3D morphable model of the eye region. It includes a separate eyeball model, allowing us to capture gaze – a facial expression not captured by previous systems [13, 14]. We then presented a novel approach for gaze estimation: fitting our model to an image with analysis-by-synthesis, and extracting the gaze direction from fitted parameters. Our method is the first to jointly optimize a dense image metric, a sparse feature metric, and a generative 3D model together for gaze estimation. It generalizes to different quality images and wide gaze ranges, and out-performs a state-of-the-art CNN method [5].

Limitations still remain. While other gaze estimation systems can operate in real time [2, 5], ours takes several seconds per image. However, previous analysis-by-synthesis systems have been made real time through careful engineering [41]; we believe this is possible for our method too. Our method can also become trapped in local minima (see Fig. 8). To avoid this and improve robustness, we plan to fit both eyes simultaneously in future work.

Notes

1.
Ten24 3D Scan Store – http://3dscanstore.com/.

References

Kleinke, C.L.: Gaze and eye contact: a research review. Psychol. Bull. 100(1), 78–100 (1986)
Article Google Scholar
Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: IEEE WACV (2016)
Google Scholar
Hansen, D.W., Ji, Q.: In the eye of the beholder: a survey of models for eyes and gaze. IEEE TPAMI 32, 478–500 (2010)
Article Google Scholar
Majaranta, P., Bulling, A.: Eye tracking and eye-based human-computer interaction. In: Fairclough, S.H., Gilleade, K. (eds.) Advances in Physiological Computing, pp. 39–65. Springer, New York (2014)
Chapter Google Scholar
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4511–4520 (2015)
Google Scholar
Wood, E., Baltrušaitis, T., Morency, L.P., Robinson, P., Bulling, A.: Learning an appearance-based gaze estimator from one million synthesised images. In: Proceedings of the ETRA (2016)
Google Scholar
Wood, E., Baltrušaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: ICCV (2015)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Conference on Computer Graphics and Interactive Techniques, ACM (1999)
Google Scholar
Romdhani, S., Vetter, T.: Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In: Proceedings CVPR 2005, vol. 2, pp. 986–993. IEEE (2005)
Google Scholar
Aldrian, O., Smith, W.A.: Inverse rendering of faces with a 3d morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1080–1093 (2013)
Article Google Scholar
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: Proceedings of the AVSS (2009)
Google Scholar
Yi, D., Lei, Z., Li, S.: Towards pose robust face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13539–3545 (2013)
Google Scholar
Thies, J., Zollhöfer, M., Nießner, M., Valgaerts, L., Stamminger, M., Theobalt, C.: Real-time expression transfer for facial reenactment. ACM TOG 32, 40 (2015)
Google Scholar
Cao, C., Weng, Y., Lin, S., Zhou, K.: 3d shape regression for real-time facial animation. ACM TOG (2013)
Google Scholar
Baltrušaitis, T., Morency, L.P., Robinson, P.: Constrained local neural fields for robust facial landmark detection in the wild. In: IEEE ICCVW (2013)
Google Scholar
Cootes, T.F., Edwards, G.J., Taylor, C.J., et al.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2001)
Article Google Scholar
Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2540–2548 (2015)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. ACM Trans. Graph. (TOG) 24, 408–416 (2005). ACM
Article Google Scholar
Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A statistical model of human pose and body shape. In: Computer Graphics Forum, vol. 28, pp. 337–346. Wiley Online Library (2009)
Google Scholar
Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: A 3d morphable model learnt from 10,000 faces. In: Proceedings of the CVPR 2016 (2016)
Google Scholar
Vlasic, D., Brand, M., Pfister, H., Popović, J.: Face transfer with multilinear models. ACM Trans. Graph. (TOG) 24, 426–433 (2005). ACM
Article Google Scholar
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: a 3d facial expression database for visual computing. TVGC 20(3), 413–425 (2014)
Google Scholar
Bérard, P., Bradley, D., Gross, M., Beeler, T.: Lightweight eye capture using a parametric model. ACM Trans. Graph. (TOG) 35(4), 117 (2016)
Article Google Scholar
Bérard, P., Bradley, D., Nitti, M., Beeler, T., Gross, M.: Highquality capture of eyes. ACM Trans. Graph. 33, 1–12 (2014)
Article Google Scholar
Ferhat, O., Vilarino, F.: Low cost eye tracking: the current panorama. J. Comput. Intell. Neurosci. 22(23), 24
Google Scholar
Schneider, T., Schauerte, B., Stiefelhagen, R.: Manifold alignment for person independent appearance-based gaze estimation. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 1167–1172. IEEE (2014)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3d gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1821–1828 (2014)
Google Scholar
Sesma, L., Villanueva, A., Cabeza, R.: Evaluation of pupil center-eye corner vector for gaze estimation using a web cam. In: Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 217–220. ACM (2012)
Google Scholar
Torricelli, D., Conforto, S., Schmid, M., DAlessio, A.: A neural-based remote eyegaze tracker under natural head motion. Comput. Methods Programs Inbiomed. 92(1), 66–78 (2008)
Article Google Scholar
Wood, E., Bulling, A.: Eyetab: Model-based gaze estimation on unmodified tablet computers. In: Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 207–210. ACM (2014)
Google Scholar
Wang, J., Sung, E., Venkateswarlu, R.: Eye gaze estimation from a single image of one eye. In: Ninth IEEE International Conference on Computer Vision, 2003, Proceedings, pp. 136–143. IEEE (2003)
Google Scholar
Wu, H., Chen, Q., Wada, T.: Conic-based algorithm for visual line estimation from one image. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, Proceedings, pp. 260–265. IEEE (2004)
Google Scholar
Lu, F., Sugano, Y., Okabe, T., Sato, Y.: Adaptive linear regression for appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 36(10), 2033–2046 (2014)
Article Google Scholar
Mora, K., Odobez, J.M.: Geometric generative gaze estimation (g3e) for remote rgb-d cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1773–1780 (2014)
Google Scholar
Jimenez, J., Danvoye, E., von der Pahlen, J.: Photorealistic eyes rendering. In: SIGGRAPH Talks, Advances in Real-Time Rendering, ACM (2012)
Google Scholar
Malbouisson, J.M., e Cruz, A.A.V., Messias, A., Leite, L.V., Rios, G.D.: Upper and lower eyelid saccades describe a harmonic oscillator function. Invest. Ophthalmol. Vis. Sci. 46(3), 857–862 (2005)
Article Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Article MathSciNet MATH Google Scholar
Riedmiller, M., Braun, H.: Rprop-a fast adaptive learning algorithm. In: Proceedings of ISCIS VII), Universitat, Citeseer (1992)
Google Scholar
Funes Mora, K.A., Monay, F., Odobez, J.M.: EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In: Proceedings of the ETRA (2014)
Google Scholar
Smith, B., Yin, Q., Feiner, S., Nayar, S.: Gaze locking: passive eye contactdetection for humanobject interaction. In: ACM Symposium on User InterfaceSoftware and Technology (UIST), pp. 271–280, Oct 2013
Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), p. 1. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Cambridge, Cambridge, UK
Erroll Wood & Peter Robinson
Carnegie Mellon University, Pittsburgh, USA
Tadas Baltrušaitis & Louis-Philippe Morency
Max Planck Institute for Informatics, Saarbrücken, Germany
Andreas Bulling

Authors

Erroll Wood
View author publications
You can also search for this author in PubMed Google Scholar
Tadas Baltrušaitis
View author publications
You can also search for this author in PubMed Google Scholar
Louis-Philippe Morency
View author publications
You can also search for this author in PubMed Google Scholar
Peter Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Bulling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erroll Wood .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 18664 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wood, E., Baltrušaitis, T., Morency, LP., Robinson, P., Bulling, A. (2016). A 3D Morphable Eye Region Model for Gaze Estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9905. Springer, Cham. https://doi.org/10.1007/978-3-319-46448-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-46448-0_18
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46447-3
Online ISBN: 978-3-319-46448-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A 3D Morphable Eye Region Model for Gaze Estimation

Abstract