1 Introduction

The importance of eye-contact in non-verbal human communication cannot be understated. Right from infanthood, humans use eye-contact as a means for attracting and acknowledging attention, and can effortlessly sense others’ eye-gaze direction [5]. In today’s ubiquitous computing environment, it becomes critical for devices to effectively attract and manage users’ attention for proactive communication and information rendering. Therefore, HCI would greatly benefit from devices that can sense user attention via eye-contact– a phenomenon termed gaze locking in [16].

Gaze locking is a sub-problem of gaze-tracking, where the objective is to determine where the user is looking. Gaze tracking has been extensively studied by the HCI [11, 12], psychology [14, 20], medical [7] and the multimedia/computer vision communities [10, 19]. Gaze-tracking techniques (with the exception of few such as [8]) have inferred the point-of-gaze using eye-based cues even though social attention literature has identified that other cues such as head orientation contribute significantly to this end [9].

This paper proposes gaze-locking using deep convolutional neural networks (CNNs), which have recently become popular for solving visual recognition problems as they obviate the need for hand-crafted features (e.g., expressly modeling head pose). Specifically, our work makes the following research contributions:

  1. (1)

    Even though the gaze-locking methodology outlined in [16] detects eye-contact from distant faces, it requires an elaborate processing pipeline which includes: eye region rectification for head pose compensation, eye mask extraction, compression of a high-dimensional eye appearance feature vector via dimensionality reduction and a classifier for gaze-lock detection. Differently, we leverage the learning power of CNNs for gaze-locking with minimal data pre-processing. We validate our model on three datasets, and obtain over 90% detection accuracy on the Columbia Gaze (CG) [16] test set. In comparison, [16] reports 92% accuracy on the CG training set.

  2. (2)

    Different from [16] and most gaze-tracking methods, we use facial appearance, which implicitly conveys face pose, in addition to eye appearance. As seen in Fig. 1, face orientation crucially determines if the user is gaze-locked with a (reference) camera or not. The eyes in the left and right images have very similar appearance; however, eye-contact is clearly made only in the right instance when one infers gazing direction as the eye orientation relative to head pose. Combining face and eye cues achieves superior gaze locking than either of the two as demonstrated in prior works [17].

  3. (3)

    CNNs are implemented on CPU/GPU clusters given their huge computation and memory requirements; their implementation on mobile platforms is precluded by the limited by the computation and energy resources in these environments. We demonstrate gaze-locking on an Android mobile platform via CNN compression using ideas from the dark knowledge concept [6].

Fig. 1.
figure 1

Left image is non-gaze-locked, while right image is gaze-locked. Their eye crops however look very similar.

Fig. 2.
figure 2

Overview of our gaze-lock detector. Inputs include 64 \(\times \) 64 left eye, right eye and face images, and the detector outputs a binary label assigned as either gaze-locked or non-gaze-locked. CNN architecture has three parallel networks each comprising four convolutional layer blocks (denoted as filter size/number of filters): CONV-L1: 3 \(\times \) 3/64, CONV-L2: 3 \(\times \) 3/128, CONV-L3: 3 \(\times \) 3/256, and CONV-L4: 3 \(\times \) 3/128, and three fully-connected layers denoted as FC1 (of size 2048 inputs \(\times \) 128 outputs), FC2: 384 \(\times \) 128 and FC3: 128 \(\times \) 2. (Color figure online)

2 Methodology

Figure 2 presents our proposed system and the convolutional neural network (CNN) architecture. CNNs automatically learn problem-specific features, obviating the need for devising hand-crafted descriptors like HoG [3]. Furthermore, replacing the largely independent feature extraction and feature learning modules by an end-to-end framework allows for efficient handling of classification errors. System components are described below.

2.1 Image Pre-processing

We essentially use the face and eye appearance to detect eye-contact, and pre-processing is limited to extraction of these regions. A state-of-the art facial landmark detector [1] is used to obtain 64 \(\times \) 64 left and right eye patches. Since face pose serves as an additional cue, a 64 \(\times \) 64 face patch obtained using the Viola-Jones detector [18] is also fed to the CNN. The red, green and blue channels for each patch are z-normalized prior to input.

2.2 CNN Architecture

Our system comprises three parallel networks (one each for face, left eye and right eye) with a VGGnet [15]-like configuration. CNNs are stacked with convolutional (Conv) layers composed of groups of neurons (or filters), which automatically compute locally salient features (or activations) from input data. Conv layers are interleaved with max pooling layers, which isolate the main activations on small data blocks and allow later layers to work on a ‘zoomed out’ version of previous outputs facilitating parameter reduction. Convolutions are also usually followed by a non-linear operation (called rectified linear unit or ReLU [13]) to make the CNN more expressive and powerful. Finally, in a fully-connected (FC) layer, neurons have access to all activations from a previous layer as against a Conv layer whose neurons only access local activations.

Each of our three networks have four blocks, with each block including two Conv layers, a ReLU and a max-pooling layer (only Conv layers are shown in Fig. 2). Similar activations are enforced for the left and right eye networks by constraining their neurons to learn identical/shared weights. The filter size or spatial extent of activations input to a Conv layer neuron is \(3 \times 3\) for all blocks, and there are 64, 128, 256 and 128 neurons respectively in the four blocks. A stride length of 1 is used while convolving (computing dot product of) the filter with the input patches. The Conv-L4 outputs are vectorized to a 2048 dimensional vector, which is input to the FC1 layer with 128 outputs. FC1 outputs from the three networks are combined and fed to FC2 followed by FC3, which assigns the input label as either gaze-locked or non-gaze-locked. The CNN model was implemented on Torch [2], and trained over 250 epochs with a batch size of 100. An initial learning rate of 0.001 was reduced by 5.0% after every epoch. To avoid overfitting, a dropout technique was used to randomly remove 40% of the FC layer neurons during training. Interested readers may refer to [15] for further details.

3 Experiments and Results

3.1 Datasets

To expressly address eye-contact detection, authors of [16] compiled the Columbia Gaze (CG) dataset which comprises 5880 images of 56 persons viewing over 21 different gaze directions and 5 different head poses. Of these, 280 are gaze-locked, while 5600 are non-gaze-locked– sample CG images are shown in Fig. 3 (left). The CG dataset is compiled in a controlled environment, and contains little variation in terms of illumination and background. The limited size of the CG dataset makes it unsuitable for training CNNs, and we therefore used two large datasets to train our CNN, namely, (1) MPIIGaze [21] comprising 213,659 images compiled from 15 subjects during everyday laptop use. As shown in Fig. 3 (center-top), MPIIGaze images vary with respect to illumination, face size and background. However, only cropped eye images (center-bottom) are publicly available for MPIIGaze; (2) The Eyediap dataset [4] (Fig. 3 (right)) contains 19 HD videos with more than 3000 images each captured from 16 participants. We ignore the depth information available for this dataset, and only use the raw video frames for our purpose.

Fig. 3.
figure 3

(left) Sample images from the CG dataset. (center-top) Original exemplars and (center-bottom) publicly available eye-only images from MPIIGaze. (right) Sample images from Eyediap.

3.2 Data Synthesis and Labeling

As only 280 gaze-locked images exist in the CG dataset, we generated 2280 gaze-locked and 5900 non-gaze-locked samples by scaling and randomly perturbing original images as described in [16]. On the contrary, we downsampled the number of images for the MPIIGaze and Eyediap datasets. MPIIGaze comprises images with continuous gaze direction from 0\(^{\circ }\) to \(-20^{\circ }\) pitch (vertical head rotation) and \(-20^{\circ }\) to 20\(^{\circ }\) yaw (horizontal rotation). The 3D gaze direction (xyz) is converted to 2D angles (\(\theta \),\(\phi \)) as \(\theta = \arcsin (-y), \phi = \arctan (-x,-z)\). Then, gaze-locking implies \((\theta , \phi ) = (0,0)\). This way, we obtained 6892 gaze-locked and 12000 non gaze-locked images from MPIIGaze. Likewise, Eyediap images show users making eye-contact with various screen regions on a \(24''\) PC monitor. We labeled images with the target looking straight ahead (around screen center) as gaze-locked, and others as non gaze-locked. Table 1 presents the training and test sets statistics for the three datasets. We now discuss gaze-locking results with different train and test sets.

Table 1. Training and test set details for the various datasets.

Experiment 1 (Ex1). To begin with, we used only the CG dataset for model trainingFootnote 1. Specifically, we trained our detector with (a) images of only one eye; (b) images from both eyes; (c) only face images, and (d) face-plus-eye images as in Fig. 2.

Experiment 2 (Ex2). Here, we repeated Ex1(a) and (b)Footnote 2, but first pre-trained the CNN with MPIIGaze and fine-tuned the same using CG. Fine-tuning involved modifying only the FC layer weights by re-training with CG images, assuming that the learned Conv-L4 activations were relevant for both MPIIGaze and CG.

Experiment 3 (Ex3). We repeated Ex1(a–d), but pre-trained the CNN with Eyediap followed by fine-tuning on CG.

Table 2. Detection performance for Ex1(a)–3(d) and comparison with [16]. Model tested on CG in all cases. [16] reports results only on the training set.

Experiment 4 (Ex4). To examine the effect of our framework on datasets other than CG, we repeated Ex1(a–d) with a CNN trained on CG and fine-tuned with Eyediap.

3.3 Results and Discussion

Gaze-locking results are tabulated in Tables 2 and 3. Detection performance is evaluated in terms of accuracy, and the Mathews correlation coefficent (MCC). MCC is useful while evaluating binary classifier performance on unbalanced datsets, as with our case where the number of gaze-locked instances are far less than non-gaze-locked ones. In Ex1, accuracy and MCC decrease as more information is input to the CNN (e.g., face = plus-eyes vs eyes/face only), contrary to our expectation. This reduction is attributable to overfitting due to the small CG dataset size in comparison to the number of CNN parameters.

Table 3. Detection results for Ex4. Model trained on CG and fine-tuned/tested on Eyediap.

However, the benefit of using additional information for gaze-lock detection is evident from Ex2, Ex3 and Ex4 (Ex2 and Ex3 involve pre-training of the CNN model with larger and visually richer datasets). Using two-eye information as against one-eye in Ex2 improves accuracy and MCC by 4.7 and 7% respectively. Ex3 and Ex4 results are consistent with social attention literature. They confirm that while gaze direction is more critical than head pose for inferring eye contact, combining head and eye orientation cues is optimal for gaze-locking. Our system achieves a best accuracy of 93% and MCC of 0.83 on the CG dataset. Table 2 also compares our results with the state-of-the-art [16]. [16] reports detection results on the training set, while our results are achieved on an independent test set. With minimal data pre-processing, our model performs similar to [16] using only eye appearance, and outperforms [16] with face-plus-eye information. Finally, while the results for Ex4 again confirm the insufficiency of the CG dataset for training the CNN, the gaze-locking performance significantly improves on incorporating facial and binocular information.

3.4 Visualizing CNN Activations

Figure 4 illustrates four neuronal activations learned in the Conv-L1 layer of our CNN model for the input eye and face images. Conv-L1 activations are informative as ReLU network activations are dense in the early layers, and progressively become sparse and localized. As eye gaze direction is given by the pupil orientation, the eye activations capture edges and textures relating to the pupil. Similarly, the face network activations encode face shape and structural details for pose inference.

Fig. 4.
figure 4

Exemplar Conv-L1 neuron outputs for input eye (top) and face (bottom) images.

4 CNN Implementation on Android

While our CNN based gaze-lock detector requires minimal pre-processing, the end-to-end framework obviates need for heuristics as with the eye mask extraction phase in [16]. Our system achieves 15 fps throughput on an Intel Core I7 2.6 GHz, 16 GB RAM PC with GeForce GTX 960M GPU. However, CNNs require large computational and memory resources which precludes their implementation on mobile devices with limited computation and energy capacity.

Fig. 5.
figure 5

Compressed version of our model working on an Android (Quad-core, 2.3 GHz, 3GB RAM) phone. Green rectangle denotes gaze-locking, while red denotes non-gaze-locking. (Color figure online)

This problem can be circumvented by compressing knowledge in a large, complex model to train a simpler model with minimal accuracy loss using the dark-knowledge concept [6]. Figure 5 shows our gaze-lock detector on an Android platform, which has a throughput of 1 fps. A more efficient implementation described in [8] can achieve upto 15 fps throughput.

5 Conclusion

This work exploits the power of deep CNNs to perform passive eye-contact detection with minimal data pre-processing. Combining facial appearance with eye information improves gaze-locking performance. Our system can also run on an Android mobile device with limited throughput. Our end-to-end system with minimal heuristics can be leveraged by today’s smart devices for capturing and managing user attention (e.g., a smart selfie application), as well as in image/video retrieval (detecting shots where a certain character is facing the camera). Future work involves implementation of a seamless, real-time vision-voice system for assistive applications such as photo-capturing for the blind.