Keywords

1 Introduction

Appearances influence what people think about the personality of other people, even without having any interaction with them. These judgments can be made very quickly - already after 100 ms [35]. Although some studies have shown that people are good at forming accurate first impressions about the personality traits of people after viewing their photographs or videos [4, 21], it has also been shown that simply relying on the appearance does not always result in correct first impression judgments [22].

Several characteristics of people varying from clothing to facial expressions, contribute to the first impression judgments about personality [29]. For example, [30] has shown that the photographs of the same person taken with a different facial expression changes the judgments about the person’s personality traits such as trustworthiness and extravertedness as well as other perceived characteristics such as attractiveness and intelligence. Furthermore, people are better at guessing other’s personality traits if they find them attractive after short encounters with them [18]. The same study also showed that people form more positive first impressions about more attractive people.

Studies of personality prediction generally either deal with correctly recognizing the actual personality traits of people, which can be measured as self- or acquaintance-reports or apparent personality traits, which are the impressions about the personality of an unfamiliar individual [34]. Below we review the recent work in apparent personality prediction.

Most of the previous work on apparent personality modeling and prediction have been in the domain of paralanguage, i.e. speech, text, prosody, other vocalizations and fillers [34]. Conversations (both text and audio) [19] and speech clips [20, 23] were the materials that were most commonly analyzed. In this domain, INTERSPEECH 2012 Speaker Trait Challenge [25] enabled a systematic comparison of computational methods by providing a dataset comprising audio data and extracted features. The competition had three sub-challenges for predicting the Big Five personality traits, likability and pathology of speakers.

Recently, prediction of apparent personality traits from social media content has become a challenge that attracted much attention in the field. For example [6, 26] demonstrated that the images that the users “favorite” on Flickr enabled the prediction of both apparent and actual (self-assessed) personality traits of Flickr users. [32] looked at the influence of a large number of physical attributes (e.g. chin length, head size, posture) on people’s impressions regarding approachability, youthful-attractiveness and dominance of them. They studied these influences based on people’s impressions formed after looking at face photographs. They performed factor analysis to quantify the contribution of physical attributes and used these factors as inputs to a linear neural network to predict impressions. Their predictions were significantly correlated to the actual impression data.

Given that the exact facial expression [30] and the posture [32] of the person in a photograph influences the first impression judgments about that person, as well as the importance of paralinguistic information in impression formation [19], continuous audio-visual data seems to be a suitable medium to study first impressions. In a series of studies using YouTube video blogs (vlogs) [13, 29], researchers showed that this is indeed the case. Furthermore, [5] showed that audiovisual annotations along with audiovisual cues enabled the best prediction performance for their regression models compared to either using only either one of them.

At the same time, deep neural networks [16, 24] in general and deep residual networks [11] in particular have achieved state-of-the-art results in many computer vision tasks. For example, [11] won the first places in the object detection task and the object localization task at the ImageNet Large Scale Visual Recognition Challenge 2015Footnote 1 with their seminal work that introduced deep residual networks. Furthermore, deep residual networks have been successfully used in a variety of other computer vision tasks ranging from style transfer [14] and image super-resolution [14] to semantic segmentation [7] and face hallucination [9].

Recently, [33] suggested that deep neural networks can be used for personality trait recognition because of the hierarchical organization of the personality traits [36]. Following this line of reasoning as well as the recent success of deep residual networks, we develop an audiovisual deep residual network for multimodal personality trait recognition. The network is trained end-to-end for predicting the apparent Big Five personality traits of people from their videos. That is, the network does not require any feature engineering or visual analysis such as face detection, face landmark alignment or facial expression recognition.

2 Methods

2.1 Architecture

Figure 1 shows an illustration of the network architecture. The network comprises an auditory stream of a 17 layer deep residual network, a visual stream of another 17 layer deep residual network and an audiovisual stream of a fully-connected layer.

The auditory stream and the visual stream are similar to the first 17 layers of the 18 layer deep residual network in [11]. That is, each stream comprises one convolutional layer and eight residual blocks of two convolutional layers. The convolutional layers are followed by batch normalization [13] (all layers), rectified linear units (all layers), max pooling (first layer) and global average pooling (last layer). In the residual blocks that do not change the dimensionality of their inputs, identity shortcut connections are used. In the remaining residual blocks, convolutional shortcut connections are used. In contrast to [11], the number of convolutional kernels are halved.

Similar to [8], the difference between the auditory stream and the visual stream is that inputs, convolutional/pooling kernels and strides of the auditory stream are one-dimensional whereas those of the visual stream are two-dimensional if the number of channels are ignored. That is:

  • An \(n ^ 2 \times 1 \times 1\) input of the auditory stream corresponds to an \(n \times n \times m\) input of the visual stream.

  • An \(n ^ 2 \times 1 \times m\)/\(n ^ 2 \times 1\) convolutional/pooling kernel of the auditory stream corresponds to an \(n \times n \times m\)/\(n \times n\) convolutional/pooling kernel of the visual stream.

  • An \(n ^ 2 \times 1\) stride of the auditory stream corresponds to an \(n \times n\) stride of the visual stream.

where m is the number of channels.

Outputs of the auditory stream and the visual stream are merged in an audiovisual stream. The audiovisual stream comprises a fully-connected layer. The fully-connected layer is followed by hyperbolic tangent units. Outputs of the audiovisual stream are scaled to [0, 1].

Fig. 1.
figure 1

Illustration of the network architecture.

2.2 Training

We used Adam [15] with initial \(\alpha = 0.0002\), \(\beta _1 = 0.5\), \(\beta _2 = 0.999\), \(\epsilon = 10^{-8}\) and mini-batch size = 32 to train the network by iteratively minimizing the mean absolute error loss function between the target traits and the predicted traits for 900 epochs. We initialized the biases/weights as in [10] and reduced \(\alpha \) by a factor of 10 after every 300 epochs. Each training video clip was processed as follows:

  • The audio data and the visual data of the video clip are extracted.

  • A random 50176 sample temporal crop of the audio data is fed into the auditory stream. The activities of the penultimate layer of the auditory stream are temporally pooled.

  • A random 224 pixels \(\times \) 224 pixels spatial crop of a random frame of the visual data is randomly flipped in the left/right direction and fed into the visual stream. The activities of the penultimate layer of the visual stream are spatially pooled.

  • The pooled activities of the auditory stream and the visual stream are concatenated and fed into the fully-connected layer.

  • The fully-connected layer outputs five continuous prediction values between the range [0, 1] corresponding to each trait for the video clip.

2.3 Validation/Test

Each validation/test video clip was processed as follows:

  • The audio data and the visual data of the video clip are extracted.

  • The entire audio data are fed into the auditory stream. The activities of the penultimate layer of the auditory stream are temporally pooled (see below note).

  • The entire visual data are fed into the visual stream one frame at a time. The activities of the penultimate layer of the visual stream are spatiotemporally pooled (see below note).

  • The pooled activities of the auditory stream and the visual stream are concatenated and fed into the fully-connected layer.

  • The fully-connected layer outputs five continuous prediction values between the range [0, 1] corresponding to each trait for the video clip.

It should be noted that the network can process video clips of arbitrary sizes since the penultimate layers of the auditory stream and the visual stream are followed by global average pooling.

3 Results

We evaluated the network on the dataset that was released as part of the ChaLearn First Impressions ChallengeFootnote 2 [17]. The dataset consists of 10000 15-second-long video clips that were drawn from YouTubeFootnote 3, of which 6000 were used for training, 2000 were used for validation and 2000 were used for test. The video clips were annotated with the Big Five personality traits (i.e. openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism) by Amazon Mechanical TurkFootnote 4 workers. Each trait was represented with a value between the range [0, 1].

The video clips were preprocessed by temporally resampling the audio data to 16000 Hz as well as spatiotemporally the video data to 456 pixels \(\times \) 256 pixels and 25 frames per second.

We implemented the network in Chainer [31] with CUDA and cuDNN. Most of the processing took place on a single chip of an Nvidia Tesla K80 GPU acceleratorFootnote 5. Processing took approximately 50 ms per training example and 2.7 s per validation/test example on a single chip of an Nvidia Tesla K80 GPU accelerator. Figure 2 shows five validation examples and the corresponding predictions.

Accuracy was defined as 1 - mean absolute error. We report the validation accuracy of the network after 300, 600 and 900 epochs of training (Table 1). Average validation accuracy of the network increased as a function of number of epochs of training with the highest average validation accuracy of 0.9121. We report also the test accuracy of the network after 900 epochs of training, which won the third place in the challenge, and compare it with those of the models that won the first two places in the challenge (Table 2).

Table 1. Validation accuracies of the challenge model after 300, 600 and 900 epochs of training.
Table 2. Test accuracies of the models that won the first three places in the challenge.
Fig. 2.
figure 2

Example thumbnails of the videos of five people and the corresponding predicted personality traits. Each trait takes a value between [0, 1]. Each color represents a trait. From left to right: Openness, agreeableness, conscientiousness, neuroticism and extraversion.

4 Post Challenge Models

For completeness, we briefly report our preliminary work on two models that we have evaluated after the end of the challenge.

First, we separately fine-tuned the original DNN after 300 epochs of training for each trait. Everything about the fine-tuned DNNs (i.e. architecture, training and validation/test) were the same with the original DNN except for their fully-connected layers that output one value rather than five values.

Second, we trained a recurrent neural network (RNN) on top of the original network. The RNN comprised two layers of 512 long short-term memory units [12] and one layer of five linear units. At each time point, the RNN took as input the layer 5 features of a second-long video clip and the output of the RNN was the predicted traits. Dropout [27] was used to regularize the hidden layers.

We used Adam to train the model by iteratively minimizing the mean absolute error loss function between the target traits and the predicted traits at each time point. Backpropagation was truncated after every 15 time points. Once the model was trained, the predicted traits were averaged over the entire video clip.

Table 3 shows the validation accuracy of the post challenge models. While the post challenge models failed to outperform the challenge model to a large extent, we strongly believe that variants thereof have the potential to do so and will be the subject matter of future work.

Table 3. Validation accuracies of the challenge model and the post challenge models.

5 Conclusion

In this study, we presented our approach and results that won the third place in the ChaLearn First Impressions Challenge. Summarizing, we developed and trained an audiovisual deep residual network for predicting the apparent personality traits of people in an end-to-end manner. This approach enabled us to obtain very high performance for all traits while exploiting the similarities between the organization of the personality traits and the deep neural networks in terms of the hierarchical organization and circumventing extensive analyses for identifying/designing relevant features for the task of apparent personality traits prediction. Our results demonstrate the potential of deep neural networks in the field of automatic (perceived) personality prediction. Future work will focus on the extensions of the current work with recurrent neural networks and language models as well as identifying the factors that drive first impressions.