Introduction

Facial expression represents one crucial component of an emotion (Scherer, 2005). In the large corpus of research on emotional facial expression, three major measurement methods have been established, (a) the Facial Action Coding System (Ekman & Friesen, 1978; Ekman, Friesen, & Hager, 2002), (b) Electromyography (Fridlund & Cacioppo, 1986; Tassinary, Cacioppo, & Vanman, 2007), and (c) Automatic Facial Expression Recognition (AFER) systems (Corneanu, Oliu, Cohn, & Escalera, 2016). All three methods suffer from certain drawbacks, which will be outlined in the following. The aim of the present paper is to propose the blenderFace method, which offers a way to overcome the disadvantages of the established methods and provides a way to take very accurate direct measurements of facial movement at high spatial and temporal resolutions.

The necessity for a new measurement method for facial expressions

The measurement of facial movement via the Facial Action Coding System (FACS; Ekman & Friesen, 1978; Ekman, Friesen, & Hager, 2002), or the Maximally Discriminative Facial Movement Coding System (MAX; Izard, 1983), as well as the emotional interpretation of facial movements via the Emotional Facial Action Coding System (EMFACS; Friesen & Ekman, 1983), the Facial Action Coding System Affect Interpretation Dictionary (FACSAID; Ekman & Hager, 2002), or the Affect Expression Identification System (AFFEX; Izard, Dougherty, & Hembree, 1983) have emphasized the rating of mostly static facial expressions (Cohn & Schmidt, 2004; Tcherkassof, Bollon, Dubois, Pansu, & Adam, 2007). In a FACS coding, one or more trained FACS coders have to judge the extent of facial muscle movement from mostly a still picture for virtually all facial muscles on a 5-point Likert scale. The extent of movement is graded from A to E for the specific facial muscles and denoted as numbered Action Units (AUs). Although it is possible to rate the dynamics of facial expressions in an image sequence, this technique has turned out to be very time consuming (Cohn & Schmidt, 2004). Further drawbacks of a FACS rating with subsequent emotional interpretation are (a) that only easily discoverable but not minor, subtle facial movements can be detected and can therefore be rated, (b) the measurement accuracy of an ordinal 5-point rating scale might not be accurate enough to adequately represent the emotional dynamics of facial expression, whereas facial expression measured at a higher resolution might reveal new findings (e.g., microexpressions; Ekman & Friesen, 1969), (c) the ratings and the subsequent emotional interpretations may be subjected to a rating bias (e.g., Horstmann, 2002). In addition, in this line of research, the stimuli used to elicit a facial expression have mostly been instructional, whereas natural emotion-eliciting situations or experimental settings with induced emotions have rarely been investigated (Reisenzein, Studtmann, & Horstmann, 2013). Therefore, this line of research has raised questions about the ecological validity of the FACS and the generalizability of the results (Motley & Camden, 1988; Russell, 1994).

The second measurement method represents the measurement of emotionally relevant facial muscles via electromyogram (EMG; Fridlund & Cacioppo, 1986; Tassinary et al., 2007). Fridlund and Cacioppo proposed the deduction of 11 emotion-specific facial muscles. According to their proposals, a pair of electrodesFootnote 1 should be placed above and parallel to each emotionally relevant muscle fiber, accompanied by a ground electrode placed on the upper forehead. However, due to the physical size of the electrodes and the conductibility of the skin, crosstalk with neighboring muscle groups is possible, for example, the measurement of the corrugator supercilii may be confounded by activity from the depressor supercilii and procerus (Fridlund & Cacioppo, 1986). Although the EMG method can be applied to measure dynamic activity in the specific muscles used to produce a facial expression, in many cases, the data are highly aggregated in subsequent data processing procedures: In the beginning, the muscle activity is sampled at a high frequency (usually between 10 and 2500 Hz), passed through several filters (e.g., to filter the power supply frequency), optionally integrated or smoothed (e.g., via a moving average), and often split into a pre-stimulus-onset phase and a post-stimulus-onset phase, averaged per phase and then statistically tested. In the end, this leads to a loss in dynamic information aside from an increase in the familywise error rate (Hochberg & Tamhane, 1987). Moreover, in many cases, the activity of only one or two muscles is deduced. This is problematic for the interpretation of emotional facial expressions because first, a single muscle may be involved in several emotional states. For example, according to EMFACS (Friesen & Ekman, 1983), AU1 (frontalis, pars medialis) may be involved in the emotions fear, sadness, and surprise, or AU4 (corrugator supercilii) may indicate sadness, fear, or anger. Therefore, in many cases, emotion specificity is difficult to determine, and an interpretation regarding valence is preferred. Second, a facial expression might not be shown in a prototypical manner. For example, in a natural, spontaneous facial expression, the emotion might be shown in only some parts of the face (Porter & Brinke, 2008, 2010). Therefore, depending on the position of the electrodes, an emotional expression might not be successfully detected. Besides, the measurement of facial expression via EMG is subject to several practical implications, such as the probably disturbing sensation of electrodes and cables on the facial skin and the need for an electromagnetically shielded laboratory (Fridlund & Cacioppo, 1986).

The third and most recent method for measuring facial expressions relies on the analysis of video data by applying automatic facial expression recognition (AFER) algorithms. Computer vision research in this area has reached a mature state and has provided fascinating results (for an overview, see Corneanu, Oliu, Cohn, & Escalera, 2016). Most AFER procedures include four steps (face detection in a picture or footage, detection of natural landmarks, emotion relevant feature extraction [e.g., mouth corner detection], and expression recognition) and are implemented in both open-source software (e.g., OpenFace; Baltrušaitis, Robinson, & Morency, 2016; Menpo; Alabort-i-Medina, Antonakos, Booth, Snape, & Zafeiriou, 2014; IntraFace; De la Torre, Chu, Xiong, Vicente, Ding, & Cohn, 2015; Computer Expression Recognition Toolbox; CERT; Bartlett, Littlewort, Wu, & Movellan, 2008; Littlewort, Whitehill, Wu, Fasel, Frank, Movellan, & Bartlett, 2011) and commercial software (e.g., Affectiva, 2017; Emotient Inc., 2016; Kairos, 2017; Noldus Information Technology, 2017; RealEyes, 2017). The AFER procedures have been able to achieve a reliable and valid classification of emotional expressions from footage and images, close to or even better than humans (Littlewort, Bartlett, & Lee, 2009; Terzis, Moridis, & Economides, 2010). Despite the fact that the AFER procedures achieve a very good correct classification rate, these procedures have drawbacks from a epistemological and psychometric perspective and may be problematic for psychological facial expression research for the following reasons: (a) Most AFER procedures return highly aggregated output, either through basic emotion classification or AU activation estimates. In most cases, the classification is performed by artificial neural networks that have been trained by one or more preclassified training samples. This means that only preclassified categories can be detected. For example, an immanent assumption is that the classification categories and the corresponding training samples are correct, but this assumption may be arguable, for example, regarding the number of prototypical expressions involved in basic emotions (Jack, Garrod, & Schyns, 2014) or regarding whether basic emotions have a prototypical appearance (Crivelli, Jarillo, Russell, & Fernández-Dols, 2016a; Crivelli, Russell, Jarillo, & Fernández-Dols, 2016b; Elfenbein, Beaupré, Lévesque, & Hess, 2007; Jack, Garrod, Yu, Caldara, & Schyns, 2012). Therefore, AFER procedures bear the risk of being plagued by circular reasoning because classifications will always be consistent with the predetermined classes, and thus the procedures will not be able to detect prototypical facial expressions other than the ones they were trained to detect.Footnote 2 (b) In addition, when AFER algorithms are applied to classify emotions, they suffer from the shortcoming that the intensities of emotion expressions are not measured, although there are approaches that can do this (Corneanu et al., 2016). In the case of AU activation classification, the output represents a downgrade in the scale of measurement from metrically measured 2D or 3D facial movement to a nominal scale, which is therefore accompanied by a loss in information and accuracy (e.g., AU 12 represents the lip corner puller, AU 15 represents the lip corner depressor). (c) Despite the fact that the classification rates of AFER algorithms are very good (Littlewort et al., 2009; Terzis et al., 2010), and the commonly used support vector machines (SVMs) have a strong mathematical basis, translating the classification rules of AFER algorithms into a format that can be comprehended by humans is a complex procedure. For example, when an SVM is used as a classifying algorithm, the coordinates of an n-dimensional hyperplane must be interpreted semantically. Therefore, the AFER algorithm output may be used descriptively (e.g., the occurrence or nonoccurrence of a specific emotion) rather than analytically (e.g., to test various hypotheses about measured facial movement). (d) Most AFER procedures also allow lower level output in the sense of coordinates of tracked fiducial points/natural landmarks (e.g., the corners of the mouth or the eyes). This information is gathered during the face segmentation phase and is needed to be able to classify facial expressions of emotion. These natural landmarks are determined, for example, by exploiting color and texture information along with ellipsoid fitting or via face saliency maps (Corneanu et al., 2016). Therefore, AFER algorithms try to match a predetermined pattern onto each frame of the footage or image. Because this is a relatively robustFootnote 3 but not always very accurate and reliable method, a confidence level is computed. This confidence level is estimated by the tracking algorithm and represents the confidence in the current landmark detection. However, the tracker and subsequently the computed confidence level are not bound to any external, ”true” criteria in a psychometric sense. Therefore, the confidence level can take on high values, even when the classifications are completely wrong (Evtimov, Eykholt, Fernandes, Kohno, Li, Prakash, & Song, 2017; Sharif, Bhagavatula, Bauer, & Reiter, 2016), thus causing potentially misleading results. (e) In addition, AFER algorithms rely on tracking the natural landmarks of the face (e.g., corners of the eyes and mouth, eyebrows), but there are procedures that allow researchers to model parts of the face that lack natural facial features (e.g., Wood, Baltrušaitis, Morency, Robinson, & Bulling, 2016), for example, the analysis-by-synthesis approach. The basic idea used in this approach is that it replicates (a subaspect of) a given entity and iteratively minimizes the differences between the replica and the target entity. The process of replication and the replica aid the understanding of the entity. For example, Wood et al. (2016) estimated a 3D model (along with light conditions, skin tone, etc.) for a part of the face that lacks natural landmarks and minimized the difference between the original image and the 3D-rendered image by adapting the 3D model in an iterative process. However, this approach does not represent the measurement of facial movements in facial areas that do not have facial landmarks (e.g., movement of the skin on the checks), but it represents a synthesis which estimates the part of the facial surface iteratively to minimize the differences in visual appearance. (f) For commercial AFER software, in most cases, the emotion classifier algorithm that makes a decision about an emotion (and its intensity) is part of the secrecy of the business and is therefore unknown. This means that the interpretation and a conclusive assessment of the results remain arguable and do not conform to open science requirements.

Aims of the present research

The aim of the present research was to develop a new measurement method for psychological facial expression research that avoids the disadvantages and combines the advantages of established methods while keeping the procedure simple and clear. This method should meet the following requirements: (a) The new method should allow a very accurate, metric-level procedure for measuring raw, uninterpreted (e.g., in the sense of prototypical emotions, or AUs) facial movements. The measurement of facial areas that lack natural landmarks must be accurate and reliable. The data generated by the measurement procedure must ideally be free from measurement artifacts and trustworthy in the sense of psychometric objectivity (i.e., via visual transparency and the comprehensibility of the measurement procedure). The measured raw facial movement data can later be analyzed by applying various statistical analyses for testing different theoretical approaches in describing the facial movement. (b) The measurement of the temporal dynamics of facial movements should be possible for approximately 10 min of video footage per participant and sample sizes of around 100 participants.Footnote 4 (c) The simultaneous measurement of facial movements for the whole face should be possible, without handicapping the participating subject to a greater extent (e.g., via cables) and being as unobtrusive as possible to the participant in order to facilitate the measurement of facial movements in natural settings. (d) The measurement procedure must be flexible enough to be easily adapted to various research questions/experimental settings and also be extensible by providing various interfaces at different levels of data processing, for example, via the use of open-source software. Although very sophisticated solutions for partial aspects of the proposed measurement procedure have been developed (e.g., building an individual facial surface; Jeni, Cohn, & Kanade, 2015; Suwajanakorn, Kemelmacher-Shlizerman, & Seitz, 2014, or the scaling of virtual objects; Ham, Lucey, & Singh, 2014), these approaches could not be integrated because they are used to achieve a different aim (e.g., a photorealistic 3D representation of the facial surface), have technical requirements that do not fit into our experimental setting (e.g., a moving camera with an inertial measurement unitFootnote 5), do not support our aim of directly measuring facial movements, or are very difficult to integrate into the rest of the measurement workflow.

The blenderFace method

The basic idea is simple because it adopts an approach developed by the film industry: Around 1990, when computers were powerful enough to allow the rendering of animated 3D faces of figures, film directors were confronted with the problem that scripted facial movements of the 3D-figures (e.g., the direction, extent, and speed of the movement of virtual facial musclesFootnote 6) did not appear authentic and plausible to human perception. This problem was solved by employing a motion-capturing procedure: Markers were placed on the head and on emotionally relevant positions of an actor’s face, and the face was recorded while the person was acting. Subsequently, the movements of the markers were digitally tracked and transferred onto the virtual face of a 3D figure. In principle, the proposed procedure follows this idea; however, it is aimed solely at measuring facial movements.Footnote 7 In contrast to the expensive equipment used in the film industry, a single standard webcam and the open-source software Blender (www.blender.org) are all that is needed to apply the suggested procedure. The proposed approach uses a 3D model to represent head movements and face topology and is therefore not subject to measurement bias due to head movements. In addition, the proposed method is deliberately based on markers that are applied to the face (i.e., markers painted on the skin), opposed to the naturally available landmark fitting procedures of AFER algorithms, as applied for example in OpenFace (Baltrušaitis et al., 2016). The use of applied markers means that movements of the facial skin can be measured very precisely and reliably even in parts of the face that are far from the available natural facial markers (e.g., on the cheeks). Furthermore, the proposed measurement procedure satisfies all requirements mentioned earlier: (a) very accurate measurement at (b) high temporal resolution for an arbitrary time period (e.g., 10 min) with (c) the simultaneous measurement of movements for the whole face and with (d) an open and flexible procedure.

For the postprocessing of the measured data (e.g., visualization, standardization, plausibility checks), we developed the open-source blenderFace package for the free and open-source R statistical programming language (R Core Team, 2016). In the following, we describe the proposed procedure in principle. For a detailed description, see the supplemental material of this article, especially the documents “Step_by_Step_Instructions.pdf ” for the tracking procedure with Blender, and the vignette from the blenderFace package “blenderFace.pdf ” on how to postprocess facial movement data. The supplemental material also includes an example video and training video clips as well as example Blender data files.

Tracking setup

Illumination

Because the proposed procedure is an optical measurement, a well-illuminated, shadow-less face and head is essential. Shadows may change the visual pattern, which should be tracked, and thus may cause the Blender pattern recognition algorithm to fail and to abort tracking. This does not constitute a complete fail because Blender is easily instructed to use an updated pattern; however, it decelerates the tracking procedure and requires manual intervention.Footnote 8 It is also advisable to use a flicker-free LED illumination and to switch off neon tubes because neon tubes flicker according to their power supply frequency, which may interfere with the webcam shutter frequency and lead to a small but noticeable up-and-down sliding of tracked markers, however, not impairing the blenderFace method in general.

Camera

There are no special requirements for a camera. Any existing lens distortion, e.g., from a wide-angle lens, can be corrected in later steps of the blenderFace method. The camera must be firmly mounted (e.g., on top of the computer monitor) to record the face of the participant from a frontal view. We successfully used a Logitech C910, a Logitech C920 Webcam, and a Mobius Actioncam with resolutions of 640 × 480 pixels (px), 1280 × 720 px and 1920 × 1080 px at 24 and 30 frames per second (fps). The higher the resolution of the camera, the more accurate the measurement of facial movements, however, at the cost of a larger video clip file size. Moreover, the final video clip file size also depends on video length and the type of compression that is chosen for the video clip file.

Regarding the optimal frame rate, according to the Nyquist–Shannon sampling theorem (Shannon, 1949), a signal should be recorded at twice the frequency of the maximal frequency that is to be measured. For practical reasons, for example due to noise and measurement errors, the rate should be 4 to 6 times that high (Fridlund and Cacioppo, 1986). Because the fastest facial movements are in a range of approximately 250–300 ms (Dimberg, Thunberg, & Grunedal, 2002; Porter & Brinke, 2008; Yan, Wu, Liang, Chen, & Fu, 2013), a framerate of 24 or 30 fps is still more precise by a factor of ˜10 and can therefore be considered sufficient to measure even rapid facial movements. To achieve a high and reliable video quality, automatic adaption of the frame rate, aperture, etc. should be disabled or held constant by the webcam driver configuration. In addition, the video should be optimally saved in a lossless format or at high data rates in order to prevent compression artifacts in the video clip and to increase tracking speed in a later step of this procedure.

Synchronization

Options for and the expense of synchronizing the recorded footage with external events (e.g., the beginning and the ending of the stimulus presentations) depend to a large extent on experimental and laboratory settings. From the technical side, Blender is capable of processing time codes (e.g., from time-code capable cameras) or building time-code proxies, which may be used in combination with time-code-capable stimulus presentation software or data-recording devices (e.g., EEG). However, in many cases, the technical expense of using a time-code or beacon synchronization may be more than is necessary, and a burn-in of a time stamp with the option of additional information (e.g., the subject number, action markers, stimulus presentation episodes) inserted into the video data stream should be sufficient. As a fallback, it has been shown to be advantageous to have a mirror or a second monitor behind a participant’s back, thus recording the stimuli as well. This provides a simple yet effective, reliable, and technically easy way to synchronize the presentation of the stimuli.

Characteristics of the participants

In general, there are no restrictions on the identification of participants who are suitable for undergoing the measurement procedure. Table 1 presents an overview of the reasons that participants had to be excluded from the studies presented in this paper. Problems may occur when the markers that need to be tracked are drawn on the participants’ skin: The markers should be placed in a way that prevents them from disappearing into skin folds during expression of emotion (e.g., in the fold of a smile). Heavily made up or bearded participants are also difficult to track because the markers might not be clearly visible. However, this does not mean that such participants must be completely excluded, but marker tracking needs more corrections and manual work to ensure that the markers are tracked correctly. Facial markers that are visible through participants’ glasses are more problematic because they appear in a biased position. This leads to a higher tracking error because lens distortion by glasses is not considered in Blender’s optical model.Footnote 9 Therefore, markers that are visible through glasses should not be tracked. No further standardization procedures (e.g., a fixed distance between the camera and the face at the beginning of the measurements) are required for the participant.

Table 1 Reasons and number of participants excluded from data analysis in studies 1 and 2

Markers

For the proposed procedure three types of markers are needed: (a) static markers on the head of the participant to track head movements and disentangle these movements from movements that are part of emotional facial expressions, (b) surface markers to estimate the individual’s facial surface, and (c) emotion markers to finally track the emotionally relevant facial movements. To measure head movements, 12 static head markers (Blender needs a minimum of eight) that must be visible in all frames in the video clip are recommended to track. These markers must be mounted in a fixed position, for example, on a cap or on the headphones (see Fig. 1) and tracked throughout the video. If some markers are hidden by parts of the face or the head (e.g., during extreme head turns), it is possible to track additional static head markers that overlap for a few frames with existing static head markers at the beginning and end of that episode. In addition, to ensure a reliable and accurate three-dimensional tracking of head movements, the static head markers must be placed on different levels of the depth of the head (e.g., some on the forehead and some at the depth-level of the ears).

Fig. 1
figure 1

Participant with glue dots as static markers on a cap and headphone and black fluid eyelid liner dots as surface- and emotion markers painted on the facial skin

A pattern of 68 surface/emotion markers on the participant’s skin (cf. Figs. 1 and 2) is needed to allow a precise measurement of facial movement. Markers, irrespective of type, should be easily recognizable as a relatively stable pattern by the Blender pattern recognition algorithm over the sequence of the frames of the video clip. This is best achieved by a large contrast in shape, brightness, and color compared with the background. In the studies presented in this paper, we used colored glue dots on a cap and a headphone for static, distinct, and quickly drawn black fluid eyelid-liner dots on the facial skin for surface/emotion markers were used, which can reliably be recognized as a stable pattern by Blender’s tracking algorithm (see Fig. 1). This preparation step was completed in approximately 2 min per participant in our studies.Footnote 10

Fig. 2
figure 2

Positions of proposed facial surface and facial emotion markers. The left part of the figure shows a 2D projection and the right part a 3D projection of the markers on the face. In the left part of figure, grey markers represent the facial surface markers needed to estimate the individual’s facial surface; black markers additionally represent facial emotion markers that may be associated with the AUs from the FACS

Parallax phase

To accurately estimate the three-dimensional surface of the participant’s face, the 68 surface markers (see grey dots in left part of Fig. 2) are needed on the participant’s face to allow for a stable and reliable assessment in three dimensions.Footnote 11 It is sufficient to track the surface markers for only a short episode (e.g., 100 – 200 frames) to estimate an individual’s facial surface. It is important that during this episode, no facial movement is shown and that the episode contains parallax of the head. Parallax is displacement in the apparent position of an object viewed along two different lines of sight. To be able to use images from the video clip to estimate the three-dimensional surface of the face, at least two images from different lines of sight are needed. In our case, this is the head with the face viewed from two slightly different perspectives.

Therefore, a short parallax phase at the beginning or the end of the video is needed to generate parallax for the face in order to provide an individual estimate of the three-dimensional facial surface: We instructed participants to direct the tips of their noses toward the camera lens and toward the middle of the right border of the computer screen for 3 s at each head position. In general, any brief episode in the footage that shows a slight head movement with no facial expression can be used. Therefore, a suitable parallax episode may also be found for participants who do not comply with the parallax phase instructions. However, participants who do not show any head movement at all (during the parallax phase or in any other phase) or always show a facial movement during the parallax phase have to be excluded from the tracking procedure. An inappropriate parallax phase (very slight head movements or facial movement during the parallax phase) will result in a high value for the solve error and will thus result in an imprecise 3D model and an inaccurate estimation of head movement.

The estimated individual facial surface is needed to provide a projection surface for the emotionally relevant markers. Surface markers at emotionally relevant positions on the face can be used not only for facial 3D surface estimation, but also tracked for the emotionally relevant episodes from the video clip. The emotionally relevant episodes are the sections of the video clip that are of substantial interest. This can be, for example, a social interaction or several phases of an experiment in which different stimuli are presented. The movement of these emotion markers represents the outcome of this measurement procedure and can be interpreted emotionally (see the black dots in the left part of Fig. 2). Therefore, emotion markers have to be tracked only for the episodes for which researchers want to measure emotional expression.

Tracking procedure

In this section, we describe the general blenderFace method of tracking the static markers on the head, the facial markers used to generate an individual 3D model of a participant’s face, and the emotion markers. This procedure ensures that the movement of the markers is measured with high precision and independently of head movements. At the end of this section, raw motion data as well as scaling data are exported for further standardization and statistical analyses.

Starting with Blender version 2.61, the Movie Clip Editor provided a motion tracking module that relies on a visual pattern recognition algorithm. In this algorithm, one or more key visual patterns (markers in our case) have to be defined on a start frame. In the sequence of follow-up frames, these key patterns are searched for in a predefined search area.

Tracking head movements

In the first step of the tracking procedure, the static head markers have to be tracked to obtain information about head movements. In addition, the 68 facial surface markers must be tracked for the short parallax phase of the video clip. Subsequently, from the parallax displacement of the tracked markers for the different frames of the video clip, 3D coordinates for each tracked marker, along with the Blender’s virtual camera movement, are computed in the Blender 3D space. In contrast to the setting in reality, in which the head and the (neutral) face turns in front of a static camera, in Blender, the movement is reattributed to the camera. Therefore, in Blender’s 3D space, head movements are transformed into virtual camera movements and the head remains in a static position (see Fig. 3). For example, a head turning upward is represented as a camera moving downward, and a head turning to the left is represented as a camera moving to the right. This reattribution is optically,Footnote 12 logically, and mathematically equivalent to the original movement. Because of the simultaneous estimation of the 3D coordinates of the tracked markers, along with Blender’s virtual camera movement, they both are represent as one “marker coordinates/camera track” unit in Blender. This has three advantages for our purpose: (a) This “marker coordinates/camera track” unit can be moved to the origin of Blender’s 3D coordinate system without affecting the proportions. This is important in later steps of the blenderFace method because it facilitates the exporting of meaningful coordinates. (b) The “marker coordinates/camera track” unit can easily be rescaled, also without affecting the proportions. This is used in later steps of the blenderFace method to rescale the “marker coordinates/camera track” unit, for example, into mm. (c) Due to the reattribution of the movement to the camera the virtual head is static and does not move in Blender’s virtual coordinate system. As a consequence, the emotion markers that are tracked later will no longer be affected by head movements, and this also simplifies the exporting of emotion marker movements in later steps of the blenderFace method.

Fig. 3
figure 3

Blender screenshot: the orange pyramid in the upper left represents the camera; the black line represents the movement of the camera in front of the grey static facial mask; the orange circles on the mask represent the tracked emotion markers projected from the camera onto the facial mask (blue dotted lines)

The overall accuracy of this “marker coordinates/camera track” unit estimation is made available in Blender’s solve error parameter. The solve error represents the mean deviation of the marker position on the basis of the parallax computation from the actual tracked marker position in the video clip. The solve error should be below 0.3, which means a mean deviation of a third of a pixel between model-based tracks and the tracked markers on the video clip.Footnote 13

Building the individual facial surface

In the second step, the individual 3D surface of the participant’s face is built on the basis of the 3D coordinates of the 68 surface markers. First, the facial surface is constructed by connecting four markers at a time to form rectangles (see the connection lines between the facial surface markers in the left part of Fig. 2). Subsequently, this rough approximation of the facial surface by the rectangles is interpolated and smoothed to closely fit the participant’s real facial surface. Afterward, the tip of the nose in the facial surface is centered at the origin of Blenders 3D coordinate system.

Tracking emotion markers

In a third step, the facial emotion markers must be tracked for the emotionally relevant episodes from the video clip. The movement of these markers is also exported into Blender’s 3D space, namely as a projection from the moving camera onto the static facial surface (Fig. 3). The movement of these markers on the facial surface represents the movement of the markers painted on the participants’ skin. A python script uses Blender’s application programming interface (API) to access the 3D coordinates of the emotion markers per frame, exports the coordinates, and saves them in a comma-separated values file (CSV) for each participant for further data processing.

Scaling and standardization procedures

Because coordinates in Blender are represented in Blender’s default unit of measurement, the so-called Blender Unit (BU), two individual scaling procedures must be followed in a fourth and last step to allow the Blender data to be rescaled into meaningful measurement units. In principle, 1 BU roughly represents 1 meter in the real world.Footnote 14 However, the process of taking the data from the tracks of the video clip and transforming them into Blender’s 3D space cannot be controlled and is therefore relatively arbitrary. Therefore, a BU measurement of a real-world object of known size is needed for the rescaling. In practice, the diameter of a glue dot on a participant’s headphone with a known diameter of—in our case—8 mm is measured in BUs. With this measurement of distance, it is possible to rescale the marker movement, originally measured in BUs, into mm. The second standardization procedure addresses the problem of comparing the facial expressions of participants with different face sizes, for example, a child’s face with an adult’s face. To prevent an effect of a potential bias in face size on the extent of facial movement, the eye–eye distance must be measured and used to rescale movement along the x-axis. Accordingly, the eye–mouth corner distance is used to rescale the movement along the y-axis. This allows marker movements to be represented in a “standardized” face so that comparisons of the movement can be made across individuals.

The complete tracking of 14 emotion markers from a 12-min video clip takes approximately 40 min.Footnote 15 In addition, it meets all the requirements described earlier when we argued for the necessity of a new method for measuring emotional facial expression.

Postprocessing of blender data

The data generated by the proposed Blender tracking procedure are saved in a CSV file for each participant and need to be postprocessed in order to be analyzed statistically. Because the amount of data and the resulting file size can get quite large,Footnote 16 and also so that we would have a standardized postprocessing procedure, we developed the blenderFace packageFootnote 17 for the R language (R Core Team, 2016). The blenderFace package serves to (a) concatenate the single CSV files into one R-data file, (b) rescale the data into mm or into a standardized face, (c) center the marker movement at the onset of stimulus presentation, (d) plot raw and aggregated data for plausibility and descriptive checks, (e) compute higher order variables of movement, such as the angle and the median distance of a marker movement, and (f) makes use of several CPU cores, if available, to speed up the postprocessing. Examples of these functions will be presented in the sections describing Studies 1 and 2. For a detailed view of the procedures, see the vignette of the blenderFace package. In the following, the general principles of the functions are described.

Concatenating Blender’s CSV data

The first step in postprocessing is to concatenate the Blender data for each participant into one large data file that can be analyzed more easily. The appropriate function in the blenderFace package performs plausibility checks for example, it tests whether unique marker names have been used in the CSV files and also integrates the data when different numbers of markers have been tracked for the participants.

Rescaling into meaningful measurement units

In a second step, the Blender data, which are scaled in BUs, need to be rescaled into more meaningful units of measurement. The blenderFace package contains two functions to perform a rescaling into either mm or a standardized face. In principle, the rescaling is performed via the rule of proportion.

To rescale into mm, we use the measurement in BUs of an object for which the dimensions in the real world are known. In the Blender tracking procedure, the diameter of an 8-mm glue dot is used. If the individual measurement of this glue dot diameter in BUs was 0.03, for example, the proportions would be constituted as follows:

$$\begin{array}{@{}rcl@{}} \frac{\text{glue dot size in BU}}{\text{glue dot size in mm}} &=& \frac{\text{value in BUs to be scaled}}{\text{outcome value in mm}} \\ \frac{\text{0.03}}{\text{8}} &=& \frac{\text{value in BUs to be scaled}}{\text{outcome value in mm}} \end{array} $$

After adequately solving this equation, it is possible to rescale the x-, y-, and z-coordinates of the markers into mm.

Rescaling the Blender data into a standardized face follows a similar procedure. We define the standardized face as a two-dimensional square of length 1. According to general proportional features of the face, the eye–eye, and the eye–mouth distances are each set to be 1/3 of head width and head height, respectively. The individual eye–eye distanceFootnote 18 and eye–mouth corner distance were measured in the preceding Blender tracking procedure and are then used to rescale the x-axis and the y-axis, accordingly. For example, if the individual eye–eye distance in BU is 0.4, the rule of proportion is constituted as

$$\frac{0.4}{1/3} = \frac{\text{value in BUs to be scaled}}{\text{standardized outcome value}} $$

for the x-axis. The y-axis is scaled accordingly. However, we restrained from rescaling the z-axis because we were not able to find a convincing scale factor. For example, the distance between the eyes is largely stable in proportion to the head and body size, because this distance is needed for stereoscopic vision. This is not the case for the height of the nose (which might be used to rescale the z-axis) because the height of the nose differs significantly between individuals and may be influenced by the climate zone that an individual’s ancestors came from (Noback, Harvati, & Spoor, 2001). The z-axis may be considered in later versions of the blenderFace package; however, for the statistical analyses and two dimensional plots presented in the following, no z-axis is needed. The reason is that according to test runs only a negligible amount of variability in facial expression movements takes place along the z-axis. Therefore, ignoring the z-axis provides computational efficiency with presumably very little loss of information. Moreover, the z-values are predetermined by the facial surface on which the markers move. Therefore, using the 3D facial surface as a projection surface has the function of preventing a projection bias that, for example, a flat projection surface would produce.

Centering data at the beginning of an emotionally relevant episode

Because individual faces differ in their size and topology, it is not possible to draw the markers at the exact same standardized position of the face for each participant. If uncorrected, between-persons differences in drawn marker positions would introduce unsystematic error into the measurement procedure. This means that if we aggregated uncorrected raw data across participants, the variability in start positions of a marker would bias the start positions of the movement. Therefore, it is necessary to center the markers at the onset of each emotionally relevant episode. This is possible because it is not the absolute position of the marker on the face that is of interest but the marker’s movement in reaction to a presented stimulus.

Depending on the experimental settings (e.g., the number of subjects, the number of experimental conditions, the number of tracked emotion markers, the length of the tracked footage), the raw data set containing the tracked markers can become relatively large.Footnote 19 To center the emotionally relevant episodes of the raw data set, the corresponding R function can use several CPU cores to speed up the centering process. To estimate how long the centering process might take, Fig. 4 shows the relationship between the CPU cores that were used and the duration of the centering process in minutes for different processors.

In principle, the centering is performed by selecting the values of the stimulus onset frame for the x-, y-, and z-axes of a marker per presented stimulus per subject. Subsequently, these values are subtracted from the corresponding values of the following frames for the duration of the episode in which the stimulus is presented. For example, if the onset frame for the stimulus episode “posing disgust” of Subject 37 for the x-values of the marker position contains “− 14,5” this value is subtracted from the x-axis values of all frames within the stimulus episode “posing disgust”.

Fig. 4
figure 4

Relationship between CPU cores used and the duration in minutes for the emotional episodes centering process for different processors. All calculations were done in RAM without swapping. The data set used for centering contained 113 participants, 14 emotion markers, and eight emotional episodes per participant

Visual representation of the data

A visual representation of the data at different levels of aggregation offers a quick check of plausibility and also allows possible outliers and artifacts to be detected (e.g., markers disappearing in skin folds, markers hidden by a hand that is moved in front of the face, tracks jumping between two different positions because of two highly probable matching patterns, etc.). To keep things simple, all plot functions commonly ignore the z-axis, for the reasons given above.

The blenderFace package offers functions to plot (a) individual or aggregated raw data or marker movement on a standardized face to get an impression of overall marker movement and detect markers that may contain outliers (e.g., Fig. 5), (b) individual median movement per marker to detect individuals with unusual marker movement (e.g., Fig. 6), (c) x- and y-movement of (symmetrical painted) markers per frame to identify frames with suspicious marker movement (e.g., Fig. 7), and (d) individual or aggregate median movement per stimulus episode with quartile ellipses to get an overall impression of marker movement per presented stimulus (e.g., Fig. 8). These plots are explained in more detail in the sections in which Studies 1 and 2 are presented.

Fig. 5
figure 5

Raw data plot of centered marker movements of a standardized face per emotionally relevant episode and aggregated over participants. For each stimulus episode (posing happiness, sadness, disgust and anger), four symmetrical facial markers were plotted on the forehead, the cheeks, the nasolabial area, and on the mouth corners

Fig. 6
figure 6

Plot of the centered median movement of the right cheek marker (Cheek_R) for the “disgust” stimulus episode for each individual as denoted by the subject number. Movement is scaled to the standardized face. The x- and y-axes of the plot have the same range so that the plotted angle reflects the correct direction of movement

Fig. 7
figure 7

Plot of Participant 37 for the disgust stimulus episode (Frames 1707–1969) for the right (red, orange) and left (blue, green) x- and y-axis movements of the cheek markers. Note that the origin of the coordinate system is at the tip of the nose; therefore, the left and right markers target the opposite directions on the x-axis (red and blue lines). The plot’s x-axis represents the frame numbers, and the y-axis represents the movement in mm

Fig. 8
figure 8

Plot of the centered median movement in mm of the four symmetrical markers per stimulus episode aggregated over participants in Study 1. The ellipse represents the quartiles of the movement distribution for the x- and y-axes. The x- and y-axes of the plots are scaled to the same range so that the angle represents the true direction of movement

Higher order parameters of facial movement

After the Blender data are postprocessed and corrected for outliers and artifacts, it is possible to analyze the data in several ways. Currently, the blenderFace package provides functions to compute the angle and the distance to compare marker movements with respect to direction and distance across different stimuli. However, the package will be developed continuously to extend its capabilities. One of our tasks, for example, is to add functions to compute speed, the onset, apex, and offset phases of an expressive episode as well as symmetry parameters of facial movement.

Study 1

The main purpose of Study 1 was to determine the optimal tracking settings, tracking parameters, and tracking procedure for Blender. These properties were tested in a small sample comprising 55 participants and eight emotion markers. In addition, in this study, we tested the required hardware characteristics (e.g., camera, processing performance), illumination settings, and synchronization with the stimulus presentation procedure.

Method

Participants

A total of 55 students from different disciplines at the University of Koblenz-Landau, Germany were recruited and received either course credit or were paid for their participation. However, due to improper lighting conditions, incorrectly placed markers, obliterated markers, or improper behavior by the participants (see Table 1), data from only 39 participants (age M = 22.1, SD = 5.78; 77% female) could be used in the subsequent analyses.

Design and procedure

In Study 1, participants’ data were collected in individual sessions. At the beginning of each session, black markers were painted on each participant’s face. The participants were equipped with a black cap and a black headphone with placed colored glue dots. Four emotion markers were used: Forehead markers were placed at the positions of the left and right inner eye brow, a nasolabial marker was placed left and right beneath the nose, a mouth marker was placed at the left and right corners of the mouth, and left and right cheek markers were placed on the cheeks. These markers were placed at the cheeks to provide a measurement of movement when natural markers are not available.

Based on a script, the record of the video (FFmpeg, ver. 2.2.16) and the presentation of the stimuli was started and stopped simultaneously. The presentation of the stimuli was implemented in Milliseconds Inquisit (ver. 3.0.6.0), whereas each consecutive stimulus was presented with a predefined duration. First, the participants were instructed to direct the tip of the nose to the upper left corner and to the middle of the right border of the computer screen. This parallax phase was implemented for 10 s at each position to obtain some parallax for the facial surface estimation. Subsequently, the participants were asked to show the emotions happiness, sadness, disgust, and anger for 10 s at a time. The instructions read “Please show the emotion happiness” (in the case of happiness), accompanied by an example picture of a facial expression of the corresponding emotion taken from Olszanowski, Pochwatko, Kukliński, Ścibor-Rylski, and Ohme (2008). The instructions were presented below the picture along with a 10 s countdown. Thereafter, additional stimuli were presented to the participant; however, these were outside the scope of the present study. Finally, the participants were debriefed and were compensated for their participation.

Measures

Tracking was done for the complete video clip, including the parallax phase and facial expression episodes. However, due to missing, not visible, or obliterated facial surface markers, we decided not to estimate individual facial surfaces, but to use a standard facial surface mask that was based on averaged faces from preliminary investigations.

Thereafter, the symmetrical four facial emotion markers were tracked and projected onto the standard facial surface mask. Subsequently, the tracked emotion markers were exported into a CSV file, matched with the stimulus presentation episodes, and merged into a single raw data file in R and saved for further processing.

Results

Accuracy of the measurement procedure

To estimate the accuracy of the tracking procedure, we calculated the following parameters: (a) Accuracy of measuring head movements and estimating the facial surface: The solve error represents the mean deviation of the tracked markers of the video and the marker positions estimated by the model (and projected onto a 2D surface). This solve error was below 0.3 for all participants, representing a mean deviation between the video-tracked marker position and the model-estimated marker position of maximally one third of a pixel. (b) Accuracy of measurement of the scaling parameters: The measurement of the distances between the eyes, the mouth corners, the left eye–left mouth corner, and the right eye–right mouth corner can be used to compute reliability. The reliability of these four measures was α = .996 and also reflected interindividual differences in head proportions (i.e., participants had different mouth widths compared with their distance between their eyes). (c) Accuracy of the model building and rescaling procedure: In a pilot study a paper cuboid of 10 × 10 × 20 cm, roughly representing the area and the size the blenderFace method is intended for, was constructed and equipped with glue dots. For this cuboid, the real-world positions of the glue dots were known. The cuboid was recorded on video, and the blenderFace measurement procedure with the subsequent scaling into mm was performed. The distances that were measured and scaled by Blender differed from the real-world measures by a mean of M = 0.80 mm (SD = 0.54) with a maximum of 1.82 mm for the 12 distances that were measured. Although the placement of the glue dots was performed very carefully by hand, this measurement also includes a manufacturing bias for the glue dot placement. Altogether, these parameters do not directly estimate the accuracy of the measurement of emotion marker movement but show that the measurement procedure itself is very accurate.

Outliers and artifacts

The raw data file was rescaled into mm and also into the standardized face. Both rescaled data sets were centered per stimulus episode (posing happiness, posing sadness, posing disgust, and posing anger). For each stimulus episode, a raw data plot was generated to detect outliers or unusual movements (see Fig. 5).

A precise inspection of these plotsFootnote 20 revealed potential outliers. A deeper analysis of these marker movements was indicated to rule out errors or artifacts. As an example, this procedure will be shown for the right cheek marker in the “posing disgust” episode; however, it was performed for all markers in all stimulus episodes. The lower left plot of Fig. 5 for the “posing disgust” episode revealed an unusual movement pattern for the right cheek marker to the left direction. This movement did not appear to be common to all participants; however, this could not be reliably determined by plotting the aggregated participants. Therefore, a second plot of individual medianFootnote 21 movement for the right cheek marker of the disgust episode revealed that Participant number 37 had caused this deviant movement (Fig. 6).

Once the participant, the marker, and the episode for which the outlier or artifact occurred were identified, a very specific inspection of the x- and y-movement of the marker per frame was performed. Figure 7 shows the x- and y- movement of the right cheek marker, along with the left cheek marker for Participant 37 for the frames of the “posing disgust” episode. The deviation in the plot can be interpreted in mm, because the scaled-to-mm dataset was used.

However, the plot revealed no artifacts or outliers but showed an asymmetrical expression of disgust for the cheek markers of Participant 37. Whereas the left and the right y-axis lines (yellow and green) run parallel to a large extent, this was not true for the x-axis (red and blue lines). Note that the origin of the coordinate system is at the tip of the nose, which leads to the fact that on the x-axis in the plot, the markers on the left and the right sides of the face run in opposite directions. Nevertheless, if the x-value of the right cheek marker (red line) were to be mirrored along y = 0, the median deviation would be stronger (≈ 6 mm) compared with the x-value of the left cheek marker (blue line, ≈ 4 mm). A visual inspection of frames 1,700 to 2,000 for the video clip of the Participant 37 confirmed the assumption of asymmetry in the expression of disgust.

For other cases in which artifacts were actually detected, for example, tracks jumping between two positions because the search pattern had a high probability of being fit to two positions in the search area, the artifacts were corrected. This was done in Blender by setting a new pattern for this track and subsequently tracking, exporting, and postprocessing the data. For cases in which a retracking was not possible (e.g., bad illumination conditions, hidden markers), the x-, y-, and z-values of this marker were set to “not available” (NA) in the raw data file in R for the corresponding frames, with the subsequent postprocessing of raw data.

Facial movement in response to the emotional stimuli

Constituting the main outcome of the procedure, we created a combined plot of marker movement per stimulus episode aggregated over participants. Figure 8 shows the markers that were arranged in accordance with their actual facial position, beginning with the left and right forehead markers, the cheek markers, the nasolabial markers, and the mouth corner markers. Data scaled to mm were used, and the x- and the y-axes were scaled to the same range. Therefore, the direction of the movement reflected the true movement of the facial skin. The ellipse around the median point represents the quartiles of the distribution of movement for the x- and the y-axis. In addition to the stimulus episodes in which participants showed happiness, sadness, disgust, and anger, a neutral episode—taken from the parallax phase of the video clip—was added to reflect measurement noise when no emotional expression was shown. However, the neutral episode reflects not only the measurement error of the blenderFace method but also unintentional movements (e.g., mouth corner movement occurring while swallowing). The relevant parameters for this plot, the angle, and the distance for each stimulus episode per marker, can also be printed (see Table 2).

Table 2 Angle and distance of facial movement in study 1

Discussion

The study was conducted to test the blenderFace method and its border conditions. As a result, the blenderFace method can be implemented with a good, shadow-free illumination on standard hardware (computer, webcam) using only on open-source software. The tracking procedure and therefore the measurement turned out to be very accurate because the mean accuracy for the tracks was below 1 pixel and below 1 mm, respectively. Because the blenderFace method uses markers that are painted on the facial skin, it becomes possible to measure movement in facial areas that lack natural landmarks (e.g., on the cheeks). In the postprocessing of the Blender data, a reliable detection of outliers and artifacts is possible. When the data are corrected, the statistical analyses of movements of the facial skin can be performed. The measured movement corresponds closely to the definition of the EMFACS/FACSAID (see Ekman & Hager, 2002; Friesen & Ekman, 1983). For example, for the emotion of disgust, the cheek, the nose, and the mouth markers move upwards, which represents the activation of AU 9 (nose wrinkler) and AU 10 (upper lip raiser). For the emotion happiness the cheek-, the nose-, and the mouth markers move upwards and sideways, thus representing the activation of AU 6 (cheek raiser) and AU 12 (lip corner puller).

In a second study, the blenderFace method was tested with a larger number of simultaneously recorded emotion markers, with a larger sample, and by estimating each participant’s individual facial surface. In addition, technical improvements that facilitate the synchronization of the stimulus presentation with the timescale of the video clip were checked (e.g., a timestamp and the participant number burnt into the video clip).

Study 2

The aim of Study 2 was to test the full set of proposed markers (see Fig. 2), test the improved experimental settings (e.g., lighting, video compression, etc.), test the improved stimulus-video synchronization markers (e.g., video time stamp branding), and replicate the findings of Study 1.

Because there were more markers to track, it was no longer practical to use marker labels that were based on facial landmarks (e.g., “mouth corner”, “inner eye brows”). A labeling scheme based on the FACS (Ekman & Friesen, 1978; Ekman et al., 2002) using Action Units (AUs) as marker labels is also inappropriate, because an AU defines the position of a marker on the face along with the direction of the movement. In contrast, the blenderFace method measures the visible movement of a marker drawn on the facial skin, which may move in virtually any direction. Therefore, we decided to use a straightforward labeling scheme defining only the position of the marker on the face (see left part of Fig. 2): Letters define the x-axis position in the sense of a longitude, whereas numbers define the y-axis position (i.e., latitude) of a marker for the surface marker mesh. The face is divided vertically by the “A”-axis—going form the center of the forehead, via the tip of the nose to the chin—in two symmetrical parts, which were labeled as “left” and “right” part of the face. Left and right refers to when looking at a face vis-à-vis (not the left and right part of the own face). Starting from the “A”-axis, the subsequent vertical axes are labeled “B”, “C”, “D”, and “E” in the direction to the ears, combined with the label “L” for the left part, and “R” for the right part of the face. This labeling scheme facilitates the comparison of corresponding markers on the two sides of the face. The horizontal grid lines are labeled starting from top to bottom, for example, “1” at the upper forehead, via “5” intersecting the tip of the nose, to “10” at the bottom of the chin. For example, the marker “A5” denotes the tip of the nose, and “CL7” the left corner of the mouth.

Method

Participants

One hundred fifty students from different disciplines at the University of Koblenz-Landau, Germany were recruited and received either course credit or were paid for their participation. However, 37 participants were excluded due to illumination problems, inappropriate adjustments of the camera drivers (e.g., autofocus, auto-brightness and contrast), hidden markers, or extreme head movements (see Table 1). A total of 113 participants (age M= 23.1, SD= 2.3; 82% female) could be used in further analyses.

Design and procedure

In Study 2 participant’s data were collected in individual sessions as the last part of an experimental sequence that fell outside the scope of the present article. As a cover story, the entire experimental sequence was presented as being about eye tracking. Similar to Study 1, before the experimental sequence began, the participant was equipped with a marker cap and marker headphones. In contrast to Study 1, the participants were painted with the full set of 68 facial markers (see Figs. 1 and 2). Again, the simultaneous beginning and end of the video recording and the stimulus presentation was controlled by a script. For additional video - stimulus presentation synchronization, FFmpeg was used to add the time code, the subject number, and stimulus episode to the video clip. Further on, a mirror in the back of the participants allowed the recording of the presented stimuli, which grants a synchronization on video frame level.

The participants were asked to show a neutral, a disgusted, a happy, and a fearful facial expression for 5 s each. The instructions to show an emotion were given by the sentence “Please show the emotion …” along with a countdown of 5 s presented at the bottom of the screen. However, we did not present an example picture of the corresponding emotion to prevent participants from imitating the exact same expression that was shown in the picture. In the last part of the experimental episode relevant to the present study, the participants were asked to point their tip of the nose directly into the webcam and to the middle of the right border of the computer monitor for 5 s each. Again, this was done to obtain a parallax for estimating the individual’s facial surface.

Thereafter, further stimuli were presented. However, these were outside the scope of the present investigation. Finally, the participants were debriefed and were given course credits or were paid for their participation.

Measures

Analogous to Study 1, the tracking of the static head markers was performed for each participant’s complete video clip, whereas the 68 surface markers were tracked only for the short parallax episode. Subsequently, the head movements were estimated and the 3D surfaces of the individual faces were built. Finally, the emotion markers were tracked for the emotionally relevant stimuli episodes, projected onto their individual facial surfaces and exported as CSV files. Subsequent postprocessing included adding stimulus presentation information, merging CSV files, standardizing the marker movements into mm and into the standardized face, and centering the marker movement at the onset of each stimulus episode.

Results

Accuracy of the measurement procedure

Again, the solve error for all participants was below 0.3, indicating a mean deviation of one third of a pixel per participant between actual tracks and model-estimated tracks. The internal consistency of the measurement of the four scaling distances (eye–eye distance, mouth corner–mouth corner distance, and left/right eye–mouth corner distance) was α = .985.

Outliers and artifacts

An artifact/outlier analysis was performed for all markers in all stimulus conditions. We asserted that the data did not contain any errors and were suited for further statistical analyses.

Facial movement in response to the emotional stimuli

As a main outcome, the median movements per stimulus episode aggregated over participants, along with the quartile ellipses, are presented in Figs. 9 and 10. The data set scaled to mm was used for the plots; therefore, the median movement can be interpreted in mm.

Fig. 9
figure 9

Plot of the median movement in mm of the first four symmetrical markers per stimulus episode aggregated over participants in Study 2. The ellipse represents the quartiles of the movement distribution for the x- and y-axes. The x- and y-axes of the plots are scaled to the same range so that the angle represents the true direction of movement. The marker positions are shown in the left part of Fig. 2

Fig. 10
figure 10

Plot of the median movement in mm of the second four symmetrical markers per stimulus episode aggregated over participants in Study 2. The ellipse represents the quartiles of the movement distribution for the x- and y-axes. The x- and y-axes of the plots are scaled to the same range so that the angle represents the true direction of movement. The marker positions are shown in the left part of Fig. 2

We also included an episode with no facial expression (neutral), to represent measurement noise. However, the overall measurement noise was very low (see black ellipses in Figs. 9 and 10, Table 3: mean distance in mm, irrespective of angle M = 0.026, SD = 0.013). The marker movement can also be presented with parameters such as angle and distance (Table 3).

Table 3 Angle and distance of facial movement in study 2

Again, the marker movement conformed to the description of movements in EMFACS/FACSAID (Ekman & Hager, 2002; Friesen & Ekman, 1983) of activated AUs for emotional facial expression: For the emotion of happiness AU 6 (cheek raiser) and AU 12 (lip corner puller) should be activated; these are visible in Figs. 9 and 10 for the markers CL4/CR4 along with CL7/CR7 as a sideways and upwards movement (green dot and ellipse). The other markers (e.g., BL5/BR5 in Fig. 10) follow this movement to some extent because the skin is also pulled in the corresponding direction.

For the emotion of disgust the description of activated AUs is heterogeneous. AU 9 (nose wrinkler), AU 10 (upper lip raiser) are very common, but so are pressing the lips together and the mouth corners downwards (AU 15, lip corner depressor, AU 16, lower lip depressor, AU 17, chin raiser), or a slightly opened mouth (AU 25, lips part, AU 26, jaw drop). With the red point and ellipse, Fig. 10 shows the upwards movements in BL4/BR4 and BR7/BL7 but no dragging down of the mouth corners or chin raising in the data when aggregated over participants. However, a tightening of the eyebrows toward the nose root (BL2/BR2 and DL2/DR2) was measured as this is mentioned in the literature (Ekman & Friesen, 1975; Rozin, Lowery, & Ebert, 1994).

For the emotion of fear, mainly the activation of AU 1 (inner brow raiser) and AU 2 (outer brow raiser) and sometimes AU 4 (brow lowerer, leads to a narrowing of the eyebrows) are reported. These movements are visible as blue points and ellipses in Fig. 9 for the markers BL2/BR2 and DL2/DR2 as an upwards and center directed movement. Movements in other seldom reported AUsFootnote 22 referring to pressed lips (AU 20, lip stretcher, AU 25, lips part, AU 26 jaw drop) could be measured only to a very small extent.

In summary, the findings of the blenderFace method in Study 2 replicate and extend the findings of Study 1. Again, they closely confirm the EMFACS/FACSAID (Ekman & Hager, 2002; Friesen & Ekman, 1983) assumptions of the activated AUs for emotional facial expression; however, the blenderFace method offers an objective way to measure facial movements.

Discussion

Study 2 showed that it is possible to measure up to 14 emotion markers simultaneously in a sample of 113 participants for 12 min of footage with the blenderFace method. The optical measurement of facial movements with the blenderFace method was very precise, which can be seen by the mean deviation of one third the size of a pixel between video track and model estimated track, an internal consistency of scaling measures close to 1, and a measurement of noise in a neutral condition resulting in a mean movement of 0.026 mm.

The measured movements for the 14 emotion markers corresponded closely to the assumed AUs of the EMFACS/FACSAID (Friesen & Ekman, 1983; Ekman & Hager, 2002). The practical improvements in Study 2, such as the illumination or the time code in the video clip, strongly facilitated the synchronization of the tracking procedure and the stimulus presentation. Also the use of an individual facial surface improved the measurement accuracy as can be seen by the decrease in marker movement noise in the neutral stimulus condition in Study 2 compared with Study 1, where no individual facial masks were used.

General discussion

In this paper, we presented the blenderFace method, which circumvents the drawbacks of existing methods that are used to measure facial expressions because it offers a very accurate (sub second, sub pixel, and sub millimeter range), non intrusive simultaneous raw data measurement of several emotionally relevant markers at a high temporal resolution. Study 2 showed that it is possible to simultaneously measure the temporal dynamics of up to 14 markers at a timely resolution of 33 ms for approximately 12 min of video footage for each of 113 participants. Due to the open-source approach, the blenderFace method is very versatile and transparent at every step of data processing and in line with the desiderata of reproducible research (e.g., Fomel & Claerbout, 2009). Although we presented only descriptive findings of the blenderFace method in this paper, they correspond with findings from established measurement instruments (EMFACS, FACSAID; Ekman & Hager, 2002; Friesen & Ekman, 1983). In addition, there are many possible research applications for this method: (a) The raw data generated by the blenderFace method can be analyzed with inferential statistical methods to test various hypotheses. This can include, for example, the identification of parameters to score emotional facial expression, similar to what Olderbak, Hildebrandt, Pinkpank, Sommer, and Wilhelm (2013) did for AFER output. With the blenderFace method, this can be investigated more efficiently than with FACS, in contrast to EMG measurement it easily can be performed for facial expressions involving the whole face, and also without the assumption of basic emotions as implied by AFER algorithms, whereas the existence of basic emotions is still in scientific discourse (Gendron & Barrett, 2017). Another way to analyze raw data obtained with blenderFace could be to investigate the temporal dynamics of facial emotion expression. For example, it may be possible to identify movement parameters that allow a distinction between a posed and a spontaneous facial expression. Although very good classifiers exist for this distinction (e.g., in the case of pain; Littlewort et al., 2009), these classifiers do not allow researchers to test for specific parameters that can be applied to make this distinction. For example, distinction parameters that have been suggested in the literature are pauses, stepwise intensity changes, several onset-, apex-, and offset-phases of specific facial markers (Dubois et al., 2013; Hess & Kleck, 1990, 1994; Schmidt, Ambadar, Cohn, & Reed, 2006; Schmidt, Bhattacharya, & Denlinger, 2009; Weiss, Blum, & Gleberman, 1987). Therefore, contrary to AFER classifiers, which do a very good job at classifying, the blenderFace method may lead to gains in scientific knowledge in this area of research. A third way to use raw data from blenderFace raw could be, for example, the analysis of microexpressions (Ekman & Friesen, 1969). Although the concept of microexpressions has existed for several years, it has not yet been extensively investigated (30 hits in PsycARTICLES and PsycINFO databases; May, 2018). (b) A novel and unique feature of the proposed blenderFace method is the possibility of superimposing the standardized measurements of several respondents as they react to an experimental stimulus condition. With this, it may be possible to identify common characteristics of facial expressions and individual deviations (e.g., asymmetries or atypical reactions) in an objective manner. This opens a large field of further research options, for example, the identification of facial reaction patterns without the restriction of a priori hypotheses such as basic emotions. (c) More generally, the blenderFace method can also be used as an optical method for measuring non-verbal behavior. For example, the head movement and head turn data that are collected in the blenderFace measurement procedure (to be disentangled from facial movements), have not yet been taken into account for data evaluation. However, head movements and head turns may be used as indicators of approach-avoidance behavior. For example, the speed with and the extent to which the face turns away from an aversive stimulus can be measured and used as an indicator. Provided the appropriate markers are used, other non verbal behavior can also be measured. For example, with markers on the hands and arms, speech-supporting illustrators could be recorded. With markers on the body (e.g., on the shoulders), the body movements during a dyadic interaction could be measured. (d) The video material also contains additional information that could be evaluated in combination with the blenderFace method. For example, it would be possible to derive biophysiological indicators such as pulse rate and increased blood flow to the skin by increasing the color intensity (Wu, Rubinstein, Shih, Guttag, Durand, & Freeman, 2012) and to combine them with participants’ facial expression.

However, the proposed blenderFace method is also subject to limitations: (a) The use of markers in the blenderFace method is relatively complex, not so much in comparison with FACS rating or EMG measurement, but in comparison with AFER procedures. However, it is precisely this use of markers painted on participants’ facial skin that allows the accurate measurement of facial movements. (b) At the moment, facial movement is measured in three dimensions but analyzed in a two-dimensional way. One reason to do this is data reduction. As outlined before, this can be done with reasonable costs in measurement precision. Another reason is that to transform a face into a standardized face, no suitable scaling object for the z-axis exists (e.g., there is a great deal of interindividual variability in size / height of the nose independent of head size). However, provided there is a suitable scaling object, a three-dimensional processing is possible. (c) Markers next to the eyelids (e.g., AU 5 or AU 7) are currently very hard to track because eye-blinking constantly changes the tracking pattern and requires a lot of manual corrections. For this purpose, other techniques (e.g., a robust ellipsoid fitting), are more suitable. Due to its openness, the method can be adapted to various experimental conditions and can also be improved on the technical side. For example, in combination with the openCV library (www.opencv.org), the functionality of Blender may be enhanced to find and track a given set of markers automatically. Furthermore, a second camera could be used to get a stereoscopic view, which in turn can be used to compute a 3D surface of the face automatically. In this sense of openness, we hope the blenderFace method will be used, adapted, and improved.