Psychophysics is the quantitative study of the relation between physical stimuli and perception (Kingdom and Prins, 2016). As such, its success is determined by our ability to precisely manipulate and measure physical stimuli. This is a relatively straightforward task when the focus is on low-level stimulus features, such as brightness, color, contrast, or spatial frequency information, among others. However, the task becomes more complex when one is interested in the stimulus features that affect our perception of complex objects and scenes. We sorely lack methods to produce naturalistic stimuli in a controlled manner, or to measure their relevant properties.

In vision science, faces are one class of complex objects for which we have made important advances. An advantage of faces over other complex objects is that images of different faces can be aligned in their main features (e.g., nose, eyes, mouth). This has enabled the adoption of morphing techniques as a way to manipulate complex face dimensions in a relatively controlled manner (e.g., Steyvers, 1999), the application of traditional psychophysical techniques to estimate templates used to classify faces, such as reverse correlation or bubbles (e.g., Gosselin and Schyns, 2001; Macke and Wichmann, 2010; Mangini and Biederman, 2004; Schyns et al., 2002; Soto, 2019), and the application of multivariate statistics to study the distribution of naturally varying faces, through principal component analysis (PCA) and related techniques (Calder et al., 2001; O’Toole et al., 1993; Turk and Pentland, 1991). For the past two decades, research on face perception has been strongly supported by the availability of such techniques and a vast number of face photograph databases (e.g., Dailey et al., 2001; Ebner et al., 2010; Lucey et al., 2010; Lundqvist et al., 1998; Ma et al., 2015; Strohminger et al., 2016).

However, naturalistic face databases and two-dimensional image manipulation techniques are sometimes insufficient to obtain the tight stimulus control required to answer certain questions in face perception research. For example, imagine an experiment aimed at determining whether people perceive males as angrier than females (e.g., Aguado et al., 2009; Bayet et al., 2015; Becker et al., 2007). An experimenter obtains photographs of male and female actors showing angry expressions, and it is observed that people are faster at recognizing anger in males than in females. Unfortunately, such results could have nothing to do with face perception, and everything to do with the production of expressions by the actors. If males simply produce stronger expressions of anger, then the observed results are not due to perception, but to the specific stimuli used. The exact poses by actors and the strength of their expression are difficult to control using naturalistic photographs and morphing techniques. It would be desirable to have a way to obtain the exact same expression pose from faces with different identities and sexes. It would also be desirable to easily standardize a set of faces with regard to features that are not of interest in a particular study. Examples include texture and coloration of the face, shape of external features like ears and neck, eye color, and facial hair. Other studies would want to precisely manipulate such features rather than simply control them. For example, one might want to answer the question of whether neck width, face color, or facial hair, all of which differ between sexes, is what facilitates perception of anger in male faces. Such control can possibly be achieved using two-dimensional image manipulation, but the process is painstaking and has traditionally involved some degree of manual image manipulation.

Morphing sequences are also used as a way to obtain dynamic stimuli. For example, some studies on face expression have presented morphing sequences going from neutral to some target expression, by concatenating them in a video representing dynamic emotional expressions. However, this approach has been criticized on the basis that such morphing sequences might not reflect the changes in face shape that would occur naturally during dynamic expression (Bernstein and Yovel, 2015; Roesch et al., 2011).

An additional limitation imposed by techniques that manipulate two-dimensional images is that they tend to be obscure regarding the specific properties of the physical object that are being modified and the extent of such changes. Take the example of face morphing. There are several different algorithms that can be used to produce morphed sequences. They all first require templates to be manually created for each two-dimensional image that will be included in the morph, which characterize the position of landmark features of the depicted objects. Such templates are used to apply a complex nonlinear transformation from one image to another, which is summarized by a single number representing “percentage of change”. Both the morphing transformation space and the changes produced within it are hard to interpret. Ideally, one would want to have the ability to more precisely manipulate specific shape features of the relevant object, and to describe transformations in terms of such interpretable shape features. Similarly, results from PCA and related multivariate techniques are often difficult to interpret, as the analysis is performed in the space of image pixels rather than in a space composed of higher-level face features. In addition, the obtained statistical description applies to face images obtained from a specific viewpoint (usually frontal).

The use of three-dimensional computer graphics face models provides researchers with more precise control over face stimuli. Indeed, some researchers in the visual neuroscience community conclude that the level of control necessary to answer some important standing questions can only be achieved through the use of computer-generated face stimuli (Anzellotti and Caramazza, 2014). However, such models are usually described by thousands of parameters that are not more interpretable than image pixels. For this reason, it would be desirable to have software able to create and manipulate realistic three-dimensional face models that are described through a smaller number of parameters that are easily interpretable by researchers.

Face research toolkit (FaReT)

The use of three-dimensional models to generate stimuli for face research has the advantage of providing strong stimulus control without losing a naturalistic appearance. However, such models are usually described as meshes with thousands of parameters, making them just as difficult to interpret as two-dimensional images described in terms of pixel luminance. In addition, such complexity is usually encapsulated by intentional obscurity: the available face databases and relevant morphing software are often proprietary and closed-source, limiting the community’s ability to access, expand, and comprehend the resources. Thankfully, MakeHuman (http://www.makehumancommunity.org/; Bastioni et al., 2008; Ceipidor et al., 2008) is a free and open-source computer graphics software that allows users to create three-dimensional face models via description through a relatively small number of easily interpretable and sensibly named parameters (nostrils angle, cheeks outer volume, etc.). In MakeHuman, any face model can be described as a point in a multidimensional shape space, with each shape parameter representing one dimension, as depicted in Fig. 1. The software uses smart algorithms to convert changes in this shape space into changes in the higher-dimensional space of a face mesh.

Fig. 1
figure 1

Schematic representation of how face identity models are generated in MakeHuman. Each model is represented as a point in a multidimensional shape space, where a dimension represents a nameable shape feature (e.g., eye bag distortion, lower lip volume, horizontal scaling of nose). Different models differ in one or more shape dimensions. If a model (e.g., John) is connected to the average model through a line, models on the other side of that line can be considered anti-identity models (e.g., anti-John)

Our goal is to enhance MakeHuman’s capabilities in order to facilitate its use by face perception researchers. With this goal in mind, we have developed the Face Research Toolkit, or FaReT, which is composed of several Python plugins that add utilities to MakeHuman. These plugins were developed to provide face researchers transitioning from image morphs/manipulations with functionality that they would appreciate, being analogous to manipulations that they already use but providing much more control and interpretability. In addition, FaReT includes a database of 27 identity models and 6 expression pose models (sadness, anger, happiness, disgust, fear, and surprise), so researchers can easily get started generating face stimuli using our plugins.

MakeHuman

Originally, MakeHuman was intended to help game designers and animators easily create and rig human models, even going so far as to supply a base set of skin textures, clothing assets, and several skeletons with varying complexity. The MakeHuman developers managed to reduce thousands of parameters (the positions of every vertex of a human body mesh, which consists of exclusively quadrilateral polygons) to dozens that are identified through sensible, easily interpreted names (nostrils angle, cheek outer volume, etc.), while also adding safeguards to help keep the mesh’s vertices smooth (Bastioni et al., 2008).

MakeHuman allows one to easily change common face parameters like sex, age, ethnicity, and level of face fat of the face models, by manipulating a single slider. In theory, any identity can be reproduced through the software, and it is also possible to create completely new realistic faces. Interaction with the software is through an intuitive graphical user interface that does not require prior knowledge or training. Skin texture models can be obtained from MakeHuman or developed independently, and they can be placed in any face model.

Similarly, an expression pose model can also be developed independently and applied to any face model in MakeHuman. The software itself comes with several pose models, and others can be obtained from the active online community. In addition, we have developed our own set of standardized basic emotional expressions (see below), based on pictures from one actor in the Karolinska Directed Emotional Faces (KDEF) database (Lundqvist et al., 1998).

Just as with identity, MakeHuman allows one to easily develop a pose model for any facial expression through its expression mixer plugin. The plugin involves a number of expression parameters that have easily interpretable names, such as “left eye down”, “right inner brow up”, and so on. Each parameter has its own slider in MakeHuman’s graphical interface, which facilitates the creation of new pose models without any prior training. The MPEG-4 standard (Pandzic and Forchheimer, 2002) influenced the development of MakeHuman’s expression mixer in the early years (Ceipidor et al., 2008), and thus many of MakeHuman’s expression parameters are equivalent to facial animation parameters (FAPs) from that widely used standard. In the behavioral sciences, a more common standard used to generate expressions for face research is the facial action coding system (FACS; Ekman and Friesen, 1975), which is related to the MPEG-4 system (Pandzic and Forchheimer, 2002). Researchers who are interested in using the traditional FACS system to generate expression models can do so in MakeHuman through the FACSHuman plugin (Gilbert et al., 2018; https://github.com/montybot/FACSHuman).

Models created with MakeHuman can be exported in formats that are easy to open and use in other 3D modeling and game development software, such as Blender (https://www.blender.org/). This greatly expands the types of research that can be performed using the 3D models created with MakeHuman. Interactive games, virtual reality (VR) videos, and motion capture are all possibilities that are relatively accessible to an interested researcher with enough technical knowledge, facilitated by the fast and easy creation of 3D face models in MakeHuman.

The functionality provided by MakeHuman can be obtained from other software, such as FaceGen Modeller expanded with the FACSGen plugin (Roesch et al., 2011). Stimuli created using FaceGen have been relatively popular among face researchers in recent years (e.g., Ho et al., 2018; Lamer et al., 2017; Oosterhof and Todorov, 2008; Soto, 2019; Soto and Ashby, 2015, 2019; Thorstenson et al., 2019; Uddenberg and Scholl, 2018). However, an important advantage of MakeHuman is that it is open-source software. This means that it can be downloaded and used absolutely free of charge by the scientific community. In addition, the base code for MakeHuman is available for examination and expansion by interested researchers, so any features that the software lacks can be added by the scientific community. Plugins and other add-ons like those provided by FaReT are completely available for anyone not only to use, but also to expand and maintain, even if the original developers discontinue maintenance. Such plugins are written using the Python programming language, which has in recent years become the standard in several areas of psychology and neuroscience.

In contrast, the use of commercial software like FaceGen Modeller can stifle scientific development. For example, there was a period of several years when the incredibly useful FACSGen was no longer being actively developed. If FACSGen had been developed on top of open-source software like MakeHuman, interested researchers could have taken the project into their own hands for maintenance and updating. Although FACSGen is now being developed and maintained again, a license must be purchased both for the plugin and for the base FACEGen software. Thus, only researchers with financial resources can use this setup.

FaReT’s database of three-dimensional face models

To help us establish the usefulness of the toolbox, we created a database of 24 identities. The identity database consists of 12 male and 12 female models. All 24 models are inspired by photographs of real people’s faces; however, the models went through several iterations of modifications to avoid matching the actual identities, while also ensuring that the faces were realistic. The final versions were judged by all three authors as realistic. See Fig. 2 for a frontal view of the final models’ faces.

Fig. 2
figure 2

Renders of the 24 identity models included with FaReT. The top two rows are the male identities and the bottom two rows are the female identities

To add further utility to the database, we also included standardized facial expressions. Using MakeHuman’s expression mixer, we established six poses for expressions that convey six basic emotions (Ekman, 1999): happy, sad, anger, fear, surprise, and disgust. With the exception of the happy expression, we based our expressions on a single actor in the KDEF database (Lundqvist et al., 1998). The original expression of happy was judged (by the authors) to be somewhat off-putting when applied to some faces, so we manually adjusted it to make it more generally applicable. By generating expression pose files, we can systematically apply the same expression to any identity we created in MakeHuman. See Fig. 3 for a frontal view of two identities with the same expression poses applied. This database was the basis for testing the tools we added to MakeHuman through its Python plugin system.

Fig. 3
figure 3

Renders of the six expression pose models originally included with FaReT, applied to two different identity models. Each row contains one identity; the top row is a male identity and the bottom is a female identity. Each column contains a separate expression: anger, disgust, fear, happiness, sadness, and surprise

An advantage of this small database of three-dimensional models is that it can be easily expanded to a much larger database using MakeHuman, including models that are completely different from the originals in some important face category or dimension. Figure 4 shows an example. Here, we have taken a single identity model from our database and used it to easily create seven novel variations. The original model is a 25-year-old Caucasian male with average face fat. By manipulating a single slider in MakeHuman, and in some cases selecting a different skin model (MakeHuman includes standard skin models for different combinations of age, sex, and race), it is possible to create the following versions of the model: teenager, elderly, fat, skinny, Asian, African, and female. Each one of these models took only seconds to generate, which means that in practice, at least 192 models can be generated from our database, tailored to the goals of each research project.

Fig. 4
figure 4

Example of how one of the models in our database (CM01: male, 25 years old, Caucasian, average face fat) can easily be used to generate novel models with a different age, level of face fat, race, or sex. Each one of these models took only seconds to generate in MakeHuman by manipulating key sliders in the GUI (named Age, Head fat, African, Asian, Caucasian, and Gender)

FaReT plugins

Averaging and standardization of three-dimensional face models

Taking advantage of MakeHuman's ability to accept Python plugins that extend its usefulness, we designed several such plugins to assist researchers who are interested in fully controlling their stimuli during face perception experiments. Conveniently, MakeHuman creates human-shaped models given a set of simple, named, numerical parameters. Because researchers often need to average faces (i.e., for studies on norm-based face encoding; reviewed by Rhodes 2017; Webster and MacLeod 2011), we created a plugin to help users average the identities generated in MakeHuman by simply obtaining the numerical average of each parameter. The end-user can specify a set of many identity models within a directory and its sub-directory to produce an average identity. In addition to the mean, the plugin also collects information about each parameter’s variability, which allows new faces to be generated roughly within the space of the models being averaged.

Researchers might be interested in creating face stimuli that differ in a number of chosen features but have standard values in all other attributes. This allows one to make conclusions about the effect of the selected features on perceptual processing, while easily controlling for the possible confounding effect of the nuisance features. To facilitate this process, we developed a standardization plugin. Figure 5a shows a schematic representation of how the standardization process works. The plugin requires as input an original unstandardized model together with a target standard model, and provides as output the original model standardized in selected features. In MakeHuman, there are two types of features: shapes and geometries. The shape parameters only affect the mesh/wireframe of the human model. Geometries are like accessories for the wireframe: skin, teeth, eyebrows, hair, etc. (anything added onto the wireframe). Therefore, we developed a plugin that can standardize one or both types of features across the identity models. When standardizing the shape features, any of the named parameters used to produce the body’s wireframe can be standardized (i.e., age, sex, ovalness, etc.). Users can specify comma-space-separated parameter names through Perl-style regular expressions to select the names; i.e., “head” can select anything that contains the ordered letters “head” (forehead-bulge, head-size, etc.), but “^head” can only select parameter names that begin with “head” (i.e., head-size, but not forehead-bulge). As an example of formatting, the default shape standardization setting is “^head, neck, ear, proportions”.

Fig. 5
figure 5

Standardization plugin. Panel (a) shows a schematic representation of the functionality of the standardization plugin. An original unstandardized model is provided together with a target standard model. The plugin outputs a standardized original model. Panel (b) shows the inputs and options included in the plugin's graphical user interface

The simpler of the two categories of features is geometries. The geometries of every identity in the specified directory will all be set to match the geometries of the specified standard model, which is selected at the start of the standardization process: meaning that the teeth, eyes, skin tone, etc., will all be set to match the specified model. This is important to consider when using separate skin models for men and women (differences in skin tone and contrast are important cues for sex classification; see Russell 2003, 2009), and simply means that stimuli that require distinct geometries (including skin texture/color models) should be standardized separately. With our standardization plugin, it is possible to standardize shapes, geometries, or both at the same time. The standardized models are outputted to a separate directory, which the user specifies, to avoid overriding the originals. Figure 5b describes the options offered through the graphical user interface of the standardization plugin.

“Morphing” in three-dimensional shape space

Morphing is an incredibly important tool in face perception research, but the process to create morphed faces is slow and painstaking, and the resulting sequence is difficult to interpret in terms of meaningful shape parameters. Morphing pixels within images is a nonlinear operation that requires the creation of templates through the manual placement of several nodes to outline each important feature (mouth, eyes, nose, head outline, etc.) for each view, for each expression, and for each face (whose placement is also each subject to human error). Obtaining high-quality morphs usually requires the manual placement of 150+ nodes in each face image (e.g., using Psychomorph or its web-based version, WebMorph). Such templates are view-specific, in that if they are created for a given face view (typically front view), then morphing the same face from a different viewpoint will require the creation of a new, independent template.

Through MakeHuman, however, we have full control of the models' parameters and can use the line connecting two identity models in shape space (see black line in Fig. 1) to interpolate between them (e.g., any point in the line segment going from John to the average model in Fig. 1) or extrapolate along the line beyond each of them (e.g., the line segment extrapolating toward anti-John in Fig. 1). This reduces the workload required to create image morphs while also increasing the accuracy of the morphs by removing the process of placing nodes altogether. Because MakeHuman models are three-dimensional, any linear interpolation from one face identity or expression to another can be rendered from any viewpoint and in any illumination condition. Finally, because the interpolation is performed in the multidimensional shape space of MakeHuman parameters, the transformation is interpretable in terms of relative changes in such intuitively named parameters.

Linear interpolation and extrapolation of models is easily achieved in FaReT through the interpolation render plugin. Figure 6 describes the inputs and options included in the graphical user interface of this plugin. The output of the plugin is a sequence of rendered models that can be used to create animated face stimuli, as explained later.

Fig. 6
figure 6

Inputs and options included in the graphical user interface of the interpolation render plugin

Linear interpolation in 3D shape space

The interpolation plugin achieves something similar to “morphing” in MakeHuman, by creating a sequence of renders from linear interpolation between two or more 3D models. In addition to the face identity parameters, it is possible to interpolate between expression pose models, orbital camera positions, or any simultaneous combination of those (see Fig. 7).

Fig. 7
figure 7

Examples of linear model interpolation using FaReT. Each row shows a different interpolating morph. Each row shows interpolation of: the camera position, expression, identity, each of them simultaneously. The Camera morph is rotating the camera by 90 degrees. The Expression morph is interpolating a male identity between a neutral expression and an angry expression. The Identity morph is interpolating from a female identity to a male identity: note that because skin tone settings contain Boolean graphical shader settings, the skin tone is not interpolated in the current version of FaReT. The Complex morph is changing a female identity expressing sadness to a male identity expressing surprise while rotating the camera to and from 30 degrees

There are several advantages of this procedure over traditional image morphing. First, the transformation is interpretable in the space of shape parameters. Second, the procedure allows precise control of what parameters are changed, so that “partial morphs” in which only some shape features are changed can be easily achieved. Third, the template creation process can be bypassed while maintaining complete control over the specific, well-defined parameters being changed as well as the number of frames to render. Fourth, as indicated above, interpolation can be performed using not only shape parameters, but also simultaneously expression pose and camera position (see bottom row of Fig. 7).

Linear extrapolation in 3D shape space

During interpolation, the shape parameters of any two identity models, or the pose parameters of any two expressions, were used to create a directional vector from one point in parameter space to the other. The plugin also includes the ability to extrapolate along that vector, beyond the original two models. Extrapolation can be applied using any two models, but if a directional vector is defined from an average face to a target identity, then extrapolation can be used to create caricatured identities and expressions by applying the vector even after reaching the target identity’s parameters. Furthermore, by reversing the direction of the vector, extrapolation can be used to create anti-identities and anti-expressions (see Fig. 1). Both caricatures (e.g., Byatt and Rhodes, 1998; Lee et al., 2000) and anti-faces (e.g., Burton et al., 2015; Cook et al., 2011; Leopold et al., 2001; Rhodes and Jeffery, 2006; Skinner and Benton, 2010) have proven to be important stimulus types for examining face encoding in the human visual system. See Fig. 8 for linear extrapolation examples.

Fig. 8
figure 8

Examples of linear extrapolation using FaReT. Depending on the two models chosen, the plugin makes possible morphing to anti-identities and anti-expressions. Creation of caricatures is also available (not shown)

Obtaining videos of dynamic faces

Face perception researchers have recently become increasingly interested in using dynamic face stimuli in their research, particularly in the study of emotional expression (for reviews, see Bernstein and Yovel, 2015; Duchaine and Yovel, 2015; Lander and Butcher, 2015). A common practice is to create sequences of morphed images from a neutral expression to some target expression, and concatenate such images into an animated video. A similar process is possible using FaReT. Our plugins automate the linear interpolation/extrapolation process and output a sequence of images that can be used to create videos or animated GIFs (see Fig. 7), which can be presented as experimental stimuli. Because of the high-quality GIFs that the GNU Image Manipulation Program (GIMP; https://www.gimp.org/) exports (by surpassing the standard 256 color limit for GIFs), we have also created a python plugin for GIMP 2.8 to help in transforming folders of MakeHuman renders to GIFs. Figure 9 shows the graphical user interface developed for this plugin. Researchers interested in creating videos with other formats (e.g., AVI) can also use the image folders created by FaReT as input to the excellent open-source software ImageJ (https://imagej.net).

Fig. 9
figure 9

Graphical user interface of the GIMP plugin used to create high-quality GIF animations from MakeHuman renders

An advantage of creating dynamic face stimuli using FaReT rather than image morphs is that our toolkit uses linear interpolation in the shape and pose spaces, producing a more natural outcome than changes in an arbitrary morphing space, and takes advantage of MakeHuman’s algorithms aimed at simulating morphological features of the human face and its musculature. Recently, some researchers have questioned whether dynamic stimuli created using image morphs can capture the true dynamics of natural emotional expressions (Bernstein and Yovel, 2015; Roesch et al., 2011). The use of MakeHuman is a step in the right direction, especially if future applications precisely manipulate the speed of unfolding of emotional expressions to match those observed in natural face videos.

Communicating with experiment software (e.g., PsychoPy) to render faces dynamically online

As indicated in the introduction, an important motivation behind developing FaReT was to be able to freely manipulate and control face stimuli to such an extent that traditional psychophysical techniques would be available to face researchers. Among such techniques, some of the most useful require online generation of stimuli as a function of the participant’s behavior, including adaptive procedures to estimate thresholds, psychophysical functions, or internal representations (Leek, 2001; Lu and Dosher, 2013; Shen, 2013; Treutwein, 1995; Watson and Pelli, 1983; Watson, 2017).

By implementing a plugin that uses the Python socket module, MakeHuman can respond to external requests to generate and render faces from external experiment control software like PsychoPy (Peirce, 2007, 2009). By generating stimuli online, it becomes possible to run adaptive psychophysics experiments. This tool allows researchers for the first time to easily program adaptive psychophysics experiments involving complex visual objects. However, because of the cumulative nature of rendering time, we recommend generating stimuli that use single images per trial for online experimental paradigms (especially if rendering with advanced lighting settings). For the time being, experiments that require stimuli with many frames (i.e., dynamic face stimuli) should be constrained to pre-generated sets of stimuli in most systems. The fast rendering required to generate dynamic stimuli online is possible, but it is likely to require both powerful hardware and additional programming (e.g., by parallelizing the task of rendering a sequence of images).

Instructions and examples on how to manipulate shape and expression parameters from Python code using the Socket Render plugin are available in the FaReT GitHub page: https://github.com/fsotoc/FaReT#communicating-with-psychopy-to-render-faces-online. The plugin allows one to load MakeHuman identity and expression models, modify their parameters, and render the modified model online. It is also possible to create the models from scratch within Python, by simply generating dictionaries of parameter values. To use as a reference, we include tables with all the relevant parameters for the models of face shape and expression (see Appendices 1 and 2), together with the name given to the corresponding feature in MakeHuman’s GUI, and the range of values that the parameter accepts. This will allow researchers to use the MakeHuman GUI to understand how different parameters change face shape and expression, and then use that knowledge to directly change the parameters from within their Python code.

Installing and learning to use FaReT

The latest version of FaReT can be downloaded from the following link: https://github.com/fsotoc/FaReT. The page includes instructions on how to install and use the different components of the toolkit. A folder is included with our database of MakeHuman models of face identity and expression.

Study 1: database validation

We performed a validation study to determine whether dynamic stimuli created using FaReT are reliably perceived by human participants. If our synthesized identities and expressions are indeed perceived correctly, then we would expect that facial expressions would receive the highest ratings for the emotion that they actually represent and that the participants would rate the male identities as more masculine than the female identities. Both of these criteria have been previously used to validate computer-generated face stimuli (Roesch et al., 2011). In addition, we wanted to know to what extent the stimuli generated with FaReT are considered to be unusual by human participants. Ideally, computer-generated face stimuli would be comparable to naturalistic photos, but it seems clear that this level of realism cannot be achieved by currently available software (e.g., MakeHuman or FaceGen). However, it would be useful to quantitatively evaluate the level to which participants judge our stimuli to be unusual compared with naturalistic stimuli. With that goal in mind, our study was modeled after a previous validation study by Ma et al. (2015), who collected norming data for photographs of faces. We used their questions so that we could compare the responses about our computer-generated face images to the responses about their naturalistic face images.

Method

Participants

One hundred and twenty-three students from Florida International University (located in Miami, Florida) were recruited to participate in this study in exchange for credit. Thirty-eight participants tried to take the survey multiple times after being removed by catch questions (see description below) that assessed whether they were actually reading the questions. Six additional participants did not finish the survey and were also removed. The remaining 79 participants (71 female participants) finished the entire survey.

The participants’ ages ranged from 18 to 45 years (M = 22.28, SD = 4.82). The participants’ reported ethnicity distribution was as follows: 5 participants (6.33%) reported Asian, 15 reported Black or African American (18.99%), 33 (41.77%) reported White, 2 (2.53%) reported Native Hawaiian or Pacific Islander, 47 (59.49%) reported “Hispanic or Latino(a)”, and 1 (1.27%) reported “Other”.

Stimuli

The face stimuli shown to participants were animated GIFs, created from renders of one identity model showing one of the six expression pose models (i.e., happy, sad, afraid, surprised, angry, or disgusted). The animation involved a simple camera rotation, with the camera orbiting the viewpoint back and forth horizontally between −30 and 30 degrees. The full camera motion consisted of 120 frames shown at 30 frames per second. The camera started from a frontal viewpoint, and the full camera movement loop lasted four seconds.

All 27 identity models were shown to participants: 12 male models, 12 female models, average male, average female, and global average. All possible combinations of identity and expression models were used, for a total of 162 animated stimuli. However, not every participant saw the same combination of identity and expression. Rather, each participant rated a subset of 23 face/expression combinations, which were selected randomly without replacement across participants. This sampling ensured that all face/expression combinations were shown once before the sampling process was started again.

Procedure

The study was performed online through the Qualtrics survey program (https://www.qualtrics.com/). Participants were tasked with rating identities from our database by answering 13 questions per stimulus, in which they were asked to rate the extent to which the identities appeared threatening, attractive, baby-faced, trustworthy, masculine, fearful, sad, angry, happy, surprised, disgusted, and unusual on a scale from 1 to 7 (1 being the lowest and 7 being the highest). Participants also rated the identity’s Euro/Afrocentricity on a scale from 1 to 100. As indicated earlier, the study included the same questions as a prior validation study by Ma et al. (2015). Responses to all questions are available to interested researchers (raw data can be downloaded from https://osf.io/grp9d/), but here we report analyses of only some of the questions, aimed at validating the stimuli generated by FaReT.

Each participant only had to rate 23 of our stimuli, which were intended to be pseudo-randomly assigned to ensure that the stimuli were all rated before reshowing one. There were four catch trials inserted randomly among the other trials, with the goal of determining whether participants were paying attention to the rating task, rather than responding randomly without regard for instructions. In catch trials, participants were simply asked to select a specific numerical rating on the 1-to-7 scale (even though no dimension was specified). If participants failed to provide the specified rating, they were redirected to the end of the survey, and their data was not included in the results. The same two stimuli were always presented during catch trials: a surprised female model and a happy global average model.

Results and discussion

One criterion used in the past to determine whether participants reliably perceive synthetic identities is that they unambiguously recognize them as either male or female (Roesch et al., 2011). Seventy-nine participants provided masculinity ratings, and they rated the masculinity of male identities (M = 5.32, SD = 1.09) significantly higher than that of female identities (M = 3.53, SD = 1.19); t(78) = 13.01, p < .001, Cohen’s d = 1.46. This implies that the participants were able to differentiate males from females in our database despite the lack of geometry-based cues (i.e., hair).

A second criterion used in prior research to validate computer-generated face stimuli is checking that participants rate the displayed expression most highly in its corresponding emotional category (e.g., faces expressing fear are given a relatively high fearful rating, faces expressing happiness are given a relatively high happiness rating, etc.; see Roesch et al., 2011). As shown in Table 1, this was true of our stimuli for the most part. When participants were shown fearful faces, they rated them higher in fear expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 438) = 19.74, p<.001, \( {\eta}_p^2 \) = .1839, and by a planned contrast comparing mean fear rating versus mean rating of all other emotions, F (1, 438) =40.52, p<.001, \( {\eta}_p^2 \) = .0847. When participants were shown sad faces, they rated them higher in sadness expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 438) = 32.68, p<.001, \( {\eta}_p^2 \) = .2717, and by a planned contrast comparing mean sadness rating versus mean rating of all other emotions, F (1, 438) =122.11, p<.001, \( {\eta}_p^2 \) = .2180. When participants were shown angry faces, they rated them higher in anger expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 426) = 39.37, p<.001, \( {\eta}_p^2 \) = .3160, and by a planned contrast comparing mean anger rating versus mean rating of all other emotions, F (1, 426) =152.8, p<.001, \( {\eta}_p^2 \) = .2639. When participants were shown happy faces, they rated them higher in happiness expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 444) = 22.2, p<.001, \( {\eta}_p^2 \) = .2, and by a planned contrast comparing mean happiness rating versus mean rating of all other emotions, F (1, 444) = 110.4, p<.001, \( {\eta}_p^2 \) = .1991. When participants were shown disgusted faces, they rated them higher in anger expression than in any other expression, and second highest in disgust expression. The difference in ratings was significant according to a repeated-measures ANOVA, F (5, 438) =35.39, p<.001, \( {\eta}_p^2 \) = .2877. A planned contrast showed that that disgusted faces were rated significantly higher in disgust expression than in all other expressions combined (i.e., on average), F (1,438) =14.98, p<.001, \( {\eta}_p^2 \) = =.0331. Finally, when participants were shown surprised faces, they rated them higher in surprise expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 438) =35.8, p<.001, \( {\eta}_p^2 \) = .2901, and by a planned contrast comparing mean surprise rating versus mean rating of all other emotions, F (1, 438) =127.4, p<.001, \( {\eta}_p^2 \) = .2253.

Table 1 Emotion ratings provided by participants in Study 1

In sum, for the most part, the participants rated the displayed expression most highly in its corresponding emotional category. However, participants rated identities expressing disgust as showing a higher anger expression than a disgust expression (see Table 1). Given that our expression pose models were developed based on the photographs from a single actor from the KDEF database (Lundqvist et al., 1998), one possibility to consider is that the poor recognition of disgust in our study could stem from the ambiguity of this expression in the original photographs of that particular actor. This is highly unlikely, as norms have been published for the KDEF database (Goeleven et al., 2008), and people consistently classified the actor’s expression as disgusted (hit rate above 90%), with the most common misclassification error being anger. We discuss other explanations for this specific result later.

The mean unusualness ratings for our identities were right in the middle of the scale (M =4.09, SD = 1.06), which was significantly higher than the ratings assigned to the photographs in the Ma et al. (2015) study (M =2.56); t(78) = 12.82, p < .001, Cohen’s d = 1.44. In that study, participants rated face photographs in the lower end of unusualness (i.e., well below the midpoint of 4 in the scale). Assuming that our participant population is similar to that used in the Ma et al. (2015) study (it might not be, as their demographics differ considerably), this implies that our identities are more unusual than the naturalistic photos, which could be due to the lack of hair, lighting/shading effects, etc. We reasoned that if lack of hair contributed to the relatively high unusualness ratings, then as men are more prone to balding, their unusualness ratings should be lower than those of women. We performed an exploratory test to compare the unusualness ratings of male (M =4.20, SD=1.12) and female (M =4.07, SD=1.24) identities, but the difference was nonsignificant; t(78) = 1.06, p=0.29, Cohen’s d = .12.

Table 2 shows the mean unusualness of the stimuli broken down by expression pose model. It can be seen that, on average, some models had higher unusualness than others. The two highest-rated models were happiness and disgust. As mentioned before, these models seemed problematic for other reasons as well. In the case of happiness, the model was manually modified after judging that its application to some identities could produce off-putting results. We might have been unable to solve this issue completely. In the case of disgust, this model produced much higher ratings of anger than disgust from participants (see Table 1). Although confusion between disgust and anger is commonly found in the literature, even with naturalistic photographic stimuli, it is problematic that the mean anger rating for these stimuli is about 1.5 points higher than the mean disgust rating. To address both issues, we built two new expression pose models and validated them in a new study that is presented in the following section.

Table 2 Unusualness ratings provided by participants in both studies

Study 2: validation of alternative models of happiness and disgust

As mentioned above, the results of Study 1 prompted us to develop two new expression pose models of happiness and disgust. For happiness, research has shown that open-mouth smiles in three-dimensional models are judged as more authentic than closed-mouth smiles (Korb et al., 2014), and thus we built a model showing an open-mouth smile. For disgust, we reasoned that we needed to build a model showing a feature that is only found in disgust and not in anger. Rozin et al. (1994) showed that there are a variety of ways to express disgust depending on the presence or absence of upper lip retraction, nose wrinkle, and gape and tongue extension. Each of these disgust expressions carry a different meaning, with the upper lip retraction carrying information about disgust in social situations (e.g., breaking of moral rules), the nose wrinkle communicating a bad smell (and to a lesser extent a bad taste), and the gape and tongue extension communicating oral irritation and the need to expel contaminated food from the mouth (Rozin and Fallon, 1987; Rozin et al., 1994). Upper lip retraction is common in expression of anger and contempt, which are important in social communication. Probably due to constraints in the structure of the face (Rozin et al., 1994), a nose wrinkle tends to be accompanied by a general contraction of face features thought to be related to sensory rejection (Susskind et al., 2008). Such contraction can also be found in expression of anger and contempt. On the other hand, gape and tongue extension is exclusively found in disgust and not in anger or contempt. Thus, we chose to include these features in the new disgust model, while noting that the new model involves a different form of disgust (related to contaminated food) than the model presented earlier (related to social communication and sensory rejection).

Renders of the two new models for happiness and disgust are shown in Fig. 10. The study used the exact same procedures as Study 1, but with stimuli generated using the new expression models.

Fig. 10
figure 10

Renders of the two new expression pose models created for Study 2, applied to two different identity models. Each row contains one identity; the top row is a male identity and the bottom is a female identity. Each column contains a separate expression: open-mouth happiness and tongue-out disgust

Method

Participants

Thirty-three people participated in this study. The participants’ ages ranged from 18 to 43 years (M=23.5, SD=5.19), and their reported ethnicity distribution was as follows: 2 participants (6.06%) reported Asian, 4 reported Black or African American (12.12%), 8 (24.24%) reported White, 0 (0%) reported Native Hawaiian or Pacific Islander, 25 (75.75%) reported “Hispanic or Latino(a)”, and 3 (9.09%) reported “Other”. Other features of the sample were as described for Study 1.

Stimuli

Stimuli were generated as described for Study 1, but using the two new models for happiness and disgust displayed in Fig. 10.

Procedures

Procedures were as described for Study 1, with the exception that here every participant rated every stimulus (i.e., combination of identity model and the two new expression pose models).

Results and discussion

As in Study 1, the 31 participants who provided masculinity ratings rated the masculinity of male identities (M =4.93, SD=1.66) significantly higher than that of female identities (M =3.05, SD=1.57); t(30) =7.61, p<.001, Cohen’s d =2.79. As shown in Table 3, when participants were shown happy faces with open-mouth smiles, they rated them higher in happiness expression than in any other expression. This was confirmed by a repeated-measures ANOVA, F (5, 186) =7.59, p<.001, \( {\eta}_p^2 \) = .1694, and by a planned contrast comparing mean happiness rating versus mean rating of all other emotions, F (1, 186) =36.41, p<.001, \( {\eta}_p^2 \) = .1637. When participants were shown disgusted faces with an open mouth and the tongue sticking out, they rated them highest in anger expression, just as in the previous study. However, unlike Study 1, anger ratings were much lower and close to disgust ratings. The repeated-measures ANOVA showed a significant difference in ratings of different emotions, F (5, 180) =21.60, p<.001, \( {\eta}_p^2 \) = .3750, and the planned contrast comparing mean disgust rating versus mean rating of all other emotions was also significant, F (1, 180) =39.95, p<.001, \( {\eta}_p^2 \) = .1816. Thus, an improvement was observed in this study regarding ratings of the disgust expression model. However, we were unable to eliminate confusion between anger and disgust. As indicated earlier, this is a common finding in the literature using naturalistic stimuli (see General Discussion below).

Table 3. Emotion ratings provided by participants in Study 2

The mean unusualness ratings for our stimuli was on average lower than that of the previous study (M =3.36, SD = 1.33), but still significantly higher than the ratings assigned to the photographs in the Ma et al. (2015) study (M =2.56); t(33) = 3.48, p < .01, Cohen’s d =1.21. As shown in Table 2, both models showed lower unusualness than their Study 1 counterparts, although the happiness model showed a larger drop in rating.

Overall, from these results it does seem advisable to use the open-mouth models depicted in Fig. 10 rather than the originally developed models for happiness and disgust depicted in Fig. 3, although further improvements are likely possible toward a completely unambiguous model of disgust.

General discussion

Here, we presented FaReT (Face Research Toolkit), a free and open-source toolkit of 3D models and software to study face perception. FaReT allows face perception researchers to easily manipulate 3D models of faces in ways that are common in the literature (morphing, averaging, standardization, etc.), but with improved speed, control, and interpretability of such manipulations. FaReT is built on top of MakeHuman, taking advantage of this software’s intuitive shape and pose spaces, where transformations are natural and interpretable. Our toolkit also easily produces dynamic stimuli that are more natural and interpretable than those produced through concatenation of image morphs, but also more controlled and manipulable than natural videos. The toolkit includes applications to enable easy control of relevant face features and standardization of irrelevant features, where “relevant” and “irrelevant” can be defined by research goals. Finally, the toolbox allows easy implementation of adaptive psychophysical techniques in the study of face perception, which traditionally have been available only with very simple, low-level stimuli.

FaReT also includes a database of 27 face identity models and six expression pose models. We performed a validation study in which the models in our database were used to create dynamic facial stimuli that participants rated in a number of perceptual features. We found that raters typically assigned the highest scores to emotions that aligned with the artificial stimuli’s corresponding facial expression. The exception was our pose model for disgust, which was rated highest in anger. Throughout the face research literature, anger and disgust are often rated similarly or confused (Thorstenson et al., 2019), including in previous studies involving three-dimensional face models (see Roesch et al., 2011, Fig. 5). Studies using adaptation suggest that they may be encoded through overlapping channels (Skinner and Benton, 2010), meaning that the artificial expressions of these emotions were portrayed well enough to capture that similarity. In the same vein, the expressions of fear and surprise also received similar ratings of emotionality.

Some of the functionality provided by FaReT is also included in the FACSHuman plugin (Gilbert et al., 2018; https://github.com/montybot/FACSHuman). For example, both provide the ability to create image sequences and videos through linear interpolation in model space. However, each package has its own strengths. In FACSHuman, expression modeling is encoded using the FACS system, which is familiar to many researchers in face perception. Some existing face image data sets are already FACS-rated (e.g., Lucey et al., 2010; Mavadati et al., 2013), which would allow for easy development of new valid models of emotional expression using FACSHuman.

On the other hand, FaReT was designed with the broader goal of helping face perception researchers produce controlled stimuli in psychophysical experiments, using tasks and designs already common in the current literature, as well as those that have been previously available only for research in low-level vision. Given that goal, FaReT allows manipulation of not only expression parameters, but also identity parameters, and includes linear extrapolation in addition to interpolation, allowing researchers to create anti-models and caricature models. This, plus the ability to easily generate average models and standardize large sets of models, allows FaReT to perform all stimulus manipulations that researchers in face perception currently achieve using morphing software. Additionally, FaReT includes a plugin that enables interactive communication with MakeHuman from Python, which opens the possibility to design adaptive psychophysics experiments using software such as Psychopy. In combination with flexible adaptive procedures such as QUEST+ (Watson, 2017; https://github.com/hoechenberger/questplus/tree/master/questplus), researchers have access to fast estimation of parameters of any parametrically defined psychophysical function, such as psychometric curves, threshold vs. stimulus parameter functions (i.e., similar to contrast sensitivity functions), transducer functions, threshold versus external noise functions (both similar to those estimated using contrast in low-level vision), and parameters of observer models (see Lu and Dosher, 2013).

Despite these successes, one limitation of FaReT highlighted by our results is that the artificial stimuli were given ratings of unusualness that are higher than those previously provided for static photographs of human faces (Ma et al., 2015). At this point, it is not clear what aspects of our stimuli or procedure might have produced the difference, as the current study and that of Ma et al. (2015) differed in several ways besides the synthetic versus natural faces. However, the assumption that the result is mostly due to the synthetic nature of our stimuli seems reasonable, and in that case a clear goal for future releases of FaReT is enhancing the quality of the assets used to generate the faces (such as skin textures), deriving parameters for faces and expressions from infrared depth-based images, or altering the type of camera used to create renders (from MakeHuman’s hard-coded orthographic camera to a perspective camera).

A second limitation of the current version of FaReT is that its model database has a limited number of identities and expressions. As a result, the average identities and expressions are likely to be biased estimates of the true averages (Cook et al., 2011). Fortunately, this database can be expanded without limit by the scientific community. Indeed, the larger community of MakeHuman users has already provided an extensive and growing database of identity ( http://www.makehumancommunity.org/models.html) and expression (http://www.makehumancommunity.org/expressions.html) models, as well as other useful assets and plugins (see http://www.makehumancommunity.org/content/user_contributed_assets.html). In addition, the fact that models in MakeHuman are described in a multidimensional space of shape parameters offers the possibility of taking a much more useful approach to this problem: fully characterizing the distribution of realistic models of identity and expression in such space, rather than any individual model, and sampling novel identities and expressions from that distribution. This would allow not only a larger database, but actually the generation of an infinite number of novel realistic models. We expect this to be the next step of development in FaReT.

We hope that FaReT will help face perception researchers in solving some of the methodological problems that they currently face, as well as opening new avenues of research, all while maintaining an open-source status. We want to allow users to precisely modify 3D facial features through MakeHuman’s sensibly named parameters, without depriving them of the important morphing procedures that had originally arisen from image manipulation software. This level of quantitative control is imperative to successful psychometric research. Our goal is to provide vision scientists with malleable high-level visual stimuli while maintaining the benefits of control that low-level vision scientists enjoy. We hope that this will help the scientific community to answer bigger questions without paying the unnecessary costs (both financial and scientific) imposed by copyrighted software.