Py-Feat: Python Facial Expression Analysis Toolbox

Studying facial expressions is a notoriously difficult endeavor. Recent advances in the field of affective computing have yielded impressive progress in automatically detecting facial expressions from pictures and videos. However, much of this work has yet to be widely disseminated in social science domains such as psychology. Current state-of-the-art models require considerable domain expertise that is not traditionally incorporated into social science training programs. Furthermore, there is a notable absence of user-friendly and open-source software that provides a comprehensive set of tools and functions that support facial expression research. In this paper, we introduce Py-Feat, an open-source Python toolbox that provides support for detecting, preprocessing, analyzing, and visualizing facial expression data. Py-Feat makes it easy for domain experts to disseminate and benchmark computer vision models and also for end users to quickly process, analyze, and visualize face expression data. We hope this platform will facilitate increased use of facial expression data in human behavior research. Supplementary Information The online version contains supplementary material available at 10.1007/s42761-023-00191-4.


Introduction
Facial expressions can reveal insights into an individual's internal mental state and provide nonverbal channels to aid in interpersonal and cross-species communication 1,2 .One of the main challenges to studying facial expressions has been arriving at a consensus understanding as to how to best represent and objectively measure expressions.The Facial Affect Coding System (FACS) 3 is one of the most popular systems to reliably 4 quantify the intensity of groups of facial muscles referred to as action units (AUs).However, extracting facial expression information using FACS coding can be a laborious and time-intensive process.Becoming a certified FACS coder requires 100 hours of training, and manual labeling is slow (e.g., one minute of video can take an hour 5 ) and inherently contains cultural biases and errors 6,7 .Facial electromyography (EMG) provides one method to objectively record from a finite number of facial muscles at a high temporal resolution 8,9 , but requires specialized recording equipment that restricts data collection to the laboratory and can visually obscure the face making it less ideal for social contexts.
Automated methods using techniques from computer vision have emerged as a promising approach to extract representations of facial expressions from pictures, videos, and depth cameras both inside and outside the laboratory.Participants can be untethered from cumbersome wires and can naturally engage in tasks such as watching a movie or having a conversation [10][11][12][13][14] .In addition to AUs, computer vision techniques have provided alternative embedding spaces to represent facial expressions such as facial landmarks 15 or lower dimensional latent representations 16 .These tools have a number of applications relevant to psychology such as predicting the intensity of emotions [17][18][19][20] and other affective states such as pain 21,22 , distinguishing between genuine and fake expressions 23 , detecting signs of depression 24   , inferring traits such as personality [25][26][27] or political orientations 28 , and predicting the development of interpersonal relationships 12,14 .Though facial expression research has seen rapid growth in affective computing facilitated by recent advances in machine learning, adoption in fields outside the domain of computer science such as psychology has been surprisingly slow.
In our view, there are at least two specific barriers contributing to the slow adoption of automated methods in social science fields such as psychology.First, there is a relatively high barrier to entry to training and accessing state of the art models capable of quantifying facial expressions.This requires knowledge of computer vision techniques, neural network architectures, and access to large labeled datasets and computational infrastructure that include Graphics Processing Units (GPUs).Though there are impressive efforts to share high quality datasets [29][30][31][32][33][34][35] , there are still difficulties sharing this data involving participants' privacy, complicated end user agreements, expensive handling fees, contacting data curators, and finding affordable and stable long-term hosting solutions.Though hundreds of models have been developed to characterize facial expressions, no standards have emerged for disseminating these models to end users.These models are typically reported in conference proceedings, occasionally shared on open code repositories such as Github, and require considerable domain knowledge as they have been developed using a multitude of computer languages, rarely have documentation, and occasionally have restrictive licensing.Each model may require the data to be preprocessed in a specific way or rely on additional features (e.g., landmarks, predefined regions of interest).Because there are currently no generally agreed upon standards for training and benchmarking beyond data competitions (e.g., WIDER, 300W, FERA, etc), each model is typically trained on different datasets, which makes it difficult to benchmark the models using the same dataset to aid in the model selection process 17,36 .Platforms such as paperswithcode.comare helping to standardize the dissemination and benchmarking of models, but sharing state of the art models has not yet become a norm in the field.Other domains such as natural language processing and reinforcement learning have begun to overcome this issue with a variety of high quality software platforms such as Stanza 37 , SpaCy, OpenAI Gym 38 , and HuggingFace.
Second, there is a notable lack of free open-source software to aid in detecting, preprocessing, analyzing, and visualizing facial expressions (Table 1).Commercial software options such as Affdex (Affectiva Inc) available through iMotions 39 and Noldus FaceReader 40 can be expensive, have limited functionality, and typically do not employ state of the art models [41][42][43] (see 17,20 for commercial software performance comparisons).Furthermore, due to strong interest from industry, there have been several free software packages such as the Computer Expression Recognition Toolbox 44 , Intraface 15 , and Affectiva API 45 (Affectiva Inc) that have turned into commercial products or been acquired by larger technology companies such as Apple Inc or Meta and rendered unavailable to researchers.Currently, OpenFace 46 is the most widely used open-source software that allows users to extract facial landmarks and action units from face images and videos.However, OpenFace does not provide a comprehensive suite of tools for preprocessing, analyzing, and visualizing data, which would make these tools more accessible to non-domain experts.As an example, in other fields such as neuroscience, the rapid growth of neuroimaging research has been facilitated by the widespread use of free tools such as FSL 47 , AFNI 48 , SPM 49 , and NiLearn 50 that enables end users to preprocess, analyze, and visualize complex brain imaging data.We believe the broader emotion research community would greatly benefit from additional software platforms dedicated to facial expression analysis with functions for extracting, preprocessing, analyzing, and visualizing facial expression data.

Facial feature detection
Preprocessing Analysis Free Facial landmarks To meet this need, we have created the Python Facial Expression Analysis Toolbox (Py-Feat) which is a free, open-source package dedicated to support the analysis of facial expression data.It provides tools to extract facial features like OpenFace 46 , but additionally provides modules for preprocessing, analyzing, and visualizing facial expression data (Figure 1).Py-Feat is designed to meet the needs of two distinct types of users.Py-Feat benefits computer vision researchers who can use our platform to disseminate their state of the art models to a broader audience and easily compare their models with others on the same benchmark metrics.It also benefits social science researchers looking for free and easy to use tools that can both detect and analyze facial expressions.In this paper, we outline the key components of the Py-Feat toolbox including the facial feature detection module and analysis tools, provide quantitative assessments of the performance of the detection models on benchmark data including the robustness of the models to real world data, and provide a tutorial of how the toolbox can be used to analyze an open face expression dataset.

Py-Feat Design and Module Overview
Py-Feat is written in the Python programming language.We selected Python over other popular languages (e.g., Matlab, C, etc) for several reasons.First, Python is open source and completely free to use and compiles to all major operating systems (e.g., Mac, Windows, Unix).This makes the software accessible to the largest number of users.Second, Python is among the easiest programming languages to read and learn and is increasingly being taught in introduction to programming classes.Though we do not currently provide a graphical user interface (GUI) to Py-Feat, we believe it is highly easy to use with minimal background in programming (see our example code below).Third, Python has emerged as one of the primary languages used across academia and industry for data science.There is a vibrant developer community that has already created a rich library of tightly integrated high quality scientific computing packages for working with: arrays such as numpy 52 and pandas 53 ; scientific numerical routines with scipy 54 , machine learning algorithms with scikit-learn 55 , tensorflow 56 , and py-torch 57 ; and plotting with matplotlib 58 , seaborn 59 , and plotly.This makes it easy for Py-Feat to incorporate new functionality as it becomes available in other toolboxes, but also for Py-feat users to incorporate any Python package into other processing pipelines.Many of the core libraries are supported by big tech companies and are rapidly providing functionality to enable users to take advantage of newer innovations in hardware such as GPUs and distributed computing systems.In addition, Python libraries tend to have comprehensive documentation and testing and there are many excellent tutorials for learning how to use python online, which makes the language very accessible to beginners.For example, we have developed basic tutorials for learning to analyze data with Python on our DartBrains.orgcourse 60 and more advanced tutorials on analyzing naturalistic neuroimaging data 61 .We have built a jupyter-book 62 to accompany our toolbox with tutorials on how to perform analyses that can be easily augmented by the user community (https://py-feat.org/).
Py-Feat currently has two main modules for working with facial expression data.First, the Detector module makes it easy for users to detect facial expression features from image or video stimuli.We offer multiple models for extracting the primary face expression features that most end users will want to work with.This includes detecting faces in the stimuli and identifying the coordinates of the spatial location of a bounding box for each face.We also detect 68 facial landmarks, which are coordinates identifying the spatial location of the eyes, nose, mouth, and jaw.The bounding box and landmarks can be used in models to detect the head pose such as the face orientation in terms of rotation around axes in three-dimensional space.Py-Feat also detects higher level facial expression features such as AUs and basic emotion categories.We offer multiple models for each detector to keep the toolbox flexible for many use cases, but we also have picked sensible defaults for users who may be overwhelmed by the number of options.The features cover the majority of the ways in which facial expressions can be currently described by computer vision algorithms.Importantly, new features and models can be added to the toolbox as they become available in the field.The majority of the models in the toolbox are implemented in PyTorch 57 , which means they can also utilize Nvidia GPUs if they are available, which can dramatically speed up performance.
In addition, Py-feat also includes the Fex data module to work with the features extracted from the Detector module.This module includes methods for preprocessing, analyzing, and visualizing facial expression data.We offer an easy to use application programming interface (API) for slicing, grouping, sampling, and summarizing data as well as selecting different types of data (i.e., faceboxes, landmarks, action units, emotions, face poses), preprocessing facial expression time series data, extracting additional features from time series data, analyzing aggregates of facial expressions data, and visualizing intermediary preprocessing steps.

Py-Feat Performance
Computer vision models are highly complex and often employ completely different preprocessing steps and model architectures.All of the technical details about the architecture of each of the models and how they were trained can be found in the Supplementary Materials.
To provide users with an estimate of how well these models are likely to perform on their own datasets, we report benchmark performance on datasets that were never used in training the models.Importantly, we primarily used benchmark datasets that are the standard for each domain in data competitions and include highly variable naturalistic images collected in the wild when possible.

Face detection
One of the most basic steps in the facial feature detection process is to identify if there is a face in the image and where that face is located.Py-Feat includes three popular face detectors including Faceboxes 63 , Multi-task Convolutional Neural Network (MTCNN) 64,65 , and RetinaFace

66
. These detectors are widely used in other open-source software 46 and are known to achieve fast and accurate face detection results even for partially occluded or non-frontal faces.Face detection results are reported as a rectangular bounding box of the face and includes a confidence score for each detected face.We benchmarked the face detection models on the validation set of the WIDER FACE dataset, which is a standard dataset containing images in the wild retrieved from the internet 67 , using average precision described in the WIDER Face technical paper 68 .Overall, we found that the Py-Feat implementations of each of the models achieved acceptable levels of performance, although lower than what was reported in the original papers 66 (Table 3).This may be a consequence of using different hyperparameters.We also observed decreased performance as the classification task becomes increasingly more difficult, which includes small, inverted, and highly occluded faces.

Landmark detection
After a face is identified in an image, it is common to identify the facial landmarks, which are coordinate points in the image space outlining the jaw, mouth, nose, eyes, and eyebrows of a face.The distance and angular relationships between the landmarks can be used to represent face expressions and used to infer affective states such as pain 21 .Py-feat uses a standard 68-coordinate facial landmark scheme that is widely used across datasets and software 46,69,70 and currently includes three facial landmark detectors including the Practical Facial Landmark Detector (PFLD) 71 , MobileNets 72 , and MobileFaceNets 73 algorithms.We benchmarked these models on the 300 Faces in the Wild (300W) dataset 70,74 , which is a standard used in data competitions and contains in-the-wild face images that vary across luminance, scale, pose, expressions and occlusion levels.We compute the average root mean squared error between the predicted and ground truth coordinates across the landmark points normalized by the interocular distance.Overall, we found that the Feat-MobileFaceNet performed the best on our benchmark.

Head pose detection
Another feature of a face expression beyond its location in an image or the location of specific parts of the face is the position of the head in three dimensional space.Rotations from a head on view can be described in terms of rotation around the x, y, and z planes and are referred to as pitch, roll, and yaw respectively.Py-feat includes support for the Img2Pose model.This model does not rely on prior face detections, so it can also be used as a face bounding box detector.The constrained version of Img2Pose is fine-tuned on the 300W-LP dataset, which only includes head poses in range (-90°to +90°).We benchmarked our head pose models using the BIWI Kinect dataset, which contains videos of participants rotating their heads according to specific pose instructions 75 .We computed the Mean Absolute Error in degrees for pitch, roll and yaw.Overall, we found that the constrained version of Img2Pose achieved a slightly better performance compared to the unconstrained version on our benchmark.

Action unit detection
In addition to the basic properties of a face in an image, py-feat also includes models for detecting deviations of specific facial muscles (i.e., action units; AUs) from a neutral face expression using the FACS coding system.Py-feat currently contains two models for detecting action units.The architecture of the models are based on the highly robust and well-performing model used in OpenFace 46 , which extracts Histogram of Oriented Gradient (HOG) features from within the landmark coordinates using a convex hull algorithm, compresses the HOG representation using Principal Components Analysis (PCA), and finally uses these features to individually predict each of the 12 AUs using popular shallow learning methods based on kernels (i.e., linear Support Vector Machine; SVM 76 ), and ensemble learning (i.e., optimized gradient boosting; XGB 77 ) (see supplemental materials for training details).We compare the performance of our models to OpenFace and also FACET, which was previously available in iMotions before the company was acquired by Apple Inc.We benchmarked the AU detection models using the Extended DISFA Plus dataset 33 , which contains short videos of participants making posed facial expressions based on imitating a target image and also spontaneous facial expressions elicited from viewing experimental stimuli.We used F1 scores, an accuracy metric for binary classification, to quantify the performance of twelve different AUs.We found that the previously available FACET-iMotions achieved the best overall accuracy and was the best detector for AUs

Emotion detection
Finally, Py-feat also includes models for detecting the presence of specific emotion categories based on third party judgments.Emotion detectors are trained on manually posed or naturalistically elicited emotional facial expressions which allows detectors to classify new images based on how much a face resembles a canonical emotional facial expression.It is important to note that there is currently no consensus in the field if categorical representations of emotion are the most reliable and valid nosology of emotional facial expressions 78,79 .For example, detecting a smiling face as happy does not necessarily imply that the individual is experiencing an internal subjective state of happiness 80 , as these types of latent state inferences require additional contextual information beyond a static image 81 .However, labeling specific configurations of AUs with the semantic concepts of emotions can still be useful in emotion research to characterize the contexts in which people tend to display these facial expressions or how the display of certain emotion expressions accompanies changes in learning 82 and social behaviors 14 .Py-feat includes two emotion detectors capable of detecting seven categories of emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral.The Residual Masking Network (ResMaskNet) 83 is an end-to-end convolutional neural network model that combines deep residual networks with masking blocks.The masking blocks help focus the model's attention on local regions of interest to refine its feature map for more fine-grained predictions and the residual structure helps to maintain performances in deeper layers.We also provide a statistical learning model that uses Linear SVM 76 using a similar procedure as our AU models.We benchmarked our models using F1 scores on a random subset of 500 images from the AffectNet dataset 84 , which contains unposed expressions of emotions as they naturally occur in the wild outside of a carefully curated laboratory environment.We found that the Residual Masking Network model 83

Robustness Experiments
While computer vision researchers typically focus on developing new face expression models that can outperform previous work on standard benchmarking datasets, end users are often more interested in how well the models perform on real world data collection contexts.This type of data is typically messier than the carefully curated open datasets.We intentionally selected benchmark datasets that contain spontaneous or naturalistic images collected outside the laboratory in the wild.In addition to these benchmarks, we also evaluated the robustness of the models included in Py-feat to different types of real world scenarios that are known to create problems for computer vision models including variations in luminance, occlusions of specific regions of the face, and also head rotation.

Luminance
To test the robustness of our model to different lighting conditions, we modified our benchmark datasets to include two different levels of luminance (low, where brightness factor uniformly sampled from [0.1, 0.8] for each image and high, where brightness factor uniformly sampled from [1.2, 1.9] for each image).This can be useful for knowing how the models might be impacted by inconsistent lighting or smaller variations in skin pigmentation.Overall, we found that the majority of the deep learning detectors were fairly robust to variations in luminance.However, the shallow learning detectors that rely on HOG features were more dramatically impacted by high and low levels of variance (Figure 2).

Occlusion
In addition, we evaluated the performance of all of the detectors in three different occlusion contexts.Occlusions of the face are very common in real world data collection scenarios where a participant may cover their face with a hand, or be partially hidden behind some other physical object.We separately masked out the eyes, nose, and mouth on the benchmark datasets described above by applying a black mask to regions of the face using the facial landmark information (Figure 2A).The pose and landmark models were fairly robust to facial occlusions.However, face detection substantially dropped with occlusions, particularly when the nose was masked.Occlusion of specific facial structures can also provide an interesting lesion test for higher level facial feature extraction such as action units and emotions.Consistent with our expectations, the AU detector performance dropped for AUs 1,2,4,5,6,9 when the eyes were masked, while performance dropped for AUs 12, 15, 20, 25,& 26 when the mouth was masked.AU 9 and 20 detection performance dropped when the nose was blocked.The emotion models were even more dramatically affected by occlusion of specific facial structures.Anger, fear, sadness, and surprise detection was substantially impacted by occlusion of the eyes, while disgust, happy, and neutral detection dropped when the mouth was blocked, and Anger, Disgust, Fear, and Sadness were degraded with occlusions to the nose.

Robustness against Head Rotation
Most action unit models are trained using images in which the participants directly face the camera.However, in real world situations, faces are likely to be rotated relative to the camera position.Prior work has evaluated the performance of different AU detection algorithms on a new dataset, in which participants (N=12) were instructed to imitate specific facial expressions, while a camera recorded their expressions at specific rotation angles of 0°, 15°, 30°and 45°8 5 .Action units for each image were manually annotated by a trained FACS coder.We tested our py-feat-XGB AU detection model using this dataset and found that AU detection performance tends to decrease as rotation angles increase.However, the XGB model is fairly robust to rotation for most of the AUs except for AUs 9, 12, 17, & 26, where performance drops substantially for the largest 45° rotation (Figure 2G).

Visualization
We provide several plotting tools to help visualize the Fex detection results in each stage of the analysis pipeline.In the facial feature detection stage, we offer the plot_detections function that overlays the face, facial landmarks, action units, and emotion detection results in a single figure (Figure 1).This function can be used to validate the detection results at each video frame or image.The Fex class also allows users to plot time series graphs as well, which can be useful for examining how detected action unit activities vary over time or if there are segments of missing data.
In addition, we provide a model which can be used to visualize how combinations of activated AUs will look like on a stylized anonymous face Figure 3.This model visualizes the intensity of AUs overlaid onto a face in the approximate locations of where the facial muscles are located and also how AUs deform the face.Using this model, users can visualize the action units and their accompanying 2D landmark deformation on a standard face from any combination of action unit activations identified from their analyses (see supplemental materials for training details) 22,86 .We hope to incorporate other types of visualization models as they become available.

Example Py-feat Analysis Walkthrough
Py-feat easily facilitates numerous complex analyses.As a demonstration, we used a subset of the open video dataset from 87 in which participants were filmed while speaking in two conditions: delivering good news statements (e.g., "your application has been accepted" ) or bad news statements (e.g., "your application was denied").A more comprehensive walkthrough using these data is included in the Py-Feat full analysis tutorial.Extracting facial features can be extracted in py-feat with relative ease using an intuitive API, and only requires two lines of code: one to initialize a Detector and another to process a video: detector = Detector() # initialize default detectors fex = detector.detect_video('video.mp4') # process each video frame The fex object is a dataframe organized as frames by features, and contains all detections for every frame of the video including: faceboxes, landmarks, poses, action units, and emotions.Each fex object makes use of a special .sessionsproperty that facilitates easy data aggregation and comparison.For example, we can compare the means of each condition of the data by setting sessions to the condition labels with .update_sessions(),followed by .extract_summary() to compute summary statistics aggregated by condition (Fig 4A ): # dictionary mapping video name to the condition it belonged to By_condition = fex.update_sessions({'001': 'good_news', '002': 'bad_news', ...}) # plot condition mean per action unit by_condition.extract_mean().aus.plot(kind="bar") Py-feat also makes it easy to perform time series analyses using the .isc()method.For example, we can estimate the similarity between videos in terms of how their detected happiness varies over time (Fig 4B ): # calculate the pairwise similarity between videos in terms of their detected happiness intervideo_similarity = fex.isc(col= "happiness", method='pearson')

# visualize the video x video correlation matrix from seaborn import heatmap heatmap(intervideo_similarity)
Py-feat makes it simple to perform formal comparisons using the .regress()method.This method performs a "mass-univariate" style analysis 88 across all specified features.For example, we can use the experiment condition labels ("good" or "bad" news) as contrast codes and AUs as outcomes to perform a t-test on every AU.This returns the associated regression beta-values, standard-errors, t-statistics, p-values, degrees-of-freedom, and residuals for each AU: # setup mean difference contrast of good news > bad news by_condition_codes = fex.update_sessions({"goodNews": 1, "badNews": -1}) # compare condition differences at every AU b, se, t, p, df, residuals = by_condition_codes.regress(X="sessions", y="aus", fit_intercept=True) Py-feat can just as easily facilitate a decoding analysis like the classification analysis performed by Watson and colleagues 87 using the .predict()method (Fig 4C These simple examples are only a fraction of the analyses that are possible using py-feat, but provide an example of how the toolbox makes it possible to conduct complex analyses with minimal python code.

Discussion
In this paper, we describe the motivation, design principles, and core functionality of the open-source Python package Py-Feat.This package aims to bridge the gap between model developers creating new algorithms for detecting faces, facial landmarks, action units, and emotions with end users hoping to use these cutting edge models in their research.To achieve this, we designed an easy to use and open-source Python toolbox that allows researchers to quickly detect facial expressions from face images and videos and subsequently preprocess, analyze, and visualize the results.We hope this project will make facial expression analysis more accessible to researchers who may not have sufficient domain knowledge to implement these techniques themselves.In addition, Py-Feat provides a platform for model developers to disseminate their models to end-user researchers and compare the performance of their model with others included in the toolbox.
Automated detection of facial expressions has the potential to complement other techniques such as psychophysiology, brain imaging, and self-report 14,22,89 along with 3-D simulations 90 in improving our understanding of how emotions interact with perception, cognition, and social interactions and are impacted by our physical and mental health.Studying facial expressions is becoming increasingly more accessible to non-specialists.For example, recording participants has become more convenient with a number of affordable recording options such as webcams that can be used to record remote participants, open-source head mounted cameras allowing reliable face recordings in social settings 13 , as well as 360 cameras that can be used to record multiple individuals simultaneously.The primary goal of Py-Feat is to make the preprocessing, analysis, and visualization of these results similarly accessible and free of charge to non-specialists.Open source software focused on the full analysis pipeline has been instrumental in contributing to the rapid progress of research in other domains such as neuroimaging with FSL 47 , AFNI 48 , SPM 49 , and NiLearn 50 and natural language processing with Stanza 37 , SpaCy, and HuggingFace.We believe the broader emotion research community would greatly benefit from additional software platforms dedicated to facial expression analysis with functions for extracting, preprocessing, analyzing, and visualizing facial expression data.
Our toolbox is designed to be flexible and dynamic and includes models that are performing near state of the art.However, there are several limitations that are important to note.First, our current implementations of some of the models are not performing as well as the original versions.This could be attributed to nuances in hyperparameter optimization, variations in random seeds, and variations in the benchmarking datasets.We anticipate that these models will improve over time as more datasets become available and also plan to continually incorporate new models as they become available.Benchmarking of new models will be added to a living document on our project website to allow users to make informed choices in selecting models.Second, we have not yet attempted to optimize our toolbox for speed.For example, we did not benchmark our models on processing time because we believe most users will be applying these detectors on batches of pre-recorded videos rather than in real-time applications.Currently, our models are able to process a single image in about 400 milliseconds with a GPU and about 1.5 seconds on a CPU.For users who need faster processing times on videos, processing can be sped up by temporally downsampling and skipping frames.We hope to optimize our code and improve processing time in future versions of our toolbox.Third, our models likely contain some degree of bias with respect to gender and race.We have attempted to use as much high quality publicly available data as possible to train our models and selected challenging real world datasets for benchmarking when available.This problem is inherent to the field of affective computing and will only improve as datasets increase in diversity and representation and preprocessing pipelines improve (e.g., faces with darker pigmentation are often more difficult to detect) 91,92 .Fourth, our toolbox currently only includes detection of core facial features (i.e., facial landmarks, action units, and emotions) but there are additional signals in the face that can be informative for social science researchers.Head pose can be used to detect nodding or a shaking of the head which can be signals of consent or dissent in social interactions.Gaze extracted from face videos can be used to infer the attention of the recorded individual.Heart rate and respiration can also be extracted from face videos 93 which can be used to infer arousal or stress levels of the recorded individual.Models for detecting these facial features could be implemented in future versions of Py-Feat pending community interest.
The modular architecture of the Py-feat toolbox should theoretically be able to flexibly accommodate future developments in facial expression research.For example, adding improved models for our existing detection suite should be relatively straightforward assuming the models are trained using pytorch.New functionality can easily be added to the detector class in the form of a new method.Finally, new types of data can be accommodated by adding a new data loader class and data type specific models.For example, as 3D faces using depth cameras or thermal cameras become more ubiquitous accompanying rapid developments in virtual and augmented reality research, researchers can train new models to detect facial expression features, which can be incorporated into the toolbox without impacting extant functionality.We also hope that the research community will contribute new tutorials to our documentation to accelerate the pace of discovery in the field.
In summary, we introduce Py-Feat, an open source full stack framework implemented in Python for performing facial expression analysis from detection, preprocessing, analysis, and visualization.This work leverages efforts from the broader affective computing community by relying on high quality datasets, state of the art models, and building on other open source efforts such as OpenFace.We hope others in the community may be interested in improving this toolbox by providing feedback and bug reports, and also contributing bug fixes, new models and features.We have outlined our contribution guidelines as well as the necessary code and tutorials on how to replicate our work on our main project website (https://py-feat.org).We look forward to the increasing synergy between the fields of computer science and social science and welcome feedback and suggestions from the broader community as we continue to refine and add features to the Py-Feat platform.

Pre-trained Facial Detectors
The Detector module offers several pre-trained models for detecting each of the following facial features: (a) finding a face in an image or video frame ("face-model"), (b) locating facial landmarks ("landmark model"), (c) detecting activations of facial muscle action units ("AU model"), and (d) detecting displays of canonical emotional expressions ("emotion model").These models are designed to be modular so users can decide which algorithms to use for each detection task based on their needs for accuracy and speed.In general, we included models with high reported accuracy, written in Python, easy to install (e.g., Pytorch 57 for neural network models and scikit-learn 94 for statistical models), and open to use for academic research.We have trained several models specifically for Py-Feat and describe the training procedures in detail here.

AU Detection
Py-Feat includes two AU detectors which were based on the robust model included in OpenFace outlined in Baltrusaitis et al. ( 2015) 95 .Following face and landmark detection, we used Histogram of Oriented Gradients (HOGs) as features in predicting action unit activations.HOGs are feature descriptors that describe an image as a distribution of orientations such as edges and corners measured across the image and have been proven effective in identifying people in images as well as action units 95,96 .We first preprocessed each image by aligning the detected faces using the interocular distance to a neutral facial expression.We then detected the facial landmarks for the aligned faces and applied a convex hull to mask out the background irrelevant to the face.To include facial features of the forehead, a convex hull was applied with the eyebrows shifted upwards 1.5 times the distance between the eyebrows and the upper eye landmarks.We extracted HOGs using the scikit-image implementation 97 with 8 orientations, 8x8 pixels per cell, and 2x2 cells per block which led to a total of 5,408 HOG features.We then applied a principal component analysis (PCA) to retain 95% of the variance, which compressed the dimensionality of these features down to 1,195 while also removing noise.The PCA reduced HOG features were then used to predict individual action units using two statistical learning algorithms, specifically a linear Support Vector Machine classifier 76 implemented in scikit-learn 55 (Feat-SVM) and an XGBoost classifier (Feat-XGB) 77 .Both models were trained using multiple publicly available datasets including BP4D 32 , BP4D+, DISFA 31 , CK+ 30 , Shoulder Pain 98 and 99-101 . Aggregating across these datasets enabled us to make predictions about a larger number of AUs (20 in total) and expose our model to both controlled and in-the-wild data.Hyperparameters were tuned with a grid search during training using 3-fold cross validation.Model performance was evaluated using F1 scores, an accuracy metric for binary classification, defined as: where precision is the number of true positives divided by the total number of positive results:

Emotion detectors
Emotion detectors are trained on manually posed or naturalistically elicited emotional facial expressions which allows detectors to classify new images based on how much a face resembles a canonical emotional facial expression.Py-feat also includes two emotion detectors.The Residual Masking Network (ResMaskNet) 83 is an end-to-end convolutional neural network model that combines deep residual networks with masking blocks.The masking blocks help focus the model's attention on local regions of interest to refine its feature map for more fine-grained predictions and the residual structure helps to maintain performances in deeper layers.ResMaskNet achieved state of the art performance on the facial expression recognition (FER) 2013 102 dataset at the time of preparing this article.Despite its accuracy, ResMaskNet has a large memory footprint (500MB) due to the depth of the architecture.We also trained an emotion detector model using an identical pipeline as our statistical learning AU models.This includes performing face alignment, applying a convex hull, and extracting HOG features, which are compressed using a PCA model that retains 95% of the variance.These features are used to classify the presence of each categorical emotion category using linear SVM implemented in scikit-learn 55 .The model was trained using the ExpW 103 , CK+ 30 and JAFFE 104 facial expressions datasets with a 3-fold cross validation for identifying the best hyperparameters.Similar to AU detectors, we evaluate model performance with F1 scores for each emotion category.

AU Visualization Model
Py-feat includes a model to visualize facial expression results on an anonymized and stylized face.Using this model, users can visualize the action units and their accompanying 2D landmark deformation on a standard face from any combination of action unit activations identified from their analyses.This can be useful for visualizing aspects of a model in an intuitive manner similar to how brain imaging software overlays statistical maps on a canonical brain 22,86 .We trained this action unit to landmark model on 20 action units (AUs 1, 2, 4, 5, 6, 7, 9, 10, 12,  14, 15, 17, 18, 20, 23, 24, 25, 26, 28, 43) with a subset of images from the EmotioNet 105 , BP4D 32 , and Extended DISFA Plus 33 datasets to balance the representation of each AU.We chose these datasets because they have both ground truth Action Unit labels.We used our toolbox with the Feat-RetinaFace face detector and MobileNets landmark detector to detect the landmarks on these images.We aligned these landmarks to a neutral face with an affine transformation using the facial landmarks and fit a Partial Least Squares Regression model with 20 components to predict these aligned landmarks from the ground truth action unit labels provided by the datasets using 3-fold cross-validation.Code to reproduce training and testing our visualization model is available in the Py-Feat Training Visualization Model Tutorial.
Overall, the PLS model achieved a cross-validated r 2 of 0.155 in predicting landmark coordinate positions on 10,000 sample images.We used our model to illustrate how visualizations can be created in two ways.First we visualize emotions by detecting happy, sad, surprise, and anger expressions from single images in the CK+ 30 dataset using the Residual Masking Network implemented in Py-Feat and then passing the AU vectors detected by the Feat-XGB AU classifier to our visualization model (Figure 3A).This is all handled seamlessly using the detector.plot_detections().In principle, Py-Feat's visualization model can generate a face from any 20 element array of numerical values between 0 and 1.This enables Py-Feat's second mode of visualization handled by the plot_face() and animate_face() functions, which can activate one or more of AUs and their underlying muscles e.g.AU1 (inner brow raiser), AU12 (lip corner puller), etc (Figure 3B).
DISFA 108 contains 27 Participants (15 male and 12 female, 18-50 years old, 1 Asian, 1 African American, 2 Hispanic and 21 Euro-American) that watched a 4-minute video clip designed to elicit a certain emotional expression.For each participant, 4,845 video frames were captured and manually annotated by a single expert FACS coder.AU intensity is rated on a six-point ordinal scale from 0 to 5. We binarized AUs using a threshold of 2. We used annotations for AUs 1, 2, 4, 5, 6, 9, 12, 17, 20, 25, and 26.
DISFA+ 109 is an extended dataset from the original DISFA dataset.It includes 9 participants (4 males and 5 females, 18-50 years old, 1 Asian, 1 African American and 7 Euro-American).DISFA+ contains both posed and spontaneous facial expressions.Participants first watched a 3-minute video clip intended to elicit a certain emotional feeling.In a following experiment, each participant was asked to imitate 30 facial action units, either single AU or combinations of AUs, and 12 facial expressions corresponding to emotions.A trained FACS coder annotated AU intensities (from a ordinal scale of 0 to 5) for a total of over 57,000 frames.We used annotations for AU1,2,4,5,6,9,12,17,20,25,26.
JAFFE 104 contains ten female Japanese college students.Each participant posed 3 or 4 examples for each of the 6 basic emotion facial expressions plus a neutral face.JAFFE is a relatively small dataset with a total number of 219 images.
EmotioNet 101 Contains approximately one million images of facial expressions with Facial Action Unit labels of different gender and ethnicities.The images are downloaded from the Internet.100,000 images were annotated by trained FACS coders and 900,000 were automatically annotated.The dataset contains faces of different ages, gender, ethnicity and emotional expressions.We used AU annotations for AUs 1, 2, 4, 5, 6, 9, 12, 17, 20, 25, 26, and 43.
AffectNet 84 contains 440,000 images collected in the wild downloaded from the Internet with various gender and ethnicity information.Images are manually annotated with eight different emotion categories including: neutral, surprise, happy, fear, sad, disgust, contempt and anger. 111contains 200 face videos from 25 different patients suffering from shoulder pain (total 48,398 frames).Participants were asked to perform a series of either active or passive range-of-motion tests.AUs 4, 6, 7, 9, 10, 12, 20, 25, 26, 27, and 43  were rated on a 5-level intensity by 3 independent certified FACS coders, and a fourth FACS coder reviewed the coding. 68Contains images collected in the wild retrieved from search engines (e.g., Google or Bing).The bounding boxes for each face were manually annotated with a total of 32,203 images with 393,703 labeled faces.This dataset is a standard for benchmarking face detection algorithms in data competitions and includes small, occluded, and upside down faces.

WIDER FACE
300W 112 Contains both in-door and in-the-wild facial images retrieved from google searches.Facial landmarks for each image were semi-automatically annotated by the AOM algorithm 113,114   .The 300W dataset covers a wide variation in luminance, pose, identity, expression, occlusion and face size.

Robustness Tests
In addition to our assessing the performance of our detector models on standard benchmark datasets, we were also evaluated the robustness of the detector models included in Py-feat to different types of real world scenarios that are known to create problems for computer vision models including variations in luminance, occlusions of specific regions of the face, and also head rotation.A brief summary of these results are available in Figure 2 for the default models in py-feat.We have also included tables that include the results of our robustness experiments for all detector models included in the toolbox.Table S1 includes results for all face detection models.Table S2 includes results for the landmark detector models.Table S3 includes results for pose estimation models.Table S4 includes results for action unit detectors.Table S5 includes results for the emotion category models.Finally, we include the performance of our action unit detector models in comparison to OpenFace on the Namba head rotation dataset 85 .

Figure 1 .
Figure 1.Facial expressions analysis pipeline.Analysis of facial expressions begins with recording face photos or videos using a recording device such as webcams, camcorders, head mounted cameras, or 360 cameras.After capturing the face, researchers can use Py-Feat to detect facial features such as the location of the face within a rectangular bounding box, the location of key facial landmarks, action units, and emotions, and check the detection results with image overlays and bar graphs.The detection results can be preprocessed by extracting additional features such as Histogram of Oriented Gradients (HOG) or multi-wavelet decomposition.Resulting data can then be analyzed within the toolbox using statistical methods such as t-tests, regressions, and intersubject correlations.Visualization functions can generate face images from models of action unit activations to show vector fields depicting landmark movements and heatmaps of facial muscle activations.

Figure 2 .
Figure 2. Py-feat Detector Robustness Experiments.A) Example image for robustness manipulations.B) RetinaFace face detection robustness results.Values are Average Precision where larger indicates better performance.C) Landmark detection robustness results.Values are Normalized Mean Average Error (MAE) where smaller values indicate better performance.D) img2pose-constrained pose detection robustness results.Values are Mean Average Error (MAE) where smaller values indicate better performance.E) Feat-XGB AU detection robustness results.Values are F1 scores where larger values indicate better performance.We note that the DISFA+ dataset does not include labels for AU7.F) Residual Masking Network emotion detection robustness results.Values are F1 scores

Figure 3 |
Figure 3 | Demonstration of action unit to landmark visualization.(A): Facial expressions generated from AU detections on real images.Detected AU activations were extracted from each of six labeled images displaying one emotion and projected through Py-Feat's visualization model.(B): Facial expressions generated by manually activating each AU in sequence.

Figure 4 |
Figure 4 | Illustrative Py-Feat Analyses.(A): Average probability of action unit (AU) activation differences when delivering good news and bad news for AUs 6, 12, and 25.The dashed line reflects maximal detector uncertainty.(B): Clustered intervideo time-series correlations of happiness detected over video-frames.Warmer colors indicate a pair of videos was more similar in terms of their Happiness time-courses.(C): Example replication analysis of Watson et al 87 .Each bar depicts the cross-validated accuracy decoding good vs bad news clips using emotion, AU, pose or combined features.Error-bars reflect the standard deviation across cross-validation folds.Py-Feat's default emotion detector performs perfectly on the subset of data in this example.The dashed line reflects chance performance.(D): Facial expression reconstructed from the AU classifier weights using the AU decoder (orange bar).
the proportion of true positives relative to the ground truth: (eq3) F1 scores range from 0 to a perfect precision and recall of 1.0.

Figure S2 .
Figure S2.Robustness Test results for Action Unit detection algorithms on the Namba head rotation dataset.Head rotation values range from 0°(head on) to 45°rotations.Values are F1 scores for each action unit, where higher values indicate better performance.Each bar indicates the performance of each algorithm on varying degrees of rotation.

Table 1 . Software comparison on functionalities and affordability
51X indicates features provided by each package.Features from Py-Feat toolbox are shown in brackets.Facial landmarks are points pertaining to locations of key spatial positions of the face including the jaw, mouth, nose, eyes, and eyebrows.Action units are facial muscle groups defined by FACS51.Emotions refer to the detection of canonical emotional expressions.Headpose refers to the pitch, roll, and yaw orientations of the face.Gaze refers to the direction the eyes are looking.*iMotions is a platform and its feature extraction relies on the purchase of either the AFFDEX or FACET modules.**Detection of action units and analysis functionalities require a separate add-on purchase of The Action Unit Module and the Project Analysis Module for the Noldus FaceReader.***We note that OpenFace can perform some preprocessing such as median face image subtraction and post-processing of AUs to correct for at-rest expressions.

Table 2 . Benchmarking datasets. Details
Table 2 includes details about each of the benchmark datasets.Full details can be found in the supplementary materials.
about each dataset used for benchmarking the py-feat detectors.

Table 3 . Benchmarking results for face bounding box detection
. Easy, Medium, Hard results retrieved from WIDER Face.Numbers are average precision scores with higher numbers indicating better detection accuracy.Bold numbers indicate best performance for each column and bracketed numbers indicate the performance of the model selected as the default for Py-Feat.

Table 3 . Benchmarking results for face landmark detection.
Feat models were initialized with face bounding boxes using RetinaFace.Numbers are root mean squared errors of coordinates with lower numbers indicating better alignment.Bolded numbers indicate best performance and bracketed numbers indicate the performance of the model selected as the default for Py-Feat.

Table 4 : Model Performance on BIWI Kinect Head Pose Dataset
. Model performance on the BIWI Kinect dataset, where Mean Absolute Error (MAE) values are reported in degrees (lower is better).Table shows performance of the img2pose models.Bolded numbers indicate best performance and bracketed numbers indicate the performance of the model selected as the default forPy-Feat.

Table 6 . Benchmarking results for AU models on DisfaPlus
2, 4, 9, 15, and 17.OpenFace and our Feat-XGB model achieved the second highest average F1 scores followed by the Feat-SVM model.OpenFace was the most accurate in detecting AUs 1, 6, and 12.The Feat-XGB model performed the best on AUs 5, 20, and 25, while the Feat-SVM model only performed the best on AU26.We have selected the Feat-XGB model to be the default model as it provides AU detection probability estimates rather than binary classifications.
. Numbers shown are F1 scores.Bolded numbers indicate best performance and bracketed numbers indicate the performance of the model selected as the default for Py-Feat.

Table 7 . Benchmarking results for motion models on AffectNet
achieved the highest F1 score, followed by the FACET-iMotions model, and the statistical learning models trained on HOG features.
. Numbers shown are F1 scores.Bolded numbers indicate best performance and bracketed numbers indicate the performance of the model selected as the default for Py-Feat.
).For example, we can use all AUs as features and try to classify the condition in which participants' were delivering news.This returns the decoder model object along with its cross-validated performance: Unique to Py-Feat is the ability to use its visualization model to reconstruct any facial expression from AU values (Fig 4D).A compelling use case is reconstructing the facial expression implied by the weights estimated for each AU by the decoder.Py-feat offer two functions to do this: plot_face() which reconstructs a single image and animate_face() which can morph one facial expression to another to emphasize what is changing:

Table S2 :
Robustness Test results for Pose detection algorithms with the BIWI-Kinect dataset.Values are Absolute error in degrees for Pitch, Roll and Yaw, where lower values indicate better performance.We conducted 5 robustness tests for each algorithm (lower/higher luminance, eyes/nose/mouth masking).Each box indicates the performance of each algorithm on the original test set, and on each robustness test.

Table S3 :
Robustness Test results for face landmark detection algorithms with the 300W dataset.Values are normalized mean squared error (nMSE), where lower values indicate better performance.We conducted 5 robustness tests for each algorithm (lower/higher luminance, eyes/nose/mouth masking).Each row shows results for each landmark algorithm in our toolbox, and the columns show each robustness test.

Table S4 :
Robustness Test results for Action Unit detection algorithms with the DISFA+ dataset.Values are F1 scores for each Action Unit, where higher values indicate better performance.We conducted 5 robustness tests for each algorithm (lower/higher luminance, eyes/nose/mouth masking).Each box indicates the performance of each algorithm on the original test set, and on each robustness test.

Table S5 .
Robustness Test results for Emotion detection algorithms with the subset AffectNet dataset.Values are F1 scores for each Emotion category, where higher values indicate better performance.We conducted 5 robustness tests for each algorithm (lower/higher luminance, eyes/nose/mouth masking).Each box indicates the performance of each algorithm on the original test set, and on each robustness test.