Training and Test Datasets
The data used in the present research were from publicly available datasets and previously published studies. The linear regression models were fit to 183 studio portraits of neutral, frontal, white faces of men and women and their ratings on various social attributes from the Chicago Face Database (Ma et al., 2015). This database originally contained social attribute ratings for 597 portraits of neutral, frontal faces from four races (Asian, Black, Latino, and White); the other 414 of the 597 faces that are not white were excluded since the effect of race is beyond the scope of our current research. The database provides, for each face, ratings by human subjects on 15 social attributes (afraid, angry, attractive, baby-faced, disgusted, dominant, feminine, happy, masculine, prototypic, sad, surprised, threatening, trustworthy, and unusual) using a 1–7 Likert scale (1 = Not at all, 7 = Extremely). We excluded judgments of unusual because neither this social attribute nor its synonym or antonym was rated in any of the out-of-sample test datasets that we used. Thus, we fit 14 linear regression models, one for each of the remaining 14 social attributes. The design matrix for the linear regression had 183 rows. Each row represented one of the 183 face images in the training dataset and each column represented one of the features in the respective facial feature space that was considered (Fig. 1b).
The models were tested on five out-of-sample independent datasets that are publicly available (Lin et al., 2021; Oh et al., 2020; Oosterhof & Todorov, 2008; Walker et al., 2018; White et al., 2017). These test datasets were selected to sample social judgments from different types of faces, including studio portraits of frontal, neutral faces, computer-generated faces, and ambient photos of faces taken under unconstrained conditions. All faces in our training and test datasets were limited to white faces; the effects of race and context (e.g., image background and facial expression) are beyond the scope of our current study. Specifically, the Lin et al. (2021) dataset included ratings for 100 studio portraits of frontal, neutral, white faces (of which 60 were non-overlapping with the training dataset, i.e., 60 novel faces) on 100 social attributes. The Oh et al. (2020) dataset included ratings for 66 novel studio portraits of frontal, neutral, white faces on 14 social attributes. The Walker et al. (2018) dataset included ratings for 40 novel studio portraits of frontal, neutral, white faces on seven social attributes. The Oosterhof and Todorov (2008) dataset included ratings for 300 computer-generated frontal, neutral, white faces on nine social attributes. The White et al. (2017) dataset originally included ratings for 1224 ambient photos (12 images of each of the 102 individuals of various races) taken in real-world contexts downloaded from their Facebook accounts (varied in viewpoint, facial expression, background, illumination, etc.) on five social attributes. We only used 504 photos of white individuals (12 images of each of the 42 individuals). Model training and testing were performed using ratings averaged across human subjects per face per social attribute.
To assess how well the linear regression models with different feature sets predicted social judgment from faces (see the first section of Results, “Generalizability Across Faces, Raters, and Social Attributes”), we fit a model for a social attribute on the Chicago Face Database and tested the model for the same or highly (dis)similar (synonyms/antonyms) social attribute on the out-of-sample test datasets. Ideally, we would fit a model for a social attribute and test the model for the same social attribute. However, the different test datasets generally measured judgments of different social attributes than the training dataset. Therefore, in the case where the same social attribute in the training dataset was not available in the test dataset, we used the synonym/antonym of the fitted social attribute in the test dataset (if available). Based on this rationale, we tested the models that were fit to the corresponding social attributes in the Chicago Face Database on nine social attributes in the Lin et al. (2021) dataset, four social attributes in the Oh et al. (2020) dataset, four social attributes in the Oosterhof and Todorov (2008) dataset, and three social attributes in the White et al. (2017) dataset.
To assess how well a model fitted for a social attribute would predict other social attributes (i.e., the last section of Results, “Non-specific Predictions Across Social Attributes”), we did not require ratings on the exact same or highly (dis)similar social attribute between the training dataset and test datasets. Therefore, we fit a model for each of the 14 social attributes in the Chicago Face Database, and assessed how well the predicted ratings from these models correlated with the ratings in the test datasets on all available social attributes (except for the Lin et al. (2021) dataset, where ratings were measured for 100 social attributes; we only used a subset of 15 social attributes that are commonly studied in the literature).
DCNN-Identity Features
To extract identity features from face images, we used the dlib C + + machine learning library, which offers an open source implementation of face recognition with deep neural networks (King, 2009, 2017). The network’s final layer represents each face image with a vector of 128 features. The network had been originally trained to identify 7,485 face identities in a dataset of about three million faces with a loss function such that the two face images of the same identity were mapped closer to each other in the face space than the face images of two different identities. Built on a ResNet architecture with 29 convolutional layers, the network achieved an accuracy of 99.38% on the “Labeled Faces in the Wild” benchmark (King, 2009, 2017). We directly used the feature vectors from the last layer of the network, without tuning the network or its last layer specifically for social judgments from faces.
DCNN-Object Features
To extract object features from face images, we used the features obtained from the block5_conv2 layer of the VGG16 network because prior studies showed that features from this layer of the network successfully predicted social judgments from faces (Song et al., 2017). We also repeated our analyses with features from other layers of the network, which produced worse performance (Fig. S1); we therefore used the features from the block5_conv2 layer for subsequent analyses. To extract the object features from a face image, the face region of the image was first detected and segmented automatically using the histogram of oriented gradients-based face detector implemented in the dlib C + + library (King, 2009, 2017). Then the segmented image was presented to the VGG16 model implemented in the Keras deep learning library (Chollet, 2015) with weights pre-trained on the ImageNet dataset (Deng et al., 2009) for object recognition. The output of the block5_conv2 layer had a volume shape of 14 × 14 × 512, which was flattened into a 100,352-dimensional feature vector. Thus, the layer represented each face image with a vector of 100,352 features.
Due to the large number of features, we used principal component analysis (PCA) to reduce the dimensionality and redundancy of these features. Our goal was to retain a much smaller number of PCs from the 100,352 features, and project the 100,352 features of the face images in both the training dataset and test datasets onto these PCs — which we eventually used in the linear regression models. To prevent biasing the PCs of the faces in the test datasets with the variance in the faces from the training dataset, we performed PCA using a larger and more comprehensive set of faces: face images of 426 white adults with neutral expression aggregated from three popular publicly available face databases (Chelnokova et al., 2014; DeBruine & Jones, 2017; Ma et al., 2015). We determined the optimal number of PCs based on their performance for predicting social judgments from the faces in the model training dataset (i.e., the 183 studio portraits from the Chicago Face Database). Specifically, the 426 faces were first represented with the 100,352-dimensional DCNN-Object feature vectors, on which we performed PCA to extract PCs of the features. Next, the 100,352-dimensional feature vectors of the 183 faces in the training dataset were projected onto these PCs obtained from the 426 faces. Finally, we fit ridge regression models using different numbers of PCs (increased from 10 to 110 with a step size of one) to predict the ratings of the 183 faces. Results showed that the first 26 PCs offered the best average prediction accuracy across all 14 social attributes, and we therefore used the first 26 PCs to represent the DCNN-Object features in all subsequent analyses (Fig. S1).
Facial-Geometry Features
The brute-force approach offered by DCNNs has the well-known effect of producing representations, such as the face features described above, that are not easily interpretable. We therefore also used a complementary human-specified set of interpretable face features. The physical and geometric features of the face (e.g., brighter skin, larger eyes, and rounder face) have been shown to influence how humans make social judgments of unfamiliar others based on faces (Ma et al., 2015). To obtain these features, we referred to the 40 facial-geometry features provided in the Chicago Face Database (Ma et al., 2015), which were defined based on a review of the social perception literature (Blair & Judd, 2011; Zebrowitz & Collins, 1997). In the Chicago Face Database, these 40 physical and geometric features were manually measured using an image editing software (Ma et al., 2015). In our present study, given the large number of faces we used, we aimed to generate a subset of those physical and geometric features that could be automatically measured, but were still easily interpretable. A recent study showed that automatically measured physical and geometric features are highly correlated with those that are manually measured (Jones et al., 2021). Here, to automatically measure physical and geometric features, we used a pre-trained model of facial landmark detection implemented in the dlib C + + library to estimate the location of 68 key points on each face image. This model had been originally built using an ensemble-of-regression-trees approach and trained on the IBUG 300-W facial landmark dataset (Kazemi & Sullivan, 2014; King, 2017; Sagonas et al., 2016). We used another pre-trained model of face parsing to segment each face image into several facial parts such as skin area, left and right eye, and the nose (see Fig. S2). This model has been originally built using a BiSeNet architecture and trained on CelebAMask-HQ dataset (Lee et al., 2020; Yu et al., 2018; Zllrunning, 2020). These automated methods allowed us to obtain 30 physical and geometric features (Facial-Geometry features) that closely imitate the manually measured physical and geometric features provided in the Chicago Face Database. The 30 Facial-Geometry features were the median luminance of skin area, nose width, nose length, lip thickness, face length, eye height (left, right), eye width (left, right), face width at cheek, face width at mouth, distance between pupils, distance between pupil and upper lip (left, right, asymmetry), chin length, length of cheek to chin (left, right), face shape, (face) heartshapeness, nose shape, lip fullness, eye shape, eye size, midface length, chin size, cheekbone height, cheekbone prominence, face roundness, and facial width-to-height ratio. We verified that the 30 automatically extracted Facial-Geometry features described the social judgments from faces as well as the 40 manually measured features by comparing the prediction accuracy of the models based on the two sets of features (see Fig. S2).
Model Fitting
L2-regularized linear regression (a.k.a. ridge regression; Hoerl & Kennard, 1970) was used to fit a set of model weights separately for each social attribute that optimally mapped facial features onto human subjects’ social judgments from faces (Fig. 1). Cross-validation was used to determine the optimal regularization parameter for ridge regression. Specifically, the training dataset was randomly split into 80% training and 20% validation samples for 2,000 iterations. At each iteration, a range of regularization parameters (n = 30, log-spaced between 1 and 100,000) were used to fit models to the training part, and each fitted model was used to predict the human ratings of the faces in the validation part. This procedure yielded a model accuracy per regularization parameter per iteration per social attribute, assessed with the mean squared error (MSE). For each social attribute, the optimal regularization parameter that minimized the average error across all iterations was selected, and the model weights were refit with this optimal regularization parameter using the entire training dataset (i.e., the final model). We also repeated this procedure of selecting regularization parameter using evaluation metrics in addition to MSE, including the coefficient of determination (R2) and the root mean square error (RMSE) — results corroborated those using MSE reported here.
The final fitted model for each social attribute was used to predict ratings of the same social attribute for the novel faces in each test dataset. Some out-of-sample test datasets did not include ratings of the exact same social attributes as in the training set (i.e., the Chicago Face Database). In those cases, we used the final model for a social attribute (e.g., dominant) to predict ratings of a semantically highly (dis)similar social attribute in the test dataset (e.g., submissive) if that was available. A bootstrap procedure was used to robustly estimate the prediction accuracy of each model on each test dataset. Specifically, the face images and their ratings in each test dataset were randomly sampled 10,000 times with replacement, and the Spearman rank-order correlation between the resampled predicted and resampled human ratings was computed per social attribute (Lescroart & Gallant, 2019). We used the Spearman rank-order correlation to assess model accuracy because the ratings in some test datasets were collected on a different scale than the training dataset and the rank order of faces based on an attribute (i.e., whether a face looks more trustworthy than another face) is a more reliable metric than raw rating values attributed to the faces. The mean prediction accuracy for each social attribute was obtained by averaging the accuracies across bootstrap iterations. For the test dataset that contained a large number of ambient photos (504 photos of 42 white individuals; White et al., 2017), one image was randomly sampled from the set of images available for each identity at each bootstrap iteration (i.e., 42 images were included at each iteration) to prevent bias in prediction accuracy.
To assess the statistical significance of the mean prediction accuracy and estimate the chance threshold for the prediction per social attribute in each test dataset, we performed a permutation analysis to generate an empirical null distribution of correlations for each social attribute and test dataset separately. At each permutation iteration, the ratings in a test dataset were shuffled across face images, and the Spearman correlation between the predicted and permuted ratings was computed for each social attribute. This procedure was repeated 10,000 times to obtain a distribution of the correlations, under the null hypothesis that there is no relationship between facial features and social judgments from faces. The chance threshold was determined by taking the 95th percentile of the empirical null distribution (p = 0.05). The permutation p-value for each social attribute was defined as the proportion of the null correlations that were greater than or equal to the observed prediction accuracy. The p-values were corrected for multiple comparisons across the predicted social attributes using the false discovery rate (FDR) procedure (Benjamini & Hochberg, 1995).
In order to characterize the robustness of our findings to the specific analysis pipeline, we also repeated the above analysis procedures using linear regression methods in addition to ridge regression, including LASSO regression and ordinary least square regression (OLS). The same cross-validation procedure as described for ridge regression was used to select the optimal regularization parameter for LASSO regression from a range of regularization parameters (n = 30, log-spaced between 0.01 and 100). No cross-validation procedure was used in training the OLS models since there was no regularization parameter to be determined for this method. We found that ridge regression provided the best predictions across social attributes and test datasets (mean prediction accuracy measured with Spearman’s \(\rho\) = 0.552 ± 0.197 for the DCNN-Identity models; 0.430 ± 0.218 for the DCNN-Object models; 0.385 ± 0.213 for the Facial-Geometry models; mean ± standard deviation across test datasets and attributions). In comparison, LASSO regression provided similar prediction accuracies as ridge regression across social attributes and test datasets (mean Spearman’s \(\rho\) = 0.527 ± 0.198 for the DCNN-Identity models; 0.426 ± 0.226 for the DCNN-Object models; 0.369 ± 0.224 for the Facial-Geometry models). However, OLS regression provided worse prediction accuracies for the DCNN-Identity models (mean Spearman’s \(\rho\) = 0.239 ± 0.140 across social attributes and test datasets) and the Facial-Geometry models (0.169 ± 0.098) due to multicollinearity in the features, and similar prediction accuracies for the DCNN-Object models (0.434 ± 0.221). Therefore, we used the linear regression method that produced the best prediction accuracies across feature spaces, ridge regression, in our present investigation.
Variance Partitioning Analysis
We used a variance partitioning analysis procedure to compare the unique and shared explained variance between each pair of feature spaces (Çukur et al., 2016; Lescroart & Gallant, 2019). Specifically, for each social attribute and each pair of feature spaces, we fit three models using the training dataset: one fit the ratings to a feature space (e.g., 128 DCNN-Identity features), the second fit the ratings to a second feature space (e.g., 26 DCNN-Object features), and the third fit the ratings to both feature spaces (e.g., 154 DCNN-Identity and DCNN-Object features). These three fitted models were used to predict the ratings of the faces in the test dataset. The variance explained (R2) by each model for each social attribute was computed by using R2, the coefficient of determination. Finally, the unique variance explained by each of the two compared feature spaces (A and B) and the shared variance explained by both feature spaces were computed as follows:
$${R}_{uA}^{2}={R}_{A\cup B}^{2}-{R}_{B}^{2}$$
$${R}_{uB}^{2}={R}_{A\cup B}^{2}-{R}_{A}^{2}$$
$${R}_{A\cap B}^{2}={R}_{A}^{2}+{R}_{B}^{2}-{R}_{A\cup B}^{2}$$
where \({R}_{A}^{2}\) is the total variance explained by the first model using feature space A, \({R}_{B}^{2}\) is the total variance explained by the second model using feature space B, \({R}_{A\cup B}^{2}\) is the total variance explained by the third model using features from both spaces, \({R}_{uA}^{2}\) is the unique variance explained by feature space A, \({R}_{uB}^{2}\) is the unique variance explained by feature space B, and \({R}_{A\cap B}^{2}\) is the shared variance explained by feature spaces A and B.
Semi-partial Correlation Analysis
To understand how different social judgments contribute to the cross-predictions across multiple social attributes (i.e., when a model fitted to one social attribute predicted other social attributes), we performed a semi-partial correlation analysis. This analysis procedure measures the relationship between two variables X and Y while statistically controlling for (or partialing out) the effect of a third variable Z on Y. Note that, in contrast, the (standard) partial correlation controls for the effect of Z on both X and Y. In this analysis, the actual ratings of a social attribute provided by the human subjects in the test dataset were used as the variable X (i.e., the social attribute to be cross-predicted by a model that was not fitted to this social attribute). The ratings of a second social attribute predicted by a model for the same set of faces were used as the variable Y (i.e., a second social attribute that was used to fit a model). The ratings of a third social attribute predicted by another model for the same set of faces were used as the variable Z (i.e., a third social attribute that was used to fit another model). To partial out the effect of Z from Y, a simple bivariate regression of Y on Z was performed, and the residuals were obtained. These residuals quantified the unique variance in Y that was not linearly associated with or predictable from Z. Finally, we computed the Spearman correlation coefficient between X and the residuals.