Psychological trait inferences from women’s clothing: human and machine prediction

Rosenbusch, Hannes; Aghaei, Maya; Evans, Anthony M.; Zeelenberg, Marcel

doi:10.1007/s42001-020-00085-6

Psychological trait inferences from women’s clothing: human and machine prediction

Research Article
Open access
Published: 22 September 2020

Volume 4, pages 479–501, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computational Social Science Aims and scope Submit manuscript

Psychological trait inferences from women’s clothing: human and machine prediction

Download PDF

9088 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

People use clothing to make personality inferences about others, and these inferences steer social behaviors. The current work makes four contributions to the measurement and prediction of clothing-based person perception: first, we integrate published research and open-ended responses to identify common psychological inferences made from clothes (Study 1). We find that people use clothes to make inferences about happiness, sexual interest, intelligence, trustworthiness, and confidence. Second, we examine consensus (i.e., interrater agreement) for clothing-based inferences (Study 2). We observe that characteristics of the inferring observer contribute more to the drawn inferences than the observed clothes, which entails low to medium levels of interrater agreement. Third, the current work examines whether a computer vision model can use image properties (i.e., pixels alone) to replicate human inferences (Study 3). While our best model outperforms a single human rater, its absolute performance falls short of reliability conventions in psychological research. Finally, we introduce a large database of clothing images with psychological labels and demonstrate its use for exploration and replication of psychological research. The database consists of 5000 images of (western) women’s clothing items with psychological inferences annotated by 25 participants per clothing item.

Multimodal Analysis and Prediction of Latent User Dimensions

A new type of pictorial database: The Bicolor Affective Silhouettes and Shapes (BASS)

Article Open access 07 May 2021

Introducing the Open Affective Standardized Image Set (OASIS)

Article 23 February 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

People use an array of social information sources, such as physical appearance, language, and belongings, to form first impressions of others [30, 40, 88]. In response, people select and display symbols to influence which inferences are made about them [59, 60]. Clothing is one of the most common symbols used for this purpose [19, 39]. People use clothing to communicate their group memberships, jobs, and interests [4, 12, 28], as well as their emotional states [54], personality traits [6, 85]), and capabilities [3, 51]. Thus, clothing is seen as a primary tool of impression management by researchers and laypeople alike [34, 46], and psychological inferences based on clothing influence impression formation [58].

In the current work, we use computational social science methods to add to the understanding of how people make inferences based on women’s clothing items. First, we review the social scientific research on clothing-based impression formation. Then we investigate which psychological attributes people believe they can infer from women’s clothes (Study 1). Subsequently, we examine interpersonal consensus in clothing inferences (i.e., we assess interrater agreement), and determine how many human raters are needed for stable average inferences (Study 2). Lastly, we use the insights from our first two studies to build a labeled database of clothes for social scientific research, and to test whether a machine learning model can replicate (i.e., predict) human inferences from clothing (Study 3). We conclude by demonstrating the use of the database, and discussing the insights from all three studies. Importantly, note that these studies focus on how people make inferences from clothing, and are agnostic about whether these inferences are accurate (or inaccurate).

Psychological inferences from clothing

The role of clothing in impression formation is a popular topic in public discourse. Popular media devotes considerable attention to how clothing can help consumers make good impressions and ‘dress for success’ (e.g., [21]. People use clothes to attempt to convey a positive image of themselves, assuming that the way they dress affects the inferences that others will make about them [8]. In scholarly work, this theme has also received interdisciplinary attention, with much of this work rooted in evolutionary approaches to social signaling (e.g., [86]). For example, various non-human animals benefit in reproductive competition through displaying ornaments or costly behaviors [18, 72]. Humans also engage in similar forms of status signaling in their display of clothing. Nelissen and Meijers [59], for instance, observed that perceivers judge targets wearing luxury clothing brands as higher status, and also express favorable behaviors towards them across a range of social situations.

A complementary account of clothing selection, again rooted in evolutionary processes, identifies clothes as a culturally acquired indicator of group memberships. Group-specific clothing serves to strengthen group cohesion [20] and signals social resources and embeddedness to observers [70]. Clothes help to express such social identities (cf. [23]) and they are often explicitly brought in line with people’s personal identity, including self-ascribed personality, capabilities, and aesthetic preferences [12, 66].

In personality psychology, personality judgment from observable indicators, or cues (e.g., language, [69]) is often analyzed using Brunswik’s classic lens model [11]. The lens framework highlights that actual personality traits, as well as people’s judgments of personality traits, are based on the target person’s cue behaviors (e.g., wearing specific shoes, Gillath et al. 2012). However, the degree of association between the observable cue and the actual personality measurement (i.e., the cue validity) might differ from the association between cue and personality judgment (i.e., the cue utilization).

Clothing choices are often utilized as cues in person perception [46]. Similar to, for instance, music selection, wearing specific clothes may serve as a means to fulfill personal psychological needs, thereby allowing for valid associations between clothes and personality [68]. On the other hand, humans are widely known to see and rely on patterns to an unrealistic degree [13, 82] and are especially prone to this tendency in the context of person perception [73]. Accordingly, psychological work has shown repeatedly that inferences made from clothing can be unreliable (i.e., low cue validity) or even fully inaccurate [26, 55]. Clothes may be more likely to reflect the personality traits that targets hope to project, rather than the traits targets actually possess. This relatively poor accuracy is in line with research on psychological inferences from other indicators, such as facial and vocal features, where accuracy is, at best, extremely limited [57, 61, 62, 77, 81].

Despite their lack of validity, psychological inferences from clothes are very relevant in everyday life, as they steer people’s perception and subsequent behavior towards each other [40]. For example, red clothes are associated with perceived dominance (e.g., [86]) and skin revealing clothes are perceived as indicators of sexual interest [26]. However, past studies on the social psychology of clothing often relied on small, study-specific clothing samples to test general theories (e.g., three outfit options for teachers [22]; Taekwondo equipment [29]). In the current work, we address this limitation using a large database (5000 images labeled by 25 human raters each) to answer question about clothing-based inferences.

Uniting computational and social psychological approaches

Computational research on clothing to date has focused on accurate machine classification of clothing images [49], extraction of clothes from images [41], and building recommendation systems for customers [35]. While social subtleties of clothes are not considered very often yet, computational research does possess two advantages over psychological work in the field: typically large datasets and powerful analysis tools. In the current work, we utilize both these resources to answer our research questions about the nature of clothing-based inferences, the origins of their variance, and their predictability.

Uniting computational and psychological research in the field of clothing and impression formation was first proposed by Aghaei et al. [5]. In their position paper, they argued that clothing-focused research in computer science and machine learning has the methods to venture beyond superficial categorizations (e.g., into colors or cuts [49, 87]) towards processing the social signals within clothes. First steps into this direction were taken by Ma et al. [50], who demonstrated automatic extraction of semantic styles from clothing images (e.g., ‘classic’ vs. ‘modern’). Wei et al. [85] extended this line of work by extracting significant correlations between the ascribed personality traits of 300 celebrities (e.g., ‘friendly’) and their clothing styles (e.g., ‘light-colored’) from online images. Note that, like in the current work, the authors did not attempt to predict people’s actual personality traits from clothes, but rather investigated ascribed (here: inferred) characteristics. Personality inferences may be easier to predict from clothing than actual personality traits. The nature of inferences already implies that the clothes are (supposed to be) the source of the measurement variance, whereas there is no such connection between clothes and actual personality. Despite the conceptual difference between inferred and actual traits, both are extremely important in everyday interactions, as inferences steer people’s behavior towards each other [40]. Similarly, research on human faces began by testing potential connections between facial features and actual psychological characteristics, before realizing the lack of reliable relationships and transitioning to focus on inferred characteristics and the downstream consequences of such inferences [62, 81]. In the current work, we aim to use computational methods to advance research on clothing-based inferences in the same direction.

Overview of studies

As mentioned, there are four overarching goals of the present research. First, we investigate which psychological inferences are most commonly being made based on clothes (Study 1). Then we determine how much variance there is in clothing-based inferences and to which degree this variance emerges from differences in clothes versus differences in raters (Study 2). Subsequently, we utilize the insights from Study 1 and Study 2 to build a database for clothing-based research on psychological inferences. Lastly, we test whether a statistical model can be trained to replicate human-like inferences from images (Study 3).

In the current work, we focus on women’s clothes to minimize the vast diversity of existing clothes and psychological associations. This decision makes our studies much more economical, while maintaining a clearly defined target population for the database and prediction model. We concentrate on women’s clothes as they receive more attention by both researchers and laypeople [2, 6, 27, 65, 75].

Study 1: identifying common psychological inferences

The role of clothes in impression formation is an active research field across several disciplines in psychology. But what are the most prevalent traits that people infer from clothing items? In this first study, we aim to identify the traits that are commonly inferred according to both researchers and laypeople. Past literature reviews provide an overview of psychological attributes examined in clothing-focused research [19, 39, 46]. We used these reviews (and the reviewed publications) to generate a set of trait inferences made about the wearers and owners of specific clothes. This first set of traits consisted of inferences commonly examined by researchers. Additionally, we collected data on the trait inferences commonly made by participants/non-researchers. Together, these two sets informed us about which psychological inferences are considered important in research as well as in people’s daily life. In the following section, we describe how we acquired, condensed, and ultimately combined these two sets.

It is worth noting that we did not aim to build a comprehensive theoretical framework specifying all latent psychological dimensions of clothes. Such work would likely employ a factor analytic approach and condense a large set of numerical inferences into a smaller set of overarching theoretical dimensions (cf., [19]). Instead, our goal was to obtain a set of the most prevalent psychological inferences from (women’s) clothes according to social science researchers and laypeople. Note that these most common inferences might not coincide exactly with the sets of characteristics mentioned in general models of psychological traits (e.g., Big Five [90]) or states (e.g., basic emotions [64]). This is because specific contexts lead to different levels of prevalence for different psychological inferences (cf., dimensions in Big Five versus dimensions of face-based inferences [79]. Therefore, we conducted a dedicated entry study to identify the most prevalent inferences from clothing.

Given the large number of strategies to determine a most relevant subset, we used methods from Aaker [1], who developed a taxonomy for personality inferences about corporate brands, as a guideline. We modified this procedure to allow an integration of open-ended responses from participants and previous publications on clothing-based inferences. The overall procedure to identify the most prevalent inferences consisted of three steps: first, two sets of words describing psychological inferences from clothes were collected. These words (e.g., ‘smart’, ‘happy’) were provided by participants through online surveys and by researchers through past publications in the field. Second, the two lists of words were aggregated into two lists of topics (e.g., ‘happy’ and ‘joyful’ might be assigned to a ‘positive mood’ topic). Third, the topics that were commonly mentioned in, both online surveys and past publications were identified as the most relevant subset. This procedure allowed us to utilize basic textual data to answer our entry question: Which psychological inferences are commonly being made from clothes according to both researchers and laypeople?

Commonly investigated inferences in academic research

We identified 53 empirical research papers and 3 literature reviews that described psychological inferences that people make based on clothing. We concentrated our search on papers cited in or citing the 3 review papers and conducted an unstructured check for major oversights through Google Scholar. From these papers, we extracted 756 traits (394 unique words). While this is likely not a complete sample of research papers or trait inferences, it allowed us to find the most common clothing-based inferences. The top row in Table 1 shows the words that were mentioned most frequently in the scientific publications.

Table 1 Counts of six most frequent inferences mentioned in 56 scientific publications and by 201 participants

Full size table

Common inferences among laypeople

We generated an additional set of common inferences (according to laypeople) by asking 201 participants on Prolific Academic (125 female, 74 male, 2 other; M_age = 33.5, SD = 10.8), which psychological attributes can be predicted from clothes. Each participant was asked to provide up to 10 open answers, resulting in a total of 1620 answers (460 unique answers). The bottom row in Table 1 shows the most common inferences mentioned by participants.

Identifying the most relevant subset

Our goal was to identify a subset of traits based on two criteria: first, the included traits should be mentioned by both researchers and participants. Second, the included traits should appear relatively often in both lists. The simplest approach to satisfy these criteria would be to sort the terms according to their frequency (as in Table 1) and look for terms with relatively high counts in both sets. However, this approach would not account for the presence of synonyms, meaning that important constructs might be overlooked (e.g., because some constructs may be described frequently using a multitude of different terms, resulting in relatively low counts). Similarly, academic language may differ from participants’ language, potentially leading to difficulties in matching latent overlap between the two groups of traits.

Given these challenges, we introduced two intermediate text processing steps to examine, spot, and compare mentioned concepts (benefits of such methods for research synthesis are described by Ref. [16]). First, we used a pretrained word2vec model (semantic space generated by Ref. [53]) to convert each mentioned word into a sequence of numerical coordinates. This numerical representation (often called embeddedness) of each word was constructed in the original training process of the word2vec model. More precisely, each word received scores on 300 variables computed from the word’s relative co-occurrence with other words (see word2vec script in supplementary materials). After this transformation of words to numerical coordinates, we used k-means clustering, to define word clusters or ‘topics’ (in the 300-dimensional space) and simultaneously assign cluster memberships to each word. We estimated 100 clusters (as a compromise between cluster uniqueness and differentiability), as we expected that the most important topics would be identifiable at this degree of complexity. Note that these steps were merely introduced to support our human selection of traits and the manual matching between research and participant content. While it would be possible to fully automate the process, we preferred human decision-making for the final matching between both lists.

Condensing the two word-frequency lists to two topic-frequency lists allowed us to better spot the targeted overlap between the researcher terms and laypeople terms. For each identified topic, we computed a simple count describing how often this topic was mentioned (through one of its indicator words) by researchers and participants, respectively. Then we examined the 20 most commonly mentioned topics in each of the two lists and manually searched for overlap. Table 2 shows the five overarching inferences which were mentioned relatively often by both researchers and participants. The final selection of common psychological inferences consisted of five traits: happiness, sexual interest, intelligence, trustworthiness, and confidence. This set of inferences strongly resembles inferences that people draw from faces [63], which fall into the overarching dimensions trustworthiness (also including attributes like happiness) and dominance (broadly related to attributes of strength and potency). Inferences of sexual interest seem to be more specific to clothing. More general models of person perception, for instance, the prominent distinction of warmth (here trustworthiness and happiness) and competence (here confidence and intelligence) in social cognition research [24] are also strongly reflected in the list of clothing-based inferences.

Table 2 Final set of clothing-based inferences

Full size table

Study 2: sources of variance in trait inferences from clothing

As described in our review of previous literature, psychological inferences from clothes often turn out to be inaccurate. The question remains to which degree people agree in their inferences, meaning whether independent raters would form the same inferences from the same piece of clothing (regardless of their accuracy). Alternatively, the source of variance in inferences might lie within the specific rater (as opposed to the clothes) leading to reliable patterns for an individual rater, but inconsistent ratings for a single piece of clothing when collected by different raters. To answer the question whether inferences lie ‘in the eye of the beholder’ versus in the characteristics of the observed clothes, Study 2 quantifies the contribution of both sources of variance. Our approach is similar to research that quantifies the interrater reliability for face-based inferences, which vary considerably across individual raters (e.g., [32]).

Additionally, quantifications of interrater agreement allow us to estimate how many raters are needed to generate stable average inferences for clothes. Intuitively, low interrater agreement entails the need to collect many inferences per piece of clothing, whereas high agreement allows for a lower number of raters for a stable average. Knowledge about this critical quantity is necessary for constructing a useful database of clothing-based inferences in Study 3. Similarly, it is useful to know how much noise can be expected in the averaged inference scores when training a statistical model to re-predict these scores as also planned for Study 3 [42]. That is, if the reliability of the averaged scores is low, then the achievable prediction accuracy will inevitably be low as well.

Methods

In short, Study 2 answers two interrelated questions. The first is the question of the determinants of clothing-based inferences (i.e., do they lie in the clothes or the observer?), and the second is how many raters are needed to obtain reliable measures of clothing-based inferences. We answer the first question by collecting clothing-based inferences from independent raters, specify ‘raters’ and ‘pieces of clothing’ as higher-level variables in a multi-level model, and estimating the relative variance explained by these variables (cf., methodology of [32]). In other words, we ask to what extent the ratings of clothing items are based on the differences between clothing items versus differences between individual raters.

We answer the second question by iteratively including more raters per piece of clothing and judging at which point the confidence interval around a ‘true’ average inference (estimated based on a much larger sample of raters) becomes sufficiently small. This iteratively shrinking confidence interval has been labeled the corridor-of-confidence [33]. Naturally, a threshold for sufficiently narrow confidence intervals is somewhat subjective; therefore, we add a more intuitive, supporting metric: the correlation coefficient between average inferences and the ‘true’ average inferences. A higher correlation indicates that the average inferences of the subsample are closer in line with the true average inferences. Iteratively increasing the number of raters also increases this correlation coefficient to a point where social scientists would evaluate it as a reliable measurement (here we chose r = 0.8). Further details are given in “Results” section.

Rater sample

To quantify interrater reliability and a corridor of confidence for clothing-based inferences, we collected data from a labeling task with a sample of clothing items and a sample of raters. We collected responses from 400 raters (250 female, 146 male, 2 other, 2 missing; M_age = 35.4, SD = 12.6) using Prolific Academic.

Clothing sample

We obtained an initial sample of 5000 images of clothing items by scraping eight large shopping websites (The Gap, Topshop, Esprit, Primark, H&M, Zara, Prada, and Gucci). We chose websites representing the largest retailers from the USA and Europe from a diverse price range. Thus, our database represents the clothes commonly worn in Western countries at the time of data collection (autumn 2019). We further obtained images from Vestiaire Collective, a second-hand website, to account for psychologically unique signals from non-new clothes. We downloaded all available article images from the shopping websites using Python scripts primarily involving the selenium package for accessing web elements [56]. The supplementary materials include an annotated Python script showing how to download the images. We manually sorted out falsely included images not showing clothes. Table 3 depicts the distribution of the six most common clothing categories across the six most common colors.

Table 3 Distribution of common clothing categories and colors in the database

Full size table

We used a white background for each image (unless the piece of clothing was white itself, in which case we used a gray background to enhance visibility). We only included upper and lower body outerwear, which are commonly visible in social interactions (i.e., we excluded underwear, socks, and swimwear). Further, we excluded shoes and accessories such as scarfs, hats, and jewelry to minimize complexity in the dataset and the resulting strain on the prediction models. Out of the 5000 available photos, we used 200 randomly selected images in Study 2. We aimed to collect 80 ratings per image on all five traits as 80 ratings clearly go beyond the common amounts of raters per stimuli in psychological research [17, 45, 52]. While there was some random dropout, we obtained a relatively stable number of ratings per image (minimum = 79, median = 82, maximum = 86).

Procedure

Each participant was presented with a subset of 40 randomly selected images. For each image, participants were asked to indicate their inference of each of the five selected attributes on a 10-point scale ranging from 0 (not at all) to 10 (very much). Afterwards, participants were asked about their belief in clothing-based inferences and face-based inferences respectively (7-point scales with three items from [36]). An example item is “I can learn something about a person’s personality just from looking at his or her face (/clothes)”. The rating task was written in Python and administered using oTree [15].

Results

Generally, people believed slightly more in the validity of clothing-based inferences than in the validity of inferences from neutral faces (d = 0.34, Welch’s t(401) = 6.038, p < 0.001). Also, the belief in clothing-based inference correlated positively with belief in face-based inference (r = 0.35, p < 0.001).

Regarding the clothing ratings, there were bell-shaped distributions of participant inferences as depicted in Fig. 1. There were no strong floor or ceiling effects, but sometimes a slight dominance of the neutral scale midpoint (e.g., for perceptions of sexual interest), likely indicating that there were many low-signal clothing items.

Regarding the level of interrater agreement, our results suggest that the level of agreement is comparable to the results of studies on agreement in face-based inferences [32]. In multilevel models predicting the clothing-based inferences, the variance explained by the respective piece of clothing ranged from 5.8% (trustworthiness) to 16.9% (sexual interest). The variance explained by individual rater tendencies was slightly higher, again mirroring research in face-based research (see Fig. 2). For example, Hehman et al. [32] estimated that face stimuli explain only about 5–10% of rating variance in perceived intelligence, whereas interrater differences accounted for 3–4 times more variance. Our results suggest that when people infer psychological attributes from clothes, interrater differences are more important than the actual characteristics of the clothes.

To probe how many raters are needed to generate reliable mean inferences, we examined each trait’s corridor of confidence. That is, we took a subsample of raters and plotted the deviation of their mean inference against the full-sample mean inference (i.e., based on all 80 raters) across a range of sample sizes (with larger subsamples of raters naturally leading to smaller deviations). These deviations can be regarded as residuals indicating how far off the subsample of raters was from the actual average score. As such residual sizes are not normalized and might thus be difficult to interpret, we also provide an alternative form of the corridor of confidence by plotting the correlation between sub-sample inferences and full-sample inferences across the same range of sub-sample sizes. The corridors of confidence for the most reliable trait (sexual interest) and least reliable trait (trustworthiness) are depicted in Fig. 3.

As shown in all panels of Fig. 3, higher numbers of raters allowed for a better approximation of the average inferences obtained by the full sample of raters. With 25 raters (see the dotted lines), the mean’s deviation from the full-sample mean was usually less than 0.4 scale points on the 10-point scales (in our study, standard deviations were between 0.502 and 0.956). More intuitively, the average inference of 25 raters always correlated with over 0.8 with the inferences of the full rater sample. As mentioned above, most social scientist would consider this a reasonably reliable approximation. Therefore, in Study 3, we decided to recruit 25 raters per image.

Study 3: building a database and testing the predictability of clothing-based inferences

In our final study, we collected data for a fully labeled clothing image database. Additionally, we aimed to test whether it is possible to train a statistical model to replicate psychological inferences from images. We employed a convolutional neural network (for an introduction to these models, see [7]), a predominant approach in image-based machine learning that commonly outperforms other image-based prediction models, conditional on large sample sizes [14, 47, 48]. Accordingly, research on clothing recognition, classification, and synthesis has predominantly relied on Deep Convolutional Neural Networks (DNNs; [49, 84, 89]). Consequently, we also targeted this family of models to address the current problem. However, building a DNN from scratch requires a large quantity of training data, which is often out of the budget of research projects. In such a case, a common practice is to deploy a pre-trained model with pre-initialized coefficients and to fine-tune it for a new task (e.g., AlexNet [43], ResNet [31], VGG [74]). Employing this technique enables DNNs to adequately learn new tasks with less training data. Thus, we used a pretrained network (built for a general object recognition task) and fine-tuned it for the task at hand [80].