Introduction

Internal bodily signals, such as hunger, thirst, fatigue, nausea, pain, temperature, and cardiac and respiratory signals, are essential for human survival, indicating the physiological state and functioning of the body (e.g. the sensation of thirst signalling the level of dehydration in the body). The ability to perceive and identify these internal sensations, known as interoception (Craig, 2003a), is fundamental to multiple psychological processes, such as emotional processing (e.g., Critchley & Garfinkel, 2017; Garfinkel & Critchley, 2013; Schachter & Singer, 1962; Seth, 2013), and learning and decision-making (Bechara & Damasio, 2002; Dunn et al., 2010; Werner et al., 2009). Furthermore, a growing body of research has linked interoception to mental health and subjective wellbeing; atypical perception of interoceptive states has been found in several mental health conditions and neurodevelopmental disorders, such as Eating Disorders (Klabunde et al., 2013; Pollatos et al., 2008), autism (Garfinkel et al., 2016; Hatfield et al., 2019; Mul et al., 2018; Nicholson et al., 2019), anxiety and Panic Disorder (Ehlers, 1993; Paulus & Stein, 2006; see Khalsa et al., 2018 for a review), depression (Dunn et al., 2007; Furman et al., 2013; Forrest et al., 2015; Harshaw, 2015; see Eggart et al., 2019 for a review), and schizophrenia (Ardizzi et al., 2016). Given the vital role of interoception in understanding typical emotion processing and learning and decision-making, as well as its atypicality in several mental health conditions, research on interoception and emotion has grown significantly in recent years.

While numerous studies have focused on the perception of interoceptive states in the self, very few (e.g., Kaulard et al., 2012) have researched the recognition of these states in others, beyond the domain of affective emotion (e.g., happiness, anger, sadness). Recognition of others’ affective emotional states (which feature an interoceptive component; Schachter & Singer, 1962) has been studied in detail, in typical adulthood, clinical samples, and across development; indeed a PubMed search using the term “emotion recognition” generated 15,009 results. Recognition of others’ emotional states is crucial for successful social interactions, as well as building and maintaining relationships, making it an important area for psychological research. Recognition of interoceptive states (beyond the affective domain) in others, including identifying others’ hunger, nausea, pain, and breathlessness, for example, is presumably equally important for social interaction, and arguably more important from an evolutionary perspective (as identifying perturbations in these states is necessary in order to offer care and assistance to others). Similarly, studying the mechanisms behind the ability to recognise other people’s bodily sensations is crucial to improve our understanding of empathy for these states in others, with important theoretical and clinical implications. It is somewhat surprising, therefore, that research has, thus far, neglected to investigate this ability.

One reason for the dearth of research investigating the recognition of others’ non-affective internal states is presumably the lack of available stimuli. Compared to affective emotion recognition research, the lack of stimuli for the investigation of non-affective state recognition is striking. Since the publication of the “Pictures of Facial Affect” (Ekman & Friesen, 1976), the first standardised battery of facial emotion stimuli, several databases of visual stimuli depicting facial and bodily affective expressions have been developed (e.g., Beaupré et al., 2000; Langner et al., 2010; Lundqvist et al., 1998; Matsumoto & Ekman, 1988; Volkova et al., 2014; Wingenbach et al., 2016). Visual stimuli depicting facial and bodily expressions of affective states have been a key component of emotion research, and substantially contributed to our knowledge of affective and cognitive neuroscience, and social and clinical psychology. A purpose-built battery of stimuli depicting non-affective interoceptive states in others will enable research on social cognition to investigate the ability to perceive and recognise these signals in others. This will lead to an expansion of our theoretical understanding of the constructs of interoception and social perception in typical adult populations, developmental samples and clinical groups, both at the behavioural and neurological levels.

This report presents the development and validation of the Interoceptive States Static Images (ISSI), a database of full body static images of actors expressing either a non-affective interoceptive state or a control action, which is freely available to other researchers. The battery consists of 423 stimuli, which depict eight actors expressing various exemplars of nine internal states and nine control actions. All photos were taken from a frontal view in a controlled environment, and underwent a standardised image processing procedure to control for lighting conditions, size, position, and background. Stimuli were validated in two stages, one utilising free labelling and the other utilising visual analogue rating scales. Recognition data for each individual stimulus and for the state and control actions overall, including the extent to which they are confused with each other, are provided.

Methods

Stimulus development

Actors

Eight trained actors (four female) aged 22 to 48 were recruited through online and campus advertisement. Neither ethnicity nor first language was specified as a recruitment criterion, but they were recorded. No specific ethnic group was targeted for recruitment and no actors were excluded based on their ethnicity. The recruitment was interrupted once the necessary number of actors was reached. All the actors who responded to our recruitment call reported being of Caucasian ethnicity. Actors were either drama students or had previously completed acting training. Actors were informed about the procedure and the purpose of the stimulus set, and gave their consent to take part in the recording session and for their images to be used in scientific research, presented at conferences, published in academic journal articles, and shared with other researchers. Actors received financial remuneration for their time.

Procedure

Prior to the recording session, actors were provided with the list of the internal states and actions they would be required to perform, and were asked to practice depicting each state or action prior to the recording session. During the recording session, they were required to wear black trousers, black socks, and a black t-shirt. Female actors were asked not to wear make-up and to tie their hair to ensure their face was completely visible at all times. Photos were taken in a purpose-built photography studio. Actors stood in a specified position in the centre of a white background, facing a camera placed on a tripod. Softbox LED lighting was used to control lighting conditions across different shooting sessions, and to reduce shadows.

Actors first produced ten control actions (jumping, clapping, lifting, running, washing hands, spinning/twirling, stumbling, walking, waving, beckoning) and then expressed ten non-emotional internal states (cold, fatigue, nausea, pain, breathlessness, hunger, thirst, hot, satiety, itch). For each stimulus category, the actor was asked to practice before posing the state or control action five separate times, which were used as different exemplars of the same stimulus. Between each attempt, the actor was asked to re-set to a neutral body position and to re-position in the middle of the background. Between each stimulus type, a longer break was given to allow actors to rest and prepare for the next stimulus category. The order of stimulus production was fixed and did not vary across actors.

Image processing

Raw photos were edited in Adobe Photoshop 2019. The backdrop was replaced with an artificial white matte background. Image artefacts and distracting visual information (e.g. tattoos) were removed. Brightness and contrast were adjusted and standardised across different images. Sharpness was increased with the function Smart Sharpness by 309%, with 0.6-pixel radius, 100% noise reduction, and by removing lens blur.

Image size and actor position were matched across stimuli using a 3456 × 5184 pixels white template. The first stimulus was positioned in the centre of the template to reach the desired size, which was sized to subtend 12° of visual angle vertically when viewed at 60 cm. Guidelines were drawn to delimit the boundaries of the actor in this position (extremes of head and feet in the vertical axis and extremes of right and left shoulders in the horizontal axis) and to provide a frame of reference for all the subsequent stimuli. For each actor, images were layered onto the original template and the size and position of the actor was adjusted to fit these guidelines. Each layered image was saved as a new file.

Stimuli validation

All stimuli went through a pre-selection process based on basic visual properties by the researchers. Photos where parts of the actor’s body were missing (e.g. the head being outside the picture top edge in some ‘jumping’ exemplars), or where motion blur could not be resolved through editing, were removed from the database. This resulted in all stimuli depicting the control action ‘stumbling’ being removed, due to a high proportion of images including several of these issues. To retain an equal number of control actions and internal states, stimuli depicting ‘thirst’ were also removed, owing to actors reporting that this state was difficult to portray because of the lack of a visible behavioural response to feeling thirst, and the authors agreeing that stimuli were not recognisable as depicting thirst. For each actor, and for each stimulus category, the four exemplars with the highest visual quality, judged by the researchers, were selected to be included in the first validation task, yielding a total of 560 stimuli.

Stimulus selection: Free-labelling task

Forty participants (four male) aged 18–30 years (M = 19.05, SD = 2.68) were recruited through Royal Holloway, University of London (RHUL) SONA System to take part in a free-rating task. Participants were all students at RHUL and received course credits for their participation. There were no exclusion criteria for this task, although any diagnosis of a mental health condition was recorded. A general description of the task procedure was provided but participants were not informed of the aim of the study until the end of the session, to avoid influencing their responses. Stimuli were divided into two sets of 280 images. Participants viewed one of the two sets, in order to reduce fatigue. Instructions were standardised across participants and the experimenter provided them verbatim as follows: You will see a series of body postures, one by one. For each one, you need to provide a very brief description of what you think the body posture represents (for example what the person is doing, thinking or feeling). There will be many stimuli, so it’s very important that you keep your answers as brief as possible. Ideally, you will use a single word or a short phrase. For example, if you see an image depicting a person sneezing, you can simply answer ‘sneezing’. If you think that the person could be doing, thinking or feeling more than one thing, you can give multiple answers, but please try to keep the description of each one brief. If I need more details, I will ask for them. There are not right or wrong answers, so I will not provide any feedback during or after the session. I will simply record your answers and occasionally intervene if I think something is not clear or if I need more details. Following the instructions, participants were invited to ask any questions they may have about the procedure. Then, the experimenter sat a few meters behind the participant and typed their responses verbatim. When additional information was required, the experimenter used standardised phrases to prompt the participant. If the answer required more details, the experimenter would say “Can you tell me more about that?”. If the answer was ambiguous/unclear, the experimenter would say “Can you be more specific?” or “Can you tell me what you mean by that?”. Finally, if participants’ responses were too verbose, the experimenter would say “Try to use single words or short phrases”. This task took approximately 20 min to complete.

Stimulus validation: Label selection and rating task

Based on the results of the free-rating task, 423 stimuli were selected to be used in the second step of validation (details on the selection procedure can be found in the Results section). Of these, 202 stimuli depicted nine internal states (breathlessness, cold, fatigue, hot, hunger, itch, nausea, pain, satiety) (Fig. 1a) and 221 stimuli depicted nine control actions (beckoning, clapping, jumping, lifting, running, twirling, walking, washing hands, waving) (Fig. 1b). Participants were recruited from the RHUL Sona System, Testable Minds database (www.testable.org), and through advertisements on social media. A total of 412 participants (169 female) aged 18–71 years (M = 30.08, SD = 10.56) with no diagnosis of any mental health condition took part in an online labelling task. The task was programmed in Testable and presented participants with a random sample of 100 stimuli. On each trial, a single image was presented in the centre of the computer screen and remained visible until participants had finished responding. Participants were provided with a list of the nine internal state and nine action labels (presented in alphabetical order) and asked to select which label best described the image. If they were unsure, participants could select more than one label or skip to the next trial if they thought that no label applied. Following label selection, participants were prompted to rate how well each chosen label described the image, using a five-point Likert scale (Very Poorly; Poorly; Moderately; Well; Very Well). This task took approximately 30 min to complete.

Fig. 1
figure 1

Examples of stimuli from the ISSI database. a Examples of internal state stimuli. b Examples of control action stimuli

Results

Free-labelling task

Participants’ responses in this stage were analysed qualitatively. First, two coders independently coded the responses for accuracy (referring to identification of the intended state or action). A score of 1 was given to those responses which either correctly identified the state or action or, for state stimuli, correctly described the action portrayed, associated with the state (e.g. for the ‘fatigue’ stimuli, both ‘tired’ and ‘yawning’ were considered correct responses). A score of 0 was given to inaccurate responses (e.g. ‘hot’ or ‘shocked’ to describe a ‘breathlessness’ stimulus). In instances where coders disagreed, responses were discussed by all authors until an agreement was found. Inter-coder agreement was near perfect (k = 0.81).

Each stimulus was given a recognisability index(RI) which corresponded to the mean accuracy score. Overall, internal state and action stimuli were recognised correctly 65% and 75% of the time, respectively. Of the internal states, itch (M = 88%, SD = 13%, range 55–100%) and cold (M = 88%, SD = 15%, range 55–100%) were the best recognised states, while hunger was the least well recognised state (M = 22%, SD = 8%, range 10–40%). Among the control actions stimulus set, walking was the best recognised state (M = 91%, SD = 12%, range 55–100%), whereas beckoning stimuli were the least well recognised (M = 49%, SD = 11%, range 30–75%). See Table 1 for a full summary of RIs.

Table 1 Recognisability Indices (RI) for each Internal State and Action category. RIs represent the proportion of recognition accuracy in the free-labelling task (Stage 1)

Based on the RI, each stimulus was categorised according to recognisability, into five categories: Very poor (RI scores 0.0–0.2), Poor (RI scores 0.21–0.4), Average (RI scores 0.41–0.6), Good (RI scores 0.61–0.8), and Very good (RI scores 0.81–1). All stimuli categorised as Very good, Good, and Average were kept in the final database. In addition, we retained a minimum of two exemplars per actor for each stimulus category. For stimulus categories where fewer than two stimuli for a given actor were categorised as Very good, Good, or Average, the two stimuli with the highest RI were retained. See Appendix Table 3 for RI scores for every retained stimulus. A final set of 423 stimuli was retained and used in the second stage of validationFootnote 1. In this final stimulus set, 209 stimuli depicted male actors, whilst 214 depicted female actors. Each of the eight actors was present in at least 50 stimuli, and the most depicted actor appeared in 56 stimuli.

Label selection and rating task

Quality and Accuracy Scores

Each stimulus was rated by a mean of 97 participants (Min = 74, Max = 123). There are multiple ways in which the validity and quality of stimuli can be defined, so to allow researchers to select stimuli based on their own requirements, a comprehensive range of stimulus measures has been created and is provided below. For each stimulus, five separate scores were calculated: the quality index(QI); the specificity index(SI); the maximum-distractor specificity index (SI+), the choice rate(CR); and the high-quality choice rate (CR+) (Table 2). The scores were calculated based on the ratings of the whole sample (both female and male observers), as well as on ratings of female and male observers separately.

Table 2 Summary of scores. T = target; D = distractor. In the formulae for QI, SI, and SI+ T and D correspond to a value between 0 and 5 (participants’ ratings of how well a stimulus depicts a given state label). In the formulae for CR and CR+, T and D correspond to a binary value: 0 or 1 (indicating whether the label was selected (1) or not (0)). n = total number of stimulus ratings across all participants. i = ‘for all individual stimulus ratings across all participants’

The QI is a score ranging between 0 and 5, and was computed by taking the mean (across all stimulus ratings) of all quality judgements given to the target (intended) label. A score of 0 was assigned whenever the target label was not selected. The QI therefore reflects the extent to which the target label is perceived as describing the image well. High QI scores indicate that the target label describes the image very well. Conversely, lower QI scores indicate that the target label does not describe the stimulus well. The SI reflects the extent to which the target label is perceived as a good description of the image, over and above distractor states or action labels. SI was computed by subtracting the mean rating given to selected distractor labels from the rating given to the target label, and taking the mean of these values across all stimulus ratings. SI values range between – 5 and 5. Negative values indicate that the target label received a lower score than the distractor labels taken together. Conversely, positive values signify that the target label received a higher rating compared to distractor labels taken together. The SI+ was obtained by subtracting the highest distractor rating from the rating given to the target label, and taking the mean of these values across all stimulus ratings. SI+ is a score ranging between – 5 and 5, whereby negative values indicate that distractor labels were given higher ratings than the target label, whilst positive values indicate that the target label received a higher rating than the distractor with the highest rating. The SI and SI+ are more conservative scores than the QI, as they take into account the discrepancy between ratings of intended and unintended labels. Values of SI and SI+ close to 0 indicate that the target label is not perceived to be a better description of the stimulus than the distractor labels. The CR consists of the proportion of participants who selected the target label to describe the stimulus, regardless of the quality rating given. CR scores range between 0% to 100%, whereby 0% indicates that the target label was never selected to describe the image, whilst 100% indicates that the target label was always selected to describe the image. The CR+ is the proportion of participants who gave the target label the highest quality rating of all labels. CR+ was calculated by assigning a score of 1 to those stimuli whose target label received the highest quality rating. Whenever a distractor obtained a quality rating equal to or higher than the target, a score of 0 was assigned. CR+ scores of 0% indicate that the target label was never rated higher than distractor labels when describing the image. CR+ scores of 100% indicate that the target label always received the highest rating, compared to distractor labels, when describing the image. All five scores are presented for each stimulus in Table 3 in the Appendix.

Whole-sample analyses revealed that QI scores were higher for action stimuli (M = 3.62, SD = .55) than for internal state stimuli (M = 3.22, SD = 1.02) [t(421) = – 5.09, p < .001]. Separate ANOVAs were conducted for QI of internal states and QI of control actions, with Stimulus Category (all internal state/action stimulus categories) and Actor Sex (male, female) as IVs. For the internal states, a significant main effect of Stimulus Category [F (17, 184) = 62.97, p < .001, η2 = .73] was found. Cold received the highest QI (M = 4.2, SD = .35). Conversely, the lowest QI was attributed to hunger (M = 1.67, SD = .44) (Fig. 2a). Post hoc t tests were conducted across all pairs of states with Bonferroni corrections and are shown in Fig. 2a. Similarly, the ANOVA for the action stimuli resulted in a significant main effect of Stimulus Category [F (17, 203) = 2.63, p = .009, η2 = .09]. Clapping stimuli had the highest QI (M = 3.8, SD = .59), while the mean QI for twirling was the lowest of the action stimulus set (M = 3.18, SD = .60) (Fig. 2b). Post hoc t tests comparing all pairs of actions, with Bonferroni corrections, and are shown in Fig. 2b). Actor Sex did not contribute to variations in QI for either internal states [F (17, 184) = .06, p = .81] or control actions [F (17, 203) = .49, p = .48], and did not interact with Stimulus Category in either internal states [F(17, 184) = 1.75, p = .09] or control actions [F(17, 203) = 1.38, p = .21].

Fig. 2
figure 2

Distribution of Quality Index (QI) scores across different Stimulus Categories of Internal States (a) and Control Actions (b). The boxplots for each state and action are presented. Individual stimuli are plotted as single data points over the boxplot. Both graphs are presented alongside tables of post hoc t tests showing the mean difference (row - column) for each pair of Internal States (panel a) and Control Actions (panel b). Asterisks denote statistical significance at alpha level of .001 (**) and .05 (*) after Bonferroni corrections. The p value before Bonferroni correction is reported in italics below the mean difference value

SI scores were higher for action stimuli (M = 3.01, SD = .86) than internal state stimuli (M = 1.90, SD = 1.84) [t(421) = – 8.08, p < .001)]. Again, separate ANOVAs were conducted for the action and the internal states stimulus sets, with SI as the DV and Stimulus Category and Actor Sex as IVs.

For the internal states, the main effect of Stimulus Category was significant [F (17, 184) = 63.197, p < .001, η2 = .73]. Cold had the highest SI (M = 3.71, SD = .50), while SI was lowest for satiety (M = – 1.07, SD = .91) (Fig. 3a). Post hoc t tests for Stimulus Category using Bonferroni corrections are shown in Fig. 3a. There was a significant main effect of Actor Sex [F (17, 184) = 5.04, p = .02, η2 = .03], whereby SI scores were higher for stimuli depicted by female actors (M = 1.99, SD = 1.77) than those portraying male actors (M = 1.81, SD = 1.90). Actor Sex did not interact significantly with Stimulus Category [F (17, 184) = 1.96, p = .054]. For the action stimulus set, a significant main effect of Stimulus Category [F (17, 203) = 4.76, p < .001, η2 = .16] was observed. Beckoning and twirling had the highest (M = 3.36, SD = .58) and lowest (M = 2.21, SD = 1.14) SIs, respectively (Fig. 3b). Post hoc t tests using Bonferroni corrections were conducted on Stimulus Category and are reported in Fig. 3b. The main effect of Actor Sex [F (17, 203) = .41, p = .52] and the interaction between Actor Sex and Stimulus Category [F (17, 203) = 1.67, p = .11] were non-significant.

Fig. 3
figure 3

Distribution of Specificity Index (SI) scores across different Stimulus Categories of Internal States (a) and Control Actions (b). The boxplots for each state and action are presented. Individual stimuli are plotted as single data points over the boxplot. Both graphs are presented alongside tables of post hoc t tests showing the mean difference (row - column) for each pair of Internal States (panel a) and Control Actions (panel b). Asterisks denote statistical significance at alpha level of .001 (**) and .05 (*) after Bonferroni corrections. The p value before Bonferroni corrections is reported in italics below the mean difference value

SI+ was significantly higher for action stimuli (M = 2.99, SD = .87) than for internal state stimuli (M = 1.82, SD = 1.89) [t(421) = – 8.21, p < .001]. Separate ANOVAs were conducted for the action and internal state stimulus sets, with Stimulus Category and Actor Sex as IVs and SI+ scores as the DV. A significant main effect of Stimulus Category was found for internal state stimuli [F (17, 184) = 63.797, p < .001, η2 = .735]. Cold and Satiety had the highest (M = 3.68, SD = .51) and lowest (M = – 1.25, SD = .95) SI+, respectively (Fig. 4a). Bonferroni-corrected post hoc t tests were conducted on all the levels of Stimulus Category and are reported in Fig. 4a. A significant main effect of Actor Sex was found [F (17, 184) = 5.35, p < .05, η2 = .03] whereby stimuli depicting female actors (M = 1.93, SD = 1.82) received slightly higher SI+ score than those depicting male actors (M = 1.73, SD = 1.96). Actor Sex did not interact with Stimulus Category [F (17, 184) = 1.93, p = .06]. The ANOVA for SI+ scores of action stimuli resulted in a significant main effect of Stimulus Category [F (17, 203) = 4.87, p < .001, η2 = .16]. Beckoning was the category to receive the highest SI+ scores (M = 3.34, SD = .58), whilst Twirling received the lowest SI+ scores (M = 2.17, SD = 1.16) (Fig. 4b). Post hoc t tests for each pair of action categories were conducted, using Bonferroni corrections (Fig. 4b). Actor sex did not contribute to variation in total SI+ scores [F (17, 203) = .38, p = .54] and did not interact with Stimulus Category [F (17, 203) = 1.65, p = .11].

Fig. 4
figure 4

Distribution of Max-distractor Specificity Index (SI+) scores across different Stimulus Categories of Internal States (a) and Control Actions (b). The boxplots for each state and action are presented. Individual stimuli are plotted as single data points over the boxplot. Both graphs are presented alongside tables of post hoc t tests showing the mean difference (row - column) for each pair of Internal States (panel a) and Control Actions (panel b). Asterisks denote statistical significance at alpha level of .001 (**) and .05 (*) after Bonferroni corrections. The p value before post hoc corrections is reported in italics below the mean difference value

CR scores showed that participants selected the target label to describe action stimuli (M = 86%, SD = 8%) significantly more often than they did to describe internal states (M = 77%, SD = 20%) [t(421) = -6.10, p < .001]. ANOVAs were computed for CR scores of internal states and control actions separately, with Stimulus Category and Actor Sex as IVs. For the internal states, a significant main effect of Stimulus Category [F (17, 184) = 73.89, p < .001, η2 = .76] was observed. Cold was the state with the highest CR (96%), whilst satiety had the lowest CR (40%) (Fig. 5a). Post hoc t tests using Bonferroni corrections are displayed in Fig. 5a. There was no main effect of Actor Sex [F (17, 184) = .09, p = .93] or interaction between Actor Sex and Stimulus Category [F (17, 184) = 1.08, p = .38]. The ANOVA for the action stimuli resulted in a significant main effect of Stimulus Category [F (17, 203) = 4.03, p < .001, η2 = .14]. Clapping had the highest CR (89%), whereas twirling was the action with the lowest CR (78%) (Fig. 5b). Post hoc t tests with Bonferroni correction across all categories are shown in Fig. 5b. Actor Sex did not contribute to variations of CR [F (17, 203) = 1.25, p = .26] or interact with Stimulus Category [F (17, 203) = .83, p = .58].

Fig. 5
figure 5

Distribution of Choice Rate (CR) scores across different Stimulus Categories of Internal States (Panel a) and Control Actions (Panel b). The boxplots for each state and action are presented. Individual stimuli are plotted as single data points over the boxplot. Both graphs are presented alongside tables of post-hoc t-tests showing the mean difference (row - column) for each pair of Internal States (Panel a) and Control Actions (Panel b). Asterisks denote statistical significance at alpha level of .001 (**) and .05 (*) after Bonferroni corrections. The p value before post hoc corrections is reported in italics below the mean difference value

Finally, CR+ scores for action stimuli (M = 84%, SD = 10%) were significantly higher than CR+ scores for internal state stimuli (M = 69%, SD = 25%) [t(421) = – 8.56, p < .001]. Once again, separate ANOVAs were conducted for the CR+ scores of action and internal states stimuli, with Stimulus Category and Actor Sex as IVs. For the internal states, a main effect of Stimulus Category was found [F (17, 184) = 61.80, p < .001, η2 = .73]. Cold stimuli received the highest (M = 92%, SD = 6%) CR+ scores, whilst Satiety stimuli had the lowest CR+ scores (M = 30%, SD = 12%) (Fig. 6a). Bonferroni corrected post-hoc t-tests across all pairs of states are shown in Fig. 6a. A significant main effect of Actor Sex was found [F (17, 184) = 5.77, p < .05, η2 = .03] whereby internal state stimuli depicting female actors (M = 70%, SD = 24%) received slightly higher CR+ scores than those depicting male actors (M = 67%, SD = 25%). Finally, Actor Sex did not interact significantly with Stimulus Category [F (17, 184) = 1.62, p = .12]. The ANOVA for the action stimuli returned a significant main effect of Stimulus Category [F (17, 203) = 6.27, p < .001, η2 = .198], whereby Walking and Twirling received the highest (M = 89%, SD = 8%) and lowest (M = 75%, SD = 14%) CR+ scores, respectively (Fig. 6b). Post-hoc t-tests with Bonferroni corrections across all pairs of actions are shown in Fig. 6b. The effect of Actor Sex on variations of CR+ scores did not reach statistical significance [F (17, 203) = .41, p = .52]. Likewise, Actor Sex did not interact with Stimulus Category [F (17, 203) = 1.71, p = .10].

Fig. 6
figure 6

Distribution of High-quality Choice Rate (CR+) scores across different Stimulus Categories of Internal States (a) and Control Actions (b). The boxplots for each state and action are presented. Individual stimuli are plotted as single data points over the boxplot. Both graphs are presented alongside tables of post hoc t tests showing the mean difference (row - column) for each pair of Internal States (panel a) and Control Actions (panel b). Asterisks denote statistical significance at alpha level of .001 (**) and .05 (*) after Bonferroni corrections. The alpha level before post hoc corrections is reported in italics below the mean difference value

To investigate the effect of observer gender on the evaluation of internal state stimuli, separate ANOVAs were conducted for the five recognition indices with Stimulus Category (all the internal states) and Observer Gender (female, male) as factors. The ANOVA with QI as DV did not reveal a main effect of Observer Gender [F(17, 386) = .57, p = .45, η2 = .001]. Moreover, Observer Gender did not interact significantly with Stimulus Category [F(17, 386) = .77, p = .63, η2 = .02]. The ANOVA for SI scores returned a main effect of Stimulus Category [F(17, 386) = 111.41, p < .001, η2 = .698], and a main effect of Observer Gender [F(17, 386) = 10.74, p = .001, η2 = .03], whereby female observers (M = 2.09, SD = 1.83) had higher SI indices than male observers (M = 1.76, SD = 1.90). Observer Gender did not interact significantly with Stimulus Category [F(17, 386) = .57, p = .80, η2 = .01]. The ANOVA for SI+ scores revealed a main effect of Stimulus Category [F(17, 386) = 112.55, p < .001, η2 = .70], and a main effect of Observer Gender [F(17, 386) = 11.25, p = .001, η2 = .03], with female observers (M = 2.03, SD = 1.88) having higher SI+ indices than male observers (M = 1.67, SD = 1.95), but no interaction between Observer Gender and Stimulus Category [F(17, 386) = .54, p = .82, η2 = .01]. The ANOVA for CR scores resulted in a main effect of Stimulus Category [F(17, 386) = 123.91, p < .001, η2 = .72], and a main effect of Observer Gender [F(17, 386) = 7.13, p = .008, η2 = .02], where female observers (M = 95.53, SD = 4.65) had slightly higher CR scores than males (M = 94.87, SD = 4.87). The interaction between the two factors was non-significant [F(17, 386) = .68, p = .708, η2 = .01]. Finally, the ANOVA for CR+ scores returned a main effect of Stimulus Category [F(17, 386) = 108.03, p < .001, η2 = .691], and a main effect of Observer Gender [F(17, 386) = 5.66, p = .02, η2 = .01], with higher CR+ scores in female observers (M = 72.59, SD = 23.70) than male observers (M = 69.34, SD = 24.50), but no interaction between Observer Gender and Stimulus Category [F(17, 386) = .56, p = .812, η2 = .01].

Confusion across stimulus categories

In order to determine which states/actions were confused with each other, confusion scores were created based on CR and CR+ scores. Confusion matrices were created whereby each row corresponds to the intended state or action portrayed by the actor, and each column represents the proportion of times each state or action label was selected regardless of quality rating in the CR matrix, and the proportion of times each state or action label was given the highest quality rating in the CR+ matrix. Among the internal states, some categories were particularly confused with others; Hunger stimuli were often rated as depicting Pain (CR = 46%; CR+ = 20%) and Nausea (CR = 40%; CR+ = 16%), Satiety stimuli were also rated as depicting Hunger (CR = 39%; CR+ = 25%) and Nausea (CR = 36%; CR+ = 17%), and Nausea stimuli were often rated as depicting Pain (CR = 30%; CR+ = 7%) (Fig. 7a). On the other hand, the confusion matrix for action stimuli revealed lower levels of confusion (i.e. target actions were less often labelled as non-target actions). Clapping stimuli were sometimes labelled as depicting Washing Hands (CR = 22%; CR+ = 10%), Running stimuli were also rated as depicting Walking (CR = 23%; CR+ = 11%), Twirling stimuli were sometimes rated as depicting Jumping (CR = 16%; CR+ = 6%), Waving stimuli were also rated as depicting Beckoning (CR = 14%; CR+ = 6%), and Beckoning stimuli were occasionally labelled as depicting Waving (CR = 10%; CR+ = 4%) (Fig. 7b).

Fig. 7
figure 7

Confusion matrixes showing the proportion of the time that each label was used to describe stimuli of each intended state (Choice Rate (CR) matrix) and the proportion of the time that each label was given the highest quality rating to describe stimuli of each intended state (High-quality Choice Rate (CR+) matrix). Confusion matrices are presented separately for Internal States (panel a) and Control Actions (panel b)

Discussion

The current report presents the creation and validation of the ISSI database, a novel stimulus set of 423 static images representing non-affective internal bodily states and control actions. Each stimulus is presented alongside a range of indices from the second stage of validation, representing the quality and specificity of depiction, and the extent to which each stimulus was recognised as the intended state or action. Confusion matrices of internal states and control actions are also included to provide an indication of which states and which actions tend to be confused with each other. The stimuli are freely available to researchers for their use in scientific research and can be downloaded from the Insulab website (https://www.insulab.uk).

Overall, 77% (Ra: 40–96%) of participants selected the intended label to describe the internal state stimuli, and 86% (Ra: 78–89%) of participants selected the intended label to describe action stimuli. When observer gender was considered, female observers gave higher ratings and were more likely to select the intended label for the stimuli compared to male observers. Within the internal state stimulus set, there was high variability between stimulus categories in terms of quality and specificity of depiction, and proportion of participants selecting the target state, with the pattern of results across different indices being relatively consistent. Satiety was the most difficult state to recognise and discriminate from other states, followed by hunger. Hunger and satiety stimuli were given fairly low quality (QI) scores, with the majority being given a mean score below 2 (‘Poor’ on the rating scale), and negative specificity (SI and SI+) scores, indicating that distractor labels were often judged to be better descriptors of the stimulus than the target label. Similarly, CR and CR+ scores were often under 50%, indicating that the target label was selected to describe the stimulus (CR), or as the best descriptor of the stimulus (CR+), less than half of the time. Other internal state categories, however, were given high quality and specificity ratings, and the intended label was selected frequently. The vast majority of cold, itch, pain, fatigue and nausea stimuli, for example, were given QI scores above 3, positive SI/SI+ scores, and CR/CR+ scores above 70%. While there is therefore variability across internal state categories and individual stimuli, all stimuli rated in the second validation stage have been retained in the final stimulus set, in order for researchers to select stimuli according to their own research requirements. While we would recommend using stimuli with high quality, specificity and choice rate scores where studies require stimuli that have been validated and are recognisable by typical participants as their intended state, the range of recognition scores also allows for the study of ambiguous stimuli, or internal states that are easily confused. Notably, action stimuli were consistently recognised better than internal state stimuli, and there was less variability among different action stimulus categories in terms of quality and specificity ratings, and the extent to which the intended label was selected to describe the stimuli. Variability was also observed across actors, both in terms of quality of depiction and recognisability of the stimuli produced. Individual differences in the ability to produce recognisable non-affective internal states are expected, and elucidating the predictors of such differences should be investigated in future research. Previous works on facial expressions of emotion indicates, for example, that autistic individuals produce less typical emotional expressions compared to neurotypical individuals (e.g., Brewer et al., 2016; Langdell, 1981). Further research is needed to elucidate whether a similar pattern is observed for the expression of interoceptive states.

It is likely that internal states were recognised less well than action stimuli due to the associations and similarities between internal states giving rise to greater confusability. In particular, there is an over-representation of gastric internal signals in the current stimulus set (i.e. nausea, hunger, satiety), which could be responsible for lower specificity scores and choice rates for these stimulus categories. Actors frequently expressed these internal states by placing their hands on or around the abdomen, likely making these stimuli difficult to differentiate. Crucially, despite variability in the low-level visual features of the stimuli within state categories, there was consistency across actors’ depictions of states, and visual cues were often in line with those that would be expected based on the location at which states are perceived within the body (e.g., the abdomen). Notably, recognition scores are likely to be dramatically increased if fewer gastric response options are available to participants (e.g., researchers could include nausea, hunger, and satiety under the same umbrella term ‘gastric discomfort’); in the current validation task, the availability of all target labels may have led to more conservative recognition estimates, while in a two-alternative forced choice task where stimuli must be labelled as either cold or satiety, for example, it is likely that participants would perform near ceiling, as the visual cues associated with these states are highly distinct. Indeed, in tasks assessing affective emotion recognition, recognition accuracy is improved by having fewer available response options, or less confusable response options in alternative forced choice tasks. For example, angry facial expressions were less likely to be labelled as depicting anger in a task where response options included “anger”, “frustration”, and “contempt” than when fewer closely related response options were available (Russell, 1993). Similarly, recognition of happiness expressions (which often shows ceiling effects even in those with difficulties recognising other facial expressions) has been found to be impaired in those with emotion processing impairments (alexithymia) when stimuli depicting pain are included in the recognition task, likely due to painful expressions sharing perceptual characteristics with happy expressions (Brewer et al., 2015). In contrast, action stimulus categories were more distinct from each other in their associated behavioural cues, and therefore less confusable with each other.

Naturalness of expression may have also played a role in the disparity of recognition scores among stimulus categories. Although visual behavioural expressions of states such as hunger and satiety, such as rubbing the abdomen (hunger and satiety) or exhaling heavily (satiety), do occur, they may be less spontaneous than behavioural expressions of other states such as feeling cold (e.g. rubbing one’s arms) or feeling itchy (e.g. scratching one’s skin). This may be due to the behavioural responses to cold and itch serving a purpose to reduce the internal state, and thus being performed more often, rather than serving a more communicative purpose and therefore only being used in social situations, and potentially less frequently. Similarly, actions that are performed with a more communicative purpose may be more frequently accompanied by a verbal description (e.g. stating ‘I’m so hungry’ while rubbing one’s abdomen), reducing the requirement for an observer to recognise the visual signals. It is worth noting that facial expressions of affective states can be either spontaneous or posed for communicative purposes, and these tend to differ in their visual features, such as onset time, duration, and amplitude of physical facial movement (Schmidt et al., 2006; Valstar et al., 2006). It is likely that spontaneous and posed/communicative expressions of non-affective internal states also differ, and the extent to which they differ may vary across internal states. It is possible that actors’ depictions of internal states were therefore more recognisable for states where spontaneous and posed expressions of the state are more similar, making the actors’ depiction more ecologically valid. For states which either are infrequently expressed, or for which spontaneous expressions differ greatly from posed expressions, actors’ depictions may have been less recognisable. Notably, the communicative value of individual internal states may also vary across different cultures. Future research is needed to examine cross-cultural influences on the expression and recognition of internal states.

Moreover, the expression of certain internal states is likely to be multidimensional, with expressions including a combination of visual (e.g. kinematic), auditory (e.g. vocal) and contextual cues. Recognition of internal states in others may, therefore, be greatly improved by the addition of vocal cues, body movement, or contextual information. When observing an individual rubbing their abdomen, for example, contextual information might be necessary in order to interpret the action accurately as a sign of hunger (e.g. it is lunch time and we are in a queue to buy food), rather than a sign of satiety (e.g. we just ate a large meal). Future research is needed to elucidate whether some states rely more than others on visual cues for their expression, and what type of cues are necessary for their recognition. The availability of stimuli depicting states that are easily confused with each other in this stimulus set will make it possible to address these research questions. Notably, research into the perception and recognition of non-affective internal states in others will pose new methodological challenges, in part complementary to those faced when studying the perception of internal states in the self. On one hand, some internal states (e.g. itch, fatigue) are associated with visual cues but are difficult to measure objectively, potentially making study of these states easier in relation to others than to the self. Conversely, some internal states, such as cardiac signal changes, are easy to objectively assess in the individual, but are not accompanied by visual cues, making them difficult to observe in others.

Another crucial aspect to consider is the relative role of facial and bodily information in participants’ recognition of the current stimulus set. Facial cues were not obscured from the stimuli in either validation stage, as both facial expressions and postural cues are likely to be important for conveying internal states, and full body postures were deemed to be the most ecologically valid. It is possible, however, that facial and body information are recognisable in isolation, or that the relative contribution of facial and body cues to state recognition varies across internal states. As emotional cues are particularly expressed by the face (Adolphs, 1999; Frith, 2009), and it may be possible to experience affective and non-affective states simultaneously, interference effects from emotional cues may be especially evident when facial cues are present. While we note that stimuli have only been validated with integrated facial and body cues, it is of course possible for future work to investigate recognition from distinct regions of the stimuli, for example by separating or manipulating facial and body cues.

Crucially, the theoretical distinction between affective (emotional) and interoceptive states is not clear-cut. Here we refer to interoceptive states as internal bodily sensations beyond the affective domain. With this, we do not imply that emotions and interoceptive states are necessarily separate entities. On the contrary, according to the leading model of emotion perception, interoception is a fundamental component of emotional experience, which derives from sensory and affective experiences in combination with contextual cues (Schachter & Singer, 1962). However, it is common in the literature to find emotion processing and interoception treated as separate components. Similarly, some states, such as pain, seem to be considered as both emotional and interoceptive, with Craig describing pain as a ‘homeostatic emotion’ due to its sensory component alongside a motivational drive to re-establish the body’s homeostasis (Craig, 2003b). This definition could arguably be applied to a number of interoceptive states. Future work is needed to assess whether individuals process affective and interoceptive states in others differently. To this end, the call for stimuli depicting internal sensations beyond the affective domain is even more critical. Going forward, it is important that categories of internal states are clearly defined, both theoretically and operationally.

In conclusion, the ISSI stimulus set will allow, for the first time, the investigation of humans’ ability to recognise non-affective internal states in others. There are opportunities for investigating this basic process, for example the role of contextual cues and the contribution of facial and body postural cues to recognition, as well as for investigating correlates of individual differences in this ability, the genetic and neural basis of recognition, developmental trajectories, and the relationship between psychopathology and recognition abilities. Less recognisable stimuli have not been eliminated from the database, as researchers are encouraged to select stimuli based on their specific needs and research questions. If the aim of the study is that of assessing the accuracy of internal state recognition, then we advise researchers to select stimuli with higher quality, specificity, and choice rates, as these offer greater validity. The availability of more ambiguous stimuli, however, will allow investigation of individual differences in interpretation, and the biasing role of additional cues, for example. Researchers using the ISSI stimuli are encouraged to report their stimulus selection process transparently, and may utilise the validation statistics in the ISSI database to do this.