Participants
Due to the pandemic-related constraints for laboratory research, the study was performed as an online study. Based on an a priori power analysis using GPower [32] with a power (1 − β) set at 0.90 and α = 0.05 the targeted sample size was 117. Of all 169 participants who started the study, 36 dropped out before finishing the study. This drop-out rate of 21.3% is comparable to other online studies in German-speaking regions [33]. From a total of 133 subjects participating in the study, six participants had to be excluded because of invalid scores in the IAT. Furthermore, eight subjects were excluded after a visual examination of response time outliers, resulting in a final sample of 119 participants. The sample consisted of 73 participants that were recruited via the local university participant pool and received course credit, as well as 46 participants that were recruited from the platform prolific, receiving a small monetary compensation for participation which was calculated on the basis of the German minimum wage (3,36 € for 20 minutes). The same inclusion criteria regarding age, ranging from 18 to 50, and German nationality as well as first language were used for both recruitment approaches. Besides those criteria no restrictions were made for participation. As a result the sample represents various domains to allow more extensive insights with regard to the expectations of general users in terms of preferences and public stereotypes.
For both sampling strategies participants did not differ in terms of age (MPool = 26.62, SDPool = 4.62, MProlific = 28.61, SDProlific = 7.54, p = 0.112), or gender (Pool 59% female, Prolific 43% female, p = 0.146) nor with regard to the control variable tendency to anthropomorphize (MPool = 43.29, SDPool = 11.62, MProlific = 42.15, SDProlific = 11.76, p = 0.608). Taken together, participants of both recruitment strategies were on average 27.38 years old (SD = 5.97) and 53% of them identified themselves as female. Moreover, participants were asked about their profession and how they would classify their own professional background. Most of the participants were students (57.98%) or employees (32.77%). The sample included people with backgrounds in the industrial domain (n = 36), in the service domain (n = 20), in the social domain (n = 36) and in other domains (n = 27).
Task and materials
Context descriptions
The context descriptions were generic textual representations of context-specific joint human–robot interactions (Table 1). The description of the industrial field of application included a robotic assistance that supported workers with assembling products, moving objects from one workstation to another and placing parts in designated areas. In the service context the robotic system delivered goods to a respective destination, sorted parcels into designated areas, cleaned work surfaces and supported employees in potential customer care. The social context described a robot that supported workers in the care of other people on an organizational, social and emotional level and could be used for social interactions such as sport exercises.
Table 1 Translated context descriptions for the industrial, service, and social domain [originally presented in German, accessible via https://osf.io/6zq9e/ (OSF)]
Robot stimuli
The ABOT (Anthropomorphic roBOT) Database was used to select robots with different degrees of anthropomorphism. This database contains over 250 standardized images of existing robots with differently anthropomorphic features with every robot having a score ranging from zero to 100 to indicate the degree of the robot’s anthropomorphism [34]. This overall score contains four dimensions of robot appearance features that were identified with a principal component analysis: the surface features (e.g., skin, gender, hair, eyebrows), facial features (head, face, eyes, mouth), body-manipulators (e.g., legs, arms, torso) and mechanical locomotion (treads, wheels). Following the approach of previous research [13, 35, 36], three different degrees of anthropomorphism represented by the overall score were considered for the study (low, medium, high). For every anthropomorphism degree three robots were chosen to minimize carryover effects within each domain, as each context description was presented three times. Apart from differences in perceived anthropomorphism, all robots had similar color schemes, similar abilities based on their appearance and no obvious gender cues like hairstyle [37] or body proportion [38].
See Fig. 1 for examples of low, medium and high anthropomorphic robots. The scores within each category were comparable, whereas the scores between the low (M = 9.14, SD = 0.56), medium (M = 23.06, SD = 0.54) and high (M = 49.2, SD=1.82) level of anthropomorphism differed substantially. It was a deliberate decision to not select robots with extremely high perceived anthropomorphism, because on the one hand, these often already have an assigned gender or at least gender specific cues (like long hair or wearing a dress). On the other hand, robots that are too anthropomorphic might in general be perceived negatively and generate a feeling of uncanniness [39].
Implicit association test
The IAT is a computer-based discrimination task, in which subjects are asked to classify individual stimuli representing concepts or attributes as quickly as possible into four different categories by pressing two possible answer keys [16]. For the four categories suitable stimuli that are easily categorizable have to be selected [16]. Typically, IAT stimuli [15] are represented by words, but images or symbols can be used as well [16]. Because gender categories with regard to robots are difficult to realize verbally, the stimuli were implemented using images of robots with typically male and female associated features. Categories in the IAT are usually represented by eight stimuli each [15]. Therefore, eight images of male and female looking robots as well as eight images of an industrial and a social context were selected. The robot stimuli were mostly derived from the ABOT database while the context stimuli were extracted from free stock image databanks (see Fig. 2 for examples of the robot and context stimuli). A pre-test was conducted to find the most suitable robot stimuli. Eighteen participants (12 female) with a mean age of 34.44 (SD = 15.35) years, evaluated 20 stimuli with regard to the perceived gender of the robot on a scale from zero (male) to 100 (female). The mean scores for every robot were calculated and respectively the eight most male (scores between 4.9 and 30.3) and most female (scores between 66.2 and 99.4) looking robots were selected. The final stimuli can be accessed at the Open Science Framework (OSF) via https://osf.io/6zq9e/.
For the analysis of the IAT the improved D-Score was calculated according to Greenwald et al. [43]. This score consists of the average response time difference between the combined stages in the IAT, thus the stage where “social” and “female robot” share an answer key and “industry” and “male robot” share the other answer key as well as the stage where this pairing is reversed (social + male robot; industry + female robot) divided by the standard deviation of the respondent’s response times in both combined stages [16]. For the exact procedure see Greenwald et al. [43].
Dependent measures
Control measures
Though all robots had a specific score from the ABOT database, qualifying them as stimuli with low, medium or high anthropomorphism, it was still necessary to verify that the participants did perceive the differences in anthropomorphic robot design. A single item was therefore used as a manipulation check to assess the perceived anthropomorphism for each robot. The nine robots were displayed in a randomized order and participants had to indicate the human-likeness of each robot on a scale ranging from 0 (not at all human-like) to 100 (completely human-like). The scale was chosen to enable a comparison with the ABOT score which ranges from zero to 100, too.
Furthermore, to prevent confounding effects that influence the participants responses, the individual tendency to anthropomorphize was measured. Research has shown that the tendency to anthropomorphize non-human entities is not universal [40, 41]. To assess stable individual differences in this tendency, the individual differences in anthropomorphism questionnaire (IDAQ) by Waytz et al. [41] was used in the study.
Preferred degree of anthropomorphism
The main dependent variable to assess the preference for differently anthropomorphic robots was the frequency with which the different degrees of anthropomorphism were chosen with regard to each context. In addition to the frequencies of the chosen robots, the response latency (in milliseconds) of every selection was measured.
Gender attribution: naming frequencies
In order to examine gender associations in the application contexts, a naming technique was used that was derived from previous research [18, 19]. After the selection of a robot in a specific context, the participants were asked to give the robot a name. This open format was used in order to not impose answer options on the participants. Further, it opened up the possibility for the participants to not just give traditional male or female names but any name they could imagine, like neutral or technical ones, which is a tendency that has been observed by Keay [42] in the naming of robots for robot competitions. For the analysis, the names had to be coded into categories. For this purpose, the categories employed in earlier research [41] were modified and also applied here. The used categories are female, male, nickname (including names of unknown gender, popular robot names, typical animal names) and functional (including technical and mechanical qualities). Three raters coded the names in the different categories independent of each other in a first rating round. In a second round, the raters discussed and resolved ambiguities together. All three raters were associated with the department and two of the raters are authors of the paper (E. Roesler & L. Naendrup-Poell). The inter-rater reliability of the coded names after the first iteration was κFleiss = 0.74. After discussing the diverging categories, in almost all cases an agreement was reached, resulting in a nearly perfect inter-rater reliability (κFleiss = 0.96). In cases where no absolute agreement could be achieved, the category that was chosen by two third of the raters was selected.
Gender attribution: implicit association
In this work the automatic semantic association of the concepts “industry” and “social” with the concepts “male robot” and “female robot” was investigated. For the analysis, the so-called improved D-score was calculated. The D-score represents an index of the relative strength of association and consists of the response time differences between the expected association of congruent and incongruent category pairings [43]. It is assumed that response times are faster when two strongly associated concepts share an answer key (congruent pairing: industry and male robot or social and female robot) compared to less strongly associated concepts (incongruent pairing: industry and female robot or social and male robot) [15].
Study design and procedure
The study was conducted as an online survey using the platform SoSci Survey. Participants completed the study on their private computers without the presence of the experimenter.
First, participants were informed about the general terms and conditions. Afterwards, the procedure of the study was presented, and they were instructed that all the robots shown in this study are equipped with the same functional capabilities.
Subsequently, participants read one of three different context descriptions, whereby the presented order of the descriptions was randomly assigned to every subject. The descriptions of the industrial, service and social domain represented the levels of the factor “application field”. Since every participant read every domain description, the study consisted of a one factorial within-subjects design. After reading a context description, participants were asked to decide which robot they would prefer in this context based on three depicted robots. The displayed robots varied in their degree of anthropomorphism with three different levels: low, medium and high anthropomorphism (Fig. 1). After selecting a robot, subjects were asked to provide a name for the robot. This procedure was repeated in total nine times–three times for each application domain.
Thereafter the implicit association test was conducted. Participants were instructed that they had to do a categorization task and were then presented with the standardized instruction of the IAT [15, 16]. After the IAT, participants had to indicate how anthropomorphic they perceived the nine robots they had seen before as a manipulation check. Then the IDAQ [41] with fifteen items rated on an eleven-point scale from zero “not at all” to ten “very much” and several socio-demographic questions were filled in. The entire experiment lasted 15–20 min.