Perceived Human-Likeness of Social Robots: Testing the Rasch Model as a Method for Measuring Anthropomorphism

Anthropomorphism is generally defined as the attribution of human-like characteristics to social robots and other non-human objects. We argue that different researchers have different interpretations of this concept, leading to measuring instruments that focus on different subsets of human-like characteristics. In the current paper, we discuss these different interpretations and explore a new method for measuring anthropomorphism, based on the Rasch model. The aim of the current work is to map anthropomorphism as a range of human-like characteristics on a one-dimensional scale. The scale’s validity and sensitivity were tested by comparing it with two available measuring instruments and by comparing people’s responses to different types of agents. In three studies, we explored whether the Rasch model is suitable for measuring anthropomorphism. Despite some limitations, results showed that the Rasch model can successfully be applied to the measurement of anthropomorphism. Implications for future work on anthropomorphism of social robots are discussed.


Introduction
During the past decades we have seen an increasing interest in research on social robots.They are being commercialized and becoming available to the general public.Because of this, understanding the effects of their appearance and behavior on people's interactions with them is gaining importance as well.In the near future, social robots may be provided more human-like features and are likely to be represented as social entities with faces, engaging in social conversation with humans.These developments could make people perceive those robots as more and more human-like.This perceived human-likeness is an important determinant for people's responses to social robots (see [1,36,38]).Moreover, human-likeness in social robots has been shown to positively affect people's engagement with those robots [4], people's expectations of the robot's navigation behavior [30], and the robot's persuasive power [7].The term anthropomor-

Subsets of Human-Like Characteristics
An examination of existing measurement instruments (e.g., [2,5,32]) revealed that human-like characteristics are generally categorized into appearances, thoughts and emotions.With appearances we refer to characteristics that reflect human form or behavior (i.e., how an object or robot looks and/or moves), including both physical shapes and physical abilities.With their measuring instrument for anthropomorphism, Bartneck and colleagues [2] focused on such appearances by asking people to indicate, among others, to what extent a robot looks human-like, looks life-like, and shows realistic movements.These items clearly focus on a robot's appearance and the extent to which this appearance resembles the human body.
With thoughts we refer to characteristics that reflect cognitive states and processes.According to Waytz and colleagues [32], anthropomorphism is a process of inductive inference which most likely occurs by attributing cognitive states that are perceived to be uniquely human to other agents (for a review, see [9]).Hence, anthropomorphism was measured by asking people to indicate to what extent an agent has cognitive abilities like consciousness and free will [32].These cognitive abilities cannot be included in the physical design of robots, because they can only be inferred from its behavior.
Finally, with emotions we refer to characteristics that indicate subjective conscious experiences which can be distinguished in primary and secondary ones (for an overview of the hierarchical organization of emotions, see [26]).Eyssel and colleagues [11] measured anthropomorphism by asking people to indicate to what extent robots can experience such primary and secondary emotions.
In their further work, Eyssel and colleagues [10,12] differentiated between personality characteristics that reflect human nature and human uniqueness.This distinction was adapted from earlier research on social perception in humans [16,17].In this line of research, human nature characteristics are described as characteristics of the human species that are shared with other animals (e.g., innate and affective traits).Uniquely human characteristics, on the other hand, are considered to be exclusive to humans and not possessed by any other species (e.g., social learning and higher cognition, see [17]).Whether or not such a clear-cut distinction exists between humans and other entities is outside the scope of the current work, but it does explain why others have approached it as a 2-dimensional construct.
The aim of this paper is twofold.First, we conceptualize anthropomorphism and discuss why measurements obtained with existing instruments may have little correspondence.Second, we propose a method for measuring anthropomorphism based on the Rasch model (see for example [3]), and test its dimensionality and validity in three studies.We end with a discussion on the benefits and limitations of using the Rasch model for measuring anthropomorphism.

Conceptualizing Anthropomorphism
We argue that anthropomorphism is a single predisposition, meaning that whatever human-like characteristics an individual ascribes to a robot-be it human nature or human uniqueness characteristics, basic physical abilities or moral decision making, they all stem from that single predisposition to do so.This is in line with Waytz and colleagues [31], who stated that attributions of human-like characteristics stem from stable individual predispositions.
However, Waytz and colleagues [31] also referred to anthropomorphism as the attribution of characteristics that people regard as distinctively human, in particular mental capacities.This view is shared by Zawieska and colleagues [37], who argued that anthropomorphism only includes human-like characteristics that robots do not have.As a result of this ostensibly narrow focus, only mental capacities (i.e., having intentions, free will, a mind of your own, consciousness, and experiencing emotions) are included in the measurement of anthropomorphism (see [31]).More specifically, the Individual Differences in Anthropomorphism Questionnaire (IDAQ) consists of items that describe characteristics of specific agents (e.g., cars, cows, and mountains) within one of three categories (i.e., technologies, animals, and natural entities).We argue that this operationalization of anthropomorphism insufficiently measures the concept, because it focuses merely on a narrow subset of human-like characteristics (mental capacities), and it includes only very specific agents.In other words, we believe that asking a person to what extent an average fish has free will does not explain or predict that person's responses to a social robot.We also wonder whether such a scale can be used for comparing the perceived human-likeness of different types of agents.
It is evident that the different measuring instruments have been developed with a focus on different subsets of the construct.Consequently, the question arises as to what extent these measuring instruments of anthropomorphism measure the same concept, and thus ultimately to whether we can faithfully compare research findings.We view anthropomorphism as a one-dimensional construct, and argue that all human-like characteristics-no matter which subset they belong to-are ordered according to the probability with which they are ascribed to robots.Some human-like characteristics are expected to be more easily ascribed to robots than others.For example, human-like appearances are expected to be more easily ascribed to social robots than underlying cognitive states and processes, regardless of an individual's general tendency to anthropomorphize.
Finally, we argue that the ordering of human-like characteristics with respect to the difficulty to attribute them to robots is similar for all individuals.More specifically, people are expected to be more likely to attribute human-like appearances to robots than they are to attribute cognitive states.Such an invariant ordering also entails that if a person attributes the ability of moral reasoning to a robot, (s)he is also expected to attribute the ability of seeing to that robot.Another person who does not attribute the ability of seeing to the same robot is not expected to attribute the ability of moral reasoning to it.
If all human-like characteristics can be invariantly ordered across people, we can compare people's individual predispositions to anthropomorphize and the perceived humanlikeness of a robot on a single scale.One model that is able to map a person's predisposition to anthropomorphize and the human-like characteristics (s)he is likely to attribute to a robot as locations on a single dimension, and thus seems highly suitable for measuring anthropomorphism, is the Rasch model [3].

The Rasch Model
The Rasch model (see Eq. 1) describes the odds of a certain response as a logistic function of person and item parameters.In our case, this relates to the probability of attributing a specific human-like characteristic i as an additive function of a person n's general predisposition to anthropomorphize (θ n ) and the difficulty to attribute that specific human-like characteristic to a robot (δ i ). ln Both parameters in this equation (an individual's predisposition to anthropomorphize θ , and the difficulty of attributing a specific characteristic to a robot δ) are estimated by means of maximum likelihood estimation.Predispositions of persons and the difficulty of the various self-report items (i.e., whether or not a robot is thought to possess a certain characteristic) are expressed in log odd units (also called logits).For a specific characteristic i to have a 50% chance to be attributed to a robot, the difficulty of that characteristic (e.g., δ i = 1) has to be matched numerically with an equivalent amount of a person n's predisposition to anthropomorphize (θ n = 1).
The Rasch model, however, is not only descriptive, but also prescriptive.It requires items (or human-like characteristics) to be ordered invariantly and transitively across persons, and persons to be ordered invariantly and transitively across items.Thus, if a person attributes, for example, four out of 10 characteristics to a social robot, the Rasch model also prescribes which four should be attributed: They should be amongst the four least difficult ones.For the most difficult to attribute characteristics, we expect to exclusively find such attributions amongst the individuals that have a high predisposition to anthropomorphize.
These formal Rasch model expectations can be tested empirically against the observed data.Provided the responses are ordered from the person with the highest to the lowest predisposition to anthropomorphize, the model anticipates the following response string for an averagely difficult to attribute characteristic: 111101010000.In this response string, a one indicates that a person agreed to the robot having that characteristics, and a zero that a person did not.The first four individuals all have high predisposition, so all four are likely to attribute this characteristic to the robot.The next four individuals have a predisposition that approaches numerically the difficulty of this characteristic or item.As a result, some of them will attribute this characteristic to the robot, and others will not.Finally the last four individuals have such a low predisposition to anthropomorphize that all are very unlikely to attribute this characteristic to the robot.
The mean square (MS) statistic is commonly used to test the match between model-predicted and observed response patterns.The MS infit-value is the weighted average of the squared standardized residuals, in which each residual is weighted by its variance [3].The model-predicted response pattern (e.g., 111101010000), yields a MS infit-value of MS = 1.00.Excessively high MS-values can be expected when the observed response string opposes the Rasch model prediction, for example when the likelihood of reporting an experiential effect increases with diminishing susceptibility (e.g., 000000111111).In contrast, the MS-value would be smaller than 1.00 for an item with a deterministic response pattern (i.e., 111111000000).In such a case, the model prediction-to-data fit is better than what one would anticipate with a probabilistic model.MS-values below 1.00 do not really challenge the Rasch model prediction, but can be used to improve one's measurement instrument.Besides MS infit also MS outfit-values are often reported.MS outfit-values are unweighted fit statistics, and are more sensitive to unexpected responses on relatively easy or difficult items.
The invariance assumption should be sufficiently met in order to map both persons and human-like characteristics on a single scale (see 1), and thus to compare individuals and human-like characteristics against each other in a meaningful way.For assessing item fit, MS-values up to 1.20 are considered excellent, and MS-values below 1.50 are considered acceptable [35].
The additional advantage of the model is the invariance between items and persons.As a result, the assessment of a persons predisposition to antropomorphize and of the items with respect to their difficulty are independent of each other (so-called specific objectivity, see [34]).As results, and in contrast to many other measurement models, measurements of personal attributes are not defined by the specific set of items used (see [18]).It thus also allows for items being deleted and/or replaced.This enables the use of different sets of items (see [33]), making this method easily adjustable for measuring responses to different types of agents (e.g., animals or natural phenomena).

Research Aims
The current research was designed with the aim to explore a new method for measuring anthropomorphism.We hypothesized that anthropomorphism can be successfully mapped onto a one-dimensional scale and that human-like characteristics are ordered with respect to their likelihood of being attributed to robots in a way that is similar for all individuals in their encounter with different types of agents in different contexts.Data from three studies were used to test these hypotheses.Two of these studies were originally designed with different purposes, but we will only assess results on the included measuring instruments for anthropomorphism.
In the first study, we developed and tested the construct validity of a 37-item anthropomorphism scale.With construct validity we refer to the relation between the ordering of items on the scale according to their difficulty and their perceived human nature and human uniqueness.Construct validity is high when strong relations between the locations of items on the dimension and their perceived human nature and human uniqueness are found.Because of the expected unidimensionality, human nature and human uniqueness were also expected to be strongly correlated.
In the second study, a 25-item version of the anthropomorphism scale was tested on its convergent validity.With convergent validity we refer to the extent to which estimates of the scale are related to estimates obtained with two other measuring instruments for anthropomorphism: the questionnaire used by Waytz and colleagues [32, referred to as the Waytzinstrument, see "Appendix A"] and the anthropomorphism part of the Godspeed questionnaire developed by Bartneck and colleagues [2, referred to as the Godspeedinstrument, see "Appendix B"].Because the focus of these instruments is on different subsets of human-like characteristics, we expected to find moderate correlations between the anthropomorphism scale and the Waytz-and Godspeedinstruments.The second study was originally designed to compare two different robots on their perceived humanlikeness.Because of this, we could use this study for testing the invariant ordering of the items on the scale for those two robots.
In the third study, we extended our scope from robots to other types of agents.A 19-item version of the anthropomorphism scale was tested on its sensitivity for differentiating between humans and other types of agents.This study was originally designed to investigate people's responses to four types of players in a game (i.e., humans, robots, computers, and algorithms), enabling us to compare responses for other types of technologies than robots as well.We expected that the scale would successfully differentiate between humans and other types of agents.
Finally, we expected that the anthropomorphism scale would show an invariant ordering of human-like characteristics on a single dimension across all studies and all experimental conditions.

Study 1
In this study, a list with 37 human-like characteristics was created, largely based on earlier work on humanness and anthropomorphism [2,11,16,17,32], and tested on its construct validity.The human-like characteristics were modeled as a function of a person's predisposition to anthropomorphize and the difficulty to attribute that human-like characteristic to a robot.We hypothesized that items and per-sons could be mapped onto a single one-dimensional scale, and that the items would be invariantly ordered according to the difficulty with which they are attributed to a robot.
Additionally, for construct validity purposes, the extent to which the 37 human-like characteristics were perceived as being human nature and uniquely human was measured.We hypothesized that the estimated difficulties with which the 37 characteristics are attributed to a robot would be related to their perceived human nature and human uniqueness.Because of the proposed unidimensionality of anthropomorphism, human nature and human uniqueness were expected to be strongly correlated as well.

Participants and Design
One hundred and sixty one participants sampled through social media participated in one of three groups in the current study.The first group consisted of 124 participants (53 males and 71 females; M age = 26.08,S D age = 8.82, Range = 15 to 59) who were given a description about a robot and completed 37 survey items.Another group of 20 participants (9 males and 9 females, M age = 19.94,S D age = 1.98,Range = 18 to 23; two participants did not indicate their age and gender) rated the 37 human-like characteristics on human nature.The remaining 17 participants (11 males and 6 females, M age = 21.12,S D age = 1.80,Range = 18 to 24) rated all characteristics on human uniqueness.Participants in all three groups participated voluntarily, gave informed consent, and were not compensated for participation.

Materials and Procedure
A set of 37 items describing human-like characteristics was constructed.For all three groups of participants, items were arranged in alphabetical order.
For the first group of 124 participants, items were formulated as a statement which could be answered with yes (coded with a 1) or no (coded with a 0).The items were presented through an online survey.After reading a short explanation about the study, participants were provided with a short description about a robot: 'The robot has eyes to perceive the environment, has arms and legs to move around in this environment, and today the robot is trying to solve a moral dilemma'.This description was followed by an instruction to not think elaborately about the statements and to give the answers that first came to mind.Finally, participants indicated their gender, age and education level, and they were thanked for their contribution.This study took approximately 5 min to complete.
The two other groups of participants were not given the description of the robot, but instead were asked to indicate to what extent each of the 37 items on the scale was perceived as 'typically human' (i.e., human nature) or 'uniquely human' respectively.Human nature was measured with one question ('To what extent is this characteristic typical for humans?') on a 7-point response format ranging from 'not at all' (coded with a 1) to 'very much' (coded with a 7).Human uniqueness was measured with one question ('Is this characteristic unique for humans?') on a dichotomous response format with 'not unique' (coded with a 0) and 'unique' (coded with a 1) as options.The two concepts were not further introduced or explained to participants.We chose for a dichotomous scale for human uniqueness because a characteristic either is, or is not uniquely human (i.e., characteristics cannot be 'a little' unique for humans).Both these evaluations took approximately 5 min to complete.

Model Test
To test whether the data sufficiently fit the model, four tests were conducted.First, fit statistics were used to test whether items and persons fitted the Rasch model (for an overview, see [3]).For assessing item fit, both infit and outfit were used.Infit indicates unexpected observations on items that are close in difficulty to a person's predisposition, whereas outfit indicates unexpected observations on items that are relatively easy or difficult [23].Infit and outfit mean square (MS) values ≤ 1.20 are considered excellent, and MS-values ≤ 1.50 are considered acceptable [35].The second test determined whether the items were sufficiently spread over the perceived human-likeness dimension.The third one tested the hypothesis that items all belong to a single dimension.Finally, the fourth one tested the hypothesis that items would be invariantly ordered according to the difficulty of attributing them to robots.Item fit Ideally, each of the items contributes in a meaningful way to the measuring instrument, indicated by a sufficient item fit.Considering the notion that the Rasch model is stochastic and that data depend on probability (and not on certainty), some misfit is to be expected [27].An acceptable five of the 37 items had outfit MS values outside of the acceptable boundaries.These were items 1 ('experience pain', outfit MS = 2.09), 26 ('anticipate on surroundings', outfit MS = 1.62), 32 ('organized', outfit MS = 2.74), 33 ('estimate distances', outfit MS = 2.12), and 37 ('avoid objects', outfit MS = 1.97).Another 7 items had outfit MS values between the good and acceptable values (see Table 1 for estimated item difficulties).
Item difficulties were estimated with a reliability of .98,and the average item difficulty was anchored at M = .00logits (SD = 2.34, Range = − 4.60 to 4.08).Infit MS values of the 37 items ranged from 0.72 to 1.22 (M = 0.98, SD = 0.12), and outfit MS values ranged from 0.46 to 2.74 (M = 1.10,SD = 0.50).These findings together indicate that there was an acceptable item fit, meaning that there were not many observations that did not fit the overall structure of the data.Person fit The purpose of person fit measurement is to detect response patterns that are unlikely given the model [24].More specifically, person fit indicates whether a person responds as expected given his/her individual predisposition to anthropomorphize.Individual predispositions to anthropomorphize were estimated with a separation reliability of .80.The average predisposition was M = − .21logits (SD = 1.16;Range = − 4.13 to 2.74).Since items are responded to by individuals who can be tired or misread the statements, some misfit is to be expected.For an acceptable ten out of 124 participants (8.1%), the model prediction did not fit the data as indicated by a t-value of t ≥ 1.96.These findings indicate that there was an acceptable person fit, meaning that there were not many observations that did not fit the overall structure of the data.Item spread All items and persons are mapped onto a single scale in Fig. 1.As can be seen in this figure, the spread of items sufficiently covers the spread of persons.In other words, the current scale was able to reliably measure individual predispositions to anthropomorphize for all participants in the current sample.It also appeared that the top region of the scale comprised many items, but not so many persons.Thus, some items appeared to be too unlikely for persons in the present sample to attribute to the robot, and therefore did not contribute to the assessment of individual differences in people's predisposition to anthropomorphize.For this reason, some of these items will be omitted from the scale in the next studies.Dimensionality Next, we tested the hypothesis that the items of the anthropomorphism scale would all belong to one dimension.Results showed that the Rasch model explained 52.8% of the variance in the data (for computational details, see [22]).A Principal Component Analysis was performed on the standardized residuals (i.e., the data not explained by the model), which checks whether multiple items share the same unexpected response pattern (for details, see [22,29]).If the model would fit perfectly, then 52.5% of the overall variance would be quantification variance, revealing a slight overfit (0.3%) to the model.Because the Rasch model estimates probabilities for discrete events (i.e., whether a person attributes a certain human-like characteristic to a robot or not), substantial quantification variance is to be expected (see also [15]).The empirical proportion of unexplained variance (i.e., 47.2%) was thus highly similar to the proportion of quantification variance one would expect with a perfect data-to-model fit (i.e., 47.5%).
An additional factor would result in an increase of 6.8% in the proportion of explained variance.The set of items thus largely tapped into a single dimension only.These findings supported the expected unidimensionality of the Invariant ordering To test the hypothesis that items on the scale would be invariantly ordered according to the diffi-culty of attributing them to robots, the sample was split in half and item difficulties were estimated twice: once for participants with even and once for participants with odd identification numbers.Consistent with the hypothesis of person-independent item difficulties, the two estimates of the 37 items were highly similar, r = .97,p < .001.The item invariance plot is provided in Fig. 2a.As can be seen in this figure, the ordering of items on the scale by their difficulty to ascribe them to the robot is similar across the samples of participants with even and odd identification numbers.We also performed the 'Wright's challenge' (see [3]).For this, the sample was split in half once more, but this time according to the participants' estimated predispositions.More specifically, item difficulties were estimated for participants with high predispositions and for participants with low predispositions to anthropomorphize separately.The estimates of the 37 items were again highly similar, r = .92,p < .001.The item invariance plot is provided in Fig. 2b.As can be seen in this figure, the ordering of items on the scale is also similar across the samples of participants with high and low individual predispositions to anthropomorphize.

Construct Validity
To test the hypothesis that the ordering of items of the anthropomorphism scale according to their difficulty was related to perceived human nature and human uniqueness, item difficulties were compared with the mean scores on these concepts.Results showed moderate but significant correlations between item difficulties and human nature (r = .60,p < .001)and between item difficulties and human uniqueness (r = .44,p < .01).The higher a characteristic was rated on human nature and/or human uniqueness by participants in the content validity groups, the less likely participants in the survey were to attribute that specific human-like characteristic to a robot.These results support the expectation that the difficulty to attribute a specific characteristic to a robot is related to that characteristic's perceived human nature and human uniqueness.In line with previous literature [10], a significant correlation was found between human nature and human uniqueness (r = .71,p < .001),indicating that items that were rated high on human nature were more likely to be indicated as being uniquely human, whereas items that were rated low on human nature were less likely to be indicated as being uniquely human.This finding supports the expectation that human-like characteristics can be mapped onto a one-dimensional scale.

Conclusions
In the current study, a 37-item anthropomorphism scale was tested on its construct validity.Results showed that people's responses sufficiently fitted the Rasch model, indicated by an acceptable data-to-model fit.Also, all human-like characteristics included in the scale could be successfully mapped onto a single dimension, confirming the hypothesis that anthropomorphism can be measured on a single scale of equal additive units.Moreover, an invariant ordering was found when splitting the sample in half, supporting the expectation that human-like characteristics can be invariantly ordered with respect to the probability of ascribing them to a robot.As expected, items high in human nature were found to be more difficult to attribute to a robot than those low in human nature.Additionally, uniquely human characteristics were shown to be more difficult to attribute to robots than non-unique ones, as indicated by their locations on the scale.These results indicated that the difficulty of attributing a specific item of the scale to a robot was related to that item's perceived human-likeness.As such, item difficulties are a valuable indicator of how a specific characteristic perceived human-likeness, which is an important aspect in the measurement of anthropomorphism.In sum, the scale had high construct validity and the method so far seems to be suitable for measuring anthropomorphism.In the next study, the scale's convergent validity will be tested by comparing it with existing measuring instruments.

Study 2
The current study was designed to investigate the relation between different measuring instruments for anthropomorphism in people's evaluations of two different robots.Data of this study will be used to test the convergent validity of our anthropomorphism scale.With convergent validity we refer to the extent to which estimates obtained with the scale converged with those obtained with two commonly used measuring instruments for anthropomorphism: the Waytzinstrument and the Godspeed-instrument.We tested to which extent estimates made with the three different instruments would be related.In addition, two different robots were evaluated on their perceived human-likeness, and we expected an invariant ordering of the items on the scale for those two robots.

Participants and Design
One hundred and thirty one participants sampled through social media participated in this study.Of these 131 participants, 48 were male and 83 female (M age = 34.86,S D age = 17.59,Range = 13 to 77).They were randomly assigned to one of two groups in which they watched a video of a robot that either resembled mostly human-like physical features (n = 68) or mostly human-like cognitive features (n = 63).The two robots did not differ on any of the three included anthropomorphism measuring instruments (all t's < 1.31, all p's > .20), allowing data of both experimental conditions to be combined into a single sample for the analyses.All participants participated voluntarily, gave informed consent, and were not compensated for participation.

Materials and Procedure
Participants performed the study online.On the welcome page, they could choose to complete the study in Dutch or in English, after which they were provided information about the procedure of the study in their preferred language.Next, they watched a short (about 1 min) video of one of the two robots, depending on the experimental condition they were in.The robot with human-like physical features was running around and pouring water in a cup (the video can be found at https://goo.gl/npyDfG),and the robot with human-like cognitive features appeared to become angry at a person who left dirt on the floor (the video can be found at https://goo.gl/i2Sqgg).
After participants watched the video of one of the two robots, they completed the three measuring instruments for anthropomorphism.The first one was a 25-item version of the anthropomorphism scale that was developed and tested in Study 1 and adjusted for the current study (see Table 2 for the items in this study).Some of the most difficult items, as well as the easiest one, were deleted because they were expected to contribute little to estimations of people's predisposition to anthropomorphize (i.e., items 1, 3, 6, 7, 8, 10, and 37 in Table 1).Three items were deleted because the construct validity test in Study 1 showed that they did not sufficiently relate to anthropomorphism (i.e., items 24, 28, and 32, in Table 1).Item 15 in Table 1 was deleted because it was phrased as a double negation.Items were formulated as a statement which could be answered with yes (coded with a 1) or no (coded with a 0).The second questionnaire was the Godspeed-instrument (see "Appendix B"), which consisted of 5 items (α = .71)with a 5-point response format.Five dummy items were included to make the goal of this questionnaire and of this study less obvious.Participants' averaged responses across the five target items were used in the analyses.The third questionnaire was the Waytz-instrument (see "Appendix A"), which consisted of 6 items (α = .78)with a 5-point response format.
Participants' averaged responses across the six items were used in the analyses.
After completing these questionnaires, participants completed several questionnaires that were related to the original purpose of the study.These questionnaires measured concepts such as desire for control, need to belong, trust, and predictability.Data on these questionnaires will not be used in our analysis.Finally, participants indicated their age and gender, were debriefed and thanked for their participation.The study took approximately 10 min to complete.

Model Test
In this section, the hypotheses that items and persons can be mapped onto a single one-dimensional scale, and that items are invariantly ordered according to the difficulty with which they are ascribed to a robot are tested.The section has the same structure as in Study 1. Item fit As in Study 1, most items fitted the model sufficiently with infit and outfit MS values ≤ 1.50 (see Table 2 for estimated item difficulties), except for items 13 ('understands Person fit Individual predispositions to anthropomorphize were estimated with a reliability of α = .78.The average predisposition was M = .33logits (SD = 1.59,Range = − 5.70 to 5.47).For a reasonable eight out of 131 participants (6.1%), the model prediction did not fit the data as indicated by a t-value of t ≥ 1.96.Dimensionality The Rasch model explained 63.7% of the variance in the data.If the model would fit perfectly, then 63.5% of the overall variance would be quantification variance.The empirical proportion of unexplained variance (i.e., 36.3%) was thus highly similar to the proportion of quantification variance expected with a perfect data-to-model fit (i.e., 36.5%).An additional factor would result in an increase of a trivial 3.6% in the proportion of explained variance.The set of items thus largely tapped into a single factor only.Invariant ordering The ordering of the item difficulties was highly similar to that obtained in Study 1, as indicated by a strong positive correlation between the item difficulties estimated in Studies 1 and 2 (r = .88,p < .001,see Fig. 3a for the invariance plot).This result supports the expectation that the probability with which the various human-like characteristics are ascribed to robots is largely independent of the individual's predisposition to do so.In other words, the scale showed an ordering of human-like characteristics that is similar for different individuals in different samples.
To explore whether the expected invariance of item difficulties also holds across the two different robots that were evaluated, the sample was split in half with respect to the robot that was evaluated.Consistent with the hypothesis of robot-independent item difficulties, the two sets of estimates (one for the robot with mostly physical human-like features, the other for the robot with mostly cognitive human-like features) were highly similar (r = .97,p < .001,see Fig. 3b for the invariance plot).This finding again supports the expectation that human-like characteristics are invariantly ordered with respect to the difficulty to attribute them to robots.

Convergent Validity
To test to what extent estimates obtained with the anthropomorphism scale converged with the two commonly used measuring instruments for anthropomorphism, the three scales were compared.Results indicated a low, but statistically significant correlation between our scale and the Godspeedinstrument (r = .22,p = .01).After correcting for measurement error attenuation, the correlation remained rather low (r = .30,for computational details, see [6]).In addition, a moderate and statistically significant correlation was found between our scale and the Waytz-instrument (r = .46,p < .001).After correcting for measurement error attenuation, this correlation remained rather moderate (r = .59).

Conclusions
In this study, an adjusted 25-item version of the anthropomorphism scale was compared with two available measuring instruments for anthropomorphism to test for convergent validity.Results showed that, as in Study 1, people's responses sufficiently fitted the Rasch model, indicated by an acceptable data-to-model fit.This result supported the expected invariant ordering of human-like characteristics with respect to their item difficulty.Additionally, the scale correlated with existing measures of anthropomorphism, but not to the extent to which we could claim convergence.This indicates that our understanding of anthropomorphism is still limited.We will elaborate more on this in the general discussion.
No differences between the robots were found on any of the measuring instruments for anthropomorphism.The ordering of items on our anthropomorphism scale was also highly similar for both robots, indicating that the two robots in this study were perceived as equally human-like.An interesting question thus is how well the scale differentiates between other types of agents.In the next study, people's predispositions to anthropomorphize humans, robots, computers, and algorithms will be compared.

Study 3
The current study was originally designed to investigate people's responses to different types of players in a social game: the Ultimatum Game (see [13]).Data of this study will be used to explore the anthropomorphism scale's sensitivity for differentiating between various types of technologies.We compared people's responses to humans, robots, computers, and algorithms.We hypothesized that the scale would successfully differentiate humans from robots.In addition, we explored whether a difference exists between the attribution of human-like characteristics to computers and algorithms.

Participants and Design
Two hundred and two participants (89 males and 113 females; M age = 34.69,S D age = 11.12,Range = 18 to 76) were recruited via Amazon Mechanical Turk (MTurk) to partici-pate in an online experiment.They were randomly assigned to one of four experimental conditions (Agent type: human vs. robot vs. computer vs. algorithm) of a between-subjects design.Participants were paid $1 for their participation.

Materials and Procedure
Participants performed the study online.To create a social interaction that provides opportunities to anthropomorphize the agents, participants played the Ultimatum Game (see [13]).In this game, two players divide a sum of credits.The first player (the agent) proposes a certain division and the second player (the participant) decides whether (s)he accepts or rejects the offer.Participants in the algorithm condition were told that there were no other players available, and that they would be connected to an algorithm that would 'randomly generate offers' during the game.Participants in the other three conditions were told to be playing the game with humans, robots, or computers.During the game, participants were shown pictures of the other players.An example of each of the agents is provided in Fig. 4.
After playing the Ultimatum Game, participants completed an adjusted 19-item version of the anthropomorphism scale.The main dependent variable in this study was participants' behavior in the ultimatum game, and the anthropomorphism scale was included to explore the possible mediating/moderating role of anthropomorphism.Since the latter was not the main goal, and due to time constraints, fewer indicators were used in this study.Items for this adjusted version were selected from the original set of 37 items in such a way that they would still cover a wide range of the anthropomorphism continuum (see Table 3 for the items in this study).Items were formulated as a statement which could be answered with yes (coded with a 1) or no (coded with a 0).
After completing the anthropomorphism questionnaire, participants indicated their age and gender, were debriefed, thanked for their participation, and paid through the MTurk system.The experiment took approximately 6 min to complete.

Model Test
In this section, the hypotheses that items and persons can be mapped onto a single one-dimensional scale and that items are invariantly ordered according to the difficulty with which they are ascribed to agents are tested.The section has the same structure as in studies 1 and 2. Item fit Most items fitted the model sufficiently with infit and outfit MS values ≤ 1.50 (see Table 3 for estimated item difficulties), except for items 12 ('ambitious', outfit MS = 1.59), 33 ('estimate distances', outfit MS = 1.97), and 30 ('calculate', outfit MS = 5.59).
Item difficulties were estimated with a reliability of α = .98.The average item difficulty was anchored at M = .00logits (SD = 1.83,Range = − 4.15 to 3.05).Infit MS values of the 19 items ranged from 0.76 to 1.29 (M = 1.00,SD = 0.17).Outfit MS values of the 19 items ranged from 0.46 to 5.59 (M = 1.14, SD = 1.12).Person fit Individual predispositions to anthropomorphize were estimated with a reliability of α = .88.The average predisposition was M = − .15logits (SD = 3.28, Range = − 5.79 to 5.31).For a reasonable ten out of 202 participants (5.0%), the model prediction did not fit the data as indicated by a t-value of t ≥ 1.96.Dimensionality The Rasch model explained 52.4% of the variance.If the model would fit perfectly, then 52.1% of the overall variance would be quantification variance.The pro-portion of unexplained variance (i.e., 47.6%) was highly similar to the proportion of quantification variance expected with a perfect data-to-model fit (i.e., 47.9%).An additional factor would result in an increase of 8.3% of the explained variance.
Next, the ordering of items in the current study was compared with that of Study 1. Results showed a moderate but significant correlation between the two estimates (r = .58,p < .01,see Fig. 5 for the invariance plot).Despite the significant correlation between the item difficulties, many of the estimated difficulties appeared outside of the 95% confidence interval, indicating substantial differences between the two studies.Some of these items (i.e., 'pick up objects', or 'walk') had lower item difficulties in Study 1 than in Study 3, and some (i.e., 'calculate', or 'understands language') had lower item difficulties in Study 3 than in Study 1.Some human-like characteristics were thus more easy or more difficult to attribute to a specific type of agent than other characteristics.More specifically, the expected invariant ordering of human-like characteristics with respect to their probability of being ascribed to non-human agents was not supported when other agents than robots were evaluated.In the following sections, we will investigate the invariant ordering of the items in Study 3 in more detail.

Invariant Ordering of Items
To explore which human-like characteristics differed in their probability of being attributed to humans, robots, computers, and algorithms, item difficulties were estimated separately  for each of these four player types.Although all correlations between the four sets of item difficulties were significant (see Table 4), some of the characteristics were clearly different in their probability of being ascribed to specific agents.Figure 6 displays the item invariance plots between each of the agent types.As can be seen in this Figure, three of the 19 items (i.e., items 18 'jump', 34, 'pick up objects', and 35 'walk') were consistently less likely to be ascribed to algorithms and computers than to robots and humans.This should not come as a surprise, as computers and algorithms lack the morphology that accommodates such physical activities.This finding revealed that a comparison between the agent types is unfair on specific attributes.Using a computer's ability to perform physical features as a measurement of its human-likeness is similar to judging people who are bound to a wheelchair as less human than their able counterparts for not being able to jump.

Sensitivity in Differentiating Between Agents
To test the extent to which the anthropomorphism scale differentiates humans from other player types, a one-way Analysis of Variance (ANOVA) was conducted with agent type as independent variable and the individual predisposition to anthropomorphize as dependent variable.Results indicated a statistically significant effect of agent type, F(3, 201) = 42.04,p < .001,η 2 = .39.More specifically, anthropomorphism was highest for humans (M = 3.50, SD = 2.35), followed by robots (M = − 0.63, SD = 2.81), algorithms (M = − 1.53, SD = 2.82) and computers (M = − 1.59, SD = 2.29), see Fig. 7a.
Pairwise comparisons (LSD) showed a statistically significant difference between humans and all other agents, t(198) = 11.03,p < .001,d = 1.89.Post-hoc comparisons using Bonferroni correction indicated that differences between any two groups other than human were not significant.
Interestingly, after removing the three morphology-related items (i.e., items 18, 34, and 35) from the scale, the differences in anthropomorphism between non-human player types became smaller, with estimations being highest for humans (M = 3.48, SS = 2.34), followed by robots (M = − 0.83, SD = 2.90), algorithms (M = − 1.32, SD = 2.83) and computers (M = − 1.35, SD = 2.37), see Fig. 7b.This could be caused by the nature of the experiment, because all players behaved in the exact same way during the Ultimatum Game.When those players were subsequently compared with respect to characteristics that they were equally likely to possess based on the behavior that they showed, it should not come as a surprise that no differences between them were found.Computer (e) computer vs human

Robot
(f) robot vs human

Conclusions
In the current study, an adjusted 19-item version of the anthropomorphism scale was used to investigate the scale's sensitivity.Results showed that, as in Studies 1 and 2, people's responses sufficiently fitted the Rasch model, indicated by an acceptable data-to-model fit.This result supported the expected invariant ordering of the human-like characteristics with respect to their item difficulty.An adequate level of sensitivity was found.The scale was able to differentiate humans from different types of technological players, but it did not differentiate those players from each other.Presumably the conceptual differences between those player types were too small to be detected by the current version of the scale.Additionally, the scale's sensitivity dropped when three morphology-related items were removed from the analysis.

General Discussion
The current research was designed to explore whether anthropomorphism can be successfully measured using the Rasch model, whether the concept can be mapped onto a onedimensional scale, and whether human-like characteristics are ordered in a way that is similar for all individuals in their encounter with robots and other types of agents in different contexts.We argued that human-like characteristics can be ordered according to the probability with which they are ascribed to robots, and that this ordering of human-like characteristics on the range of perceived human-likeness is similar for all individuals in their encounter with different types of agents.We developed a set of human-like characteristics and used data from three studies to test the scale's psychometric qualities.These studies had designs with differing contexts and experimental conditions, and people's predispositions to anthropomorphize were compared in different samples with different types of agents.
In the first study, we hypothesized that items and persons could be mapped onto a single one-dimensional scale, and that items would be invariantly ordered according to the difficulty with which they are attributed to a robot.Additionally, the estimated difficulties with which the 37 characteristics are attributed to a robot were expected to be related to their perceived human nature and human uniqueness.In the second study, we tested the extent to which estimates made with three different measuring instruments for anthropomorphism would be related.In the third study, we hypothesized that the anthropomorphism scale would successfully differentiate humans from other types of agents.
All hypotheses were (at least partially) confirmed, and the next sections describe in more detail the dimensionality of anthropomorphism, implications for further research, and the comparison of the three different measuring instruments.

Dimensionality of Anthropomorphism
Across studies, an invariant ordering of human-like characteristics was found, indicating that this ordering was similar for different people in their encounter with different types of agents in different contexts.This finding was supported by dimensionality tests that consistently indicated that the data could be represented in a one-dimensional structure.More specifically, in each of the studies, an additional factor would result in only a small increase in the proportion of explained variance.This result supported the hypothesis that anthropomorphism can be represented as a one-dimensional construct.
Tests of construct validity showed significant correlations between item difficulties and their perceived human nature and human uniqueness, supporting the expectation that the scale measures anthropomorphism.Together, these findings indicate that human-like characteristics are ordered in such a way that they range from low to high on a single dimension.

Implications for Further Research
Findings on some of the characteristics have important implications for future research and thus need some further consideration.For example, in Study 1 the item 'Experience pain' was rated as extremely low in human uniqueness and as medium in human nature, but it appeared to be the most difficult one to ascribe to a robot.One possible explanation for this unexpected finding could be that a nervous system is necessary for experiencing pain, which is not unique for humans, but is something that robots clearly do not have.
This result also raises the issue of physical versus cognitive capacities.For humans and other organisms, the ability to move around in the (physical) environment is a given, whereas for certain technological artifacts this may not be so obvious.Likewise, future artificial intelligence may create agents with high mental capacities without a body, which makes it easier for them to solve a moral dilemma than to pick up an object.This issue may become apparent in the near future, and more research is needed to further investigate this.A large benefit of the Rasch approach is the ability to select items for a specific design, and as such can cope with such technological developments.
Another important finding was that when morphologyrelated characteristics (i.e., the items about a robot being able to jump, walk, and pick up objects) were disregarded in Study 3, the differences on anthropomorphism between robots, computers, and algorithms decreased.Future research can be designed to explore whether we can create a measure for anthropomorphism that is universal for various types of agents (e.g., humans, animals, technologies, and deities).

Comparison of Measuring Instruments
The anthropomorphism scale was compared with two available measuring instruments for anthropomorphism to test for convergent validity.These available instruments were the Godspeed-and Waytz-instruments.We tested whether measurements obtained with the three instruments would be related.This hypothesis was confirmed by significant correlations between the scale and both other instruments, but these correlations were rather low.
Although these correlations were smaller than expected, this should not come as a surprise, given the nature of the different instruments.They were all developed with different views on the concept anthropomorphism.The Godspeedinstrument focuses on mostly appearance-related features, whereas the Waytz-instrument focuses mostly on cognitionrelated features.

Limitations and Future Research
In all studies, the items of the anthropomorphism scale were ordered alphabetically, which may have influenced people's responses to certain items because of order effects.The items on the scale should be ordered randomly in future studies to prevent the occurrence of such ordering effects.
All studies were performed with mainly Dutch and American participants, so their cultural backgrounds and experiences with technology could have been quite similar.Earlier work has shown that people's tendencies to attribute humanlikeness to non-humans can be related to religion [14], and that people from different countries (such as individualistic versus collectivistic ones) respond differently to computers [19] and evaluate robots differently [28].It would be interesting to investigate whether cultural differences influence people's predispositions to anthropomorphize.
In addition, we used social media for sampling participants in studies 1 and 2, and thus relied on data gathered from experimenters' acquaintances.This may have led us to end up with homogeneous groups of participants, which could partly explain the high consistency between those studies.For increasing our understanding of the concept anthropomorphism, it is important to also collect data from more diverse groups and check whether the high consistency prevails.
In Study 2, each statement was presented as a general statement about 'a robot', and not specifically aimed at the robot that participants watched in the short movie clip.This could explain why no differences in anthropomorphism were found between the two robots.Future research that is designed to investigate evaluations of different robots should therefore not use the scale in a general way, but rather phrase each question to be specifically about a certain robot (or other non-human object).
We did not test whether people had previous knowledge about or experience with robots.This experience could influence people's responses to those robots (see e.g., [21,25]) and should therefore be taken into account in future studies.The Rasch model offers a promising approach to investigate how such experiences affect a person's predisposition to anthropomorphize, the difficulty of ascribing a specific characteristics to a robot, or both.

Conclusions
Despite some limitations, we have shown that anthropomorphism can be measured on a one-dimensional scale, and that items are ordered invariantly when this scale is applied to robots.The Rasch model thus provides a reliable way of measuring anthropomorphism.Because of the invariant ordering of the human-like characteristics, the scale provides opportunities for comparing various types of agents and robots with each other across studies.The method also provides the possibility to select items based on the context of the study, making it a versatile tool for measuring anthropomorphism.
Findings of Study 3 showed that different types of agents (except humans) were difficult to distinguish, presumably because the behavior of these different agents was identical.It is therefore an interesting question how the method performs when interactions with different types of agents are evaluated.Ultimately, a scale can be developed that contributes to our understanding of what makes people attribute human-like characteristics to social robots and other non-humans.This understanding can help designers to create robots as social entities that are accepted as members of our society.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A Waytz-Instrument
Items of the anthropomorphism questionnaire adapted from [32], answered on a 5-point or a 7-point response format.

Fig. 1
Fig. 1 Item-person map of Study 1, displaying the estimates of participant's predisposition to anthropomorphize and the item difficulty linked with each human-like characteristic mapped onto a single scale of equal additive units.Each number on the right represents an item.Each dot on the left represents a person

Fig. 2
Fig. 2 Item invariance plots of the item difficulties of subjects with a even and odd identification numbers and b high and low predispositions to anthropomorphize in Study 1.Each number represents an item,

Fig. 3
Fig. 3 Item invariance plot of the item difficulties of a Studies 1 and 2, and b robots with mostly physical and cognitive human-like features.Each number represents an item, corresponding with the numbers in Table 2. Red lines indicate 95% confidence intervals.(Color figure online)

Fig. 4 123 4 . 2
Fig. 4 Examples of pictures used in the a human, b robot, c computer, and d algorithm groups in Study 3

Fig. 5
Fig. 5 Item invariance plot of the item difficulties in Studies 1 and 3.Each number represents an item.Red lines indicate 95% confidence intervals.(Color figure online)

Fig. 6
Fig. 6 Item invariance plots of the item difficulties of each of the four experimental conditions in Study 3: algorithm, computer, robot, and human.Each number represents an item.Red lines represent 95% confidence intervals.(Color figure online)

Fig. 7
Fig. 7 Visualization of averaged person measures on the scale a before and b after removing the biased items in Study 3. Whiskers represent 95% confidence intervals The low correlation also clearly shows that we still have limited understanding of what anthropomorphism entails, and thus what indicators are best used in its measurement.It is only when multiple different measuring instruments converse that we can claim to fully grasp what the indicators of anthropomorphism are.We believe that the Rasch model is a promising tool to uncover such indicators and thus what anthropomorphism truly entails.

"
To what extent does the robot have thoughts of its own?" "To what extent does the robot have intentions?""To what extent does the robot have a free will?"To what extent does the robot have a consciousness?" "To what extent does the robot have desires?""To what extent does the robot have values and norms?" "To what extent does the robot experience emotions?"

Table 1
Item difficulties (δ), infit-and outfit mean squares, average values of human uniqueness and human nature (with the latter adjusted to the same scaling by subtracting 1 and dividing it by 7) of the anthropomorphism scale in Study 1