Asynchronously Embedding Psychological Test Questions into Human–Robot Conversations for User Profiling

Psychological variables of a person (e.g., cognitive abilities, personality traits, emotional states, and preferences) are valuable information that can be utilized by social robots to offer personalized human–robot interaction. These variables are often latent and inferred indirectly from a third-person perspective based on an individual’s behavioral manifestations (e.g., facial emotion expressions), and hence the true values of inferred psychological variables remain unknown to a robot observer. Although earlier studies have employed robot-administered psychological tests to infer psychological variables based on an individual’s first-person responses, these tests were formally presented and could be tedious to some users. To leverage the validity and reliability of well-established psychological tests for user profiling with ease, the present study examined the possibility of asynchronously embedding psychological test questions into casual human–robot conversations. In our experiment using a big-five personality inventory, the verbal responses from users to these asynchronous test questions were then compared with the written responses to the same personality test. The personality measures estimated from the two approaches correlated strongly in a young adult population but only moderately in an older population. These findings demonstrate the validity of the proposed asynchronous method for psychological testing in human–agent interactions and suggest some caveats when this testing method is applied to older adults or other special populations.

oratories into shopping malls [5] and consumer markets (e.g., Cozmo & RoBoHoN) for providing service and entertainment. According to the International Federation of Robotics, 1.7 million social and entertainment robots were sold in 2015, and the number of sales is projected to be over 7.4 million units in 2025 [6]. In another example, socially assistive robots [5] have been gradually adopted to assist in the health care of children [7], adults [8], and the elderly [9].
For successful long-term interactions with humans, social robots are recommended to identify returning users, remember past interactions with them, and get to know them for offering personalized responses [10,11]. For example, a user may be annoyed by the same, non-personalized greeting question from a robot: "Do you mind telling me more about yourself ?" For another example, the personal spatial zone a user prefers during human-robot interaction is influenced by the personality of that user [12]. Therefore, it is important for a social robot to learn the personal information and idiosyncrasies of a user so as to foresee and meet the social, emotional, and cognitive needs of that user [13].
Despite its importance, understanding or profiling a user is a challenging task, especially when concerning hidden psychological variables (e.g., traits, states, values, and preferences). Although not directly observable, a user's psychological variables (e.g., personality) has been inferred indirectly from behavioral manifestations (e.g., linguistic or prosodic patterns [14]) or psychological tests (e.g., a personality questionnaire [15]) to facilitate human-robot interaction (HRI). Note, however, that inferences from behavioral manifestations usually rely on observations from a third-person perspective, and hence the true values of inferred psychological variables actually remain unknown to an observer, be it a social robot or a real person. By contrast, psychological tests, particularly those with sound psychometric properties, can reliably estimate the values of an individual's psychological variables based on the individual's first-person responses.
Although psychological tests have been utilized in previous HRI studies, overall they have not been heavily employed as user-profiling tools and used to their full potential for improving HRI. In most HRI studies, psychological tests were administered by researchers using paper questionnaires to understand a user before or after HRI [15] rather than during HRI for a robot to dynamically adapt its behavior to individual users. Only a few recent studies pioneered the possibility of having a robot to administer psychological tests during HRI, and all of these robot-administered psychometric evaluations used standard testing procedures as in humanadministered tests (e.g., [16]), which may sometimes appear uninteresting or even tedious to users.
To circumvent the formality problem, here we propose using asynchronous test questions (ATQs) as a user-friendly way of administering a psychological test during HRI. Specifically, our proposed testing procedure consists of three steps: (1) obtaining items or questions of a psychological test; (2) embedding parts of these questions into contextually relevant periods in a conversation during HRI as asynchronous mini-tests; and (3) aggregating all the answers to these ATQs for scoring the psychological test. Compared to a traditional psychological test that usually assesses one psychological domain with items temporally grouped together, ATQs from a psychological test are not given to a testee all at once. As a result, ATQs that assess a psychological domain as a whole are less susceptible to the issues of sustained attention and cognitive demand relative to their original test. Furthermore, because ATQs are presented casually in contextually relevant periods in a conversation rather than formally in a test setting, it is less likely for a testee to be self-conscious about being tested and modify his/her responses accordingly [17].
As has been reported previously, different formats of the same psychological assessment-such as computer, questionnaire, or interview-could lead to differences in evaluations [18]. To examine whether a temporally fragmented psychological test could yield results comparable to those of the original test and thus substitute for the original test, the present study used a social robot to administer a bigfive personality test amidst a broader HRI session for each research participant using the ATQ procedure because bigfive personality measures are commonly used predictors of human behavior [19,20] and important dimensions in HRI [21]. The verbal responses from each participant to ATQs were then compared with the written responses to the same personality test for validating the ATQ procedure. The ideal result would be a perfect positive correlation between verbal and written responses to the same test, although theoretically the upper bound of this correlation is the test-retest reliability of the psychological test. The procedure and results of our ATQ experiment are detailed in the following sections.

Methods
The ATQ experiment was part of a larger human-robot interaction study on both young and older adults. Approved by the Research Ethics Office of National Taiwan University (REC 201803HS017), the 1-h larger study consisted of three main HRI events. They are robot-administered cognitive testing, followed by robot-accompanied toy-playing, and then robotassisted tablet-using. All participants gave written informed consent for their participation in the study.

Human Participants
Relative to young adults, older adults tend to have stronger negative attitudes toward robots [22], which may, in turn, affect their human-robot interactions in general and our robot-administered psychological testing in particular. Therefore, we experimented with our ATQ procedure on two age groups of participants to examine its suitability for use with both young and older adults, which cannot be taken for granted [23].

Social Robot
We used a programmable humanoid robot-RoBoHoN (Sharp Co., Ltd.)-in our study. RoBoHoN is 19.5 cm tall when standing. For research participants to maintain eye contact with RoBoHoN during HRI, the RoBoHoN unit used in our experiment was placed on a table to converse with the  (Fig. 1). RoBoHoN has a built-in speech-to-text and text-to-speech engines for speech recognition and production, respectively. Although it could be fully autonomous, our RoBoHoN was remotely controlled by a human operator in a dark observation room to accurately detect sentence endpoints in participant speech and manage relevant conversational contingencies such as speech pauses, repetitions, queries, etc. that arise during memory recall and decision-making.

Psychological Test
We used the Ten Item Personality Inventory (TIPI) in that each of the five personality dimensions-Extraversion, Agreeableness, Conscientiousness, Emotional Stability, & Openness to Experience-is measured by two items [24]. The personality inventory was administered twice on each participant to validate whether the original test can be substituted by its ATQs. The first test was administered verbally by RoBoHoN using ATQs across each 1-h HRI session (Fig. 2), whereas the second test was administered by an experimenter right after the whole HRI session as a post-study questionnaire in TIPI's original form on paper. The order of the two administrations was not counterbalanced to prevent participants from experiencing ATQs as repeated "test" questions, which would defeat the purpose of using ATQs in non-testing contexts. Both administrations imposed a 5-point response format on each item: Strongly Agree 5, Slightly Agree 4, Neutral 3, Slightly Disagree 2, and Strongly Disagree 1. Specifically, RoBoHoN would ask: "Do you strongly or slightly agree, strongly or slightly disagree, or are you neutral?" as a follow-up question to verbally described these five response options whenever a participant's response to an ATQ could not be clearly mapped onto the 5-point Likert scale. The ATQs used in this study were not a verbatim copy of the original TIPI test [24]. They were adapted to appear natural in human-robot conversations. The ten pairs of personality-describing adjectives, as shown in Tables 1 and  2, were embedded in sentences such as "Are you, in general, an [anxious, easily upset] person?", "Do you consider yourself a [conventional, uncreative] person in general?" or "I'm curious whether you are a [dependable, self-disciplined] person in general." The original instruction of TIPI is: "Here are a number of personality traits that may or may not apply to you. Please write a number next to each statement to indicate the extent to which you agree or disagree with that statement. You should rate the extent to which the pair of traits applies to you, even if one characteristic applies more strongly than the other." It should be noted that the human-robot conversational contexts were prearranged such that the 10 pre-programmed ATQs were asked by the RoBoHoN as parts of small talk before or after an interaction event, such as toy-playing or cognitive testing. As a result, the presentation order of the 10 personality items was fixed across participants and not identical to the order in the original test. For example, when RoBoHoN greeted a participant at the beginning of the study, it expressed interest in learning more about the participant and asked whether he/she is a conventional person. In another example, right before a toy-playing event, RoBoHoN asked a participant whether he/she is a person who is open to new experiences.
verbal responses to ATQs were initially coded from video recordings by two independent coders from our research team who knew the purpose of the present study but were not provided with the participants' written responses at the time of behavioral coding. The coding instructions were as follows. "Please help label the degree of agreement expressed in each participant's verbal response by a number from one to five with one being a strong disagreement, two being a slight disagreement, three being neutral, four being a slight agreement, and five being a strong agreement. If a particular response of a participant is difficult to judge, please make your best guess based on the response patterns of that participant, if any." There were indeed difficult cases and minor coding differences (ordinal Krippendorff' α 0.995 and 0.87 for the younger and older groups, respectively). For example, an older participant shook her head without a verbal response to an ATQ but then verbally answered "Agree" when RoBo-HoN reminded her of the five response options. The two raters discussed in person on their coding differences incurred by such difficult cases to reach a consensus on each item score of each verbal response.
The final item scores were then compared with the scores obtained from the written responses of the same participants. Below we report the results of Pearson correlations, paired t-tests, and their associated equivalence tests [25]. For each test item or personality dimension, an ideal result would be a perfect positive correlation and even no difference between verbal and written responses, which translates to a correlation not statistically equivalent to 0 and significantly larger than 0 as well as a difference not significantly different from 0 and statistically equivalent to 0.

The Young Adult Group
The results of the younger participants are summarized in Table 1. For almost all the ten test items and five personality dimensions, both null-hypothesis and equivalence tests suggest very strong positive correlations and little differences between the participants' verbal and written responses. In other words, the young participants responded similarly to the same questions regardless of whether these questions were synchronously or asynchronously presented, suggesting a good validity of our proposed ATQ testing procedure on the younger adult population.

The Older Adult Group
The analysis results of the older participants are summarized in Table 2. Overall, the results were not as ideal as the ones of the young participants. Several correlations between the two response formats were weak or even statistically not different from 0. Moreover, the distributions of verbal and written responses are not statistically equivalent for some test items and personality dimensions. In particular, the ratings of one's Openness to Experiences trait were not very consistent between the two testing procedures, suggesting some potential problems in the use of the ATQ procedure on the older adult population.

Discussion
We used a brief personality test as an example to explore the possibility of using temporally fragmented psychological tests for user-friendly psychological assessment in human-robot interaction. As the first attempt toward such endeavor, the present study examined whether a psychological test, when administered informally in the form of asynchronous test questions, could still lead to assessment results comparable to those from a formal psychological test. Our results showed that the proposed ATQ testing procedure was quite successful with younger adults but only partially successful with older adults.
The discrepancies between participants' verbal and written responses, especially those observed in older adults, could result from several sources. First, participants, particularly those low in behavioral consistency, might not respond consistently to the same test question even with the same administration method. Second, participants, when receiving the same question through different perceptual modalities (audition vs. vision), might engage distinct attentive/memory processes [26,27] and thus make different choice decisions. Third, participants might mishear some questions unnaturally pronounced by a robot, a situation that had been observed during robot-administered cognitive assessments [16]. Fourth, participants might be less aware of being tested and hence less defensive when answering informal ATQs than formal test questions. Fifth, when asked the same question, participants might be more willing to disclose themselves to a robot than a researcher. Previous studies have found people to be more willing to share personal information with non-judgmental computer agents [18,28].
While the aforementioned factors might all affect the experimental results and the current study design could not distinguish these different sources of contributions, the proposed ATQ procedure, when applied to the young adults, sill yielded results on par with those from a conventional paper-based test. By contrast, some of these factors might affect older adults more pronouncedly than younger adults. Revisiting the video recording of each HRI session allowed us to provide some speculations as to why the ATQ procedure did not work as well in the older adults and how this might be overcome in the future. Below we will elaborate on these findings together with other implications from this explorative study. For the two-tailed paired T tests, t_null were t values of the null-hypothesis tests, and none of them suggested a difference in means between verbal and written responses; t_eq were t values of the zero-equivalence tests, most of which suggested the differences in means to be statistically equivalent to 0 when power set to be 0.8 and Cohen's dz 0.57 *P < 0.05; **P < 0.01; ***P < 0.001

Validity, Reliability, and Applicability of ATQs
The validity and reliability of ATQs are limited by the ones of their original psychological test. For example, in our younger group, the Pearson correlations between participants' verbal and written responses ranged from 0.77 (Openness to Experience) to 0.88 (Extraversion), which are comparable to the six-week test-retest reliability that ranged from 0.62 (Openness to Experience) to 0.77 (Extraversion) in the original TIPI study on young adults [24]. Additionally, as evidenced by the marked differences between the results of the two age groups in our study, the psychometric properties of ATQs may vary with different tested populations in a way similar to their original test, such as an overall reduced test-rest reliability of TIPI in older adults relative to younger adults [29]. The applicability of ATQs is critically constrained by the cognitive capabilities of testees. It is important to note that questions asynchronously presented in a conversation and questions simultaneously presented on a paper are processed respectively through audition and vision, two fundamentally different perceptual modalities. Thus, compared to the questions of a paper-based test that can be seen all at once, ATQs can only be heard sequentially, and a testee cannot voluntarily reinspect words that have been presented or modify an answer to an earlier ATQ. Consequently, if a testee cannot maintain auditory attention or short-term memory throughout the entire presentation of an ATQ, he/she may not process the question as fully as in the case of a paper-based test. For the two-tailed paired T tests, t_null were t values of the null-hypothesis tests, and only the one concerning "Anxious, easily upset" suggested a difference in means between verbal and written responses; t_eq were t values of the zero-equivalence tests, most of which suggested the differences in means to be statistically equivalent to 0 when power set to be 0.8 and Cohen's dz 0.66 *p < 0.05; **p < 0.01; ***p < 0.001 As a case in point, the older participants in our study might not process ATQs thoroughly. For example, when responding to the test item, "Do you consider yourself a conventional, uncreative person in general?", many of our elderly participants paused, pondered, and then responded, "Yes, I'm conventional." Such a narrow focus on parts of a heard sentence could result from an age-associated, smaller capacity of attention and short-term memory [30] or a developed habit of selective attention [31]. This can explain why older participants agreed more with the compound question "conventional, uncreative" in their verbal than written responses-sequential auditory processing might have accentuated the earlier "conventional" part [32], with which they agreed more relative to the later "uncreative" part.
Moreover, a cognitively demanding task that precedes particular ATQs may set up a stressful or even frustrating context and induce age-based stereotype threats [33,34] in older but not younger adults when they answer those ATQs. In our study, the ATQ "anxious, easily upset" was asked right after a series of RoBoHoN-administered cognitive tests in which the older participants showed much poorer performance (not shown here) than did the younger participants, presumably because of cognitive decline [35][36][37]. Such results suggest that these tests might be rather challenging to our older participants and therefore induce their stress responses, including negative affect [38,39]. By contrast, the paper version of our personality test was given right after a task relatively easy for both younger and older participants and could be the rea-son behind why our older participants agreed less with being "anxious, easily upset" in their written than verbal responses.

Guidelines for Using ATQs
Based on the above discussions, we recommend the following for effective use of ATQs: (1) Basing questions on a psychological test that has good validity and reliability to avoid invalid measures, such as the unreliably measured Openness to Experience in the present study; (2) Adapting questions for the target population in a way unambiguous to them to avoid unintended responses, such as the partial answers of our older participants' to the compound questions; (3) Asking the same question on different occasions if the answer to this question may lack cross-situational consistency, such as the responses of our older participants' to the "anxious, easily upset" question.
Importantly, despite the fact that the validity and reliability of ATQs may vary across populations, it is unnecessary and impractical to also administer a paper version of the same test in addition to ATQs, especially when ATQs are applied in commercial products. This is because the test-retest reliability of ATQs can be directly evaluated using our third recommendation, and arguably the ultimate validity of ATQ is whether they help estimate internal variables of a person and improve predictions on the behaviors of that person [40,41].
Last but not least, because estimated personal characteristics may be exploited for malicious purposes [42], such sensitive information should be securely protected [43], and users have the right to decline any form of psychological testing in the first place [44], including ATQs. One possible way to address users' privacy concerns is to ask users' informed consent [45] so that users are aware of being profiled. It also helps to solve the personalization-privacy paradox if personal information can be stored locally inside a conversational agent rather than in the cloud. This approach has been proved effective in reducing smartphone users' perception of privacy violations [46].

Possible Future Directions
The present study has several limitations and can be extended in the following directions.
• Exploring the longest temporal window within which the ATQ procedure can remain valid; • Using a psychological test with more items to improve test reliability [47,48] ; • Embedding a question into a contextually relevant conversation by an automatic information retrieval mechanism [49]; • Taking non-Likert, natural language responses from participants as they are and scoring them by fuzzy methods [50].
These possible extensions are of importance in applications. They will either clarify the boundary conditions or automate the question-distributing and answer-scoring components of the ATQ testing procedure. Future feasibility studies are needed to address these important issues and put ATQs into use in real-life HRI.

Conclusion
The present study put forward and experimented with the possibility of asynchronously administering a psychological test by embedding items from the test into human-robot conversations. This proposed ATQ procedure was then successfully validated on a young adult group but less so on an older group, based on which we derived our guidelines for future effective use of this approach. The ATQ procedure is designed as a user-friendly method of psychological testing during human-robot interaction. Social robots can leverage such a non-strenuous procedure to support fragile individuals in the completion of psychological tests or to profile general users for response personalization. Overall, the asynchronous testing procedure holds great promise for improving user understanding and thereby human-robot connections.
As a concluding remark, it should be pointed out that the ATQ testing procedure can, in theory, be generalized to use various psychological tests and applied to various populations once the test questions are properly adapted for a target population. Also, this proposed method can be implemented in various social robots designed for companion or assistance, such as text-based chatbots [51] and embodied conversational agents [52]. All in all, we hope that the ATQ testing procedure can help import the long-accumulated knowledge of psychology-in the crystallized form of psychological tests-into robotics for improving machine cognition and service.

Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Su-Ling Yeh received her B.S. and M.S. degrees in Psychology from National Taiwan University (NTU), Taiwan, and Ph.D. degree in cognitive psychology from the University of California, Berkeley, USA. Since 1994, she has been with the Department of Psychology, NTU and was awarded Lifetime Distinguished Professorship in 2012. She is a recipient of Academic award of Ministry of Education and Distinguished Research Award of National Science Council of Taiwan. She is an APS (American Psychological Science) fellow, and 2019-20 Stanford-Taiwan Social Science Fellow at the Center for Advanced Study in the Behavioral Sciences, Stanford University. She serves as associate director of NTU Center for Artificial Intelligence and Advanced Robotics and Editorial Board of Scientific Reports. Her research interests include cognitive neuroscience, perception, attention, consciousness, multisensory integration, aging, and applied research on display technology, eye tracking device, affective computing, and AI/robots.