1 Introduction

For speech-based dialog systems without visual representation, the system’s voice is the only feature a user can relate to. Therefore, it has to be perceived as trustworthy and empathic. This especially applies to Companion-systems, which are “cognitive technical systems with their functionality completely individually adapted to each user […which] interact with [him/her] as competent and cooperative service partners” [1]. Companion-systems should be able to support every user in different situations and in all kinds of emotional states – positive as well as negative [1].

In general, anthropomorphic and/or naturally sounding voices are used in most areas of Human-Computer Interaction (HCI), like navigation systems, smart home environments or voice user interfaces (VUIs) in smartphones [2]. There is indeed evidence for the human tendency to use schemes from human-human interaction for the communication with computer systems or virtual agents, regardless of the level of anthropomorphism of their voice [3]. However, empirical findings, which support the hypothesis that human-like, anthropomorphic voices most likely support human-computer cooperation as well as users’ perceived trust, are rare in comparison to artificial voices [4]. In order to provide a deeper comprehension of users’ individual experiences with different kinds of voices, an established experimental design [5] was adapted with the focus on users’ subjective perceptions of two diverse voices in a task-related dialog with a simulated Companion-system.

2 Background

During the past decades, the recognition of the importance of users’ emotions increased significantly in the field of HCI, which lead to the emergence of the research area of Affective Computing [6]. Up to now, it hardly seems imaginable to do research without the consideration of users’ affective states, especially as far as User-Companion Interaction (UCI) is concerned [7]. Actual user affect influences most factors of users’ perception of and experience with systems, e.g. performance, cognition, concentration, or memory [8]. Therefore, systems must be able to avoid negative affective states for creating and perpetuating cooperation as well as trust. For this current research, this shall be realized by the voice solely.

The question arose if the variation of the speech-output at all is able to fulfill this requirement. To answer this, several studies which investigated the impact of systems’ voices on users’ perception and user behavior were surveyed. Here, important effects were detected regarding system voices and their influences on users. The variation of the voices’ gender, speed, volume, manner etc. e.g. [911] has different impacts. For example, a female voice helps to communicate emotional content, whereas male voices tend to sound competent and convey task-related information [9]. The manner can help to increase interaction success, e.g. when motivational feedback is provided [10, 11]. In comparison of a human-sounding voice with a computerized one, users significantly prefer the human-sounding voice; even learn faster while solving a task [10]. These findings prove that a system voice indeed is able to affect the interaction. Furthermore, users’ personality characteristics strongly influence the perception of speed and volume of system voices. For example, introvert users prefer low speed and volume; extrovert users sympathize with louder and even exaggerated tones [12]. Therefore, user characteristics also have be taken into account when examining the effect of different voices on users.

Before an explanation of the intended research goal, it still needs to be clarified when an interaction between a Companion-system and a user can be labeled as successful. For this purpose, Frommer et al. [13] developed a Wizard of Oz (WOz) experiment where users had to interact with a simulated Companion-system (description follows below) while solving a task. An artificial, computerized voice was chosen for guiding users through this experiment. During the interaction, challenges occurred at specific stages, which demanded the adaption of current task-solving strategies from the users [5].

Quantitative as well as qualitative methods were used to analyze users’ perceptions, their interaction behavior as well as user characteristics. This research process established the basis for the forthcoming study introduced here. Individual user characteristics (e.g. personality traits, socio-biographic variables or technical experience) influence actual user behavior directly and have to be taken into account while analyzing data of users of technical devices [14]. User characteristics were shown to influence users’ (task-)performance, especially during situations that were perceived as challenging. Participants with greater performance “were younger, more experienced with computers, showed lower amounts of neuroticism and higher amounts of agreeableness (NEO-FFI) on average.” [15]. Furthermore, the analysis of semi-structured interview material showed the importance of the subjective experience of users while interacting with the artificial speech-based system. It became obvious that users tended to anthropomorphize the system, even if it’s just a voice and a screen [16]. But this is not necessarily linked to comfortable feelings in the interaction. In fact, the artificial voice is associated with feelings like anxiety or scariness and with the tendency to distance from the system by reducing initiative. Hence, wishes for a change of the artificial voice into a more human-like one occurred, maybe as a result of imagining a deeper and more trustworthy relationship including a more comfortable interaction atmosphere with such a system voice [16, 17]. Furthermore, participants used more negative attributes for the description of the voice than neutral or positive attributes [18].

To deal with these findings, the aforementioned experiment was modified and the application of two different voices was chosen: an anthropomorphic voice compared with an artificial one. The psychological research goal is to find out which effects the voice has on users’ perception of its support while solving the task. This study shall also evaluate which kind of voice is most likely to evoke positive affect and greater perceived trust in users. Furthermore, we want to survey the influence of user characteristics on the voice preference.

3 Methods

The aforementioned WOz experiment represents a suitable approach for our research. Before we explicate the modifications and hypotheses, we will give a short description of this experiment and the LAST MINUTE corpus, which is the result of previous research.

3.1 Wizard of Oz Experiment LAST MINUTE

The WOz experiment and the resulting LAST MINUTE corpus were developed as a research tool to investigate subjects during an interaction with a speech-based interactive dialog system, including a problem-solving task with planning, re-planning and strategy change [15]. All tasks had to be solved by users with the help of a solely speech-controlled computer system. In accordance with the central design feature of WOz experiments, this system was controlled by hidden human operators. The subjects believed they communicate with an autonomous computer-system. A male sounding, clearly computerized voice (MARY TTS, mary.dfki.de) was chosen to reinforce the feeling of interacting with a computer system [5].

According to Frommer et al. [13] as well as Rösner et al. [5], the experiment was executed as described in the following: At first, the system introduced itself and asked some personal questions, the so called personalization module. The system explained users that this information is needed for individual adaptation. After that, the actual last minute module [5] began with the explanation of the task. Subjects had to pack a suitcase for a suggested summer vacation for fourteen days in a predefined time. They were informed that detailed weather information will be gathered and provided later. Participants could choose items out of twelve categories (e.g. tops, shoes, accessories), which were presented in a predefined order on a screen in front of them. This stage is called “baseline” (BSL). Within the interaction course, particular restrictions, namely challenges, occur. The first of these challenges is called “weight limit barrier” (WLB). Here, users were informed that their suitcase is confined by the airlines’ weight limit. New items could be added only when others were unpacked before. As a result, participants had to adapt to this unexpected condition and to cope with their possibly emerging stress. After they passed more than half of all categories the final information regarding the destination was revealed. The vacation resort was located in the southern hemisphere where the seasons are switched. Now, subjects had to pack for cold climate, which means they had to change their strategy. This challenge is called “weather information barrier” (WIB). Apart from time and weight restrictions, this rendered the packing process even more complicated. In this situation, about half of participants got an empathic intervention inviting them to express their actual feelings. The remaining time could be used for correction, and is called “revision stage” (RES). In the end, participants had the chance to explicate how satisfied they were with the content of their suitcase [5, 13].

3.2 Modification of the Wizard of Oz Experiment LAST MINUTE

For the purpose of our prospective research we modified the established WOz experiment to focus on users’ perceptions of the system voice. Particular attention is paid to users’ individual ratings of the system and its voice as well as possible changes in users’ affective states during the course of interaction.

With respect to prior results, we modified the personalization module to avoid primary uncertainty regarding the system and the interaction [16] and to strengthen the sympathy of users towards the system in the beginning. The intervention was removed because of its indistinct effects [19].

There will be two experimental groups: One half of the participants will interact with the artificial voice which was already used in the prior experiments; the other half will interact with an anthropomorphic voice (IVONA TTS, www.ivona.com). We paid attention to use male voices to avoid the aforementioned gender effects. The setting stays equal for both groups. We expanded the experiment with two rating phases to gather information of actual user conditions and to detect significant changes during the interaction. Altogether, we survey users’ conditions and perceptions in three particular experimental phases as described below (also Fig. 1).

Fig. 1.
figure 1

The three rating phases during the modified WOz experiment

The first rating occurs before the start of the experiment. Here, we survey general information about the user, like socio-biographic variables and experience with technical devices. Furthermore, we measure users’ task-related motivation (Achievement Motives Scale, AMS) [20] and their actual affective state (Positive and Negative Affect Schedule, PANAS) [21]. This rating phase represents the Baseline (see Fig. 1) for further points of measurement. The actual experiment begins after this phase. Participants pass through the personalization module to get to know the system (Introduction), and immediately start with the last minute module. Here, users are enabled to practice the task while packing the first three categories (tops, jackets & coats, trousers & skirts) (Cooperation). After finishing the third category, the second system rating occurs. Here, we measure the actual affective state (PANAS) again and also the subjective system-evaluation by using a shortened version of the AttrakDiff (AttrakDiff mini) [22], which quantifies hedonic and pragmatic product quality. This rating happens aside the experimental screen, which ensures objective appraisal by participants without the effect of politeness towards the system [23]. After that, the last minute module continues. During this last phase, participants have to face all challenges (WLB, WIB and RES, see Sect. 3.1) (Interference) and to finish the task (Conclusion). The third and last rating occurs after the system-initiated goodbye. The applied questionnaires gather information about users’ present affective state (PANAS), final subjective system-evaluation (AttrakDiff mini) and the perceived trustworthiness of the system (Human-Computer Trust) [24].

Therefore, we have three particular rating phases during the experiment: the Baseline, the Introduction & Cooperation as well as the Interference & Conclusion (as shown in Fig. 1). By doing so, we want to evaluate possible changes in users’ emotional states during these phases. Furthermore, we want to survey differences in the subjective system-rating and perceived trustworthiness of both groups. A comparison of all ratings between the experimental groups may offer a profound basis for reaching the intended research goals.

With respect to users’ individuality, all participants will answer open questions regarding their subjective experience of the system’s voice, including possible influences on their feelings and behavior during the experiment as well as possible ideas regarding a change of the voice. Furthermore, some questions refer to users’ ascriptions to the system [16, 25] as well as users’ experiences of the relationship between themselves and the system.

We will also gather information about specific user characteristics in a second, separate session. Standardized psychological questionnaires are used to gather information about users’ affinity towards technology, emotion regulation, personality dimensions, coping with stress, self-efficacy, locus of control in the usage of technical devices as well as the psychological concept of the individual need to evaluate. This information may help to classify the different reactions and perceptions into distinct groups of users.

3.3 Hypotheses

This design aims at the evaluation of the perception of trust and cooperation between user and Companion-system by means of a variation in system’s speech-output. More precisely, we survey the impact of an anthropomorphic system-voice compared to an artificial system-voice on users’ actual affect, system evaluation as well as the development of trust. Furthermore, this design serves to detect possible correlations between the perception of the system voice and specific user characteristics, e.g. gender, personality dimensions or affinity towards technology.

With regard to the previous explanations, we suppose that the anthropomorphic voice has a more positive influence during the interaction with the simulated Companion-system, compared to the artificial voice. The anthropomorphic voice may increase users’ perceived trustworthiness. Furthermore, the possibly cooperative relationship between user and simulated Companion-system will be influenced during the phase of Interference & Conclusion to an unacquainted extent.

Hence, several hypotheses were formulated:

  1. 1.

    Regarding the phase of Introduction & Cooperation, we expect more positive affect (PANAS) and a higher system rating (AttrakDiff mini) of those users who interact with the anthropomorphic voice in comparison to the other group.

  2. 2.

    Both experimental groups have to face the barriers during the phase of Interference & Conclusion and will show lower system-rating (AttrakDiff mini), compared to the phase of Introduction & Cooperation.

Especially the change between cooperative interaction (second rating phase) to possibly interfered (or even failed) interaction (third rating phase) seems interesting. But here, we can just formulate explorative questions:

  1. 3.

    Will significant differences occur between the two groups regarding their perceived trustworthiness and subjective system-evaluation for the third rating phase?

  2. 4.

    Will significant differences occur between the two groups regarding users’ affective state for the second and third rating phase?

  3. 5.

    Are there significant influences of user characteristics on the following goal criteria: perceived trustworthiness, subjective system-evaluation and users’ emotional state?

The human-like voice may rather evoke the assumption of competence in users which possibly can or cannot be satisfied during the interaction. Trust issues and higher levels of negative affect may be the result. As mentioned before, the perception of system-voices is strongly influenced by user characteristics. Even if we suppose that the anthropomorphic voice may evoke more positive affect in general, individual preferences and perceptions have to be taken into account, too. Therefore, it seems possible for some users that they perceive the artificial voice as less competent, and thus may more likely forgive mistakes.

Besides these assumptions, the design of the study shall help to get a profound understanding of the effects of several user characteristics (e.g. personality dimensions, coping with stress, motivation, self-efficacy) on the preference of a specific system voice.

4 Outlook

The experiment takes place in a research lab of the Otto von Guericke University Magdeburg. A small sample of six participants already passed a test phase. The first (not systematically analyzed) results show that they indeed experience different, albeit marginal affective states during the interaction. Of course, we will need a greater sample size to support our hypotheses. We plan experimental group sizes of about 30 participants for statistical evaluation of all measurements. In order to reduce influences based on participants’ age or gender, both experimental groups will be homogeneous regarding these characteristics (only students aged 18 to 28, gender balanced).

The inclusion of actual affective user states and a profound understanding of users’ subjective perceptions of UCI are required for the development of Companion-systems, which shall be experienced as supportive, empathic and trustworthy partners by their individual users.