1 Introduction

In modern information societies (such as the Netherlands), the societal participation of low-literate citizens is problematically low [20]. Limited information skills (reading and writing) and communication skills (speaking and understanding) can lead to problems with societal participation. These problems can be cognitive in nature (such as a lack of these skills, and societal knowledge and experience), but also affective (fear, shame, and low self-efficacy, an individual’s task- and context-specific judgment of their own capabilities, cf. [5, 89]) or social (low motivation and desire to learn, or low trust in teachers and other learners, cf. [36, 39]). We can address them by designing information and communication skills training that is grounded in relevant real-life societal participation scenarios, so-called crucial practical situations [55, 56]. To this end, we are designing the virtual learning environment VESSEL: A Virtual Environment to Support the Societal participation Education of Low-literates [89]. In VESSEL, learners will perform interactive exercises situated in the domain of societal participation, while the system provides learning support by addressing the combined cognitive, affective, and social spectrum of learning problems that low-literates experience. We predict that learning in VESSEL will result in higher learning effectiveness, which we define as consisting of learning accessibility (there should be no practical or emotional barriers for the learner to start learning, cf. [94,95,96]), learning experience (the learners’ skills, needs, and wishes should be incorporated throughout the learning, cf. [24, 60, 96]), and learning outcomes (the learning should aim to reach meaningful and desired goals, cf. [7, 60, 94, 108]) [88]. By applying cognitive, affective, and social perspectives, we identify nine concrete system objectives. Learning accessibility can be increased by lowering (1) cognitive, (2) affective, and (3) social barriers to learning. The learning experience can be made more (4) cognitively achievable, (5) affectively positive, and (6) socially engaging. Finally, the following important learning outcomes can be reached: Learners can (7) train applied information and communication skills and gain practical experience, (8) raise their self-efficacy, and (9) become more motivated to participate in society independently.

We use the Socio-Cognitive Engineering method (SCE, see Fig. 1) to design VESSEL. The SCE method integrates operational demands (describing the system’s context-of-use), human factors knowledge (describing theory relevant to user-system interactions), and technology (describing current and envisioned technology drivers and constraints) into an iterative software design process [74,75,76]. Relevant data about these operational demands, human factors, and technology are collected in a theoretical foundation. This foundation is used to derive a system specification, consisting of system objectives (the general operational or domain goals of the envisioned system), functional requirements (the system’s intended functionality), claims (hypotheses about how the requirements help reach system objectives), and use cases (sequences of actions that describe how the system results in valuable outcomes for particular actors, cf. [1]). This specification can then be developed into a prototype, which is used to experimentally evaluate the claims.

Fig. 1
figure 1

Socio-Cognitive Engineering method [74, 76]

Earlier work [88, 89] has resulted in a high-level VESSEL specification, consisting of a requirements baseline with eight requirements (see Table 1) and claims that connect these requirements to the nine system objectives of learning effectiveness. This requirements baseline is theoretically supported [89]: The requirements in the specification were derived from theories of adult learning (andragogy [51,52,53], transformative learning [65, 66], constructivism [10, 45], and e-learning [31, 46, 47]) and theories on computer-supported learning that highlight the value of information provision, worldwide communication, interactivity, and use of gaming principles/gamification [85]. The baseline is also empirically grounded [88]: The requirements were refined by applying grounded theory [35] to qualitative data obtained through workshops, focus groups, and cultural probes (a qualitative data collection method wherein participants use provided recording tools, such as cameras, notepads, and sound recorders, to provide insight into their daily lives, cf. [32, 33]) used with low-literate participants. However, the requirements and claims in this specification have not yet been practically evaluated. As such, the next steps in our design and development process should be the ’prototype development’ and ’claim evaluation’ steps in the evaluation phase (Fig. 1). During prototype development, we must translate the generic requirements baseline into use cases, low-level claims, and functional VESSEL prototypes. We then use these prototypes during claim evaluation to experimentally test the validity of the claims with low-literate end users. The experimental outcomes of this evaluation phase can then be used to update the foundation and refine the specification, iteratively improving the VESSEL design.

The high-level specification affords a range of possible technological implementations, each meeting certain requirements in certain ways. We choose to design VESSEL as an autonomous rules-driven Embodied Conversational Agent (ECA) coach that helps low-literate learners with situated interactive exercises by offering cognitive, affective, and social learning support. Figure 2 presents the envisioned VESSEL system setup, showing the ECA coach and exercise elements. Here, exercises are scenario-based training (cf. [82]) situated in crucial practical situations important to low-literate learners. ECAs are “anthropomorphic interface agents” [12, p. 1] that can directly interact with system users. We expect that an ECA coach implementation of VESSEL has theoretical and empirical benefits. Example benefits include the following: ECAs can adapt their looks and behaviours easily, allowing them to match the demands of different training scenarios or the needs and wishes of different users; ECAs inherently afford natural language and spoken dialogue, making them easier to communicate with especially for users who struggle with text and reading; ECAs can add a social presence to exercises; particularly in a coaching role, they can serve as a focal point for user support, allowing users to naturally ask questions and request help. Finally, an ECA coach matches the support desires of low literates: In [88], we show that low-literate learners strongly prefer personalized ‘human’ support over ‘computer’ support. An anthropomorphic ECA puts a human face on the computer system, allowing low-literate learners to access the benefits of automated support, thereby enhancing learning effectiveness. However, to the best of our knowledge, little experimental validation of the effectiveness of ECA coaching with low-literate societal participation learners currently exists.

Following the SCE methodology, we aim to address this larger problem by carrying out multiple design and evaluation cycles. In this paper, we aim to design and develop a VESSEL prototype consisting of situated interactive exercises and an ECA coach that provides cognitive, affective, and social learning support and evaluates this prototype with low-literate learners. The prototype developed in this work will be a proof-of-concept, used in a Wizard-of-Oz experiment (i.e. controlled by a human ‘wizard’ instead of a computer, cf. [64]) to investigate both how we can best translate the existing VESSEL specification into an ECA coach-supported virtual learning environment, and whether or not the idea of ECA coach support for low-literate learners provides the envisioned learning benefits as described above: Better learning accessibility, an improved learning experience, and more success at reaching important and meaningful learning outcomes. The question of whether or not low-literate learners will be able to benefit from this prototype is non-trivial, as low-literate learners are known to struggle with accessing and using complex technology due to cognitive, affective, and social barriers [88, 90]. Additionally, Kramer et al. [54] highlight that many ECA studies underreport the actual ECA design process and argue for studies that “open the black box” (p.8) and clearly articulate the methods, objectives, and assumptions that go into this design. Ter Stal et al. [101] similarly report a dearth of clear guidelines for and taxonomies of ECA design features. Consequently, the comprehensive design, development, and evaluation of a proof-of-concept ECA coach that provides cognitive, affective, and social learning support meaningfully integrated into exercises are a unique and interesting contribution. A complementary study by Schouten, Venneker et al. [91] has zoomed in on the affective and social support contributions of a different prototype, further exploring the boundaries of this problem space.

The above yields two research questions:

  • Q1. Design How can we create an ECA coach that provides effective cognitive, affective, and social learning support to low-literate learners doing situated interactive exercises in a virtual learning environment?

    • Q1a. In what ways can an ECA coach provide cognitive, affective, and social learning support?

    • Q1b. Which functionalities, interaction methods, and appearances should an ECA coach have to effectively provide this learning support in a virtual learning environment?

  • Q2. Evaluation Does this support-providing ECA coach increase learning effectiveness for low-literate learners working with VESSEL, compared to low-literate learners working with VESSEL but not receiving coach support?

Fig. 2
figure 2

VESSEL system setup. Three double-sided arrows indicate user-system interactions. Bottom left arrow: The user performs exercises. Bottom right arrow: The coach monitors the user’s actions, and interacts with the user by giving feedback and support. Top arrow: The coach monitors exercise state and changes support as appropriate

The first research question is answered in four steps. First, we update our SCE foundation (Fig. 1). We update technology by explaining the potential benefits that ECAs in general, and an ECA coach specifically, can offer to our VESSEL design. We update human factors knowledge by incorporating theory that describes how an ECA coach could offer cognitive, affective, and social learning support. We update operational demands by designing the situated interactive exercises that make up the educational content of VESSEL. Second, this updated foundation is used to refine the requirements in the specification. Third, we translate this refined specification into practical use cases, to make explicit how the prototype should work, what effects we expect our ECA coach to have, and how these effects can be measured. Fourth, we create the prototype and describe it in terms of functionality, interaction methods, and appearance. To answer the second question, we experimentally evaluate the learning effectiveness impact of the coach with low-literate learners. We claim that a support-providing ECA coach will raise VESSEL’s learning effectiveness for low-literate learners by improving the system’s cognitive, affective, and social learning accessibility, learning experience, and learning outcomes.

The structure of this paper is as follows. Section 2 provides additional background information on the demographic of low-literate learners, and on the current state of learning support aimed at this demographic. Section 3 shows the updated VESSEL foundation. In section 4, the specification requirements are refined, and use cases are derived and written. Sections 5, 6, and 7 present the evaluation process. In Sect. 5, the prototype is described in terms of functionality, interaction methods, and appearance. Section 6 describes the experiment created to evaluate the effectiveness of the learning support provided by the prototype’s ECA coach. Section 7 presents the results of the evaluation. Finally, Sect. 8 presents conclusions, discussion of findings, and directions for future work.

2 Background

2.1 People of low literacy

People of low literacy (or low-literate people) are defined as adult people whose mastery of reading, writing, speaking, and understanding/comprehension skills are limited to a degree that they cannot independently participate in society. The OECD (Organisation for Economic Cooperation and Development) defines literacy as the ability to use printed and spoken information in the pursuit of one’s daily and overall goals [78]. Low literacy can specifically be measured against crucial practical situations: The set of behaviours that a person must be able to carry out in order to be able to participate independently [55, 56]. Examples of these crucial practical situations include banking, renting a living space, grocery shopping, and communicating with neighbours [55, 88].

As the elements of low literacy are highly culturally dependent (owing to, e.g. different norms, expectations, governmental institutions, and crucial practical situations, cf. [88]), we focus on low literacy in the Netherlands, the country where this study is situated. Following the most recent definitions provided by the Dutch Court of Audit (Algemene Rekenkamer), around 2.5 million Dutch people are considered low-literate [2], including 1.8 million people in the labour force (i.e. aged between 16 and 65). This represents an increase over the 2012 Programme for the International Assessment of Adult Competencies (PIAAC) survey, which reported 1.3 million low-literate Dutch people [37, 40]. Gubbels et al. [41] report that almost a quarter of Dutch 15-year olds struggle with language mastery and literacy issue, further cementing the widespread nature and persistence of this issue. Low literacy is similarly heterogenous among other demographic angles: Low literacy can be found among both native Dutch speakers and non-native migrants, among both men and women, and among a variety of educational backgrounds [87]. A more comprehensive look into the demographic makeup of Dutch low literates can be found in Schouten, Smets et al [89] and De Greef, Segers & Nijhuis [38].

2.2 Existing learning support

Learning support for Dutch low literates is provided primarily in the form of adult language and integration classes, which are made available to interested learners at regional education centres, libraries, volunteer centres, and private institutions [11, 100]. In these classes, which focus equally on language acquisition, practical skills training, and knowledge of society, students discuss and practice crucial practical situation exercises with the support of teachers and peers [34, 55]. This classroom-based approach to training is kept as the gold standard because it meaningfully applies scenario-based learning [80, 82] to the actual practice of societal participation that low-literate people struggle with, and because low-literate learners put enormous value into support from both trusted authority figures (i.e. teachers) and people that they perceive to be in the same situation (i.e. student peers) [88].

However, Schouten [87] highlights that classroom learning can meaningfully be supplemented with computer-based learning, citing three broad reasons: First, computer-based learning can improve learning accessibility for learners who cannot access the classroom easily. Second, computer-based learning has great inherent potential for individualization (as learning software is easily adaptable, cf. [6, 80]); individualized learning and support are very valuable for increasing learning effectiveness [69, 97], especially for a heterogeneous learner group like low-literate learners [89]. Finally, computer-based learning enables the use of learning scenarios that would be impractical or impossible to train in a classroom setting (e.g. scenarios that have an element of real risk, or that require expensive tools). Schouten, Smets et al. [90] assess the current practice of learning support software aimed at low-literate learners and use this to establish a set of eight design requirements for learning support software that emphasizes these benefits. They describe that a virtual learning environment (VLE) could potentially be an effective way of meeting these requirements. These design requirements (further refined in [88] and shown in Table 1) form the basis of VESSEL: A VLE aimed at effectively supporting low-literate learners who want to improve their societal participation, designed to make the benefits of computer-based learning accessible to low-literate learners and to be used in concert with existing learning solutions.

3 Foundation

3.1 Technology: embodied conversational agents

Embodied Conversational Agents (ECAs) are a subclass of Intelligent Virtual Agents: Autonomous software programmes that can interact with humans and other agents [12]. ECAs extend from traditional intelligent agents by being ‘embodied’ as animated virtual characters in a virtual environment. Being embodied has consequences for agent appearance and behaviour. In contrast to non-embodied agents, ECAs can be judged on their appearance; particularly when ECAs look human-like, humans evaluate it on appearance factors such as sex, age, ethnicity, and dress style. Studies suggest that humans judge ECA characters on the same qualities as they do other humans, such as similarity to themselves [3, 70, 105], attractiveness [49, 72], and cultural appearance stereotypes [3, 4, 109]. In addition, ECAs can use not only verbal communication behaviours (e.g. speech and natural language understanding), but also nonverbal behaviours (e.g. body language, gesturing, facial expressions, and gaze direction [23, 54, 101]). ECAs can be designed to behave as social actors: Potential possibilities include recognizing and responding to verbal and nonverbal input from humans, taking part in ongoing discussions, paying attention, and using conversational functions like turn-taking [12, 23, 28, 62]. This lets humans react to the social cues and behaviours of ECAs as if they were human conversation partners [50, 73, 84, 98].

Because ECA behaviours and appearances can be adjusted [6, 80], ECAs can be used to fulfil a variety of roles in a virtual environment. For instance, Bickmore et al. [14] adapted the ethnicity of a virtual nurse character to better match different user demographics, increasing user-system satisfaction. Prior studies have shown the potential effectiveness of using ECAs in the role of a digital coach. Lane et al. [57] report on an ECA coach that increased users’ willingness to attempt challenging programming problems and their self-efficacy in computer science education. Coaching ECAs developed by both Bickmore et al. [16] and de Rosis et al. [86] were effective in changing user behaviour patterns. Shamekhi et al. [92] show that an ECA coach focussed on teaching self-care was appreciated and accepted by spinal cord injury patients, with participants suggesting that this approach could be valuable particularly for adults dealing with ’sensitive topics’. Hudlicka [44] shows that even when users express negative opinions on an ECA coach’s affective and social realism (i.e. the ECA’s ability to conduct natural conversations and come across as a ’real’ person), the coach’s interactive feedback and support are still valued, and the coach still supports users in implementing a meaningful meditation practice routine. Finally, Kramer et al. [54] provide a scoping literature review of the use of coaching ECAs in physical health domains. They report that while no significant increase in user health literacy is found, ECA coaches do increase user motivation to apply health measures, user identification of preconception risks, and system usability.

Because of the social interaction options, individualization potential, and learning support outcomes described above, an ECA fulfilling the role of a digital coach in VESSEL could be effective in supporting the cognitive, affective, and social issues of low-literate learners. Cognitively, digital coaches in general can help learners reach stated learning goals: Bickmore et al.’s [16] health counsellor increased physical activity and fruit and vegetable consumption, and Veletsianos and Miller [104] show that learners deeply engage and converse with a digital coach, increasing learning. ECA digital coaches in particular can provide individualized learning support by using scaffolding to structure their verbal/textual feedback and by using multimodal media for support [63]. De Rosis et al. [86] show that users converse with and learn from an ECA for health promotion with adaptive dialogue, and Miao et al. [67] show that a scaffolding-based ECA coach is both technically feasible and accepted by learners. Affectively, ECA digital coaches improve the affective experience of situated, interactive learning exercises: Shaw et al. [93] describe how the embodied, human-like nature of an ECA can emotionally benefit learners in situated exercises, while Lester et al. describe that “the very presence of an animated agent in an interactive learning environment - even one that is not expressive - can have a strong positive effect on student’s perception of their learning experience” [59, p. 6]; they call this the persona effect. Also, socially, digital coaches can be seen as ‘friends’ and trusted mentors in a learning system [13, 86], forming a long-term relationship of trust between learner and coach [15, 71]. Ter Stal et al. [101] report that social relational agents are seen as more likeable, caring, and trustworthy, particularly if the ECA shares information about itself with the user (cf. [48]). ECA digital coaches can use nonverbal behaviours and appearance factors, such as similarity attraction, to form these relationships quickly and strongly [70]. Digital coaches also enhance engagement and learning in a virtual learning space, by acting as conversation partners that human users will genuinely talk to [104].

3.2 Human factors knowledge: learning support

To support learners with cognitive, affective, and social issues, the ECA coach must be able to offer cognitive, affective and social support. Cognitive support is operationalized in VESSEL as scaffolding. Scaffolding is a learning support technique that focuses on providing the right amount of support to learners at the right time. Support is first increased to the level that the learners need to progress and then gradually decreased over time [83]. This way, “students are encouraged to develop their own creativity, motivation, and resourcefulness” [103, p. 652]. The coach can use verbal scaffolding techniques [27, 29] by offering exercise-specific explanations and hints; this helps learners understand the learning content and successfully complete the exercise. Affective support is operationalized as motivational interviewing. Motivational interviewing is a counselling technique aimed at leveraging intrinsic motivation to enact behavioural change [68]. The motivational interviewing techniques help learners to feel good about the learning process, and to reframe and solidify positive self-efficacy information (cf. [17, 30, 61]). The coach can use motivational interviewing techniques by offering exercise-specific feedback, eliciting self-reflection, and applying social persuasion to raise learner self-efficacy [99]. Social support is operationalized as small talk, which is a cornerstone of building trust. Trust is an important element of the learning process [12, 21], as it makes learners more receptive to the coach’s suggestions and motivates learner persistence. Small talk leads to the building of trust by increasing coordination between speaking partners, establishing common ground, and helping to keep the conversation at a safe level of depth, thereby avoiding ‘face threat’ [22]. The coach can apply these categories in exercise-specific small talk.

3.3 Operational demands: exercises

For this prototype, we require a set of situated exercises that covers a range of possible cognitive, affective, and social issues that low-literates can encounter in daily life. We draw two exercise scenarios from the list of crucial practical situations: ‘using online banking’ (online banking), and ‘requesting a new passport at a city hall service desk’ (service desk). These two exercise domains test different skill sets: online banking tests reading and writing skills, while service desk tests speaking and comprehension. Furthermore, we apply two difficulty levels to each scenario, ‘Easy’ and ‘Hard’, resulting in four exercises: Easy online banking, Hard online banking, Easy service desk, and Hard service desk. This step has two purposes: Firstly, using four different exercises will provide a larger and more varied range of data than using two, while keeping pairs of exercises in the same domain (i.e. two online banking and two service desk exercises) enables more meaningful direct comparison of the outcomes. Secondly, this setup more accurately mirrors the participation experiences of people of low literacy, who can encounter both simple and difficult challenges in any given domain [88]. This allows us to compare the difference in practical experience between ’easy’ and ’hard’ situations and evaluate the types and amounts of support that are needed for each. All four exercises are designed to incorporate the ECA coach.

We determine which cognitive, affective, and social challenges are likely to appear in each of these four exercises, and what level of information and communication skills will be needed, using the Societal Participation Experience of Low-Literates (SPELL) model from Schouten, Paulissen et al. [88] and domain-specific literature. In the online banking exercises, the user must transfer money from a personal account to a web shop. These exercises are designed using Bayles’ [8] overview of critical online banking usability factors, and Nielsen’s [77] and Leavitt & Schneiderman’s [58] general usability guidelines. The difficulty between Easy online banking and Hard online banking is changed by raising/lowering the usability and user-friendliness of the websites: The Easy online banking website is less complex, less information-rich, and easier to navigate than the Hard online banking website. Visual appearances were based on examples of real-life online banking websites (see Fig. 3).

In the service desk exercises, the user must speak to a city hall employee to report the loss of a passport. This city hall employee is presented as an ECA character. The difficulty between Easy service desk and Hard service desk is changed in two steps by presenting the Easy service desk ECA as more friendly and polite than the Hard service desk one. First, we use De Jong et al.’s [25] overview of social demeanour and politeness effects to write dialogue for the ECAs. De Jong et al. provide politeness ratings for 21 dialogue tactics, ranging from imperative requests (“Do this for me”) to apologetic speech (“I’m sorry, could you please do this for me”); using this overview, we write friendly and polite dialogue for the Easy service desk ECA, and curt and impolite dialogue for the Hard service desk ECA. Second, we give the ECAs different appearances: This will help learners distinguish between the two characters, reinforcing the idea that one character is a polite person, while the other one is mean. The Easy service desk ECA is given a friendly appearance, while the Hard service desk ECA is given an unfriendly appearance.

To increase the likelihood that players interpret the visual appearances of the conversation partner ECAs as ’friendly’ and ’unfriendly’, these appearances are taken from a pre-study [26], in which eight low-literate participants (in groups of two) were asked to rate a set of twelve ECA characters of diverse age, ethnicity, gender presentation, and dress style; the literature currently does not show clear consensus on which ECA designs are preferred in which situations [101, 102], necessitating this approach. We expected that participants would prefer those ECAs that were similar to them, and dislike ECAs that were dissimilar, based on Moreno & Flowerday [70]. ECAs were drawn from Brinkman et al.’s [19] ‘Virtual Reality Exposure Therapy (VRET)’ virtual environment. Participants were given paper pictures of a service desk setting (Fig. 4) of the twelve ECAs, and they were asked to ‘select the four characters you would like to have as a conversation partner in this setting and order these four from best to worst’. They were then asked to ‘select and order the four characters you would least like to see’, and finally, to fill out the ordering with the last four characters; the process was done in three steps to avoid overloading participants. The eight obtained orderings were evaluated to see which characters were considered ‘best’ or ‘worst’ most often. All participants strongly disliked one particular ECA; this ECA was chosen for the Hard service desk exercise. Three similar-looking ECAs shared the ‘best’ spot; we selected one of these for the Easy service desk exercise.

Both ECAs share a small number of visual commonalities. They have one set facial expression. They open and close their mouth on a set pattern while speaking, regardless of speech content. They go through one simple ‘idle’ animation loop, swaying left and right slightly while sitting on a chair; apart from this, they employ no other body movements or gestures of any kind. Both ECAs can be seen in the context of the service desk exercises in Fig. 4; in both cases, the service desk background is the same static image.

Fig. 3
figure 3

Online banking exercise websites. Left: Easy online banking. Right: Hard online banking

Fig. 4
figure 4

‘Conversation partner’ ECAs, shown inside the virtual environment used for the service desk exercise. Left image: Easy service desk exercise ECA. Right image: Hard service desk exercise ECA

For each exercise, written instructions are provided on-screen. In the online banking exercises, the instructions show the task (to transfer money to another account) and the information necessary to complete it: Recipient name and bank account number, and money amount. In the service desk exercises, the instructions only show the task. All exercises have been designed with a 6-minute time limit, in order to define ‘success’ (the exercise is correctly completed within 6 min) and ‘fail’ (the exercise is completed incorrectly or not within 6 min) states for the exercise. When the limit is reached, the exercise must be stopped. This 6-minute limit is based on cognitive walkthrough of the exercises and practical considerations: A shorter limit would not give participants enough time to reasonably do the exercises, while a longer limit would inflate the time footprint of the envisioned experimental study (see Sect. 6).

4 Specification

Following Fig. 2, we refine the generic ‘VESSEL’ requirements. This means that for each existing requirement, we create new, more detailed requirements that zoom in on one or both of the system’s two core components: The exercises, and the coach. When working with a large requirements baseline, careful choices must be made about which requirements to test in which configuration [107]. Since we are building a single-user prototype, we choose for the time being to discard the requirement R4. Collaboration, as it demands a prototype that supports multiple users at once. The remaining seven generic requirements are refined, resulting in a set of coach-specific and exercise-specific requirements. Table 1 shows the new requirements baseline. An in-depth description of the refinement process can be found in Appendix A. In addition, two use cases have been created to demonstrate the envisioned optimal way that a user would interact with VESSEL. These use cases can be found in Appendix B.

Table 1 Refined VESSEL requirements baseline, from the perspective of VESSEL as an ECA coach supporting exercises 

5 Evaluation: prototype development

Functionality. The prototype consists of the ECA coach that offers cognitive, affective, and social learning support as described in Sects. 3.1 and 3.2, and the exercises described in Sect. 3.3 (Easy and Hard online banking, and Easy and Hard service desk). Cognitive support is offered during the exercises. Bloom’s [18] taxonomy of keywords has been used to identify all cognitively challenging elements in the exercises, including (long) difficult words, complex scenario-specific terms, and necessary exercise steps that may not be intuitive. The coach knows when the user is having difficulty with these challenges and offers scaffolding support that ranges from ‘asking the user if they need help’ to ‘telling the user what to do’. If the user asks a question, the coach uses general-purpose utterances to answer it. In the service desk exercises, the coach can also show the user images of a Dutch passport, ID, or driver’s license. Affective support is offered after the exercises. The coach knows the user’s accuracy and completion time and uses this to provide motivational interviewing feedback. The coach also asks the user’s opinion on either the online banking website or the conversation partner, to encourage the user to reflect on their experience. Social support is offered before the exercises. The coach follows a short small talk script based on the topic of the exercise. The coach asks about the user’s experiences with and opinions on the topic of the exercise; depending on user answers, follow-up questions may be asked as well.

Interaction methods Users use a mouse and keyboard to navigate and use the online banking websites. Users can talk to the service desk conversation partner ECA and the coach ECA in natural language. For the purposes of the evaluation, the ECAs are designed to be controlled via the Wizard-of-Oz method [64]. Both ECA programmes contain a list of pre-recorded natural language utterances, which were written and recorded during prototype development: All coach utterances were voiced by one research confederate, while the conversation partner ECA voices were voiced by two other research confederates. The wizard operator controls the ECAs by selecting these utterances in a control program, causing the ECA to ’say’ the utterance. Apart from selecting pre-recorded utterances, the wizard has no further control over the ECAs; the ECAs’ possibility space is fully described by the utterances. The wizard is not allowed to interact with participants in any other way.

The ECA coach has access to four groups of utterances: Cognitive support utterances, affective support utterances, social support utterances, and general-purpose utterances like “yes”, “no”, “I don’t know”, and “I did not understand you”. The wizard uses these utterances in accordance with the following rules. At the start of an exercise, social support is used. The wizard must follow the ‘small talk’ social support script as closely as possible, selecting utterances from the list of social support utterances in a pre-described order. Some of these utterances are questions that the coach asks of the user: If users answer these questions, the wizard must interpret the user’s speech and choose the correct response for the coach from a small list of possible responses. During the exercise, cognitive support is used. The wizard must interpret the user’s actions and speech to choose appropriate utterances from the list of cognitive support utterances. If users are struggling with a pre-identified challenging element, the coach should offer support about that element. In these cases, the wizard must use their own expertise to judge on a case-by-case basis which users are ’struggling’, and which specific cognitive support utterance to use. After the exercise, affective support is used. The wizard must follow one of four motivational interviewing scripts, depending on the user’s performance (the exercise was completed with little coach support vs. the exercise was completed with a lot of coach support or not completed) and speed (the exercise was completed in under 3 min vs. the exercise was completed in 3 to 6 min or not completed). Finally, the use of general-purpose utterances is up to the wizard’s interpretation of the user’s speech and actions: This includes reacting to unanticipated user questions, prompting the user to repeat themselves if their speech was not understood, and getting the exercise ‘back on track’ as quickly as possible should unanticipated questions or disturbances occur.

The conversation partner ECA has access to two groups of utterances. A scenario script contains all utterances, in the correct order, to hold the exercise conversation. The wizard must follow the scenario script perfectly when the exercise is conducted. Here, too, some of the ECA’s utterances are questions; the wizard must interpret user answers to these questions to select the correct follow-up utterance. A second list contains general-purpose utterances, similar to the coach’s (but recorded in the conversation partner ECA’s voices).

Appearance Exercise appearances are shown in Sect. 3.3 (see Figs. 3 and 4). The coach ECA’s visual appearance was based on the same pre-study used for the conversation partner ECAs [26]. Participants were asked to imagine the twelve ECA characters as digital coaches and order them from most to least preferred. While not unanimous, one particular ECA was ranked in the top spot more than any other. We selected this ECA as the coach (see Fig. 5). Like the conversation partner ECAs, the coach ECA has one facial expression, opens and closes its mouth while talking on a single animation cycle regardless of the audio being played, does not gesture or use body language, and animates in a simple ‘swaying its head back and forth’ animation loop. A grey background is used.

Fig. 5
figure 5

VESSEL Coach ECA (top right corner) and supporting material for the Easy online banking exercise. Text is in Dutch. From top to bottom, lines read: ‘The exercise is: Transferring money using online banking. To whom: Mister Jansen. How much money: 10 euro. Account number: NL POST 1200 1111 00’

6 Evaluation: methods

6.1 Experimental design

An experiment was carried out to evaluate the learning effectiveness impact of our VESSEL prototype coach, in terms of the six claims that were presented as use case post-conditions (see Appendix B). In the SCE method, specification claims serve as evaluation hypotheses. This results in the following six hypotheses:

Learning Experience

  • H1: Cognitive Experience (Performance) The coach leads to a shorter exercise completion time, and higher perceived performance.

  • H2: Affective Experience (Positive Affect) The coach leads to more positive affective states during and after the exercise.

  • H3: Social Experience (Engagement) The coach increases the amount of user-system interaction and results in learners viewing VESSEL as more helpful and easy to learn with.

Learning Outcomes

  • H4: Cognitive Outcomes (Success) The coach leads to a higher exercise completion rate.

  • H5: Affective Outcomes (Self-Efficacy) The coach leads to higher self-efficacy.

  • H6: Social Outcomes (Retention) The coach leads to a higher motivation to continue learning.

To test these hypotheses, a mixed-method repeated-measures within-subjects experiment was designed. The study’s main independent variable was Coach Presence. This variable had two levels: With Coach, and Without Coach. Participants were invited to work with the prototype in two consecutive sessions (one week apart): One session in which they tested the complete prototype, including all exercises and the ECA coach (the ’With Coach’ session), and one session in which they tested a version of the prototype that only included the four exercises, but not the coach (the ’Without Coach’ session). In the With Coach session, participants completed all four exercises (Easy online banking, Hard online banking, Easy service desk, and Hard service desk) with support from the coach. In the Without Coach session, participants completed the same exercises without coach support. All participants participated in both sessions. Session order was counterbalanced: 50% of participants did the With Coach session the first week and the Without Coach session the second week, and 50% of participants did the opposite. Exercise order was partially counterbalanced: Each participant was offered the four exercises according to one of four pre-determined orderings. These orderings were counterbalanced across participants, but kept the same per participant in both the With Coach and Without Coach sessions.

6.2 Measures

Twenty-dependent variables were measured: Eighteen variables were self-report measures, obtained using two questionnaires (see Sect.6.4), and two variables were objective performance metrics. Table 2 describes the variables. Additionally, semi-structured interviews were used to gain qualitative insight into the proceedings and the participants’ experiences with the VESSEL prototype, with the following questions:

  • How did you like the session? Do you think it went well, or poorly?

  • What went well for you? What went poorly for you? What did you think was the cause?

  • What parts of the session did you enjoy? And what parts of the session did you dislike?

  • What would you change about the exercises you just did?

One question was only asked in the With Coach session:

  • What do you think about the coach? Has the coach helped you? Was it nice to have the coach around, or annoying?

Additionally, the following questions were only asked after the second session:

  • Did you notice any differences between the two sessions? What differences did you see?

  • Which of the two sessions did you like best? And why?

6.3 Participants

Twelve low-literate users participated in the entire experiment. Kurvers et al.’s [56] five language learner profiles were used to select suitable participants; these profiles divide low-literate first-language learners (L1) and second-language learners (L2) into categories based on their language comprehension skills and their learning ability. Only learners that matched profiles 2 (L1 and L2 learners with no particular strengths or weaknesses, considered ’average low-literate learners’), 3 (typical L2 learners, particularly struggling with vocabulary and with speaking and understanding spoken Dutch), and 4 (low-skilled L1 learners, with decent speaking skills but serious difficulties with reading and writing) were invited to participate, as these learners can realistically benefit from computer-supported learning. Learners at the extreme ends of the low-literacy spectrum (profiles 1, relatively high-skilled and self-directed L1 and L2 learners, and 5, L1 and L2 learners with serious learning difficulties and very limited educational backgrounds) are expected, respectively, to be too skilled to benefit from our support, and too low-skilled to be able to use our prototype in the first place. Participants were recruited in several language classes throughout the Netherlands; teachers in these classes used the profiles to select and invite suitable learners to participate. Six men and six women participated, with ages ranging from 30 to 63 (M=48.2, SD=10.5). Two of the participants were natively fluent in Dutch; the other ten participants identified as ‘somewhat fluent’. Other (native) languages spoken by the participants included: Arabic, Bosnian, Edo, English, French, Somali, Spanish, and Turkish. Four participants reported having prior experience with online banking, and all twelve participants had prior experience with service desk conversations. Of the latter, seven participants specifically had experience with passport recovery. There was no overlap between these participants, and the participants for the pre-study [26].

6.4 Materials

The experimental setup consisted of two laptops connected to two additional monitors (Fig. 6). The laptops, on one side of the table, were used by the experimenters. The monitors, on the other sides of the table, were used by the participants. The laptop and monitor on the right were used to run and control the coach. The laptop and monitor on the left were used to run and control the exercise environment. On the participant side, a mouse and keyboard (plugged into the left laptop) were provided for the online banking exercises.

Fig. 6
figure 6

Schematic overview of experimental setup: 2 laptops (lower figures) connected to 2 monitors (upper figures). Monitors are placed and angled such that participants could not see the laptops and the experimenters while seated

Three questionnaires were used. Two questionnaires measured the eighteen self-report variables (see Table 2): These were called the ‘exercise’ questionnaire (EQ), and the ‘session’ questionnaire (SQ). Answers were given on a five-point bipolar Likert scale, using greyscale answer bars (Fig. 7). Participants were told to mark one of the five boxes per question. Bars were labelled ‘Nee’ (No) and ‘Ja’ (Yes) at the left and right extremes. Question SQ.1 was included as a practice question to allow low-literate participants to ‘get used to’ the answer schema and was not included in later analysis. A third ‘demographic’ questionnaire measured participant age, sex, schooling history, time spent in the Netherlands, known languages (‘fluent’ and ‘somewhat fluent’), and prior experiences with online banking and city hall service desk situations. For objective measures, exercise completion time was measured with a stopwatch, and exercise completion was tallied by hand. Finally, an audio recorder was used to record the experimental proceedings and the end-of-session interviews.

Table 2 Overview of quantitative measures  
Fig. 7
figure 7

Example answer bar for the short and SQs (‘Nee’ means ‘No’, ‘Ja’ means ‘Yes’)

6.5 Procedure

The first session started with a general introduction, informed consent forms, and the demographic questionnaire. Next, the first SQ was administered. Experimental proceedings diverged after that, based on experimental condition. In the Without Coach condition, researchers explained the general experiment flow. Participants were introduced to their first exercise and shown the instruction material. Participants were told to complete the exercise alone, without outside help, within the 6-minute time limit (which they could not see). After that time, or as soon as participants were finished, an EQ was administered. Proceeding from there, the remaining exercises were carried out the same way. In the With Coach condition, researchers instead introduced the coach. The coach (controlled always by the same experimenter) introduced itself to the user (with the name ‘Anna’) and explained the general experiment flow. The coach introduced the first exercise and the instruction material and offered social learning support. Participants were told to complete the exercise, with the help of the coach, within the 6-minute time limit. During the exercise, the coach provided cognitive learning support when needed. After the time limit, or as soon as participants completed the exercise, the coach offered affective learning support. Researchers then administered an EQ. The remaining exercises were carried out the same way. After the end of the fourth exercise, conditions converged. The researchers administered a second SQ. Then, a semi-structured ending interview was conducted, using the questions presented above. And finally, participants were debriefed.

In the second week of experimental sessions, each participant completed the opposite set of exercises, swapping the With Coach and Without Coach conditions. Otherwise, the same procedure and exercises from week 1 were used. The week 2 ending interview used the same questions as week 1, but included the questions about the perceived differences between the two conditions. Finally, at the end of week 2, participants were fully debriefed and rewarded for their participation.

7 Evaluation: results

Three sets of analyses were carried out. First, a repeated-measures general linear model (GLM) analysis was conducted on the EQ data. Second, a factor analysis was used to condense the data of the SQs into several factors; another repeated-measures GLM analysis was conducted on these results, as well as a paired-samples T-test. Third, a final repeated-measures GLM analysis was used to analyze the performance results of the online banking exercises. Finally, qualitative observations were made by the researchers, both live during the experiment and by listening to the audio recordings afterwards.

Prior to analysis, questionnaire reliabilities were investigated. The EQ had an average reliability of \(\alpha\) =.845. Beside question 1 (the ‘practice question’), question 13 was also dropped from the SQ as it showed scattered answers and low reliability (based on general descriptives and Cronbach’s alpha). The complex wording of this question seems to have led to confusion and misunderstanding. The remaining eleven questions show an average reliability of \(\alpha\) =.600.

7.1 Exercise questionnaire analysis

A 2-by-2-by-2 repeated-measures GLM analysis was done with the EQ data. Three GLM factors were chosen. The Coach factor had two levels: ‘With Coach’ and ‘Without Coach’. The Scenario factor had two levels: ‘online banking’ and ‘service desk’. The Difficulty factor had two levels: ‘Easy’ and ‘Hard’. The five questions of the EQ were all treated independently: They were designed to measure entirely separate concepts, and Pearson correlation analysis showed no significant correlations. All main effects and all interaction effects were tested. Table 3 shows means and standard deviations of the five questions for each of the eight measurement moments. Table 4 shows the analysis results.

Table 3 EQ means (standard deviations)  
Table 4 Significant results of EQ repeated-measures GLM analysis  

The following significant results were found:

  • Coach Results showed that perceived performance was higher, positive affect was higher, and perceived computer support was higher for With Coach compared to Without Coach

  • Scenario Results showed higher perceived performance, higher self-efficacy, and higher positive affect for service desk compared to online banking. Online banking showed higher experienced difficulty

  • Difficulty Results showed higher perceived performance, higher self-efficacy, and higher positive affect for Easy compared to Hard. Hard showed higher experienced difficulty

  • Coach*Scenario Two sets of effects were found. For online banking only, the With Coach condition showed increased perceived performance, positive affect, and perceived computer support compared to Without Coach. This was not seen for service desk. Furthermore, With Coach showed lower experienced difficulty for online banking, but higher experienced difficulty for service desk; Without Coach did not show this

  • Coach*Difficulty Results showed that With Coach raised self-efficacy in the Hard exercises compared to Without Coach. No similar effect was seen for the Easy exercises.

  • Scenario*Difficulty Results showed that for online banking, the Hard exercise resulted in lower perceived performance, higher experienced difficulty, lower positive affect, and lower perceived computer support, compared to the Easy exercise. No similar effects were seen for service desk

  • Coach*Scenario*Difficulty Two effects were found. In the Hard online banking exercise, With Coach showed higher self-efficacy than Without Coach; in the Easy online banking exercise, no difference was found. Perceived computer support was much higher for the Easy online banking exercise than for the Hard online banking exercise, although both dropped significantly in the Without Coach condition compared to With Coach. In both cases, no effects were seen for service desk

Tests for between-subjects effects showed no significant results for age, gender, experience with online banking/service desk, and exercise counterbalancing order.

7.2 Session questionnaire analysis

Three analysis steps were used for the SQ. First, factor analysis was used to effect data reduction: Pearson correlation analysis showed several potentially significant correlations. An exploratory factor analysis was conducted, using principal component analysis for extraction and Varimax Rotation (with Kaiser Normalization) for rotation. Both eigenvalues and scree plots suggested a solution with four factors. Table 5 shows the factor loadings for this solution.

Table 5 Factor loadings for the 11 questions used in the factor analysis  

Based on the factor loadings shown in Table 5, a two-factor solution was decided on. Factor 1, ‘Information Skills’, contained questions 2, 5, 6, and 10, with a reliability of \(\alpha\)=.86. Factor 2, ‘Communication Skills’, contained questions 3, 4, 9, and 11, with a reliability of \(\alpha\)=.82. While questions 7 and 12 seemed to form a third factor, the reliability of this factor was only \(\alpha\)=.41; these questions were kept as separate items instead. Question 8 was also kept as a separate item.

Second, a 2-by-2 repeated-measures GLM analysis was conducted on the two factors and three independent questions. Because the SQ was only administered at the start and end of each experimental session, only two GLM factors were chosen. The Coach factor had levels corresponding to the coach’s presence or absence, ‘With Coach’ and ‘Without Coach’, and the Time factor had levels corresponding to the SQ measurement moment, ‘Pre-Session’ (the questionnaire was administered before a session) and ‘Post-Session’. (The questionnaire was administered after the end of a session.) All main effects and interaction effects were tested. Only one significant result was observed: Participant information skill was higher for Post-Session than for Pre-Session (F=5.474, p=.039). Tests for between-subjects effects showed no significant results for age, gender, experience with online banking/service desk, and exercise counterbalancing order. Table 6 shows means and standard deviations for the factors and questions.

Table 6 SQ data means (standard deviations)  

Third, a paired-samples T-test was conducted to compare ‘Information Skills’ and ‘Communication Skills’ means in the first and second week. These means are different from the means in Table 6: 50% of participants did With Coach sessions in the first week, and 50% did Without Coach in the first week. First week/second week means were compared for the four SQ measurement moments (before and after each session). Results are shown in Fig. 8. Before the first experimental session and before and after the second session, ‘Communication Skills’ were significantly higher than ‘Information Skills’.

7.3 Performance metrics analysis

A 2-by-2 repeated-measures GLM analysis was done with participant completion time and completion rate. Only data from the online banking exercises were used for this analysis: Data from the service desk exercises did not show enough variance, as completion rates were 100% for both exercises and completion times were strongly homogeneous. Two GLM factors were chosen: The Coach factor had levels: ‘With Coach’ and ‘Without Coach’, and the Difficulty factor had levels: ‘Easy’ and ‘Hard’. All main effects and interaction effects were tested. Significant results were only seen for the Difficulty factor: Exercise completion time was higher (F= 13.035, p=.006), and exercise completion rate was lower (F=22.559, p=.001) for Hard compared to Easy. Tests for between-subjects effects showed no significant results for age, gender, experience with online banking / service desk, and exercise counterbalancing order. Table 7 shows means and standard deviations.

Fig. 8
figure 8

Means for ‘Information Skills’ and ‘Communication Skills’ factors, across the four measurement moments. Values next to bars represent mean (standard deviation). Boxes on the right show the test statistics of the paired-samples T-test that compared means for Information Skills and Communication skills in that measurement moment; bold text indicates significant results

Table 7 Performance metrics means (standard deviations) for online banking exercises  

7.4 Observations

Experimenters observed that the coach seemed to work as intended, particularly for the online banking exercises. In Without Coach sessions, participants often seemed to quietly struggle with completing the exercise; no participants tried to talk to the computer system, and only some participants tried to get researcher help. However, in With Coach sessions, almost all participants interacted with the coach in some way and benefited from its help. Broad personal differences were observed in the degree to which this happened. Participants who spoke with the researchers a lot during the introduction to the experiment spoke to the coach in the same way they would talk to a human actor, including attributing personality traits to ’her’ and asking complex questions (e.g. “This bill I have to pay seems way too high. Coach, what do you think?”). Participants who spoke less to the researchers commonly spoke less to the coach as well, generally restricting themselves to answering coach questions and asking for direct instructions (e.g. “Coach, how do I pay a bill?”). However, these participants were still seen acting on the coach’s advice. Significantly less coach-participant interaction was seen during the service desk exercises. Participants asked for help less often, and in fact seemed to get stuck less often. Interestingly, whenever help was needed, participants more often asked the service desk ECA directly. With the focus on the conversation partner, participants seemed to overlook the coach’s presence. One participant echoed this, saying that (paraphrased) “\(\ldots\) I kind of forgot she was there.” Participants spoke to the service desk ECA the same way they spoke to the coach, i.e. some participants really engaged with her, while others only answered direct questions. Interestingly, this most often happened in situations where the scenario was inaccurate or incomplete compared to real life: For instance, many participants asked if they would be required to ‘bring passport photos next time’, something that we had not incorporated in the exercise. Participants would use their own experience and expertise with these scenarios to catch these inaccuracies, and then press the conversation partner for clarification.

Some unexpected technical difficulties occurred during the experiment. Both the coach and the conversation partner programmes showed an unexplained, variable time delay when speaking, ranging from two seconds to twenty. From interviews, it seems participants perceived this as ‘the coach just being very quiet’. But for the experimenters, this made it hard to use the right support at the right time. Particularly in situations where participants asked questions and then quickly moved on, this was a problem: The coach would either be stuck using a now-irrelevant speech utterance, confusing the participant, or it would have to be muted for the duration, causing a strange visual effect (the coach soundlessly ‘talking’) that some participants noticed. In either case, no further support would be possible for a while.

Post-test interviews showed that most participants accepted the Wizard-of-Oz illusion quite readily. Participants did not notice the behind-the-scenes technical difficulties, and even the aforementioned ’soundless talking’ was usually mentioned as an oddity, not as something that stood out. When asked about the coach, participant response was almost universally positive. This seemed inversely correlated with ‘participant skill’: Participants who completed the exercises easily and quickly were more often ambivalent or negative about the coach, while participants who required a lot of help to complete exercises were very happy with the coach’s help. Participants were positive about the entire VESSEL prototype: Many mentioned that they enjoyed this way of learning and doing exercises and expressed hope that they would be able to ‘do something like this at home’ soon. The interviews also showed that many participants saw the online banking exercises as much more difficult than the service desk exercises. Particularly, the Hard online banking exercise was considered very challenging, and almost impossible to complete (within the time limit) without the coach. The coach’s support was much appreciated here. In contrast, both service desk exercises were seen as easy. Participants did sometimes notice differences in politeness between the two conversation partners, but otherwise seemed to consider the exercises equivalent; this was not the case with the online banking exercises, which were more clearly ‘easy’ and ‘difficult’.

8 Conclusions and discussion

8.1 Findings

Building on earlier design work for the envisioned system VESSEL [88, 89], this study has designed, developed, and evaluated VESSEL as a virtual learning environment, wherein societal participation exercises are supported by an ECA coach. The foundation of data was updated with situated interactive exercises, the literature on cognitive, affective, and social learning support, and the benefits of ECA coaching. The specification requirements were refined to reflect VESSEL as an ECA coach-supported exercise environment (see Table 1). Use cases were derived and used to design and develop a functional VESSEL prototype. This prototype was tested with low-literate end users in order to evaluate the claims of learning effectiveness underlying the ECA coach.

The study’s first research question was: “How can we create an ECA coach that provides effective cognitive, affective, and social learning support to low-literate learners doing situated interactive exercises in a virtual learning environment?” Sub-question 1a, “In what ways can an ECA coach provide cognitive, affective, and social learning support?”, is answered in Sect. 3. The coach should offer cognitive support in the form of scaffolding, affective support in the form of motivational interviewing, and social support in the form of small talk. Sub-question 1b, “Which functionalities, interaction methods, and appearances should an ECA coach have to effectively provide this learning support in a virtual learning environment?” is answered in Sects. 4 and 5. The coach should provide learning support that is adapted to the individual learner, to help them complete exercises. The coach should interact with learners in the form of pre-recorded utterances, complemented with visual materials when necessary; and the coach’s appearance should align with user expectations of the role of a ‘digital coach’. This outcome seems true across all participating learners, regardless of age, sex, or ethnicity. Expectations for the pre-study [26] were that participants would prefer ECAs that were similar to them in gender and ethnicity, as humans can experience similarity attraction to ECAs just as to humans [70, 72]. But instead, all ECAs in the prototype were valued on matching the (visual) stereotype of their role. Participants chose the coach depicted in Fig. 5 because ‘she looked friendly and approachable’, and the Easy service desk conversation partner depicted in Fig. 4 because ‘she looked like she belonged there, like she would know what was happening’. Participants would actually dislike similar-looking ECAs, saying that (paraphrased) “if this person is like me, also low literate, then they won’t be able to help me”. This clashes with expectations that user-ECA similarity attraction would be high [70], but does confirm that stereotype-reinforcing appearances can have a strong impact [4]; our results seem to suggest a ’job-appropriate clothing’ stereotype rather than Angeli & Brahnam’s [4] sex and gender stereotypes, though it should be mentioned that our ’most positive’ ECAs were both read as female, while our ’most negative’ ECA was read as male. Our most-popular ECAs were also generally the more conventionally attractive ones, mirroring results by Khan and de Angeli [49] and Nass et al. [72]. In this study’s post-test interviews, our participants (no overlap with the pre-study participants) primarily reported that they judged the ECA characters on how well they fit the scenario: The coach and the Easy service desk conversation partner were liked and valued, while the Hard service desk conversation partner was disliked.

The study’s second research question was: “Does this support-providing ECA coach increase learning effectiveness for low-literate learners working with VESSEL, compared to low-literate learners working with VESSEL but not receiving coach support?” Six hypotheses were derived, based on six claims of learning effectiveness: Cognitive, affective, and social learning experience and cognitive, affective, and social learning outcomes. Using the results from Sect. 7, these hypotheses resolve in the following ways:

Learning Experience

  • H1: Cognitive Experience (Performance) This hypothesis is partially supported. Self-reported performance increased in the presence of the coach. However, completion time did not. Users experienced that they were doing better in the presence of the coach, but were not actually any faster

  • H2: Affective Experience (Positive Affect) This hypothesis is supported. User positive affect significantly increased in the presence of the coach

  • H3: Social Experience (Engagement) This hypothesis is supported. Users reported feeling ‘supported by the computer’ significantly more when the coach was present. Users were also observed to interact with the system much more when the coach was present: Users actually talked to the coach, with some interactions going beyond the exercise topics

Learning Outcomes

  • H4: Cognitive Outcomes (Success) This hypothesis is not supported. No significant main effect of coach presence was found for exercise completion rate

  • H5: Affective Outcomes (Self-Efficacy) This hypothesis is partially supported. No significant main effect of coach presence on any self-efficacy measure was found. But an interaction effect shows that the coach significantly raised online banking self-efficacy only after the Hard online banking exercise

  • H6: Social Outcomes (Retention) This hypothesis is not supported. After factor analysis, only SQ question 12 measured this hypothesis. No significant main effect of coach presence was found

The ECA coach created in this study, designed to provide cognitive, affective, and social learning support meaningfully integrated into four online banking and service desk exercises, has significantly increased several aspects of the learning effectiveness of VESSEL. The hypothesis that working with the ECA coach would improve the learning experience is fully supported in hypotheses H2 and H3, and partially supported in H1. The hypothesis that working with the coach would improve learning outcomes is only partially supported in hypothesis H5. This seems to indicate that the coach particularly influenced participants’ subjective experience of working with VESSEL: Participants were more positive and more engaged, felt like they performed better and showed a higher self-efficacy regarding online banking. A similar increase in social engagement and self-efficacy was found by Lane et al.’s [57] ECA coach, and similar improvements in the affective experience are reported by Lester et al.’s persona effect study [59], Lane et al.’s computer science education ECA coach [57], Bercht & Vicari’s pedagogical support agent [9], and Shaw et al.’s embodied situated support agent [93]. However, objective measures of performance and success (exercise completion rate and completion time) were not influenced. This result goes counter to other studies that show that ECA coaches can influence objective learning outcomes such as learner behaviour (e.g. increasing rate of physical exercise and changing diet [16], and increasing learner involvement in dialogue with a learning agent [86]) and user-system satisfaction [14]. This discrepancy bears further investigation. It is possible that our skew towards subjective (self-reported) findings is a result of the mostly subjective set of measurements. Future studies should investigate if other objective performance measures (such as number of mistakes made, or amount of coach support needed) reveal more digital coach effects, or if the influence of the VESSEL coach as described in this work is mostly subjective.

8.2 Limitations

As this prototype was designed to be a proof-of-concept first design, a number of unexpected shortcomings were encountered during the experiments. These can be related to the functionality, interaction methods, and appearance of the exercises and the coach. A significant issue with exercise functionality was that difficulty levels of the exercises did not come out as balanced as designed. In analysis, both of the Easy (online banking/service desk) and both of the Hard exercises were treated as equivalent in difficulty level (as intended). However, the Hard online banking exercise was significantly more difficult than any other. This can be seen in the main effects and interaction effects for the ‘Scenario’ and ‘Difficulty’ factors: All main effects for either factor are always accompanied by either a ‘Scenario*Difficulty’ interaction effect, or a ‘Coach*Scenario *Difficulty’ one that shows strong differences between the Hard online banking exercise and the other three exercises. Additionally, in the post-experiment interviews, many participants reported seeing the Hard online banking exercise as an outlier. Almost no differences were seen between the two service desk exercises. Following up on the ‘Coach*Scenario’ and ‘Coach*Scen-ario*Difficulty’ interaction effects seems to suggest that all coach-related main findings only apply to ‘difficult information skill exercises’, or maybe only to ‘difficult online banking’. Consequently, result generalizability suffers. This can be seen as a failure to adhere to requirement R1.1-E (exercise adaptability). Difficulty levels were not properly calibrated. For the online banking exercises, difficulty was intended to come from complexity and information density differences. But these differences were much stronger than expected. For the service desk exercises, difficulty was intended to come from sensitivity and politeness differences. However, strict adherence to R2.1-E (exercise sensitivity) meant that these exercises were not significantly different in practice. Additionally, participant communication skill was significantly higher than information skill, throughout the experiment. All participants reported having prior experience with ‘service desk conversations’, and over half of all participants had explicit experience with ‘passport application’ conversations. This likely lowered the experienced difficulty for these exercises. Both of these issues highlight the importance of user involvement in all steps of the design process, particularly when designing for demographics with particular needs: Pre-testing the exercises with low-literate users would have revealed the low impact of the politeness manipulation and the users’ pre-existing knowledge of and focus on the exercise domain, allowing more careful calibration to take place. This stands as a lesson for future work.

service desk exercise interaction methods showed two more shortcomings. First, because participants held a natural dialogue with the conversation partner, it turned out to be unexpectedly difficult for the coach to provide support without interfering in the conversation. To provide support, the service desk conversation would have to be stopped, creating unrealistic pauses (in scenario context). Additionally, participants reported in the interviews that switching attention between the conversation partner and the coach felt strange and took effort. Participants would direct their questions at the conversation partner instead of the coach, and (in some cases) forget about the coach entirely. While the single ECA coach in the online banking exercises has worked, having multiple (talking) ECAs in a single exercise may require more careful design; collecting all required functionality in a single ECA seems like the optimal solution (and one we intend to study in later work), but if this is not possible, user-centred design and testing could ensure that the different ECAs are actually perceived as uniquely meaningful. Second, measures of exercise completion rate and completion time were useless for the service desk exercises: Regardless of difficulty level or coach presence, exercise completion rate was 100%, and completion times showed very little variance. Again, this can primarily be blamed on high participant communication skills and experience with the scenario. Adding to this, the fact that the exercise was a conversation gave it a clear, easy-to-understand structure that the online banking exercises did not have. On the online banking websites, participants could get lost and lose time, while during the conversations, the conversation partner guided the participant with directed questions. This may have disincentivized ‘exploratory’ behaviour: Participants felt they had to follow suit in the conversation, instead of (for example) asking about unfamiliar words. The combination of prior experience and a strong guided structure meant that all participants completed the conversation in close to minimum possible time. For this type of exercise, ‘completion time’ may not be a valuable performance metric.

The most significant issue with the ECA coach was the informal nature of the Wizard-of-Oz control rules. Clear behavioural rules are important for the success of the Wizard-of-Oz-method [64]. During the small talk and motivational interviewing sections, there was a flow structure based (partially) on measurable objectives and keywords. However, particularly during the scaffolding section, the provided support was highly dependent on the wizard’s appraisal of the situation. This led to two uncertainties. Functionality-wise, it was unclear what support the coach should give at any given time and for any given problem, which can be cast as a failure to adhere to requirement R1.1-C (coach adaptability). Regarding interactions, it was unclear how much of the participants’ utterances the coach (represented by the wizard) was supposed to understand. Due to lack of clear rules, the wizard has probably responded to more participant utterances and behaviours in their interaction than an automated ECA could have done. A human wizard can understand participant questions, perceive and read participant nonverbal cues such as body language, and analyze their emotional state. A human wizard can also apply their own reasoning to understand what a participant is ‘trying to do’, and direct support accordingly. This makes the found effects uncertain. Would a fully autonomous digital coach, with limited interaction possibilities, still have the same effects for low-literate participants? Veletsianos & Miller [104] emphasize the importance of a social, human-like experience for users working with pedagogical agents, suggesting that more machine-like interaction might not have the same positive effects. Future work should investigate ways of structuring and formalizing the coach’s control rules, regarding both support functionality and speech recognition (taking into account the opportunities afforded for the latter by state-of-the-art technology), in order to increase accuracy and study the VESSEL concept as envisioned.

One issue shared by the coach and the service desk exercises was the low graphical fidelity of the ECAs. All ECAs had a low-fidelity, somewhat unrealistic appearance, and only one facial expression. The question of whether human ECAs should be ‘naturalistic’ (as human-looking as possible) or ‘stylized’ (non-realistic and exaggerated) has no clear answer: Haake & Gulz [42] collect and discuss arguments for both approaches and conclude that the ‘right’ answer in any context depends on the agent’s intended goals and motives. While the VESSEL ECAs were accepted as social actors, it is possible that more naturalistic appearances would have resulted in stronger emotional and social bonding: Perhaps the coach’s emotional support, or the intended politeness of the Easy service desk conversation partner and intended rudeness of the Hard service desk conversation partner, would have shown stronger effects. This stands as a direction for future study.

Finally, two important oversights in the experimental design relate to the participants. First, the relatively low total number of participants almost definitely influenced result significance and power. Second, the relative lack of first-language learner (L1) participants may have made it impossible to find differences between these participants and second-language learner (L2) participants. It is well-documented that low-literate first- and second-language learners encounter different problems in learning and participation [43, 55, 56, 79]. Since we could only find two L1 participants for our evaluation, we cannot say if the two groups experienced the prototype (the coach, the exercises, or the interaction methods) in different ways. This is an important aspect of designing for these demographics. Addressing both, we intend larger and more varied participant samples in future studies.

8.3 Conclusions

Previous caveats notwithstanding, our results do indicate that our digital coach has significant beneficial effects for low-literate learners (using VESSEL). We mention in Sect. 1 that an ECA coach could benefit low literates by ‘putting a human face on computer support’. The results from this study show that the low-literate users accepted our coach as a useful source of help that could be relied on. Real interaction was observed between participants and coach: Help was asked for, offered, and accepted, and a small number of participants actually engaged the coach in dialogue, suggesting that as predicted by Bickmore & Picard [15], a friendly relationship of trust has started to form. Miao et al. [67] have already shown that ECA coaches in general can be accepted by learners; our work extends on this by showing that our ECA coach design is accepted by low-literate learners in particular. These positive effects were particularly seen with participants who struggled with the exercises, suggesting that they were helped the most by the coach’s presence and support. Since our primary goal with VESSEL is to support exactly these learners (learners in Kurvers et al.’s [56] profiles 2 and 3, see Sect. 6.3), this is promising. All the same, we do note that these positive effects were only found in the Hard online banking exercise, which was designed to test information skills. We clearly show that the coach supports information skills learning, but do not (yet) show a similar benefit to communication skills learning.

In conclusion, the starting assumption of our work (that a carefully designed virtual coach with integrated cognitive/affective/social learning support would work with low-literate societal participation learners) is confirmed, opening the possibility for more specialized work in this area. Our own future work will build on these results. The following iteration in our VESSEL design process will focus on addressing the issue of unstructured rules described above, taking the other study pitfalls and learned lessons into account. Now that the proof-of-concept evaluation has shown the validity of the core VESSEL ideas, we intend to create a formally structured VESSEL design specification, that comprehensively describes how to create situated interactive exercises at the right level of difficulty, and how to structure learning support such that an ECA coach can accurately provide it without requiring a human operator.