Keywords

1 Introduction

Decision-making is a high-level cognitive process based on various cognitive processes like perception, attention, and memory [9]. The most conventional decision-making research in psychology and cognitive science generally begins with developing a theory or hypothesis about what should happen in pre-defined behavioral paradigms given to the subjects, in which the behavioral paradigms are designed to mimic cognitive operation in the real world. Economic and reinforcement learning theories have been widely applied to formalize the computations taking place in the brain in these simulated scenarios [8, 18]. Particularly, reinforcement learning has been a valuable framework widely used for understanding the underlying deficits of cognitive processes of patients with mental illnesses [1, 3, 10, 11, 15]. The computational parameters extracted from such theory-driven models have been utilized as a promising phenotyping tool for humans. More importantly, the phenotypes represented by the computational parameters can be linked to the activities of the neural substrates [13].

Human decision-making research has focused for a long time on the investigation of reinforcement learning and decision-making from visual input or symbolic information, although various forms of modalities have been examined in animal studies [2, 16]. Nevertheless, humans interact with their environment using a wide array of senses. In particular, much of the time, voiced-based natural language communication is the dominant modality for learning and making decisions in the real world, especially in social contexts. Investigating the influence of information presentation formats, e.g. visual versus voice, on reinforcement learning and decision-making processes is beneficial to various fields, such as human-machine interaction, cognitive science, psychiatry, economics, and marketing. The format of the information presented, which serves as input to human cognitive processes, may guide, constrain, and even determine cognitive behaviour [20]. Particularly, it is necessary to investigate how people learn and make decisions when they are provided with information via a voice interface based on natural language instead of visual symbols given the fact that apart from the visual-based interfaces, speech has become a more prominent way of interacting with automatic systems in the past few years. It is also a more natural interface for many users. Voice-enabled intelligent personal assistants (IPAs) like Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana, that use input such as the user’s voice and context information to provide assistance is widely available on smartphones [19]. With the growth of intelligent personal assistants, the level of spoken interactions with technology is unprecedented. Home-based devices such as Amazon Echo, Apple HomePod, and Google Home are increasingly using speech as the primary form of interaction. Across industries, voice-enabled IPAs are assisting with customer service, technical support, scheduling tasks, and many other personalized services [5]. Voice interfaces have been an indispensable part of our daily life. Importantly, the ubiquity of voice-based products makes it possible to capture human decision-making data under various contexts with high ecological validity.

However, the voice user interface has hardly been applied in the realm of cognitive and decision-making research. Given the same decision-making paradigm, will a subject behave differently interacting with conventional visual instructions and stimuli compared to the auditory version? Although in previous studies it has been shown that visual and auditory stimuli are processed differently at the input stage [12], it is unclear whether these differences result in a downstream effect in cognitive processing. Most of the findings to date stem from research on the impacts of modality effects. Penney [14] reviewed research on the effects of visual and auditory presentation on short-term retention of verbal stimuli, developing the separate streams hypothesis of modality effects. According to this theory of modality effects, there are separate processing streams for auditory and visually presented information in short-term memory. Accordingly, encoding as both visual and auditory representations improves the chances of successful retrieval, as both subsystems can be used to recover information. The classic example of using these modality effects is the educational practice of presenting information to-be-learned both graphically and with textual information through an auditory mode [7].

No study has ever tested whether people perform differently on reinforcement learning and decision-making tasks when interacting with the voice-enabled IPAs using natural language as compared to conventional visual-based interfaces. If the answer is in the affirmative, how does it influence performance? If, on the other hand, the characterization is equivalent, is it feasible to use these auditory-based paradigms such that they can be embedded into widely accessible IPAS for sampling the computational phenotypes that reflect the status of the general population, and for clinical applications, those with psychiatric conditions? The present study compares people’s performance on conversational voice-based and visual-based two-armed bandit tasks in order to test if there are significant differences as a function of stimulus modality and to provide empirical evidence for the feasibility of adopting voice-enabled IPAs for ecologically sampled human decision-making phenotyping in future cognitive studies. Both the superficial behavioural measures and the cognitive processes represented by the reinforcement models were compared across the two versions of the task.

2 Method

2.1 Participants

This study was approved by the local ethics committee of the School of Computing, Dublin City University. The participants in this study were recruited through advertisements and poster boards on the university campus. A total of 30 participants participated in the experiment. The age range of the subjects was from 20 to 40 years.

2.2 Procedure

People who were interested in participating in the experiment were given an URL link to the study. After reading through the plain language statement and completing the informed consent in the first two pages of the link, the participants were first required to provide some basic demographic information. They were then directed to the gamified two-armed bandit task. Two interfaces for the task, i.e. a visual interface and a conversational voice interface, were developed (details about the task are introduced in the next section). The participant had to play both versions, one after the other, but the sequence of the two interfaces was randomly assigned across the subjects. A “wash-out” task was included between each task to reduce any after-effects. This was done by allowing the subjects to play a minesweeper game for 5 min. The workflow of the complete experiment is illustrated in Fig. 1.

Fig. 1.
figure 1

Workflow diagram of the experimental procedure.

2.3 The Two-Armed Bandit Task

In order to make the task less monotonous, we placed the participants in a story-based scenario where they had to undertake a journey in a forest. In this journey, they would pass through crossroads, i.e. junctions and interact with two leprechauns distinguished by different colours standing at each junction. The participants are initially given 1,000 gold coins at the beginning of their journey through the forest. They have to navigate through the trees and bushes to reach home. As they make their way through the forest, they will come upon the series of junctions already described. At each junction, there will be two leprechauns, one with a blue colouring and the other coloured red, who may steal gold coins from the participant. Although unknown to the participants, they will have to pass through such junctions 120 times. The probability that a leprechaun will steal some gold coins fluctuates independently and slowly based on a Gaussian distribution for each leprechaun. There is always one leprechaun who on average is less prone to stealing but that can change slowly over time. This leprechaun represents the most beneficial choice when selecting a leprechaun to go past. After choosing the leprechaun, the participants are provided with feedback indicating if they lost gold coins or not. The participants are instructed that the chance of the stealing leprechaun being blue or red depends only on the recent outcome history. The aim of the participants is to learn and choose the better leprechaun that steals less from them as more as possible to reserve more gold coins when they get back home.

Fig. 2.
figure 2

The visual-based two-armed bandit task. The participant is initially given 100 gold coins. In the first example trial (the first row), the blue leprechaun was chosen and it stole one gold coin from the participant and ran away. In the second example trial (the second row), the red leprechaun was chosen and the ‘Good choice!’ feedback was given. (Color figure online)

Two versions of the task were developed, i.e. the traditional visual-based version and the conversational voice-based version. A screenshot for the visual-based version of the task is shown in Fig. 2. This implementation is developed using HTML, CSS, and Javascript. The server is run on a Heroku instance with the data stored on a managed MongoDB service. A series of open-source libraries were utilized for the visual and voice features. There are no visual aesthetics in the conversational voice-based interface. The interaction was maintained over the entire system-participant interaction. Initially, the system narrated the scenario for the participant. The script of the instruction for the voice-based interface was exactly the same as that for the visual-based interface. Participants had to click or touch the leprechauns to make the selection in the visual version of the task, whereas they spoke out their choices, i.e. ‘blue’ or ‘red’, in response to the query from the voice interface. The lexicographic strategy was implemented even though the voice interface is able to recognize ‘the blue leprechaun’, ‘blue leprechaun’, ‘blue one’ etc., as long as the most-important attribute ‘blue’ and ‘red’ were included. Based on the subject’s response, whether the coins were lost or saved was calculated. The result was confirmed as a response to the subject on each iteration of coins lost or saved. In the visual version of the task, the good feedback was shown as a text dialogue saying “Good choice!”, while the negative feedback was presented as the leprechaun taking the gold coins and illustrated as the it running away. In the voice version of the task, the participant was informed “Yay! Good selection!” for selecting the leprechaun if it did not steal coins and ‘Oops! Bad Selection.’ for selecting a leprechaun that stole gold coins on that round.

2.4 Wash-Out Task

A spatial memory test was developed to distract the participants during the wash-out period between the two versions of the two-armed bandit task. A screenshot of the spatial memory game is shown in Fig. 3.

Fig. 3.
figure 3

The screenshot of the wash-out task.

2.5 Comparison of the Model-Independent Behavioural Measures

We firstly compared participants’ performance on the task in terms of superficial behavioural statistics that should capture fundamental aspects of learning, the probability of shifting to the other option, \(p_{shift}\), and the probability of choosing the correct action \(p_{correct}\). We calculated the probability of shifting after receiving a loss versus no loss, and the overall shifting rate as a function of the task version.

2.6 Computational Modelling Analysis

The Reinforcement Learning Model. The participant choice and the outcome (whether or not gold coins were lost) in each trial were recorded and the data fitted to a simple reinforcement learning model. It is assumed in this model that participants first learn the expected value of each leprechaun based on the history of previous outcomes and then use these values to make a decision about what to do next. The most classic model of learning is the Rescorla-Wagner learning rule [17] whereby the value of option k is updated in response to the loss \(p_t\) in trial t according to:

$$\begin{aligned} Q^k_{t+1}=Q^k_t + \alpha (p_t-Q^k_t) \end{aligned}$$
(1)

where \(\alpha \) is the learning rate, which ranges from 0 to 1 and captures the extent to which the aversive prediction error \(p_t - Q^k_t\), updates the value. The \(p_t\) was encoded as -1 if there was a loss occurred and 0 if no loss was caused. The initial value for each of the options \(Q^k_0\) is assumed to be zero.

The simplest model of decision-making is to assume that participants choose the most valuable option. However, this assumption is not consistent with what is observed when people select between options. A basic property of findings on option selection is that people do not seem to always choose the better of two options. If they did, we would expect to see that their choices followed a step function, as long as one option has a higher value than the other, they always choose the former one and vice versa. Instead, people’s choices follow a sigmoid-like pattern, more step-like when there is a large difference between the option values, but, as that difference narrows, people start to choose the higher-valued one with less consistency (i.e., they are increasingly likely to choose the “objectively” lower-valued option). Thus, one choice rule with these properties is known as the ‘softmax’ choice rule, which chooses option k with probability:

$$\begin{aligned} p^k_t = \frac{exp(\beta Q^k_t)}{\sum ^K_{i=1}exp(\beta Q^i_t)} \end{aligned}$$
(2)

where \(\beta \) is the inverse temperature parameter that controls the level of stochasticity in the choice, ranging from \(\beta = 0\) to \(\beta = \infty \). \(\beta = 0\) represents the participant was completely randomly making the choices, whereas \(\beta = \infty \) means they were deterministically choosing the option with the highest value.

Hierarchical Bayesian Estimation of Parameters. A hierarchical Bayesian procedure was used to estimate distributions over model parameters at both individual- and population-level for the two sets of datasets on visual-based and voice-based tasks separately. Specifically, each parameter was assigned an independent population-level distribution that was shared across participants for each dataset. The standard deviation for the population-level distribution was estimated separately for each parameter. Posterior distributions were estimated using Hamiltonian Monte Carlo with an NoU-Turn Sampler (HMC with NUTS) as implemented in Stan [4] via its RStan interface. The Gelman-Rubin index \(\hat{R}\) (Rhat) was used to assess the convergence of the MCMC samples [6]. \(\hat{R}\) values close to 1.00 indicate that the MCMC chains have converged to stationary target distributions. There were no population-level parameters with R values greater than 1.1 (most were below 1.01). Four chains were run with 1000 warming up and 4000 samples each.

In order to examine the effects of the task modality on the underlying cognitive process, we compared the posterior distributions of the population-level parameters across the two versions of the task using the 95% Highest Density Interval (HDI). Specifically, we calculated the difference in the population-level parameters for the two datasets and reported the 95% HDI of the difference. If this HDI did not overlap zero, we consider there to be a meaningful difference between the performance on the two versions of the reinforcement learning task. The individual estimates for mean values of each parameter were also extracted and compared in order to examine the effects of the task version on each individual.

Fig. 4.
figure 4

The probability of shifting to the other option after receiving a loss (left) and no loss (middle), and the overall probability of shifting regardless of the outcome as a function of the task version (right). The probability of loss-shift was not significantly different in the two versions of the task, whereas the no loss-shift and the overall shift rate were significantly elevated in the voice-based version of the task. Each dot represents a participant and error bars represent 1 standard error of mean.

3 Results

3.1 Comparison of the Model-Independent Behavioural Measures

Three model-independent behavioural measures were compared between participants’ performance on the visual-based and the voice-based interfaces. The probability of loss-shift, no-loss-shift and shift regardless of the outcome across the trials for each participant was calculated for the visual-based and voice-based versions of the task, respectively and is shown in Fig. 4. Each dot represents a participant and error bars represent 1 standard error of mean. Visually, the group mean of the loss-shift and no loss-shift of the participants in the voice-based task were both higher than that in the visual-based task. The paired t-test shows that the no loss-shift probability on the voice-based task was significantly increased compared to the visual-based version (\(t=-1.11, p=0.28\)), whereas there is no significant difference in terms of the loss-shift probability between the two versions of the task (\(t=-2.62, p=0.01\)). Additionally, the overall tendency to shift in the voice-based version of the task was marginally significantly increased (\(t=-2.09, p=0.05\)).

Fig. 5.
figure 5

The posterior means along with the 95% HDI for the difference of the group means of the learning rate \(\text{ diff}_A\) and the inverse temperature \(\text{ diff}_tau\) between the two versions of the task. The 95% posterior intervals excluded zero for the effect of the task version upon the learning rate parameter, whereas it included zero for the inverse temperature parameter.

Fig. 6.
figure 6

The mean of the posterior distribution of the individual-level learning rate (left panel) and the inverse temperature parameter (right) for each participant on the visual-based versus voice-based version of the task. Each dot represents one participant.

3.2 Comparison of the Cognitive Parameters

Both the group- and individual-level free parameters, i.e. the learning rate and the inverse temperature, contained in the reinforcement learning model were estimated for the two versions of the task. In order to evaluate the influence of the task version on the overall performance, the difference of the group-level posterior distribution for the learning rate parameter \(\text{ diff}_A\) and the inverse temperature parameter \(\text{ diff}_{tau}\) between the visual-based task and the voice-based task was calculated and illustrated in Fig. 5. Participants adopted similar learning rates in the two versions of the task at the group level as the 95% HDI of the \(\text{ diff}_A\) excluded zero. However, the group-level inverse temperature on the voice-based version of the task was significantly decreased compared to the visual-based task, indicating the participants were more deterministic in terms of choosing the option with the highest expected value on the visual-based task. Figure 6 demonstrates the mean of the posterior distribution of each individual parameter for each participant. The means of the posterior distributions of the individual-level parameters for the visual-based task were more decentralized compared to that for the voice-based task.

4 Discussion

The current study compares for the first time performance during voice and visual aversive two-armed bandit task conditions with an otherwise identical experimental protocol. Furthermore, this is the first time as far as the authors are aware that such a learning task has been conducted over a voice interface. Although the participants demonstrated equivalent loss-shift rates, the overall shifting rate and the probability of shifting in trials where no loss was caused were significantly elevated in the voice-based version of the task. The comparison of the underlying cognitive parameters revealed that participants adopted similar learning strategies for the two versions of the task, though more decision noise was present in the voice-based version of the task. The increased source of the decision noise may reflect the difference in terms of the format of the input information (visual versus auditory) impacts the overall weight given to the two options at that moment in the decision process. Another possible explanation could be that responding to colour questions when no colours have been seen would be confusing and difficult for participants. A parallel task with questions suited to auditory modality (e.g. left versus right with stereo auditory input) would be useful in the future study. Additionally, the change of the control adjustments (click/touch versus speech) may also contribute to the alteration of the decision-making process. More noise might be included in the process when the outcome probability of each option and the intensity of control were evaluated simultaneously. Although efforts have been made to improve the efficiency of the system-subject interaction, use of the voice interface may be a more deliberate action for the participants in this experiment, especially given the fact that sometimes the participants needed to repeat their answers several times before the system identified what was said. We suspect that the elevated shifting rate in the voice-based version of the task may be the behavioural-level representation of the decision noise as reflected by the inverse temperature parameter given the learning rate on the two versions of the task was not significantly different. Overall, we anticipate future work in this area as natural speech interfaces present opportunities for human phenotyping based on learning behaviour in uncertain environments. In particular, given the ability to perform these experiments outside the laboratory, it is plausible that the human behaviour captured may be more representative of real-world behaviour and more valuable in terms of ecological validity.

5 Conclusion

The rapid advancement of voice-enabled IPAs provides opportunities to investigate how people learn and make decisions in the context of using natural language to communicate. It is necessary to examine if people perform equivalently on decision-making learning tasks when interacting with voice interfaces versus the conventional and to some degree validated approaches of texts and graphic stimuli, As such, these findings suggest that stimulus modality has no influence on the learning strategy in the reinforcement learning task, although more decision noise was introduced in the voice-based interface. These findings have implications for the presentation of reinforcement learning tasks in experimental settings. It is important for example to further enhance the efficiency and ease of interacting with the voice interface if we wish to use voice-based IPAs as sensors to measure human decision-making in their daily environments in the future. What is clear however is that characterisation of human behaviour in a way that may be useful for the derivation of computational biomarkers in the case of clinical applications for example, is possible over contemporary pervasive computing technologies.