1 Introduction

The concept of ‘affective computing’ suggests that computers may be capable of detecting emotion, responding appropriately and expressing emotion [1]. In this sense, computer systems incorporate human-like emotional intelligence and empathy [2]. In the field of affective computing, emotion is defined as a systems’ subjective interpretation of meaningful events [3]. Within robotics, sentiment analysis and emotion understanding are essential to developing longer-term relationships and rapport with human users, especially to maintain interest when the novelty of engagement wears off [4].

Sentiment expression is widely understood to be multimodal, requiring both non-verbal and verbal efforts that are mutually understood by and align with the interactive parties. Non-verbal parameters include facial expressions and head movements (i.e., tilting). Verbal expression of emotion is more complicated and includes semantics (content of speech) and prosodic cues which impact meaning (i.e., intonation, pitch, volume/energy, pauses or speed) [5].

Through changes in verbal and non-verbal parameters, various emotions can be expressed (and ultimately detected). Previous work on cross-cultural human-human interaction has reported that many of these speech features can be used universally, to correctly identify emotion [6]. Specifically, humans identify joy/happiness by a rapid speaking rate, higher pitch and larger pitch range, while sadness can be detected through a slower speaking rate, lower average pitch and more narrow range [7]. Anger and fear both have a faster speaking rate, but anger has a rising pitch contour, while fear may have more varied loudness [7].

The ‘Big Six’ or ‘Big Eight’ categories are often used to understand and distinguish between main emotions. The Big Six include: anger, disgust, fear, joy/happiness, sadness, and surprise [8,9,10]. Plutchik identified eight emotions, adding anticipation and trust to the Big Six [11]. These were also understood to have opposing emotions (e.g., happiness/joy and sadness) and range in intensity. As a result, trends measuring emotion have resulted in Russell’s Circumplex Model of Affect, whereby emotions revolve around arousal and valence [12]. Computational modelling of sentiment has aimed to detect these continuous values [13, 14]. Researchers have also used a form of multiclass classification, whereby the classes include the categories as identified as above, or simply: positive, negative and neutral [9, 10, 15].

However, sentiment expression and detection are extremely complex. This is because prosodic factors such as intonation do not consist of single independent systems, but are a product of the amalgamation of various features, including tone, loudness, tempo, rhythm and pitch range and contour [16]. Combining verbal expressions as such with non-verbal expressions in systems can be helpful, but also more complicated. This is because emotional incongruence may occur, whereby the different modalities indicate differing emotions [17]. For example, an individual may smile while presenting bad news, or demonstrate sarcasm with a serious facial expression. While humans can typically navigate this incongruence, on-going research is dedicated to attempting to understand how [18, 19]. Nuances such as these are important to understand when attempting to design computer and robotic systems that can detect and express sentiment.

Previous research on human-robot interaction has focussed on non-verbal expressions of sentiment, including facial and body expressions [20, 21]. Less has been conducted purely on prosodic factors. However, this has included augmenting ‘robotic’ voices with forms of prosody in an attempt to convey sentiment [5]. A gap in knowledge remains on which prosodic factors are required for successful sentiment expression and detection in human-robot interaction.

This review is part of an over-arching project that seeks to develop a sentiment analyser that can be implemented on a robot. Our previous work has included the development of a coverage-based sentiment and sub-sentence extraction system that estimates a span of input text and recursively feeds this information back to the networks for sentiment identification [4]. Twenty-four ablation studies were conducted and showed promising results. Our next step, and the aim of this review, is to understand how emotional speech is expressed or detected within existing robotic systems, with a focus on prosody.

2 Methods and Methodology

We conducted a scoping review, as this method seeks to explore and synthesise the available literature, as well as to map relevant ideas and concepts in regard to the research topic [22, 23]. As in other reviews, scoping reviews are conducted systematically and transparently, but like narrative and descriptive reviews, cover the breadth (not the depth) of research by summarizing previous knowledge [24, 25].

The methods and procedures of this scoping review aligned with the established five-step process proposed by Arksey and O’Malley [24]: (1) identifying the research question, (2) identifying relevant studies, (3) selecting studies, (4) charting the data, and (5) collating, summarizing, and reporting the results. We also report this review in alignment with the PRISMA-ScR guideline [26].

2.1 Identifying the Research Question

The purpose of this project is to develop a sentiment analyser that can be implemented on a robot. Therefore, we wanted to understand how emotional speech is expressed or identified within existing robot systems, such as by using tone, speed or pitch. This helped us to consequently develop the research question: What prosodic elements are related to emotional speech in human-computer/robot interaction?

2.2 Identifying Relevant Studies

Four engineering and social science databases were searched on the 5th May 2021. These included SCOPUS, IEEE Xplore, ACM Digital Library and PsycINFO.

Keywords (usually synonyms) were separated by the Boolean operators ‘AND’ and ‘OR’. The search strategy included the following: (emotion OR sentiment) AND (speech OR verbal OR tone OR pitch) AND (expression OR identification) AND (experiment OR evaluation) AND robot.

To focus the review on more recent and current state-of-the-art methods, we decided to limit our search to the last 10 years, covering 2011 to 2021. Other limits included being published in English, focussing on humans (not animals) and the full-text being available. All items had to include an experiment/evaluation or research component, as well as focus on prosody, rather than semantics. Items on multimodal emotion (i.e., facial expression and speech) were only included if the content on speech could be separated and was discussed in sufficient detail. Table 1 exemplifies the search syntax used in two of the databases.

Table 1 Examples of the search syntax used in two of the databases

2.3 Selecting the Studies

We created an Excel sheet, to document the searches. This document included the dates of each search, the databases searched and the literature identified through the searches. A two-step process helped to screen the literature for eligibility. This included first reading the abstracts and titles of the items, and removing those that did not meet the eligibility criteria. Duplicate studies between the searches were also removed in this step. The second step consisted of downloading and reading the full-text items and identifying reasons for exclusion. Items that passed the full-text screening stage were included in the review. The screening process was presented in a PRISMA diagram [27].

2.3.1 Charting the Data and Collating, Summarizing, and Reporting the Results

A second Excel sheet formed our coding framework, into which we extracted relevant data from each study. This included the following: title of the publication, first authors surname and publication date (year), publication type (journal or conference paper), study setting and country, robot/system, purpose of the robot/system, speech detection/expression, prosodic factors and description of them, study participants, evaluation/study method and outcome.

Data from the coding sheet were summarised and presented in a manner which best answered the research question.

3 Results

The database search yielded 1,889 results. Twenty-seven duplicates were removed, and another 1,806 were excluded during the first screening process. During the second screening, 43 studies were excluded with reasons. These included not being on speech (n = 12), not focusing on prosody (n = 11), not reporting on sentiment (n = 7) or experimental results (n = 7), not including anything on human-robot interaction (n = 3) and being published in languages other than English (n = 3). Consequently, a total of 13 publications were included in the review. The PRISMA diagram [27] in Fig. 1 demonstrates the search and screening process.

Fig. 1
figure 1

PRISMA diagram showing the literature search and screening process

3.1 Characteristics of the Included Literature

The literature was published from 2012 [28] to 2020 [29, 30]. The majority of the studies were presented at conferences and subsequently published as full-text conference papers [5, 30,31,32,33,34,35,36,37]. Only two were published as journal papers [28, 29]. One study appeared to be published as a journal article and as a conference paper [38, 39], hence is reported together.

The studies were conducted in France [32, 35], Japan [29] and Spain and Poland [5]. Nine did not state the specific country location [28, 30, 31, 33, 34, 36,37,38,39]. The settings of the research were most commonly places of education (schools and universities) [29, 32, 33, 35, 38, 39] or online [30, 31]. One was conducted in a Alzheimer’s Center and hospital [5] and four did not state the specific setting [28, 34, 36, 37].

Overall, the studies were mostly identified to be experimental in nature. Two appeared to be observational, in which differences in prosodic elements between synthesised and natural human speech were explored [37] and responses to different pitches were observed [29]. A further two studies collected mixed methods data, such as qualitative perceptions (via a survey or interviews) in addition to quantitative measures [31, 32].

Across the studies, the sample sizes varied from three [37] to 300 participants [36], including children, adults and students. Of the studies (n = 10) that specified a sample size, a total of 774 participants were included (mean: 77.4). The studies are further summarised in Table 2.

Table 2 Summary of the included studies

3.2 The Systems

Most of the studies used robots, including NAO [30], ERICA [38, 39], Pepper [29, 31], the Survivor Buddy robot [33], RAMCIP [5], ALICE [35] and Hobbit [31]. Three studies used systems [28, 36, 37], such as the VOICEROID 2 Yuduki Yukari speech synthesiser system [37]. Only one used Poppy, a virtual robot (embodied conversational agent) [32]. One did not provide specific details on the system used [34]. Figure 2 shows some of the robots used.

Fig. 2
figure 2

Images showing the NAO [40], Hobbit [41] and Pepper [42, 43] robots

The purpose of almost all of the studies was to advance emotional speech detection and/or expression through prosody, with the ultimate purpose of implementing the system in a social robot/agent [5, 29,30,31,32, 34,35,36,37,38,39]. Thus, five studies focussed on emotion detection, four focussed on emotion expression and three focussed on both (see Table 2).

3.3 Prosodic Elements Used in the Literature

Across the literature, various different prosodic elements were used to convey or detect emotion. In six studies, the emotions included some or all of the Big Eight [5, 31, 33, 35,36,37]. However, Antona et al. [5] also supplemented these with emotions such as ‘tired/confused’ and ‘focussed.’ Three studies categorised emotions as positive, negative or neutral (or variations thereof) [5, 30, 36], while two used affective dimensions to determine emotion [28, 38, 39]. For example, these included activation (level of arousal), valence, power, expectation and intensity. Regardless of the approach, the emotions were sometimes determined in comparison to a baseline emotion, usually referred to as the ‘neutral’ or ‘calm’ emotion [5, 28, 33, 35, 36].

The prosodic elements from most to least common were tone (also referred to as pitch/frequency) (n = 8), loudness (also referred to as energy/volume) (n = 6), speech speed (n = 4), pauses (n = 3) and non-linguistic vocalisations (n = 1).

Changes in intonation (tone) and loudness of speech helped to express emotion. For example, a higher pitch and volume were used to express/detect a positive emotion (e.g., excitement or happiness) [5, 33, 35, 36]. Conversely, a slightly lower volume and much lower pitch expressed negative emotions (e.g., sadness) [5, 33, 35, 36]. Anger was associated with increased volume [33, 37] and a high pitch peak [36].

Speech speed and pauses were sometimes interrelated. This is logical, because adding pauses to the speech would also impact the overall speed of the speech. This was achieved by adding additional spaces in the text and adding commas and full stops to create pauses [5]. In addition, the speed also referred to the rate of speech/words spoken within a specific timeframe [5]. It was evident that a faster rate of speech correlated with the anger [33, 35, 36] excited [5] and fear [33, 36] emotions. A lower rate was associated with the sadness emotion [5, 33, 35, 36].

One study did not explain the prosodic elements and instead focussed on stress/emphasis of words and rhythm, compared to speaking in monotone [32]. Tsiourti et al. [31] also used a different approach, by synthesising commonly understood non-linguistic vocalisations in a commercial Text-To-Speech (TTS) service. These included laughter to convey happiness, a negative sounding “oh” to represent sadness and a fast sudden intake of breath to convey surprise.

3.4 Effect of the Prosodic Elements on Sentiment Detection or Expression

Most of the literature reported successful results, showing that prosodic elements are useful in helping to express or detect emotion. Some positive findings were evident in the literature on human-robot/agent interaction [29, 31, 32, 35]. For example, children smiled more and were also more responsive to questions when prosody was used in the Poppy avatar, compared to when only facial expressions were used [32]. Crumpton et al. [33] showed promising results in emotion detection, whereby participants were able to detect different emotions, above levels of chance (20%). These included anger (65.9%), calm (68.9%), fear (33.3%), sadness (49.2%) and happiness (30.3%) after adjusting some of the prosodic elements. Participants were also able to accurately identify the happy and surprised emotions from robots using non-linguistic vocalisations [31].

Mixed findings were reported in some of the literature on speech systems. The system used by Eyben et al. [28] was effective at detecting sentiment for five dimensions (activation, expectation, intensity, power and valence), outperforming standard neural networks. However, in another study the system was only effective when sentiment analysis was included in addition to prosody (increasing the correlation coefficient by 0.15, from 0.41 to 0.56) [38, 39]. This was explained due to valence conflicting with sentiment (i.e., emotional incongruence).

Negative findings were also reported. Specifically, some of the research found that several emotions were more difficult to detect by participants. These included negative emotions such as frustration, disappointment, anxiety [5], anger [35], fear, disgust [36] and sadness [31]. Aly et al. [35] explain that participants are dependent on non-verbal cues (e.g., gestures) with emotions such as anger and that the Mary TTS engine limited their ability to design a persuasive vocal pattern for this emotion. Additionally, Rabiei et al. [36] highlight that some emotions are simply more difficult for humans to identify.

4 Discussion

The most effective and commonly used prosodic elements related to emotional speech in human-computer/robot interaction were tone (n = 8), loudness (n = 6), speech speed (n = 4), pauses (n = 3) and non-linguistic vocalisations (n = 1). However, some of the literature did not specify what elements they used and instead used a lack thereof (i.e., monotone voice).

It was evident that research in this field is premature, as displayed by the small number of available studies. Additionally, this was evident in many studies focussing on the speech synthesiser systems and not yet being at the stage of implementing them in social robots/agents. However, positive findings in the literature on human-robot/agent interaction indicated a promising opportunity for implementing systems in various robots, including the popular NAO, Pepper and Hobbit robots [29,30,31].

It is important to note that synthesis of the findings was difficult, due to the various uses of emotion and measures of prosody. Regardless, the categorisation of emotion was consistent with that identified in the literature and often adhered to or included the Big Six [8,9,10], Big Eight [11], arousal and valence [12] or classification as negative, positive and neutral [9, 10, 15]. A novel finding was the addition of other emotions and dimensions. Specifically, Crumpton et al. [33] added the baseline emotion ‘calm,’ while Antona et al. [5] also used the ‘tired/confused’ and ‘focussed’ emotions. Some affective dimensions of determining emotion were also novel. These included the established categories of valence and arousal (also referred to as activation), but also considered power, expectation and intensity [28, 38, 39].

It was interesting that the common prosodic elements were mostly effective in helping to express or detect emotion within human-robot/agent interaction [29, 31, 32, 35], but negative emotions were often more difficult to identify [5, 31, 35, 36]. This may be because people often rely on non-verbal cues [35]. This suggests that while prosody is important for affective computing, it is not the sole solution. Instead, this should be complemented with gestures and facial expressions (if possible), in a multimodal strategy (e.g., in [35]). Even in appearance-constrained robots, prosody can be supplemented with changes in visual appearance. For example, a study with 33 participants found that colour and motion can be combined to convey emotion [44]. Specifically, anger is best conveyed with colour, while fear can be conveyed with motion, and joy/happiness is best conveyed with a combination of colour and motion. General agreements in colour can be leveraged and employed in conjunction with speech. These included the common pairings of red to anger and yellow to joy/happiness [45, 46]. However, regardless of effectiveness, incorporating prosodic elements may further help to augment robotic voices with affective capabilities, and overcome issues with them sounding too ‘robotic’ as expressed in some human-robot interaction studies [5, 47].

4.1 Implications for Development and Future Research

Literature on sentiment expression and detection through prosody was fairly premature. Thus, further research is warranted, before design and development recommendations can be made. It was evident that some prosodic elements (tone, loudness and speech speed) were more often used than others (e.g., non-linguistic vocalisations), with promising results in sentiment expression and/or detection. Future research should explore the effectiveness of these specific prosodic elements in emotional speech, using larger sample sizes. Once these have been determined effective, the speech should be implemented on a robot and tested in real-life interaction scenarios. Long-term research and development should also include non-verbal parameters and verbal semantics, to determine which combination leads to the most successful expression and detection of negative sentiment.

4.2 Strengths and Limitations

The review adhered to the Arksey and O’Malley [24] method for conducting scoping reviews, and was reported in adherence with the PRISMA-ScR items [26]. Another strength of our scoping review is that the included research was limited to the last 10 years of publication, meaning that the findings represent the most recent state-of-the-art methods. This also helped us to overcome a common limitation of scoping reviews, whereby the included literature is vast [48], which may result in limited detail in the findings.

As in other scoping reviews (e.g., [49, 50]), our work was limited to literature published in English and did not include grey literature. We did also not conduct quality or bias assessments of the included literature, meaning that the included studies were of varying quality. However, this is typically not a requirement of scoping reviews [48, 50].

5 Conclusion

This scoping review of recently published literature helped to identify common prosodic elements used in human-computer interaction: tone, loudness, speech speed and pauses. Non-linguistic vocalisations and emphasis/stress were less frequently used. Future research should explore the effectiveness of commonly used prosodic elements in emotional speech, using larger sample sizes and real-life interaction scenarios. Finally, the successfulness of prosody in conveying negative sentiment may be improved with additional non-verbal parameters (e.g., motion or changes in light that represent emotion). Thus, it is essential that more work be conducted to determine how these may be combined with prosody and which combination is most effective in human-robot affective interaction.