Extrovert or Introvert? GAN-Based Humanoid Upper-Body Gesture Generation for Different Impressions

Wu, Bowen; Liu, Chaoran; Ishi, Carlos Toshinori; Shi, Jiaqi; Ishiguro, Hiroshi

doi:10.1007/s12369-023-01051-8

Extrovert or Introvert? GAN-Based Humanoid Upper-Body Gesture Generation for Different Impressions

Open access
Published: 11 October 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Social Robotics Aims and scope Submit manuscript

Extrovert or Introvert? GAN-Based Humanoid Upper-Body Gesture Generation for Different Impressions

Download PDF

Bowen Wu^1,2,
Chaoran Liu ORCID: orcid.org/0000-0003-3789-2981^2,3,
Carlos Toshinori Ishi^2,3,
Jiaqi Shi^1,2 &
…
Hiroshi Ishiguro¹

1263 Accesses
1 Citation
Explore all metrics

Abstract

Gestures, a form of body language, significantly influence how users perceive humanoid robots. Recent data-driven methods for co-speech gestures have successfully enhanced the naturalness of the generated gestures. Moreover, compared to rule-based systems, these methods are more generalizable for unseen speech input. However, many of these methods cannot directly influence people’s perceptions of robots. The primary challenge lies in the intricacy of constructing a dataset with varied impression labels to develop a conditional generation model. In our prior work ([22]) Controlling the impression of robots via gan-based gesture generation. In:Proceedings of the international conference on intelligent robots and systems. IEEE, pp 9288-9295), we introduced a heuristic approach for automatic labeling, training a deep learning model to control robot impressions. We demonstrated the model’s effectiveness on both a virtual agent and a humanoid robot. In this study, we refined the motion retargeting algorithm for the humanoid robot and conducted a user study using four questions representing different aspects of extroversion. Our results show an improved capability in controlling the perceived degree of extroversion in the humanoid robot compared to previous methods. Furthermore, we discovered that different aspects of extroversion interact uniquely with motion statistics

Towards Synchronous Model of Non-emotional Conversational Gesture Generation in Humanoids

The Willful Marionette: Modeling Social Cognition Using Gesture-Gesture Interaction Dialogue

Affective Body Movements (for Robots) Across Cultures

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Humanoid robots can be used in human society as elderly companions, shop clerks, or television broadcasters. Robot and human interactions are assumed to involve both verbal and non-verbal behaviors in these scenarios, during which users’ perceptions of the robots are formed. Such perceptions are claimed to be closely related to how well the society or an individual accepts robots [1]. It is crucial to develop methods that can successfully influence user perceptions of robots as a way to a way to increase user acceptance.

Studies in cognitive science show that several characteristics of robots, including appearance, verbal and nonverbal behavior, can influence how people perceive them [2,3,4,5,6,7]. Among non-verbal activities that significantly influence how people perceive robots, we can consider gestures. Studies have shown that altering specific gestural movements can produce distinct impressions [8, 9]. However, the purpose of these studies is not the automatic synthesis of gestures; therefore, while rule-based systems for controlling robot impressions can be built, these studies cannot be easily integrated into an automatic robot’s control system.

Co-speech gesture datasets that are used to train data-driven gesture generation models were developed using open-source technologies for recording human motions [10,11,12,13,14]. These models have primarily been assessed on virtual agents through subjective evaluations in which participants rank or score gestures for various systems [14,15,16,17,18,19,20,21]. Such data-driven models may be adapted to voice inputs that are not in the training data, providing a key component in establishing a multimodal scheme for social robots as opposed to rule-based systems where gestures are manually constructed for particular circumstances. These studies, however, did not take into account how to change people’s perceptions of robots. There is no dataset that contains impressions as labels for training them because the amount of data required would be too large to create one. For example, If we consider each item of the Big Five personality trait to have three categories (low, mid, and high), collecting data to cover every combination of personality would require at least $3^5$ combinations. To find commonality, it is also necessary to collect data from multiple individuals for each combination. As a result, modeling the impressions solely through data collection is a difficult task.

We propose a pseudo-labeling method based on the above-mentioned existing cognitive science approaches, which automatically categorizes gesture data to address the infeasibility of collecting labeled data. To create a data-driven model that can generate gestures for various impressions, a model based on generative adversarial networks (GANs) was trained using pseudo-labeled data. Extroversion was chosen as the objective impression since it has been extensively investigated in cognitive science. From related studies, we summarized gesture characteristics that influence extroversion, such as the speed or amplitude of motion. The gestures in the dataset are divided into different classes as pseudo labels based on these characteristics, and these classes are subsequently employed as additional input in our proposed model. We conducted subjective studies to evaluate the effectiveness of our model, which demonstrated that it could produce gestures that influenced the perceived extroversion of a virtual agent and a humanoid robot.

This paper is an extension of our previous work [22]. First, we improved the motion retargeting used for the control of the robot we used and achieved expected results with the robot (Sects. 4.2 and 5.1). Second, the previous single-item measure for the evaluation was replaced by four questions for different aspects of extroversion in our experiment for a more robust assessment (Sect. 4.3). Finally, we analyzed the obtained results for each aspect of extroversion respectively and found that different aspected were affected differently by motion statistics (Sects. 4 and 5.4).

The remainder of this paper is structured as follows: we present research efforts that are related to the current study in Sect. 2. Our proposed model is described in Sect. 3. We detail our experiment that evaluated our method in Sect. 4, and we discuss our findings in Sect. 5.

2 Related Works

2.1 Deep-Learning-Based Co-Speech Gesture Generation

For the task of speech-driven gesture generation, deep-learning-based models have shown promising performance. Hasegawa et al. [23] discovered that the Mel-Frequency Cepstral Coefficient (MFCC) of the audio is useful for modeling the relationship between speech and gestures for long short-term memory (LSTM). In terms of the accuracy of the generated gestures, Kucherenko et al. [20, 24] claim that MFCC is superior to other types of audio features. Yoon et al. [13, 17] suggested using spoken text is useful in modeling the semantic relationship between speech and gesture. In addition to deterministic generation, generative models have realized probabilistic generation. Taylor et al. [18] combined the Generative Flow (GF) and Variational Auto-Encoder (VAE) to model the distribution of conditional gestures. Adversarial training loss was utilized to train the gesture generation models in addition to L2 distance to force the model to output more realistic motions [25, 26] although they maintain the deterministic generation. The conditional gesture distribution when conditioned on speech signals has also been modeled using GANs [21, 27], where the prosody feature was used as the condition. Prosody features were also used in another work for gesture generation [28], in which part-of-speech information extracted from text was simultaneously utilized to select the keywords for gesticulation. Alexanderson et al. [16] advocated explicitly controlling the speed, radius, and height of hand locations in the generated gestures using GF-based models, but they failed to take into account the potential interaction between human and robot by leveraging these controls. Additionally, while both GANs and GF are probabilistic models, GF typically has more parameters than the generator of GANs due to its weak non-linearity, which slows down the inference. In this study, we trained a conditional GAN-based model to directly regulate the features that are related to the personality of gestures, and we examined its effect in terms of the perceived personality of an avatar and an android robot.

2.2 Human Impressions of Robots

Interpersonal relationships and human-robot interactions are greatly facilitated by personality [29,30,31]. One of the most common indicators of an individual’s personality is the level of extroversion or introversion [32]. By altering the size, pace, and frequency of movements, Kim et al. [8] created various personality types for robots, revealing that consumers were more amused and impressed by extroverted robots than by introverted robots. Neff et al. [33] developed a technique for altering the perception of a virtual agent’s extroversion by changing the parameters for language creation, gesture pace, and movement. Mileounis et al. [34] examined the relationship between extroversion and social intelligence using NAO, a humanoid robot controlled remotely by computer, and found that extroversion gives the appearance of having higher social intelligence. Higher amplitude and speed of a gesture are associated with higher levels of extroversion and neuroticism, according to Deshmukh’s investigation on the relationship between the speech and amplitude of a gesture and Godspeed scores [35]. Dou et al. [36] ran tests using real-world shopping settings to examine how robotic speech and movements affected how people perceived their personalities. Their findings demonstrated that particular instructional movements produced extroverted impressions. A number of studies have looked into the emotions that robots can express through their body language in order to influence how humans perceive them during human-robot interaction. In one previous study [37], participants were shown gestures to represent Ekman’s six main emotions, which indicates that social robots are capable of communicating emotions simply by moving their heads and arms. According to Costa et al. [38], individuals can accurately identify emotions after viewing footage of a robot’s facial expressions, and movements were also useful for identifying emotions. Cultural differences in gestures were found by Gjaci et al. [39], where they analyzed the impressions of robots by implementing a speech-driven gesture generation model trained on culture -dependent and -independent datasets. Certain movement characteristics in the upper body of a robot called Pepper were found to be correlated with the perceived extroversion and introversion through interactive experiments with human movement analysts [40]. Zabala et al. [41] proposed a multimodal non-verbal behavior generation model for robot Pepper to control the emotional expression, which was also claimed to be effective for expressing personality. The key difference between our method and theirs [41] is that they apply gates on the conditional emotion inputs of their model to attempt to affect the personality, whereas our method directly associates motion statistics to the expressions of extoversion.

The above studies looked into the characteristics that affect how people see robots. We have found that the most frequently discussed factors associated with extroversion are the speed and amplitude of gestures. It is natural to assume that that distinct personality can be yielded by altering these two factors. Consequently, we created a system for automatically generating co-speech gestures that can be used to change these characteristics and subtly alter people’s perceptions of robots.

3 Methodology

Our proposed method consists of two main steps. The first step is to heuristically assign trait labels to each sample based on gesture features that are associated with extroversion. The second step involves training an adversarial neural network using the pseudo-labeled samples to develop a gesture generator that produces gesture sequences by taking extracted speech features and a trait label as inputs. As a result, our model produces gesture sequences that exhibit characteristics corresponding to the input trait labels related to extroversion, and thus it is expected to yield different perceived extroversion degrees among users. An overview of our proposed method is shown in Fig. 1.

3.1 Pseudo-Label

We chose features involving the speed and amplitude of gestures for labeling because they are considered to be related to extroversion. Specifically, speed is defined as the average moving distance of all joints in terms of 3D coordinates, averaged by the total duration. Denoting the coordinates of a specific joint k at a specific time step t as tuple $(x_{t}^{k}, y_{t}^{k}, z_{t}^{k})$, speed feature is defined as follows:

$$\begin{aligned} \frac{1}{TK}\sum ^{T}_{t=1}\sum ^{K}_{k=1}\sqrt{ \begin{aligned} ({x}_{t}^{k} - {x}_{t-1}^{k})^2&+ ({y}_{t}^{k} - {y}_{t-1}^{k})^2 \\&+ ({z}_{t}^{k} - {z}_{t-1}^{k})^2 \end{aligned} } \end{aligned}$$

(1)

where K is the total number of joints and T is the total length of a gesture segment. Amplitude is defined as the average maximum distance between any two time steps of two hand positions in terms of 3D coordinates, denoted as follows:

$$\begin{aligned} \frac{1}{2} (\max _{i, j \in T} (dist({lh}_{i}, {lh}_{j})) + \max _{i, j \in T} (dist({rh}_{i}, {rh}_{j}))) \end{aligned}$$

(2)

where lh and rh stand for the left and right-hand positions. $dist(\cdot ,\cdot )$ is defined as

$$\begin{aligned} {dist}(a, b) =\sqrt{ ({a}_{x} - {b}_{x})^2 + ({a}_{y} - {b}_{y})^2 + ({a}_{z} - {b}_{z})^2 } \end{aligned}$$

(3)

where x, y and z are the coordinates along different axes in 3D.

Based on the features described above, we define three categories for gestures as trait labels, namely low, mid, and high. Gestures that have low speed and small amplitude are expected to be categorized as low. Gestures that have high speed and large amplitude are expected to be categorized as high. Neutral gestures are expected to be labeled as mid. For this purpose, it is necessary to determine the boundary which divides all gesture samples into different categories. We use K-means [42] for dividing, for which the number of clusters is set to three. The distance metric is defined as follows:

$$\begin{aligned} l(x, y) = \frac{1}{2} \sqrt{ \begin{aligned}&({speed}_{x} - {speed}_{y})^2 \\&+ ({amp}_{x} - {amp}_{y})^2 \end{aligned} } \end{aligned}$$

(4)

where speed and amp are the gesture features we defined above. Before executing K-means, speed and amplitude features are normalized according to the following equation as they have different scales:

$$\begin{aligned} x_{normalized} = (x - \mu ) / \sigma \end{aligned}$$

(5)

where $\mu $ is the mean, $\sigma $ is the standard deviation. After this process, each sample in the dataset is assigned one of the trait labels we defined.

The classification depending on data distribution rather than human perception may not be optimal. However, due to the fact that the labeling task is infeasible because of the huge amount of work, our goal is to leverage the findings in cognitive science about personality to obtain such labels without having human to perform the labelling. Our experimental results validated our proposed method in that gestures generated by using different labels left different impressions to the participants (Sect. 4).

3.2 Speech Feature Extraction

We extract the prosody feature from the speech signal for gesture generation. The prosody feature consists of the fundamental frequency (F0) and the power of an audio signal. Although the prosody feature does not contain semantic information which captures the semantic relationship between speech and gestures, it reflects when a person is speaking and the emphasis on speech. F0 is extracted in log-scale and in semitone intervals using a conventional autocorrelation-based method following Ishi et al. [43], where the interval of estimation is 10 milliseconds. The power is in dB units. After extraction, for the speech segment of each sample, we obtain a two-dimensional feature vector $s\in {\mathcal {R}}^{T*2}$, where T is the length of the speech segment.

3.3 Gesture Generator

The gesture generator resembles the generator proposed by Wu et al. [21], which is a bi-directional gated recurrent unit (bi-GRU) based neural network. The network takes input as speech features described in Sect. 3.2, previous poses, and a noise vector drawn from the standard Gaussian distribution and outputs a sequence of gestures, which is defined as a sequence of joint rotation Euler angles along the 3D axes. The difference is that the trait label is included as an additional input in our model. The input to our generator then consists of trait, noise and prosody. trait is a one-hot vector. noise is a one-dimensional vector randomly drawn from the standard Gaussian distribution. prosody is a two-dimensional vector with a time dimension and is averaged every five frames to match the fps of gesture, which is 20. Before being fed into the bi-GRU layers, trait and noise are repeated to the same length as prosody and then concatenated together. The seed pose is also concatenated with this vector. The output is a sequence of a vector whose length is the same as prosody. When generating gestures, previously generated gestures are used as seed poses. The initial seed pose is all zero. The generator is trained with a discriminator using adversarial loss.

The discriminator has an additional input for the trait label, which is to force the discriminator to recognize gestures that are not corresponding to the input trait label, optimizing the generator toward producing the correct gestures in terms of gesture characteristics. As the input to the discriminator, gestures, trait, and prosody are concatenated and fed into the 1D convolutional layers. The output is a scalar, which can be interpreted as the approximate the Wasserstein distance between the distribution of the generated gesture conditioned on all inputs and that of the ground gestures, and is used as the learning signal for training the generator. An illustration of our model is also shown in Fig. 2.

Denoting the generator as G, discriminator as D, all inputs to the discriminator as c, and gestures as x, the loss function for the adversarial training is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{adv} =&\max _{D}\min _{G}\frac{1}{N}\sum _{i=1}^{N}D(x^{(i)}, c^{(i)}) \\&- D(G(c^{(i)}, x_{prev}^{(i)}, z), c^{(i)}) \\&+ \lambda _{GP}(\Vert \nabla _{D}^{(i)}\Vert _{2} - 1)^2 \end{aligned} \end{aligned}$$

(6)

where the third term is called gradient penalty (GP) for training stability as proposed in Gulrajani et al. [44] and used by Wu et al. [21]. z is drawn from a standard normal distribution. Similar to the latter work [21], we also include continuity loss for maintaining the continuity between consecutive gestures as follows:

$$\begin{aligned} {\mathcal {L}}_{continuity} = \frac{1}{N}\sum _{i=1}^{N}\text {Huber}({\hat{x}}^{(i)}_{:k}, x^{(i-1)}_{-k:}) \end{aligned}$$

(7)

where k is a hyperparameter to define the number of frames considered and $\text {Huber}(\cdot ,\cdot )$ was defined previously [45]. Finally, our total loss function is defined as:

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{adv} + \alpha {\mathcal {L}}_{continuity} \end{aligned}$$

(8)

where $\alpha $ is a hyperparameter to weighting the continuity loss.

4 Experiment

Our experiment is based on the dataset proposed by Takeuchi et al. [12], where the speech is recorded as wave files at 44-KHz sample rate, and the gestures are human joint rotation values recorded using Motion Capture toolkit. Although the dataset we chose contains gesture data for the whole body, we only used the 12 upper body joints to train our model because the movements in the upper body are correlated more with the speech and impressions under co-speech gesture generation settings and for comparison with the baseline model. The information of the used joints can be found in the appendix A. The palm and fingers are excluded due to noisy data. The dataset used in our experiment was downloaded from an earlier work’s repository [24]. There are totally 1047 utterances. 957 samples of them are training data, 45 are validation data, and 45 are test data. 192 min data^{Footnote 1} were used during training at 20 fps.

We trained our model using the proposed method. The length of gesture segment is set to two seconds. Other hyperparameters and details can be found in our code.^{Footnote 2} After training, we obtained a model that generates a sequence of joint rotation values by taking the input audio segment and the input trait label. As a result, there will be three generated gesture sequences for each audio segment, which we denote as Gh, Gm and Gl for trait labels high, mid and low, respectively. Our comparison also involves the ground truth gesture and the current state-of-the-art gesture generation model proposed by Wu et al. [21] on the same dataset we used, which we denote as Gg and Gb, respectively. The purpose is to compare the impression yielded by the ground truth and the gestures that are generated by a model trained on full dataset. Totally, there are five groups for comparison, which are Gg, Gb, Gh, Gm and Gl.

4.1 Effect of Pseudo-Label

The purpose of the pseudo-label is to control the speed and amplitude of the generated motion. To verify if this has been successfully achieved, we analyzed the generated motion using different labels for each audio sequence in the test set and calculated their speed and amplitude using equations described in Sect. 3.1. Comparison to those of the baseline and the ground truth is shown in Fig. 3. Results show that by inputting different trait labels, our model produces motions that have different degrees of speed and amplitude, which indicates that our purpose has been achieved.

4.2 Visualization

We used a virtual humanoid avatar and a small humanoid robot for visualization and evaluation. A snapshot of the avatar and robot is shown in Fig. 4. The avatar is a free downloadable character in Unity^{Footnote 3} asset store. To control its motion, the joint rotation values are directly mapped to the nearest joint based on the avatar’s joint configuration with a few simple transformations including right-hand-rule to left-hand-rule transformation and rotation order transformation. Each gesture sequence was visualized on the avatar and the video was recorded using the intrinsic recording tool of Unity. Since there was a short delay in audio as compared to gestures when using the intrinsic audio player, we only recorded the videos for gestures and attached the corresponding audio to the recorded videos manually.

The small humanoid robot is called CommU, which is a purchasable robot. It has fewer joints than humans, including only pitch and roll for the shoulder, hip, and head, yaw for the head, and one degree of freedom (DoF) for mouth opening, which are summarized in Table 1. For shoulder joint control, we first transform the sequence of joint rotation values to joint positions including fingertips by forward kinematics based on the bone length and joint hierarchy provided by the dataset. Then we compute the direction from the shoulder to the fingertip of the middle finger and solve the pitch and roll value by inverse kinematics. For the hip and head’s yaw and roll, we directly map the joint rotation values to the corresponding joint by similar transformations used for the avatar. The mouth motion and head’s pitch were controlled by a model proposed by Ishi et al. [46]. The videos were recorded using a 4K camera and the audio was attached afterward. A sample video can be found online.^{Footnote 4}

Table 1 Joint configuration for CommU

Full size table

4.3 User Study

The goal of the user study is to investigate the effectiveness of using our method to generate gestures with different perceived extroversion for the avatar and for CommU in an interactive manner with human. To evaluate the perceived extroversion of the avatar and the robot, we designed two similar questionnaires for the avatar and robot respectively, which include videos for participants to watch and questions related to extroversion and the naturalness of the avatar and the robot for each video. There are two types of video. The first type consists of a gesturing avatar and robot and the audio, and the gesture was either generated from the audio or the ground truth. The second type is the muted version of the first type. Although only the with-audio condition is meaningful since the avatar and robot are multimodal, the no-audio condition was also included to ensure that in case there is no significant difference for with-audio condition, we could confirm that how the generated gestures are perceived without the effect of audio for further analysis. There are 15 videos for each type of video. These samples were randomly chosen from the test set, which was not used during training. Based on the above description, our experiment involves four conditions as follows:

Condition 1: Avatar performed 5 types of gesture with audio (with-audio).
Condition 2: Avatar performed 5 types of gesture without audio (no-audio).
Condition 3: CommU performed 5 types of gesture with audio (with-audio).
Condition 4: CommU performed 5 types of gesture without audio (no-audio).

The questions are defined as ‘How [aspect] do you think the avatar/robot is?’, where [aspect] refers to one of the following: (a) sociable, (b) enthusiastic, (c) reserved and (d) quiet, which follows Ludewig et al. [47] to evaluate the extroversion of robots. The question to evaluate the naturalness is ‘How natural is the gesture of the avatar/robot?’ After watching a video, participants rated from one to seven for each aspect, where one represents strongly negative, seven represents strongly positive and four represents neutral. The final perceived extroversion score is calculated by:

$$\begin{aligned} (a + b + (8 - c) + (8 - d)) / 4 \end{aligned}$$

(9)

where a, b, c and d represent the score of each specific question.

There are five sections in the questionnaire. The first section includes an explanation of the purpose and steps of our experiment for obtaining their consent, as well as questions about their gender, age and the current time. In section two and three, with-audio and no-audio videos are evaluated respectively. At the beginning of each section, an attention check question was prepared for data screening. At the end of the experiment, participants are asked to fill in the current time once again for calculating the duration for completion. For completing each questionnaire, each participant is paid 1200 JPY (9) via the crowdsourcing company. Since our experiment involves human subjects, we acquired approval from the ethics committee of the Advanced Telecommunication Research Institute International (ATR, review number 22-605).

Table 2 Pair-wise comparison for condition 1

Full size table

Table 3 Pair-wise comparison for condition 2

Full size table

4.4 Results

Throughout the experiment, 50 participants were recruited for each questionnaire via crowd sourcing service.^{Footnote 5} The procedure of statistical test is as follows: For obtained results for all groups (Gg, Gb, Gh, Gm, and Gl) in experiment condition, we first performed the Shapiro-Wilk test and Bartlett’s test to check the normality and the variance equality of the obtained data. If all groups passed the normality test and had equal variances, one-way analysis of variance (ANOVA) was used to test whether all group means are equal. Then Tukey’s honestly significant difference (Tukey-HSD) was used for multi-comparison. Otherwise, non-parametric Kruskal–Wallis one-way ANOVA was used, and the following is Dunn’s test for post-hoc multi-comparison. The alpha for all statistical tests was set to 0.05. Finally, Cohen’s d was calculated as the effect size to indicate how large the difference is.

Table 4 Pair-wise effect size for different aspects on avatar in condition 1

Full size table

4.4.1 Avatar

We first report the statistics of the participants for condition 1 and 2. The average age of the participants is 36 and the standard deviation is 8. Half of them are male. The average duration for completion is 17 min and the standard deviation is 5 min. All of them passed the data screening question.

The obtained results and the analysis are described in the following. The results of condition 1 are shown in Fig. 5a. Although all groups have equal variance ($p=0.14$), the obtained scores for Gl are not normally distributed ($p<0.05$); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal ($p<0.001$). As expected, the perceived extroversion increases monotonically within Gl, Gm and Gh ($p<0.05$, $d=0.96$ between Gl and Gm, $p<0.001$, $d=2.33$ between Gm and Gh), indicating that our model produces three levels of extroversion on the avatar. The most extroverted one, Gh, is not significantly extroverted than Gg ($p=0.18$), meaning that our model can only weaken the perceived extroversion. The perceived extroversion of Gb is different from all generated groups ($p<0.001$ for Gh, Gm, and Gl), showing that training the model on all data has a different effect compared to the model trained on a subset. Additionally, there is a significant difference between Gg and Gb ($p<0.05$, $d=0.68$), suggesting that the baseline model failed to replicate the perceived extroversion of the ground truth. Based on the analysis above, hypothesis 1 was well supported.

The results of condition 2 are shown in Fig. 5b. Although all groups have equal variance ($p=0.93$), the obtained scores for Gl are not normally distributed ($p<0.001$); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal ($p<0.001$). Similarly to condition 1, the perceived extroversion increases within Gl, Gm and Gh ($p<0.001$, $d=1.24$ between Gm and Gl, $d=2.02$ between Gh and Gm). When not combined with audio, Gh is perceived as more extroverted than Gg ($p<0.05$, $d=0.62$), showing that without the effect of audio, our model succeeded in extrapolating the perceived extroversion. The perceived extroversion of Gb is not significantly different from Gm ($p=0.1$), showing that training the model on a subset has similar results to training the model on the whole dataset when not combined with audio. Additionally, similarly to condition 1, there is a significant difference between Gg and Gb ($p<0.005$, $d=0.86$). Based on these observations, hypothesis 2 was also supported.

Table 5 Pair-wise comparison for condition 3

Full size table

Table 6 Pair-wise comparison for condition 4

Full size table

We additionally report the results for the four aspects related to extroversion used in our questionnaire (Fig. 6). The statistical tests were conducted independently for each aspect. Overall, the results are similar to when combining them together. While for sociable and enthusiastic, the evaluation scores increase, they decrease for reserved and quiet. This indicates that our model is controlling the impression in terms of each aspect. However, we also note that some aspects were being affected differently from others as shown in Table 4. Specifically, between Gl, Gm and Gh, the absolute effect sizes of reserved and quiet are generally smaller than sociable and enthusiastic. Also, the effect of enthusiastic appears to be larger than other three aspects.

The results of the naturalness evaluation in condition 1 are shown in Fig 7. Although the naturalness of Gg is significantly higher compared to all other groups, no significant difference was found within the generated groups. This ensures that the evaluation results on extroversion will not be affected by naturalness.

4.4.2 CommU

Table 7 Pair-wise effect size for different aspects on CommU in condition 3

Full size table

The statistics of the participants for condition 3 and 4 are as follows: The average age of the participants is 36 and the standard deviation is 8. The average duration for completion is 22 min and the standard deviation is 6 min. One participant failed to pass the data screening so that was excluded. Another participant used too much time (141 min) to complete the questionnaire thus was excluded. Totally, we have 24 males and 25 females for the experiment.

The results of condition 3 are shown in Fig. 8a. Although all groups have equal variance ($p=0.66$), the obtained scores for Gg and Gb are not normally distributed ($p<0.01$); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal ($p<0.001$). The perceived extroversion of Gl is significantly lower than Gm ($p<0.001$, $d=0.82$). However, a significant difference could not be found between Gm and Gh ($p=0.053$), which indicates that our model can produce two levels of extroversion on CommU. This is different from the avatar and we will discuss this in Sect. 5.2. The most extroverted one, Gh, is not significantly extroverted than Gg ($p=0.26$), meaning that our model can only weaken the perceived extroversion for CommU. The perceived extroversion of Gb is not significantly different from Gm ($p=0.27$), showing that training the model on a subset achieved similar results to training the model on the whole dataset in the case of CommU. Additionally, there is a significant difference between Gg and Gb ($p<0.01$, $d=0.52$), suggesting that the baseline model failed to replicate the perceived extroversion of the ground truth itself, which is consistent with the results obtained from the avatar. Hypothesis 3 was partially supported based on the observations above.

The results of condition 4 are shown in Fig. 8b. Although all groups have equal variance ($p=0.12$), the obtained scores for Gb, Gl, and Gh are not normally distributed ($p<0.05$); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal ($p<0.001$). The perceived extroversion of Gl is lower than Gm ($p<0.001$, $d=0.8$). Different from condition 3, the perceived extroversion of Gm is also lower than Gh ($p<0.001$, $d=0.69$). This indicates that our model produces three levels of perceived extroversion on CommU when not combined with audio. Gh is perceived as more extroverted than Gg ($p<0.01$, $d=0.36$), showing that without the effect of audio, our model succeeded in extrapolating the perceived extroversion also for CommU. Same as when combined with audio, the perceived extroversion of Gb is not significantly different from Gm ($p=0.55$). Moreover, there is no significant difference between Gg and Gb ($p=0.07$), meaning that the model trained on the whole dataset can yield similar perceived extroversion with the ground truth. Hypothesis 4 was supported by the above analysis.

Additionally, we report the results for the four aspects related to extroversion used in our questionnaire (Fig. 9). The statistical tests were conducted independently for each aspect. The results are similar to when combining them together, which indicates that our model can control all of the four aspects for CommU by using Gl and one of Gm or Gh. However, the results also show different effects on different aspects as shown in Table 7. Specifically, the effect of sociable appears to be smaller than other aspects. Also, although none of the four aspects show significant difference between Gm and Gh, the scale of their effect sizes are different.

The results of the evaluation on naturalness in condition 3 are shown in Fig 10. Although the naturalness of Gg is significantly higher compared to all other groups, no significant difference was found within the generated groups. This ensures that the evaluation results on extroversion will not be affected by naturalness.

5 Discussion

5.1 Motion Retargeting

It is crucial to develop a proper motion retargeting method for the robot since the rigging configuration may be different. In the current experiment, the results obtained on avatar and CommU are similar in no-audio condition, i.e., participants evaluated only the motions, indicating that our motion retargeting from avatar to CommU was successful.

On the other hand, in our previous work [22], we obtained different results on avatar and CommU in no-audio condition, which is evidence of possible failure in motion retargeting. Previously, the control of the shoulder actuators of CommU was realized by computing the vector from shoulder to wrist to estimate the pitch and yaw for the shoulder by considering that CommU has no DoF for the elbows and wrists. Consequently, the wrist position was retargeted to the hand position in CommU, resulting in different hand trajectories for CommU and the avatar, thus perceived differently by users. To improve this, we developed a more suitable motion retargeting for CommU using the vector from shoulder to fingertips as described in Sect. 4.2. An illustration is shown in Fig. 11. Our results demonstrate the importance of implementing an appropriate motion retargeting algorithm. Note that we did not compare the previous and the current retargeting because the previous retargeting was not able to resemble the avatar’s motion well as shown in Fig. 11.

5.2 Difference Between Avatar and CommU

Although we have made a great effort on reproducing the avatar’s gesture on CommU and we ensure that the hand trajectory and head motion are almost the same, the results for avatar and CommU are still different. As reported in Sect. 4.4, while the perceived extroversion on Gh is higher than Gm for the avatar, they were not different for CommU. To investigate this difference, we did more analysis by comparing the results between no-audio condition and combined with-audio condition.

For both avatar and CommU, the perceived extroversion increases after combined with audio, compared with no-audio. This can be verified by comparing the paired difference for two conditions. By viewing the perceived extroversion obtained from each video for one participant and that of its muted version for the same participant as a one-sample pair, we calculate the differences for all pairs. It turns out that the perceived extroversion in with-audio condition is higher than that in no-audio for the avatar ($p<0.001$, $d=0.42$). For CommU, likewise, the perceived extroversion in with-audio condition is higher than that for no-audio ($p<0.001$, $d=0.28$). It is reasonable to conclude that in our experiment, including audio can cause an increase in the perceived extroversion both for avatar and CommU.

Table 8 Results of the comparison between the simulated neutral group and experimental groups

Full size table

Furthermore, by comparing the perceived extroversion of Gh between with-audio and no-audio, we found that while the perceived extroversion of avatar in with-audio condition was higher than no-audio condition ($p<0.001$, $d=0.47$), there was no significant difference for CommU ($p=0.99$). This does not follow the expectation that the perceived extroversion for CommU should also increase when combined with audio, especially considering the fact that the perceived extroversion increased for all other groups (Gg, Gb, Gl, and Gm) on CommU when combined with audio ($p<0.05$), and the fact that there is a significant difference between Gm and Gh for CommU in no-audio condition.

We suspect that the upper limit of the perceived extroversion of CommU has been reached. The physical constraints of CommU (fewer joints and DOFs, lower maximum speed, smaller reaching space) can limit its expressiveness, which, on the other hand, are not problems for the avatar. Consequently, the results are different for avatar and CommU. Although we believe that it is likely that if we use a more flexible robot the results will be closer to the avatar, further investigations are necessary and we leave this as future work.

5.3 Extrovert or Introvert?

While comparing the perceptions of the various groups reveals which group is seen as being more or less extrovert than the others, it is not clear whether one group is seen as being extrovert or introverted. The context of our questionnaire might help to make this clear. Each video is rated by our participants on a scale of 1 to 7, with 1 denoting a strongly negative and 7 denoting strongly positive as explained in Sect. 4.3. Therefore, a score of 4 should therefore be considered neutral in our experiment because it should not be either positive or negative. Additionally, an extrovert or an introvert should be represented by a perceived extroversion score that is higher or lower than 4, respectively.

We can simulate a neutral group whose perceived extroversion follows a normal distribution with a mean value of 4 and a standard deviation of 1, from which we draw samples to statistically test whether the perceived extroversion is higher or lower than 4. The amount of samples is equal to the number of samples we gathered for our experiment. We test the difference between the simulated group and all other groups following the test procedure outlined in Sect. 4.4. The result, which is summarized in Table 8, demonstrate that our model is capable of producing gestures that are perceived as extrovert or introvert both for avatar and CommU when different trait labels are input.

5.4 Different Aspects of Extroversion

In the experiment, we utilized four questions that represent different aspects of extroversion for the evaluation. Overall, the results of taking average over all aspects and analyzing them independently show similar findings. Both demonstrate that our model is capable of controlling extroversion as well as the four specific aspects. However, it is important to note that certain aspects are affected differently compared to others.

For instance, when examining the avatar’s results (4), between Gl, Gm and Gh, the absolute effect sizes of reserved and quiet are generally smaller than sociable and enthusiastic. Also, the effect of enthusiastic appears to be larger than other three aspects. For CommU’s results (7), the effect of sociable appears to be smaller than other aspects. Also, although none of the four aspects shows significant difference between Gm and Gh, the scale of their effect sizes is different.

These findings indicate that our method has distinct effects on different aspects of extroversion. One possible explanation is that the relationship between the chosen features and the various aspects of extroversion may have different scales or may only impact specific aspects. Recent studies, including our own, have primarily focused on the average of different extroversion aspects, overlooking the potential diverse effects on each individual aspect.

Future work should investigate how different aspects of extroversion are influenced differently and develop methods for fine-grained control over these aspects. This will contribute to enhancing the modeling and understanding of extroversion.

6 Conclusion

Leveraging the flexibility of data-driven techniques, we introduced a conditional GAN-based co-speech gesture generation model that utilizes cognitive heuristics. Our method effectively controlled the perception of extroversion, as evidenced by experimental results with both an avatar and a humanoid robot using four questions related to distinct aspects of extroversion. Intriguingly, while the perceived extroversion varied with different input labels, we observed that separate aspects of extroversion responded differently. Our findings suggest that the extensive data collection typically associated with deep learning can be mitigated using heuristic methods that incorporate insights from cognitive science. This underscores the potential of melding learning techniques with the principles of cognitive science.

Data Availability

The data used to train our model can be created using our published code. The data collected through the user study are available from the corresponding author on reasonable request.

Code Availability

Our implementation has been made publicly available.

Notes

This differs from the previous work [24]. We counted our local data and confirmed that the training data have 192 min.
https://github.com/wubowen416/Controlling-the-Impression-of-Robots-via-GAN-based-Gesture-Generation
https://unity.com/
https://youtu.be/H4So35mf9jw
https://koto-komagane.com/

References

Destephe M, Brandao M, Kishi T, Zecca M, Hashimoto K, Takanishi A (2015) Walking in the uncanny valley: importance of the attractiveness on the acceptance of a robot as a working partner. Front Psychol 6:204
Article Google Scholar
Yamashita Y, Ishihara H, Ikeda T, Asada M (2017) Appearance of a robot influences causal relationship between touch sensation and the personality impression. In: Proceedings of the international conference on human agent interaction, pp 457–461
Tamagawa R, Watson CI, Kuo IH, MacDonald BA, Broadbent E (2011) The effects of synthesized voice accents on user perceptions of robots. Int J Soc Robot 3(3):253–262
Article Google Scholar
Torre I, Goslin J, White L, Zanatto D (2018) Trust in artificial voices: A “congruency effect” of first impressions and behavioural experience. In: Proceedings of the technology, mind, and society, pp 1–6
Ryoko S, Chie F, Takatsugu K, Kaori S, Yuki H, Motoyuki O, Natsuki O (2012) Does talking to a robot in a high-pitched voice create a good impression of the robot? In: ACIS. IEEE, pp 19–24
Thepsoonthorn C, Ogawa K-I, Miyake Y (2018) The relationship between robot’s nonverbal behaviour and human’s likability based on human’s personality. Sci Rep 8(1):1–11
Article Google Scholar
Hoffman G, Birnbaum GE, Vanunu K, Sass O, Reis HT (2014) Robot responsiveness to human disclosure affects social impression and appeal. In: International conference on human-robot interaction, pp 1–8
Kim H, Kwak SS, Kim M (2008) Personality design of sociable robots by control of gesture design factors. In: International symposium on robot and human interactive communication. IEEE, pp 494–499
Bergmann K, Eyssel F, Kopp, S (2012) A second chance to make a first impression? how appearance and nonverbal behavior affect perceived warmth and competence of virtual agents over time. In: International conference on intelligent virtual agents. Springer, pp 126–138
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE conference on computer vision and pattern recognition, pp 7291–7299
Güler RA, Neverova N, Kokkinos I (2018) DensePose: dense human pose estimation in the wild. In: IEEE conference on computer vision and pattern recognition, pp 7297–7306
Takeuchi K, Kubota S, Suzuki K, Hasegawa D, Sakuta H (2017) Creating a gesture-speech dataset for speech-based automatic gesture generation. In: International conference on human-computer interaction. Springer, pp 198–202
Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: International conference on robotics and automation. IEEE, pp 4303–4309
Ferstl Y, Neff M, McDonnell R (2019) Multi-objective adversarial gesture generation. In: Motion, interaction and games, pp 1–10
Ishi CT, Machiyashiki D, Mikata R, Ishiguro H (2018) A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot Autom Lett 3(4):3757–3764
Article Google Scholar
Alexanderson S, Henter GE, Kucherenko T, Beskow J (2020) Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer graphics forum, vol 39. Wiley Online Library, pp 487–496
Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Graph 39(6):1–16
Article Google Scholar
Taylor S, Windle J, Greenwood D, Matthews I (2021) Speech-driven conversational agents using conditional flow-VAEs. In: European conference on visual media production, pp 1–9
Kucherenko T, Nagy R, Jonell P, Neff M, Kjellström H, Henter GE (2021) Speech2properties2gestures: gesture-property prediction as a tool for generating representational gestures from speech. In: Proceedings of the 21st ACM international conference on intelligent virtual agents, pp 145–147
Kucherenko T, Hasegawa D, Kaneko N, Henter GE, Kjellström H (2021) Moving fast and slow: analysis of representations and post-processing in speech-driven automatic gesture generation. Int J Hum Comput Interact 37(14):1300–1316
Article Google Scholar
Wu B, Liu C, Ishi CT, Ishiguro H (2021) Probabilistic human-like gesture synthesis from speech using GRU-based WGAN. In: Companion publication of the 2021 international conference on multimodal interaction, pp 194–201
Wu B, Shi J, Liu C, Ishi CT, Ishiguro H (2022) Controlling the impression of robots via gan-based gesture generation. In: Proceedings of the international conference on intelligent robots and systems. IEEE, pp 9288–9295
Hasegawa D, Kaneko N, Shirakawa S, Sakuta H, Sumi K (2018) Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: 18th international conference on intelligent virtual agents, pp 79–86
Kucherenko T, Hasegawa D, Henter GE, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: 19th ACM international conference on intelligent virtual agents, pp 97–104
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506
Yoon Y, Park K, Jang M, Kim J, Lee G (2021) SGToolkit: an interactive gesture authoring toolkit for embodied conversational agents. In: The 34th annual ACM symposium on user interface software and technology, pp 826–840
Wu B, Liu C, Ishi CT, Ishiguro H (2021) Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics 10(3):228
Article Google Scholar
Pérez-Mayos L, Farrús M, Adell J (2020) Part-of-speech and prosody-based approaches for robot speech and gesture synchronization. J Intell Robot Syst 99(2):277–287
Article Google Scholar
Robert L (2018) Personality in the human robot interaction literature: a review and brief critique. In: Proceedings of the 24th Americas conference on information systems, pp 16–18
Hwang J, Park T, Hwang W (2013) The effects of overall robot shape on the emotions invoked in users and the perceived personalities of robot. Appl Ergon 44(3):459–471
Article Google Scholar
Tay B, Jung Y, Park T (2014) When stereotypes meet robots: the double-edge sword of robot gender and personality in human-robot interaction. Comput Hum Behav 38:75–84
Article Google Scholar
Robert L, Alahmad R, Esterwood C, Kim S, You S, Zhang Q (2020) A review of personality in human–robot interactions. SSRN 3528496
Neff M, Wang Y, Abbott R, Walker M (2010) Evaluating the effect of gesture and language on personality perception in conversational agents. In: International conference on intelligent virtual agents. Springer, pp 222–235
Mileounis A, Cuijpers RH, Barakova EI (2015) Creating robots with personality: the effect of personality on social intelligence. In: International work-conference on the interplay between natural and artificial computation. Springer, pp 119–132
Craenen B, Deshmukh A, Foster ME, Vinciarelli A (2018) Shaping gestures to shape personalities: the relationship between gesture parameters, attributed personality traits and godspeed scores. In: 27th IEEE international symposium on robot and human interactive communication. IEEE, pp 699–704
Dou X, Wu C-F, Lin K-C, Tseng T-M (2019) The effects of robot voice and gesture types on the perceived robot personalities. In: International conference on human-computer interaction. Springer, pp 299–309
Li J, Chignell M (2011) Communication of emotion in social robots through simple head and arm movements. Int J Soc Robot 3(2):125–142
Article Google Scholar
Costa S, Soares F, Santos C (2013) Facial expressions and gestures to convey emotions with a humanoid robot. In: International conference on social robotics. Springer, pp 542–551
Gjaci A, Recchiuto CT, Sgorbissa A (2022) Towards culture-aware co-speech gestures for social robots. Int J Soc Robot 14(6):1493–1506
Article Google Scholar
Van Otterdijk M, Song H, Tsiakas K, Van Zeijl I, Barakova E (2022) Nonverbal cues expressing robot personality-a movement analysts perspective. In: 2022 31st IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, pp 1181–1186
Zabala U, Rodriguez I, Martínez-Otzeta JM, Lazkano E (2021) Expressing robot personality through talking body language. Appl Sci 11(10):4639
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Ishi CT, Ishiguro H, Hagita N (2008) Automatic extraction of paralinguistic information using prosodic features related to f0, duration and voice quality. Speech Commun 50(6):531–543
Article Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein GANs. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, New York
Google Scholar
Girshick R (2015) Fast R-CNN. In: IEEE international conference on computer vision, pp 1440–1448
Ishi CT, Minato T, Ishiguro H (2017) Motion analysis in vocalized surprise expressions and motion generation in android robots. IEEE Robot Autom Lett 2(3):1748–1754. https://doi.org/10.1109/LRA.2017.2700941
Article Google Scholar
Ludewig Y, Döring N, Exner N (2012) Design and evaluation of the personality trait extraversion of a shopping robot. In: 2012 IEEE RO-MAN: the 21st IEEE international symposium on robot and human interactive communication. IEEE, pp 372–379

Download references

Funding

This work was supported in part by JST, Moonshot R &D under Grant JPMJMS2011 (methodology conceptualization), and Grant-in-Aid for Scientific Research on Innovative Areas JP22H04875 (evaluation experiments).

Author information

Authors and Affiliations

Osaka University, Osaka, Japan
Bowen Wu, Jiaqi Shi & Hiroshi Ishiguro
RIKEN, Kyoto, Japan
Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi & Jiaqi Shi
ATR, Kyoto, Japan
Chaoran Liu & Carlos Toshinori Ishi

Authors

Bowen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chaoran Liu
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Toshinori Ishi
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Ishiguro
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: BW, CL, CTI; Methodology: BW, CL, CTI; Formal analysis and investigation: BW, CTI; Writing—original draft preparation: BW, JS; Writing—review and editing: CL, CTI; Funding acquisition: CTI, HI; Resources: CTI, HI; Supervision: HI.

Corresponding author

Correspondence to Chaoran Liu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Consent to participate

Informed consent was obtained from all personal participants included in the study.

Consent for publication

There is no written description or image of personal information included in the manuscript.

Ethics approval

Since our experiment involves human subjects, we acquired approval from the ethic committee of the Advanced Telecommunication Research Institute International (ATR, review number 22-605).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Joint Configuration

In the dataset we used, the human motion is recorded using Motion Capture, and the data are stored in bvh file format. The header in the bvh file specifies the hierarchy of all joints on the target and its channels by the name of each joint. Our desired joint names are as follows:

Spine
Spine1
Neck
Head
LeftShoulder
LeftArm
LeftForeArm
LeftHand
RightShoulder
RightArm
RightForeArm
RightHand

In total, there are 12 joints, which correspond to the upper body movements.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, B., Liu, C., Ishi, C.T. et al. Extrovert or Introvert? GAN-Based Humanoid Upper-Body Gesture Generation for Different Impressions. Int J of Soc Robotics (2023). https://doi.org/10.1007/s12369-023-01051-8

Download citation

Accepted: 28 August 2023
Published: 11 October 2023
DOI: https://doi.org/10.1007/s12369-023-01051-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Extrovert or Introvert? GAN-Based Humanoid Upper-Body Gesture Generation for Different Impressions

Abstract

Similar content being viewed by others

Towards Synchronous Model of Non-emotional Conversational Gesture Generation in Humanoids

The Willful Marionette: Modeling Social Cognition Using Gesture-Gesture Interaction Dialogue

Affective Body Movements (for Robots) Across Cultures

1 Introduction

2 Related Works

2.1 Deep-Learning-Based Co-Speech Gesture Generation

2.2 Human Impressions of Robots

3 Methodology

3.1 Pseudo-Label

3.2 Speech Feature Extraction

3.3 Gesture Generator

4 Experiment

4.1 Effect of Pseudo-Label

4.2 Visualization

4.3 User Study

4.4 Results

4.4.1 Avatar

4.4.2 CommU

5 Discussion

5.1 Motion Retargeting

5.2 Difference Between Avatar and CommU

5.3 Extrovert or Introvert?

5.4 Different Aspects of Extroversion

6 Conclusion

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent to participate

Consent for publication

Ethics approval

Additional information

Publisher's Note

Appendix A: Joint Configuration

Appendix A: Joint Configuration

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation