1 Introduction

Humanoid robots can be used in human society as elderly companions, shop clerks, or television broadcasters. Robot and human interactions are assumed to involve both verbal and non-verbal behaviors in these scenarios, during which users’ perceptions of the robots are formed. Such perceptions are claimed to be closely related to how well the society or an individual accepts robots [1]. It is crucial to develop methods that can successfully influence user perceptions of robots as a way to a way to increase user acceptance.

Studies in cognitive science show that several characteristics of robots, including appearance, verbal and nonverbal behavior, can influence how people perceive them [2,3,4,5,6,7]. Among non-verbal activities that significantly influence how people perceive robots, we can consider gestures. Studies have shown that altering specific gestural movements can produce distinct impressions [8, 9]. However, the purpose of these studies is not the automatic synthesis of gestures; therefore, while rule-based systems for controlling robot impressions can be built, these studies cannot be easily integrated into an automatic robot’s control system.

Co-speech gesture datasets that are used to train data-driven gesture generation models were developed using open-source technologies for recording human motions [10,11,12,13,14]. These models have primarily been assessed on virtual agents through subjective evaluations in which participants rank or score gestures for various systems [14,15,16,17,18,19,20,21]. Such data-driven models may be adapted to voice inputs that are not in the training data, providing a key component in establishing a multimodal scheme for social robots as opposed to rule-based systems where gestures are manually constructed for particular circumstances. These studies, however, did not take into account how to change people’s perceptions of robots. There is no dataset that contains impressions as labels for training them because the amount of data required would be too large to create one. For example, If we consider each item of the Big Five personality trait to have three categories (low, mid, and high), collecting data to cover every combination of personality would require at least \(3^5\) combinations. To find commonality, it is also necessary to collect data from multiple individuals for each combination. As a result, modeling the impressions solely through data collection is a difficult task.

We propose a pseudo-labeling method based on the above-mentioned existing cognitive science approaches, which automatically categorizes gesture data to address the infeasibility of collecting labeled data. To create a data-driven model that can generate gestures for various impressions, a model based on generative adversarial networks (GANs) was trained using pseudo-labeled data. Extroversion was chosen as the objective impression since it has been extensively investigated in cognitive science. From related studies, we summarized gesture characteristics that influence extroversion, such as the speed or amplitude of motion. The gestures in the dataset are divided into different classes as pseudo labels based on these characteristics, and these classes are subsequently employed as additional input in our proposed model. We conducted subjective studies to evaluate the effectiveness of our model, which demonstrated that it could produce gestures that influenced the perceived extroversion of a virtual agent and a humanoid robot.

This paper is an extension of our previous work [22]. First, we improved the motion retargeting used for the control of the robot we used and achieved expected results with the robot (Sects. 4.2 and 5.1). Second, the previous single-item measure for the evaluation was replaced by four questions for different aspects of extroversion in our experiment for a more robust assessment (Sect. 4.3). Finally, we analyzed the obtained results for each aspect of extroversion respectively and found that different aspected were affected differently by motion statistics (Sects. 4 and 5.4).

The remainder of this paper is structured as follows: we present research efforts that are related to the current study in Sect. 2. Our proposed model is described in Sect. 3. We detail our experiment that evaluated our method in Sect. 4, and we discuss our findings in Sect. 5.

2 Related Works

2.1 Deep-Learning-Based Co-Speech Gesture Generation

For the task of speech-driven gesture generation, deep-learning-based models have shown promising performance. Hasegawa et al. [23] discovered that the Mel-Frequency Cepstral Coefficient (MFCC) of the audio is useful for modeling the relationship between speech and gestures for long short-term memory (LSTM). In terms of the accuracy of the generated gestures, Kucherenko et al. [20, 24] claim that MFCC is superior to other types of audio features. Yoon et al. [13, 17] suggested using spoken text is useful in modeling the semantic relationship between speech and gesture. In addition to deterministic generation, generative models have realized probabilistic generation. Taylor et al. [18] combined the Generative Flow (GF) and Variational Auto-Encoder (VAE) to model the distribution of conditional gestures. Adversarial training loss was utilized to train the gesture generation models in addition to L2 distance to force the model to output more realistic motions [25, 26] although they maintain the deterministic generation. The conditional gesture distribution when conditioned on speech signals has also been modeled using GANs [21, 27], where the prosody feature was used as the condition. Prosody features were also used in another work for gesture generation [28], in which part-of-speech information extracted from text was simultaneously utilized to select the keywords for gesticulation. Alexanderson et al. [16] advocated explicitly controlling the speed, radius, and height of hand locations in the generated gestures using GF-based models, but they failed to take into account the potential interaction between human and robot by leveraging these controls. Additionally, while both GANs and GF are probabilistic models, GF typically has more parameters than the generator of GANs due to its weak non-linearity, which slows down the inference. In this study, we trained a conditional GAN-based model to directly regulate the features that are related to the personality of gestures, and we examined its effect in terms of the perceived personality of an avatar and an android robot.

Fig. 1
figure 1

High-level overview of proposed method. Different blocks represent different steps. Leftmost block shows that a speech gesture dataset is divided into three subsets according to gesture features (Sect. 3.1). Middle block shows that our gesture generator takes speech and trait label as input to generate gestures that exhibit different gesture characteristics corresponding to the input trait label, and it is trained using a gesture discriminator (Sect. 3.3). Rightmost block shows that the generated gestures yield different impressions in terms of extroversion (Sect. 4)

2.2 Human Impressions of Robots

Interpersonal relationships and human-robot interactions are greatly facilitated by personality [29,30,31]. One of the most common indicators of an individual’s personality is the level of extroversion or introversion [32]. By altering the size, pace, and frequency of movements, Kim et al. [8] created various personality types for robots, revealing that consumers were more amused and impressed by extroverted robots than by introverted robots. Neff et al. [33] developed a technique for altering the perception of a virtual agent’s extroversion by changing the parameters for language creation, gesture pace, and movement. Mileounis et al. [34] examined the relationship between extroversion and social intelligence using NAO, a humanoid robot controlled remotely by computer, and found that extroversion gives the appearance of having higher social intelligence. Higher amplitude and speed of a gesture are associated with higher levels of extroversion and neuroticism, according to Deshmukh’s investigation on the relationship between the speech and amplitude of a gesture and Godspeed scores [35]. Dou et al. [36] ran tests using real-world shopping settings to examine how robotic speech and movements affected how people perceived their personalities. Their findings demonstrated that particular instructional movements produced extroverted impressions. A number of studies have looked into the emotions that robots can express through their body language in order to influence how humans perceive them during human-robot interaction. In one previous study [37], participants were shown gestures to represent Ekman’s six main emotions, which indicates that social robots are capable of communicating emotions simply by moving their heads and arms. According to Costa et al. [38], individuals can accurately identify emotions after viewing footage of a robot’s facial expressions, and movements were also useful for identifying emotions. Cultural differences in gestures were found by Gjaci et al. [39], where they analyzed the impressions of robots by implementing a speech-driven gesture generation model trained on culture -dependent and -independent datasets. Certain movement characteristics in the upper body of a robot called Pepper were found to be correlated with the perceived extroversion and introversion through interactive experiments with human movement analysts [40]. Zabala et al. [41] proposed a multimodal non-verbal behavior generation model for robot Pepper to control the emotional expression, which was also claimed to be effective for expressing personality. The key difference between our method and theirs [41] is that they apply gates on the conditional emotion inputs of their model to attempt to affect the personality, whereas our method directly associates motion statistics to the expressions of extoversion.

The above studies looked into the characteristics that affect how people see robots. We have found that the most frequently discussed factors associated with extroversion are the speed and amplitude of gestures. It is natural to assume that that distinct personality can be yielded by altering these two factors. Consequently, we created a system for automatically generating co-speech gestures that can be used to change these characteristics and subtly alter people’s perceptions of robots.

3 Methodology

Our proposed method consists of two main steps. The first step is to heuristically assign trait labels to each sample based on gesture features that are associated with extroversion. The second step involves training an adversarial neural network using the pseudo-labeled samples to develop a gesture generator that produces gesture sequences by taking extracted speech features and a trait label as inputs. As a result, our model produces gesture sequences that exhibit characteristics corresponding to the input trait labels related to extroversion, and thus it is expected to yield different perceived extroversion degrees among users. An overview of our proposed method is shown in Fig. 1.

3.1 Pseudo-Label

We chose features involving the speed and amplitude of gestures for labeling because they are considered to be related to extroversion. Specifically, speed is defined as the average moving distance of all joints in terms of 3D coordinates, averaged by the total duration. Denoting the coordinates of a specific joint k at a specific time step t as tuple \((x_{t}^{k}, y_{t}^{k}, z_{t}^{k})\), speed feature is defined as follows:

$$\begin{aligned} \frac{1}{TK}\sum ^{T}_{t=1}\sum ^{K}_{k=1}\sqrt{ \begin{aligned} ({x}_{t}^{k} - {x}_{t-1}^{k})^2&+ ({y}_{t}^{k} - {y}_{t-1}^{k})^2 \\&+ ({z}_{t}^{k} - {z}_{t-1}^{k})^2 \end{aligned} } \end{aligned}$$
(1)

where K is the total number of joints and T is the total length of a gesture segment. Amplitude is defined as the average maximum distance between any two time steps of two hand positions in terms of 3D coordinates, denoted as follows:

$$\begin{aligned} \frac{1}{2} (\max _{i, j \in T} (dist({lh}_{i}, {lh}_{j})) + \max _{i, j \in T} (dist({rh}_{i}, {rh}_{j}))) \end{aligned}$$
(2)

where lh and rh stand for the left and right-hand positions. \(dist(\cdot ,\cdot )\) is defined as

$$\begin{aligned} {dist}(a, b) =\sqrt{ ({a}_{x} - {b}_{x})^2 + ({a}_{y} - {b}_{y})^2 + ({a}_{z} - {b}_{z})^2 } \end{aligned}$$
(3)

where x, y and z are the coordinates along different axes in 3D.

Based on the features described above, we define three categories for gestures as trait labels, namely low, mid, and high. Gestures that have low speed and small amplitude are expected to be categorized as low. Gestures that have high speed and large amplitude are expected to be categorized as high. Neutral gestures are expected to be labeled as mid. For this purpose, it is necessary to determine the boundary which divides all gesture samples into different categories. We use K-means [42] for dividing, for which the number of clusters is set to three. The distance metric is defined as follows:

$$\begin{aligned} l(x, y) = \frac{1}{2} \sqrt{ \begin{aligned}&({speed}_{x} - {speed}_{y})^2 \\&+ ({amp}_{x} - {amp}_{y})^2 \end{aligned} } \end{aligned}$$
(4)

where speed and amp are the gesture features we defined above. Before executing K-means, speed and amplitude features are normalized according to the following equation as they have different scales:

$$\begin{aligned} x_{normalized} = (x - \mu ) / \sigma \end{aligned}$$
(5)

where \(\mu \) is the mean, \(\sigma \) is the standard deviation. After this process, each sample in the dataset is assigned one of the trait labels we defined.

The classification depending on data distribution rather than human perception may not be optimal. However, due to the fact that the labeling task is infeasible because of the huge amount of work, our goal is to leverage the findings in cognitive science about personality to obtain such labels without having human to perform the labelling. Our experimental results validated our proposed method in that gestures generated by using different labels left different impressions to the participants (Sect. 4).

3.2 Speech Feature Extraction

We extract the prosody feature from the speech signal for gesture generation. The prosody feature consists of the fundamental frequency (F0) and the power of an audio signal. Although the prosody feature does not contain semantic information which captures the semantic relationship between speech and gestures, it reflects when a person is speaking and the emphasis on speech. F0 is extracted in log-scale and in semitone intervals using a conventional autocorrelation-based method following Ishi et al. [43], where the interval of estimation is 10 milliseconds. The power is in dB units. After extraction, for the speech segment of each sample, we obtain a two-dimensional feature vector \(s\in {\mathcal {R}}^{T*2}\), where T is the length of the speech segment.

Fig. 2
figure 2

Illustration of proposed model

3.3 Gesture Generator

The gesture generator resembles the generator proposed by Wu et al. [21], which is a bi-directional gated recurrent unit (bi-GRU) based neural network. The network takes input as speech features described in Sect. 3.2, previous poses, and a noise vector drawn from the standard Gaussian distribution and outputs a sequence of gestures, which is defined as a sequence of joint rotation Euler angles along the 3D axes. The difference is that the trait label is included as an additional input in our model. The input to our generator then consists of trait, noise and prosody. trait is a one-hot vector. noise is a one-dimensional vector randomly drawn from the standard Gaussian distribution. prosody is a two-dimensional vector with a time dimension and is averaged every five frames to match the fps of gesture, which is 20. Before being fed into the bi-GRU layers, trait and noise are repeated to the same length as prosody and then concatenated together. The seed pose is also concatenated with this vector. The output is a sequence of a vector whose length is the same as prosody. When generating gestures, previously generated gestures are used as seed poses. The initial seed pose is all zero. The generator is trained with a discriminator using adversarial loss.

The discriminator has an additional input for the trait label, which is to force the discriminator to recognize gestures that are not corresponding to the input trait label, optimizing the generator toward producing the correct gestures in terms of gesture characteristics. As the input to the discriminator, gestures, trait, and prosody are concatenated and fed into the 1D convolutional layers. The output is a scalar, which can be interpreted as the approximate the Wasserstein distance between the distribution of the generated gesture conditioned on all inputs and that of the ground gestures, and is used as the learning signal for training the generator. An illustration of our model is also shown in Fig. 2.

Denoting the generator as G, discriminator as D, all inputs to the discriminator as c, and gestures as x, the loss function for the adversarial training is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{adv} =&\max _{D}\min _{G}\frac{1}{N}\sum _{i=1}^{N}D(x^{(i)}, c^{(i)}) \\&- D(G(c^{(i)}, x_{prev}^{(i)}, z), c^{(i)}) \\&+ \lambda _{GP}(\Vert \nabla _{D}^{(i)}\Vert _{2} - 1)^2 \end{aligned} \end{aligned}$$
(6)

where the third term is called gradient penalty (GP) for training stability as proposed in Gulrajani et al. [44] and used by Wu et al. [21]. z is drawn from a standard normal distribution. Similar to the latter work [21], we also include continuity loss for maintaining the continuity between consecutive gestures as follows:

$$\begin{aligned} {\mathcal {L}}_{continuity} = \frac{1}{N}\sum _{i=1}^{N}\text {Huber}({\hat{x}}^{(i)}_{:k}, x^{(i-1)}_{-k:}) \end{aligned}$$
(7)

where k is a hyperparameter to define the number of frames considered and \(\text {Huber}(\cdot ,\cdot )\) was defined previously [45]. Finally, our total loss function is defined as:

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{adv} + \alpha {\mathcal {L}}_{continuity} \end{aligned}$$
(8)

where \(\alpha \) is a hyperparameter to weighting the continuity loss.

Fig. 3
figure 3

Boxplot of the distribution of the features defined in Sect. 3.1 for different generated gesture groups defined in Sect. 4

4 Experiment

Our experiment is based on the dataset proposed by Takeuchi et al. [12], where the speech is recorded as wave files at 44-KHz sample rate, and the gestures are human joint rotation values recorded using Motion Capture toolkit. Although the dataset we chose contains gesture data for the whole body, we only used the 12 upper body joints to train our model because the movements in the upper body are correlated more with the speech and impressions under co-speech gesture generation settings and for comparison with the baseline model. The information of the used joints can be found in the appendix A. The palm and fingers are excluded due to noisy data. The dataset used in our experiment was downloaded from an earlier work’s repository [24]. There are totally 1047 utterances. 957 samples of them are training data, 45 are validation data, and 45 are test data. 192 min dataFootnote 1 were used during training at 20 fps.

We trained our model using the proposed method. The length of gesture segment is set to two seconds. Other hyperparameters and details can be found in our code.Footnote 2 After training, we obtained a model that generates a sequence of joint rotation values by taking the input audio segment and the input trait label. As a result, there will be three generated gesture sequences for each audio segment, which we denote as Gh, Gm and Gl for trait labels high, mid and low, respectively. Our comparison also involves the ground truth gesture and the current state-of-the-art gesture generation model proposed by Wu et al. [21] on the same dataset we used, which we denote as Gg and Gb, respectively. The purpose is to compare the impression yielded by the ground truth and the gestures that are generated by a model trained on full dataset. Totally, there are five groups for comparison, which are Gg, Gb, Gh, Gm and Gl.

4.1 Effect of Pseudo-Label

The purpose of the pseudo-label is to control the speed and amplitude of the generated motion. To verify if this has been successfully achieved, we analyzed the generated motion using different labels for each audio sequence in the test set and calculated their speed and amplitude using equations described in Sect. 3.1. Comparison to those of the baseline and the ground truth is shown in Fig. 3. Results show that by inputting different trait labels, our model produces motions that have different degrees of speed and amplitude, which indicates that our purpose has been achieved.

4.2 Visualization

We used a virtual humanoid avatar and a small humanoid robot for visualization and evaluation. A snapshot of the avatar and robot is shown in Fig. 4. The avatar is a free downloadable character in UnityFootnote 3 asset store. To control its motion, the joint rotation values are directly mapped to the nearest joint based on the avatar’s joint configuration with a few simple transformations including right-hand-rule to left-hand-rule transformation and rotation order transformation. Each gesture sequence was visualized on the avatar and the video was recorded using the intrinsic recording tool of Unity. Since there was a short delay in audio as compared to gestures when using the intrinsic audio player, we only recorded the videos for gestures and attached the corresponding audio to the recorded videos manually.

Fig. 4
figure 4

Snapshot of CommU (left) and avatar (right)

The small humanoid robot is called CommU, which is a purchasable robot. It has fewer joints than humans, including only pitch and roll for the shoulder, hip, and head, yaw for the head, and one degree of freedom (DoF) for mouth opening, which are summarized in Table 1. For shoulder joint control, we first transform the sequence of joint rotation values to joint positions including fingertips by forward kinematics based on the bone length and joint hierarchy provided by the dataset. Then we compute the direction from the shoulder to the fingertip of the middle finger and solve the pitch and roll value by inverse kinematics. For the hip and head’s yaw and roll, we directly map the joint rotation values to the corresponding joint by similar transformations used for the avatar. The mouth motion and head’s pitch were controlled by a model proposed by Ishi et al. [46]. The videos were recorded using a 4K camera and the audio was attached afterward. A sample video can be found online.Footnote 4

Table 1 Joint configuration for CommU
Fig. 5
figure 5

Results of perceived extroversion on avatar. Only the p-values between Gl, Gm, and Gh are annotated. The error bar represents standard error. ***\(p<0.001\). Details of all pair-wise comparison can be found in Table 2 and 3

4.3 User Study

The goal of the user study is to investigate the effectiveness of using our method to generate gestures with different perceived extroversion for the avatar and for CommU in an interactive manner with human. To evaluate the perceived extroversion of the avatar and the robot, we designed two similar questionnaires for the avatar and robot respectively, which include videos for participants to watch and questions related to extroversion and the naturalness of the avatar and the robot for each video. There are two types of video. The first type consists of a gesturing avatar and robot and the audio, and the gesture was either generated from the audio or the ground truth. The second type is the muted version of the first type. Although only the with-audio condition is meaningful since the avatar and robot are multimodal, the no-audio condition was also included to ensure that in case there is no significant difference for with-audio condition, we could confirm that how the generated gestures are perceived without the effect of audio for further analysis. There are 15 videos for each type of video. These samples were randomly chosen from the test set, which was not used during training. Based on the above description, our experiment involves four conditions as follows:

  • Condition 1: Avatar performed 5 types of gesture with audio (with-audio).

  • Condition 2: Avatar performed 5 types of gesture without audio (no-audio).

  • Condition 3: CommU performed 5 types of gesture with audio (with-audio).

  • Condition 4: CommU performed 5 types of gesture without audio (no-audio).

The questions are defined as ‘How [aspect] do you think the avatar/robot is?’, where [aspect] refers to one of the following: (a) sociable, (b) enthusiastic, (c) reserved and (d) quiet, which follows Ludewig et al. [47] to evaluate the extroversion of robots. The question to evaluate the naturalness is ‘How natural is the gesture of the avatar/robot?’ After watching a video, participants rated from one to seven for each aspect, where one represents strongly negative, seven represents strongly positive and four represents neutral. The final perceived extroversion score is calculated by:

$$\begin{aligned} (a + b + (8 - c) + (8 - d)) / 4 \end{aligned}$$
(9)

where a, b, c and d represent the score of each specific question.

There are five sections in the questionnaire. The first section includes an explanation of the purpose and steps of our experiment for obtaining their consent, as well as questions about their gender, age and the current time. In section two and three, with-audio and no-audio videos are evaluated respectively. At the beginning of each section, an attention check question was prepared for data screening. At the end of the experiment, participants are asked to fill in the current time once again for calculating the duration for completion. For completing each questionnaire, each participant is paid 1200 JPY (9) via the crowdsourcing company. Since our experiment involves human subjects, we acquired approval from the ethics committee of the Advanced Telecommunication Research Institute International (ATR, review number 22-605).

Table 2 Pair-wise comparison for condition 1
Table 3 Pair-wise comparison for condition 2
Fig. 6
figure 6

Results of the four aspects used in the questionnaire on avatar in condition 1. Error bar represents standard error. Only the p-values between Gl, Gm, and Gh are annotated. *\(p<0.05\), **\(p<0.01\), ***\(p<0.001\). (sociable: Gg: \(M=5.02\), \(SE=0.09\). Gb: \(M=4.39\), \(SE=0.15\). Gl: \(M=2.24\), \(SE=0.15\). Gm: \(M=3.25\), \(SE=0.14\). Gh: \(M=5.17\), \(SE=0.13\). enthusiastic: Gg: \(M=4.97\), \(SE=0.12\). Gb: \(M=4.24\), \(SE=0.13\). Gl: \(M=2.37\), \(SE=0.14\). Gm: \(M=3.24\), \(SE=0.11\). Gh: \(M=5.15\), \(SE=0.12\). reserved: Gg: \(M=3.42\), \(SE=0.14\). Gb: \(M=3.82\), \(SE=0.16\). Gl: \(M=5.68\), \(SE=0.16\). Gm: \(M=4.76\), \(SE=0.14\). Gh: \(M=2.86\), \(SE=0.16\). quiet: Gg: \(M=3.39\), \(SE=0.13\). Gb: \(M=3.89\), \(SE=0.15\). Gl: \(M=5.42\), \(SE=0.18\). Gm: \(M=4.62\), \(SE=0.13\). Gh: \(M=2.82\), \(SE=0.15\).)

4.4 Results

Throughout the experiment, 50 participants were recruited for each questionnaire via crowd sourcing service.Footnote 5 The procedure of statistical test is as follows: For obtained results for all groups (Gg, Gb, Gh, Gm, and Gl) in experiment condition, we first performed the Shapiro-Wilk test and Bartlett’s test to check the normality and the variance equality of the obtained data. If all groups passed the normality test and had equal variances, one-way analysis of variance (ANOVA) was used to test whether all group means are equal. Then Tukey’s honestly significant difference (Tukey-HSD) was used for multi-comparison. Otherwise, non-parametric Kruskal–Wallis one-way ANOVA was used, and the following is Dunn’s test for post-hoc multi-comparison. The alpha for all statistical tests was set to 0.05. Finally, Cohen’s d was calculated as the effect size to indicate how large the difference is.

Table 4 Pair-wise effect size for different aspects on avatar in condition 1

4.4.1 Avatar

We first report the statistics of the participants for condition 1 and 2. The average age of the participants is 36 and the standard deviation is 8. Half of them are male. The average duration for completion is 17 min and the standard deviation is 5 min. All of them passed the data screening question.

The obtained results and the analysis are described in the following. The results of condition 1 are shown in Fig. 5a. Although all groups have equal variance (\(p=0.14\)), the obtained scores for Gl are not normally distributed (\(p<0.05\)); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal (\(p<0.001\)). As expected, the perceived extroversion increases monotonically within Gl, Gm and Gh (\(p<0.05\), \(d=0.96\) between Gl and Gm, \(p<0.001\), \(d=2.33\) between Gm and Gh), indicating that our model produces three levels of extroversion on the avatar. The most extroverted one, Gh, is not significantly extroverted than Gg (\(p=0.18\)), meaning that our model can only weaken the perceived extroversion. The perceived extroversion of Gb is different from all generated groups (\(p<0.001\) for Gh, Gm, and Gl), showing that training the model on all data has a different effect compared to the model trained on a subset. Additionally, there is a significant difference between Gg and Gb (\(p<0.05\), \(d=0.68\)), suggesting that the baseline model failed to replicate the perceived extroversion of the ground truth. Based on the analysis above, hypothesis 1 was well supported.

The results of condition 2 are shown in Fig. 5b. Although all groups have equal variance (\(p=0.93\)), the obtained scores for Gl are not normally distributed (\(p<0.001\)); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal (\(p<0.001\)). Similarly to condition 1, the perceived extroversion increases within Gl, Gm and Gh (\(p<0.001\), \(d=1.24\) between Gm and Gl, \(d=2.02\) between Gh and Gm). When not combined with audio, Gh is perceived as more extroverted than Gg (\(p<0.05\), \(d=0.62\)), showing that without the effect of audio, our model succeeded in extrapolating the perceived extroversion. The perceived extroversion of Gb is not significantly different from Gm (\(p=0.1\)), showing that training the model on a subset has similar results to training the model on the whole dataset when not combined with audio. Additionally, similarly to condition 1, there is a significant difference between Gg and Gb (\(p<0.005\), \(d=0.86\)). Based on these observations, hypothesis 2 was also supported.

Fig. 7
figure 7

Results of perceived naturalness on avatar in condition 1. The error bar represents the standard error. ***: \(p<0.001\). (Gg: \(M=5.7\), \(SE=0.12\). Gb: \(M=4.84\), \(SE=0.14\). Gl: \(M=4.67\), \(SE=0.15\). Gm: \(M=4.83\), \(SE=0.12\). Gh: \(M=4.92\), \(SE=0.13\))

Fig. 8
figure 8

Results of perceived extroversion on CommU. Only the p-values between Gl, Gm, and Gh are annotated. The error bar represents standard error. ***\(p<0.001\). Details of all pair-wise comparison can be found in Table 5 and 6

Table 5 Pair-wise comparison for condition 3
Table 6 Pair-wise comparison for condition 4

We additionally report the results for the four aspects related to extroversion used in our questionnaire (Fig. 6). The statistical tests were conducted independently for each aspect. Overall, the results are similar to when combining them together. While for sociable and enthusiastic, the evaluation scores increase, they decrease for reserved and quiet. This indicates that our model is controlling the impression in terms of each aspect. However, we also note that some aspects were being affected differently from others as shown in Table 4. Specifically, between Gl, Gm and Gh, the absolute effect sizes of reserved and quiet are generally smaller than sociable and enthusiastic. Also, the effect of enthusiastic appears to be larger than other three aspects.

The results of the naturalness evaluation in condition 1 are shown in Fig 7. Although the naturalness of Gg is significantly higher compared to all other groups, no significant difference was found within the generated groups. This ensures that the evaluation results on extroversion will not be affected by naturalness.

4.4.2 CommU

Fig. 9
figure 9

Results of the four aspects used in the questionnaire on CommU in condition 3. Error bar represents standard error. Only the p-values between Gl, Gm, and Gh are annotated. *\(p<0.05\), **\(p<0.01\), ***\(p<0.001\). (sociable: Gg: \(M=4.86\), \(SE=0.1\). Gb: \(M=4.35\), \(SE=0.1\). Gl: \(M=3.65\), \(SE=0.12\). Gm: \(M=4.33\), \(SE=0.13\). Gh: \(M=4.44\), \(SE=0.13\). enthusiastic: Gg: \(M=4.84\), \(SE=0.1\). Gb: \(M=4.23\), \(SE=0.09\). Gl: \(M=3.51\), \(SE=0.11\). Gm: \(M=4.41\), \(SE=0.11\). Gh: \(M=4.67\), \(SE=0.1\). reserved: Gg: \(M=3.27\), \(SE=0.12\). Gb: \(M=3.86\), \(SE=0.1\). Gl: \(M=4.61\), \(SE=0.16\). Gm: \(M=3.47\), \(SE=0.14\). Gh: \(M=3.16\), \(SE=0.14\). quiet: Gg: \(M=3.38\), \(SE=0.12\). Gb: \(M=3.87\), \(SE=0.1\). Gl: \(M=4.56\), \(SE=0.16\). Gm: \(M=3.56\), \(SE=0.14\). Gh: \(M=3.18\), \(SE=0.13\).)

Table 7 Pair-wise effect size for different aspects on CommU in condition 3

The statistics of the participants for condition 3 and 4 are as follows: The average age of the participants is 36 and the standard deviation is 8. The average duration for completion is 22 min and the standard deviation is 6 min. One participant failed to pass the data screening so that was excluded. Another participant used too much time (141 min) to complete the questionnaire thus was excluded. Totally, we have 24 males and 25 females for the experiment.

The results of condition 3 are shown in Fig. 8a. Although all groups have equal variance (\(p=0.66\)), the obtained scores for Gg and Gb are not normally distributed (\(p<0.01\)); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal (\(p<0.001\)). The perceived extroversion of Gl is significantly lower than Gm (\(p<0.001\), \(d=0.82\)). However, a significant difference could not be found between Gm and Gh (\(p=0.053\)), which indicates that our model can produce two levels of extroversion on CommU. This is different from the avatar and we will discuss this in Sect. 5.2. The most extroverted one, Gh, is not significantly extroverted than Gg (\(p=0.26\)), meaning that our model can only weaken the perceived extroversion for CommU. The perceived extroversion of Gb is not significantly different from Gm (\(p=0.27\)), showing that training the model on a subset achieved similar results to training the model on the whole dataset in the case of CommU. Additionally, there is a significant difference between Gg and Gb (\(p<0.01\), \(d=0.52\)), suggesting that the baseline model failed to replicate the perceived extroversion of the ground truth itself, which is consistent with the results obtained from the avatar. Hypothesis 3 was partially supported based on the observations above.

The results of condition 4 are shown in Fig. 8b. Although all groups have equal variance (\(p=0.12\)), the obtained scores for Gb, Gl, and Gh are not normally distributed (\(p<0.05\)); thus Kruskal–Wallis test and Dunn’s test were used. Not all group means are equal (\(p<0.001\)). The perceived extroversion of Gl is lower than Gm (\(p<0.001\), \(d=0.8\)). Different from condition 3, the perceived extroversion of Gm is also lower than Gh (\(p<0.001\), \(d=0.69\)). This indicates that our model produces three levels of perceived extroversion on CommU when not combined with audio. Gh is perceived as more extroverted than Gg (\(p<0.01\), \(d=0.36\)), showing that without the effect of audio, our model succeeded in extrapolating the perceived extroversion also for CommU. Same as when combined with audio, the perceived extroversion of Gb is not significantly different from Gm (\(p=0.55\)). Moreover, there is no significant difference between Gg and Gb (\(p=0.07\)), meaning that the model trained on the whole dataset can yield similar perceived extroversion with the ground truth. Hypothesis 4 was supported by the above analysis.

Additionally, we report the results for the four aspects related to extroversion used in our questionnaire (Fig. 9). The statistical tests were conducted independently for each aspect. The results are similar to when combining them together, which indicates that our model can control all of the four aspects for CommU by using Gl and one of Gm or Gh. However, the results also show different effects on different aspects as shown in Table 7. Specifically, the effect of sociable appears to be smaller than other aspects. Also, although none of the four aspects show significant difference between Gm and Gh, the scale of their effect sizes are different.

The results of the evaluation on naturalness in condition 3 are shown in Fig 10. Although the naturalness of Gg is significantly higher compared to all other groups, no significant difference was found within the generated groups. This ensures that the evaluation results on extroversion will not be affected by naturalness.

Fig. 10
figure 10

Results of perceived naturalness on CommU in condition 3. The error bar represents the standard error. ***: \(p<0.001\). (Gg: \(M=5.21\), \(SE=0.11\). Gb: \(M=4.31\), \(SE=0.11\). Gl: \(M=4.05\), \(SE=0.12\). Gm: \(M=4.33\), \(SE=0.12\). Gh: \(M=4.03\), \(SE=0.12\))

Fig. 11
figure 11

Illustration of motion retargeting from avatar to CommU. Blue arrows indicate the coordinate system on which the joint position data of the avatar lie and around which the shoulder of CommU can rotate. (Color figure online)

5 Discussion

5.1 Motion Retargeting

It is crucial to develop a proper motion retargeting method for the robot since the rigging configuration may be different. In the current experiment, the results obtained on avatar and CommU are similar in no-audio condition, i.e., participants evaluated only the motions, indicating that our motion retargeting from avatar to CommU was successful.

On the other hand, in our previous work [22], we obtained different results on avatar and CommU in no-audio condition, which is evidence of possible failure in motion retargeting. Previously, the control of the shoulder actuators of CommU was realized by computing the vector from shoulder to wrist to estimate the pitch and yaw for the shoulder by considering that CommU has no DoF for the elbows and wrists. Consequently, the wrist position was retargeted to the hand position in CommU, resulting in different hand trajectories for CommU and the avatar, thus perceived differently by users. To improve this, we developed a more suitable motion retargeting for CommU using the vector from shoulder to fingertips as described in Sect. 4.2. An illustration is shown in Fig. 11. Our results demonstrate the importance of implementing an appropriate motion retargeting algorithm. Note that we did not compare the previous and the current retargeting because the previous retargeting was not able to resemble the avatar’s motion well as shown in Fig. 11.

5.2 Difference Between Avatar and CommU

Although we have made a great effort on reproducing the avatar’s gesture on CommU and we ensure that the hand trajectory and head motion are almost the same, the results for avatar and CommU are still different. As reported in Sect. 4.4, while the perceived extroversion on Gh is higher than Gm for the avatar, they were not different for CommU. To investigate this difference, we did more analysis by comparing the results between no-audio condition and combined with-audio condition.

For both avatar and CommU, the perceived extroversion increases after combined with audio, compared with no-audio. This can be verified by comparing the paired difference for two conditions. By viewing the perceived extroversion obtained from each video for one participant and that of its muted version for the same participant as a one-sample pair, we calculate the differences for all pairs. It turns out that the perceived extroversion in with-audio condition is higher than that in no-audio for the avatar (\(p<0.001\), \(d=0.42\)). For CommU, likewise, the perceived extroversion in with-audio condition is higher than that for no-audio (\(p<0.001\), \(d=0.28\)). It is reasonable to conclude that in our experiment, including audio can cause an increase in the perceived extroversion both for avatar and CommU.

Table 8 Results of the comparison between the simulated neutral group and experimental groups

Furthermore, by comparing the perceived extroversion of Gh between with-audio and no-audio, we found that while the perceived extroversion of avatar in with-audio condition was higher than no-audio condition (\(p<0.001\), \(d=0.47\)), there was no significant difference for CommU (\(p=0.99\)). This does not follow the expectation that the perceived extroversion for CommU should also increase when combined with audio, especially considering the fact that the perceived extroversion increased for all other groups (Gg, Gb, Gl, and Gm) on CommU when combined with audio (\(p<0.05\)), and the fact that there is a significant difference between Gm and Gh for CommU in no-audio condition.

We suspect that the upper limit of the perceived extroversion of CommU has been reached. The physical constraints of CommU (fewer joints and DOFs, lower maximum speed, smaller reaching space) can limit its expressiveness, which, on the other hand, are not problems for the avatar. Consequently, the results are different for avatar and CommU. Although we believe that it is likely that if we use a more flexible robot the results will be closer to the avatar, further investigations are necessary and we leave this as future work.

5.3 Extrovert or Introvert?

While comparing the perceptions of the various groups reveals which group is seen as being more or less extrovert than the others, it is not clear whether one group is seen as being extrovert or introverted. The context of our questionnaire might help to make this clear. Each video is rated by our participants on a scale of 1 to 7, with 1 denoting a strongly negative and 7 denoting strongly positive as explained in Sect. 4.3. Therefore, a score of 4 should therefore be considered neutral in our experiment because it should not be either positive or negative. Additionally, an extrovert or an introvert should be represented by a perceived extroversion score that is higher or lower than 4, respectively.

We can simulate a neutral group whose perceived extroversion follows a normal distribution with a mean value of 4 and a standard deviation of 1, from which we draw samples to statistically test whether the perceived extroversion is higher or lower than 4. The amount of samples is equal to the number of samples we gathered for our experiment. We test the difference between the simulated group and all other groups following the test procedure outlined in Sect. 4.4. The result, which is summarized in Table 8, demonstrate that our model is capable of producing gestures that are perceived as extrovert or introvert both for avatar and CommU when different trait labels are input.

5.4 Different Aspects of Extroversion

In the experiment, we utilized four questions that represent different aspects of extroversion for the evaluation. Overall, the results of taking average over all aspects and analyzing them independently show similar findings. Both demonstrate that our model is capable of controlling extroversion as well as the four specific aspects. However, it is important to note that certain aspects are affected differently compared to others.

For instance, when examining the avatar’s results (4), between Gl, Gm and Gh, the absolute effect sizes of reserved and quiet are generally smaller than sociable and enthusiastic. Also, the effect of enthusiastic appears to be larger than other three aspects. For CommU’s results (7), the effect of sociable appears to be smaller than other aspects. Also, although none of the four aspects shows significant difference between Gm and Gh, the scale of their effect sizes is different.

These findings indicate that our method has distinct effects on different aspects of extroversion. One possible explanation is that the relationship between the chosen features and the various aspects of extroversion may have different scales or may only impact specific aspects. Recent studies, including our own, have primarily focused on the average of different extroversion aspects, overlooking the potential diverse effects on each individual aspect.

Future work should investigate how different aspects of extroversion are influenced differently and develop methods for fine-grained control over these aspects. This will contribute to enhancing the modeling and understanding of extroversion.

6 Conclusion

Leveraging the flexibility of data-driven techniques, we introduced a conditional GAN-based co-speech gesture generation model that utilizes cognitive heuristics. Our method effectively controlled the perception of extroversion, as evidenced by experimental results with both an avatar and a humanoid robot using four questions related to distinct aspects of extroversion. Intriguingly, while the perceived extroversion varied with different input labels, we observed that separate aspects of extroversion responded differently. Our findings suggest that the extensive data collection typically associated with deep learning can be mitigated using heuristic methods that incorporate insights from cognitive science. This underscores the potential of melding learning techniques with the principles of cognitive science.