Personality trait estimation in group discussions using multimodal analysis and speaker embedding

The automatic estimation of personality traits is essential for many human–computer interface (HCI) applications. This paper focused on improving Big Five personality trait estimation in group discussions via multimodal analysis and transfer learning with the state-of-the-art speaker individuality feature, namely, the identity vector (i-vector) speaker embedding. The experiments were carried out by investigating the effective and robust multimodal features for estimation with two group discussion datasets, i.e., the Multimodal Task-Oriented Group Discussion (MATRICS) (in Japanese) and Emergent Leadership (ELEA) (in European languages) corpora. Subsequently, the evaluation was conducted by using leave-one-person-out cross-validation (LOPCV) and ablation tests to compare the effectiveness of each modality. The overall results showed that the speaker-dependent features, e.g., the i-vector, effectively improved the prediction accuracy of Big Five personality trait estimation. In addition, the experimental results showed that audio-related features were the most prominent features in both corpora.


Introduction
The aspects of nonverbal communication have become important focuses in human-computer interaction (HCI) studies. This is because nonverbal aspects are naturally delivered in human-to-human communication. When we interact with other people, we consider not only what they are saying but also how they are speaking. If nonverbal aspects were not considered, communication would become very unnatural or robot-like. The study of of the nonverbal aspects, e.g., personality, has attracted much attention in HCI. Personality extensively influences human life, in areas such as decision making, preferences, and reactions. It comprises the patterns of the habitual behaviors, emotions, and cognition of a person [34]. We could achieve a better understanding of ourselves and other people around us by understanding personality.
The integration between personality science and HCI studies has been emerging since the mid-2000s, and thus, the term personality computing (PC) was established as a research field [33,53]. Vinciarelli and Mohammadi [53] argued that three phenomena fuel PC from a technological perspective: (1) the availability of personal information in social networks, (2) the possibility of data collection via mobile technology on a daily basis, (3) the consideration of social and affective intelligence in the computing and machinery research. Subsequently, three major problems are addressed in PC, i.e., automatic personality recognition, perception, and synthesis tasks [53]. The features for these tasks are extracted from personality-expressive signals, such as behavioral modalities from various data sources [33].
The most popular and influential personality taxonomy is the Big Five personality trait system [31,34]. Since it is relatively stable in time as well as applicable across vari-ous cultures and trait measures, the Big Five personality trait system is accepted in a wide range of areas, including in PC [33,34,53]. This measurement classification system comprises five traits: 1. Openness to experience (O): the degree of being curious and inventive; 2. Conscientiousness (C): the degree of being efficient and organized; 3. Extraversion (E): the degree of being energetic, active, and outgoing; 4. Agreeableness (Ag): the degree of being cooperative and compassionate; 5. Neuroticism (N): the degree of being sensitive and nervous.
In an earlier study, manual assessment was conducted by using a standardized factor analysis of personality description questionnaires to determine one's Big Five personality traits. However, this type of manual assessment is very costly and thus not applicable for HCI interfaces. Accordingly, automatic personality trait assessment studies have attracted great attention in recent years. Several techniques have been proposed from the perspectives of various modalities for automatic personality trait estimation. For instance, personality detection studies based on facial expression analysis were reviewed in [17] using image processing techniques. Concurrently, studies on speech personality trait recognition have also progressed in the speech research community, especially since the Interspeech 2012 Speaker Trait Challenge was released [42]. Other approaches using language models have also been widely employed to estimate personality traits, such as those derived from conversations through social media [10,56]. Instead of focusing on one modality, several studies have also used multimodal analysis to infer personality traits [9,22,25,29].
Despite the growth in the number of automatic personality detection studies, the reliability of detection performance is still far from ideal. Most of the existing studies focused on inferring individually perceived personality traits in self-presentation scenarios [5,42], which is not ideal for representing personality. McCrae and Costa (1996) reported that personality shows the basic tendencies of a person, particularly in dealing with social interactions [31]. Manifesting personality traits in interactions is more meaningful than selfpresentation.
In recent years, several studies have considered the automatic inference of personality traits from interaction processes, such as small group interactions [19,22,25,29]. Okada et al. [29] proposed a personality trait estimation method based on a co-occurrent multimodal event discovery approach using the audio-visual (AV) subset of a group meeting from the Emergent Leadership (ELEA) corpus (ELEA-AV). Subsequently, the study of Kindiroglu et al. [19] demonstrated a multidomain and multitask approach for predicting the extraversion and leadership traits in the ELEA corpus. Additionally, prior work in [25] focused on personality trait estimation by using multimodal features and communication skills indices for datasets with multiple discussion types.
Many transformer-based methods and various types of multimodal fusion techniques have been proposed for solving various computing tasks [21]. Most of the methods required a large-scale dataset which is difficult to fulfill for analyzing a social interaction, such as the main task addressed in this study. With a relatively smaller size of data, we focus on how to handle individuality features and how to mitigate the issue of individual differences in more diverse group discussion corpora (different language and environment settings).
This study aims to address two novel points. First, we investigate the relationships between the state-of-the-art speaker individuality feature extracted from speech, namely, the identity vector (i-vector), and the Big Five personality traits. Our hypothesis is that speaker individuality is interrelated with personality. Second, we investigate the effectiveness of multimodal features regardless of the selected language. In this study, we consider two group discussion datasets, including the Multimodal Task-Oriented Group Discussion (MATRICS) corpus (in Japanese) and the ELEA-AV corpus (in European languages), to infer the Big Five personality traits. By the end of this paper, we will discuss the following key questions: 1. Is the speaker individuality feature effective for inferring the Big Five personality traits? 2. What are the effective multimodal features for estimating the Big Five personality traits for the MATRICS and ELEA-AV corpora?
The rest of this paper is organized as follows. Section 2 describes works that are closely related to this study. In Sect.3, we introduce the utilized multimodal corpora. Subsequently, we describe the employed feature representation approach in Sect. 4. The experimental settings and results are summarized in Sect. 5. In Sect. 6, we discuss the results and answer the key questions in this study. Finally, this paper is concluded in Sect. 7.

Related work
Automatic personality computing is useful for many HCI applications because it can model the relationships between stimuli and the outcomes of social perception processes. In other words, an automatic personality computing method  [35]. Their study aimed to automatically predict personality traits obtained from selfreported questionnaires. In [1], Aran et al. investigated video blogs in a small group meeting to predict personality traits, especially extraversion traits. Another work from Jayagopi and Gatica-Perez [18] attempted to propose a solution for predicting group performance and personality traits by mining typical behavioral patterns. Subsequently, a mining approach for extracting co-occurrent events from multimodal timeseries data for personality estimation was also proposed by Okada et al. [29]. Batrinca et al. [6] conducted a comparative analysis to observe the difference between the personality trait recognition accuracies obtained for a human-machine interaction (HMI) scenario and a human-human interaction (HHI) scenario.
In addition to the studies mentioned above, several studies specifically focused on improving Big Five personality trait prediction. For instance, Fang et al. [16] used three nonverbal features, including intrapersonal features, dyadic features, and one-vs.-all features, to predict the Big Five model. Lin et al. [22] developed a Big Five predictor based on the use of an interaction mechanism in bidirectional long short-term memory (BLSTM) to model the vocal behaviors of participants. In the prior study [25], communication skills and task types were considered for estimating the Big Five personality traits.
Our current work differs significantly from the existing studies in terms of the utilized features and dataset dependency. In most studies, low-level features were extracted for the estimation process. We consider the transfer learning technique by extracting higher-level features using state-ofthe-art pretrained speaker embedding models (i-vector and x-vector extractors [13,47]). To ensure the effectiveness of our proposed system regardless of the selected language, we use two different language corpora, i.e., a European language corpus and a Japanese corpus (Table 1). Fig. 1 Overview of the utilized multimodal group discussion corpora presents an overview of these corpora. The MATRICS corpus was used as the main dataset for analyzing the effectiveness of each modality. In addition, the ELEA-AV corpus was used to analyze the speaker individuality features as audio-related features despite the different nature of this dataset.

MATRICS corpus
The MATRICS corpus is a Japanese group discussion dataset introduced in [28]. Forty participants were involved in ten uniformly distributed discussion groups (four participants in each discussion group). The MATRICS corpus consists of multimodal raw data, i.e., audio data, video data, and head motion data. In addition, reliable manual transcriptions and assessments of the Big Five personality traits and communication skills are also available. The audio data were recorded via an Audio-Technica HYP-190H hands-free head-worn microphone. In contrast, the video data were recorded using two SONY HDR-CX630V cameras that captured two opposite angles of the group interaction overview. The head motion data were recorded by ATR-Promotions WAA-010 accelerometers.
The assessment of Big Five personality trait scores in the MATRICS corpus was obtained from a survey, while the communication skills were annotated by 21 human resource management experts using the recorded video data. The communication skills annotations presented in [30] contained five different indices, including listening attitude (LA), smooth interaction (SI), aggregation opinions (AO), communicating one's own claim (CC), and logical and clear presentation (LP). The overall total score was also calculated as the total communication (TC) index. Each annotator assessed all the communication skills indices of each participant from the given segmented video sessions. The reliability of the assessment was confirmed by the level of agreement among the annotators with Cronbach's alpha (α) and the Pearson correlation coefficient (ρ), except for LA (with α < 0.85 and ρ = 0.59).
Unlike the other group discussion datasets with only one discussion task available per group, such as the ELEA corpus [38], the MATRICS corpus consists of three different tasks for each discussion group. These tasks are distinct in terms of freedom and the scope of the given prior information regarding the conversation structure. The freedom levels of task-1, task-2, and task-3 are ordered from low to high, whereas the amount of given preliminary information is ordered from more to less. The details of the discussion topic for each task are described as follows: 1. task-1 (in-basket): the selection of an invited guest for a school festival; 2. task-2 (a case study with prior information): preparation of a food and beverage booth at a school festival; 3. task-3 (a case study without prior information): arrangement of a two-day travel itinerary in Japan for a foreign friend.

ELEA-AV corpus
In addition to using the MATRICS corpus, we used the AV subset from the ELEA corpus [38] to check the effectiveness of speaker individuality features. This subset includes recordings from 27 group meetings with 102 participants. Each recording has a length of 15 minutes. The task in the ELEA corpus is known as a winter survival task. In this task, the participants had to order 12 different items to bring with them as if they were the survivors of an airplane crash that occurred in winter. This corpus originally aimed to analyze emergent leadership in group discussions. Nevertheless, this corpus also provided both self-assessed and perceived Big Five personality trait scores for each participant. Therefore, the Big Five estimation model could be constructed using this corpus. We aimed to verify whether speaker individuality features, as audio-related features, could be practical in more general cases (regardless of the different characteristics of the MATRICS and ELEA-AV corpora).

Feature representation
In this study, we extracted three modality groups (i.e., audio, language, and motion & visual groups) and communication skills indices as the inputs for Big Five estimation. Table 2 shows a summary of the multimodal features.

Audio-related features
In prior work [25], audio-related features were obtained by OpenSMILE [15], which was configured for perceived speaker traits in the Interspeech 2012 Speaker Trait Challenge proposed by Schuller et al. [42]. Unlike prior work, we aimed to thoroughly analyze the effectiveness of audiorelated features specifically for Big Five personality trait estimation in group discussions. Accordingly, five categories of audio-related features were extracted in this study, including speaker identity features, spectral-related features, voice-related features, energy-related features, and turntaking features.
Speaker identity features-We aimed to investigate whether the features related to speaker identity could contribute to the performance of an automatic Big Five personality trait estimator. Accordingly, we extracted the i-vector and x-vector features in this study. The i-vector subspace modeling approach introduced by Dehak and Shum [13] has become the state-of-the-art technology in speaker recognition systems. In the i-vector approach for speaker recognition [12,13], a low-dimensional vector that is extracted using joint factor analysis (JFA) represents a speech segment. This approach has been reported to reduce high-dimensional sequential speech data to a lower-dimensional fixed-length vector representation that contains more relevant information. Figure 2 shows the simplified block diagram of the i-vector extraction process.
In the former i-vector modeling approach, the assumption of a Gaussian feature distribution was made; however, this is not always applicable in practice. Thus, a DNN model was developed to address this issue [45]. Subsequently, to improve the robustness of the i-vector obtained with the DNN model, the process of obtaining an i-vector from a DNN with embedding layers was proposed by Snyder et al. [46,47]. This i-vector is also known as an x-vector [47]. The architecture of the x-vector extractor is shown in Fig. 3. We utilized the pretrained VoxCeleb [27] i-vector and x-vector models provided by David Snyder that are available in the Kaldi toolkit [37,47]. These pretrained models were constructed using Mel-frequency cepstral coefficients (MFCCs) as their input features.
Before extracting an i-vector or x-vector using the pretrained models, we selected the "long" utterances (utterances with lengths of more than 3 s) of each speaker in a session (one instance). The speaker individuality vector for an instance was then defined as the average of the individuality vectors derived from all "long" utterances. This preprocessing step was conducted to assure the reliability of the extracted vector. Figure 4 shows the PCs of the x-vectors extracted from five speakers (MATRICS corpus) in threedimensional space.
Spectral-related features-MFCCs are widely used as standard features in speech processing domains, including emotion and speaker trait recognition [40][41][42]. MFCCs represent the spectral envelope of a signal (timbral information) [50] and were reported to have the ability to separate the impacts of the source and filter of the input speech. An MFCC can be obtained by mapping the Fourier power spectrum of a signal onto the Mel scale [48]. Subsequently, the discrete cosine transform was performed for the Mel log powers was performed and resulted in the Mel spectrum, in which the amplitude refers to the corresponding MFCC. Figure 5 shows the block diagram of deriving the MFCC of an input signal.
In addition to MFCCs, the first-and second-order framebased MFCCs (delta and delta-delta, respectively) are also considered prominent features in several applications. The following equation shows the mathematical expression of a delta coefficient (d t ) for a frame t given that the coefficients (c t+n and c t−n ) with have typical N values of 2.  [47] In this study, we extracted MFCC features with delta and delta-delta using a speech processing toolkit (SPTK [52]) to infer Big Five personality traits. In general, it was suggested that the first 8-13 MFCCs represented the shape of the spectrum. Furthermore, the higher-order coefficients were related to the finer spectral details, such as pitch and tone. However, using a large number of cepstral coefficients results in more analytical complexity. Therefore, the first 12 to 20 MFCCs are typically used for optimal speech analysis [26]. We used the first 12 coefficients and both delta and delta-delta as the spectral-related features.
Voice-related features-We extracted the statistical properties of the fundamental frequency (F0), linear predictive  Before extracting these features, we conducted preprocessing on the raw audio data via the selection of "long" utterances (more than 3 s), downsampling to 16 kHz, and framing with a 30-ms length and a 50% overlap. This preprocessing step was conducted to capture better information related to voiced speech.
The F0 trajectory estimation was acquired using the robust algorithm for pitch tracking (RAPT) [49] in SPTK. The LPC and LSP features were obtained using tenth-order linear predictive coding, which is commonly used for mimicking speech production systems [3]. The LPCs and LSPs were useful for estimating speech formants. For this reason, we extracted these features only from the "long" voiced utterances.
Energy-related features-This feature set was derived from sound energy (further named PI). The sound energy was represented by statistical properties calculated in the framebased unit.
Turn-taking features-This feature set is represented by three speaking turn (ST) feature variables for participants: (i) the total speaking length (the total duration of the speaking utterances in a session), (ii) the total utterance count (the number of utterances in a session), and (iii) the average utterance length (the total utterance count in a session divided by the total duration).

Motion and visual features
In the MATRICS corpus, the motion and visual features can be categorized into two groups. The first group includes the features obtained from the head movements recorded by accelerometers. The statistical properties of the head movements were calculated (as shown in Table 2) [30]. Head movement refers to the norm of the three-dimensional head acceleration (|a t |) at a particular time t (where a t = {x t , y t , z t }). The movements performed while speaking were calculated by joining the head activity data with the speaking time data via manual transcription for each participant. This feature set was also normalized using z-score normalization. We further referred to this feature set as head motion (HM).
The second group includes the face-related features extracted by using OpenFace [4], the state-of-the-art facial behavior analysis toolkit. We extracted action units (AUs), head pose (PSs), and eye gazes (GZs) by inputting the raw video data that captured the face of each participant while having a discussion. Figure 6 shows the example of This feature set was reported as a prominent cue in social event analysis [54]. Last, GZs show the eye movements that contribute to social and emotional communications, especially for tracking the attention directions of participants [14,54,55]. In this study, we extracted GZs using the facial landmark detection model [57].
In the ELEA-AV corpus, there are three groups of motionand visual-related features. The first group is referred to as visual activity features, which capture body activity (bMotion) and head activity (hMotion) features. These features were extracted by the body tracking, head tracking and optical flow [29]. The second group is based on motion energy images (MEIs) [7]. MEIs were obtained by integrating different images of the whole recorded clip. Since the MEIs changed on a time-series basis, the segmentation of timeseries MEI data according to categorical patterns followed the procedure described in [29]. The third group of motionand visual-related features in the ELEA-AV corpus is the visual focus of attention (VFOA). These features employed a probabilistic framework to estimate head locations and poses on the basis of a state-space formulation [39]. The VFOA features that we employed followed those utilized in [29].

Communication skills and leadership indices
As mentioned in Sect. 3, the communication skills (CS) indices in the MATRICS corpus were obtained by manual assessment from 21 experts in human resource management. Subsequently, the leadership (Ld) indices included in the ELEA-AV corpus were related to individual impressions about dominance and leadership. These indices were determined by other participants in the meeting as perceived interaction scores. Five Ld items were included: perceived leadership, perceived dominance, perceived competence, perceived liking, and dominance ranking. More details on the CS and Ld indices are described in [28,38], respectively. As a preprocessing step for these features, we applied z-score normalization to both the CS and Ld indices.

Experiment
In our preliminary study [25], one of the objectives was to clarify the effectiveness of verbal and nonverbal features and CS indices for estimating the Big Five personality traits. We extracted audio-related features in the same manner as the baseline system in the Interspeech 2012 Speaker Trait Challenge [42] designed for estimating perceived speaker traits from single speaker utterances. In contrast, we aimed to thoroughly study which audio-related features are more suitable for estimating the self-assessed speaker traits of each participant in a group discussion, as provided in the MATRICS corpus. Self-assessed speaker traits are more robust than perceived speaker traits, regardless of the speech content and environment. Since the sizes of the group discussion corpora are relatively limited, we also considered performing transfer learning by using the state-of-the-art speaker individuality features for estimating the Big Five personality traits. Figure  7 shows the main ideas of our experimental process.
The experiment in the current study aimed to investigate the effectiveness of (1) speaker individuality features (i-vector and x-vector) (Sect. 5.2.1); (2) nonverbal behaviors, e.g., face gestures (Sect. 5.2.2); and (3) a combination of modality groups (Sect. 5.2.3) for Big Five personality trait estimation in both the MATRICS and ELEA-AV corpora. Accordingly, we conducted unimodal analysis followed by multimodal analysis by considering each modality group. An ablation test was also conducted to study the importance of each modality group. In this study, the experiment was conducted as a binary classification task (similar to [2,29]). The input was the combination of features explained in Sect. 4, and the targets were the Big Five personality trait scores, i.e., neuroticism (N), extraversion (E), openness (O), agreeableness (Ag), and conscientiousness (C). As mentioned above, the Big Five scores were obtained from a self-assessed questionnaire, which is usually more accurate but more difficult to predict than the perceived Big Five scores used in prior studies [22,29].

Experimental settings
In the prior study [25], the support vector machine (SVM), random forest, Naïve Bayes, and decision tree algorithms were investigated for predicting the Big Five personality traits in the MATRICS corpus. The results showed that the random forest classifier could obtain the most reliable estimation accuracy for most traits and, therefore, suitable to generalize a prediction model. A random forest is an ensemble learning algorithm that generates a set of decision trees from the given data samples, randomly selects its subsets, and chooses the best solution among the subset predictions by voting. This algorithm can reduce overfitting issues and result in a robust and high-performance model [8]. Figure 8 shows an illustration of the random forest algorithm. We utilized the random forest algorithm in the ensemble module from scikit-learn [32] to build our classification model. Parameter tuning was applied for the number of estimators (N est ) and the maximum depth. To achieve our goals, we conducted a comparative analysis on the basis of the obtained feature set. The feature set for unimodal analysis is shown in Table 2. Additionally, an ablation test was conducted with respect to the modality groups for multimodal analysis. Four modality groups were involved, including the audio-related modality (A), language-related modality (L), motion-and visual-related modality (M), and communication-related modality (C). The combinations of these modality groups for multimodal analysis are listed in Table 3.
The feature selection procedure was conducted for each feature set, where the number of selected features was based on the best overall unimodal analysis result with default classifier parameters (no parameter tuning). This feature selection process was conducted only for feature sets with more than ten elements. A support vector regressor (SVR) was used by fitting the training features and training outputs of this feature selection process. Figure 9 shows an example of i-vector feature selection analysis using several elements (ranging from {N i /8, N i /4, N i /2, N i }, where N i is the number of i-vector dimensions (400)). Although a larger number of elements resulted in better accuracy for neuroticism, the estimates for other traits worsened. Therefore, we selected 100 as the number of features for the i-vector to compensate for the estimation of the other traits. Subsequently, to reduce the probability of imbalance issues, we also conducted late fusion for each modality group before merging it with the other modalities. The number of selected features from each modality group (except the CS and Ld groups) was uniform and selected from {5, 10, 20, 30}.
Following a previous study, [30], the utilized MATRICS corpus consisted of 107 out of 120 data samples due to some missing values recorded from accelerator data. Furthermore, for the ELEA-AV corpus, we used all 102 existing data samples. From the available data samples, we conducted leave-one-person-out cross-validation (LOPCV). As participant data were set as the testing data, the other participants' data were set as the training data. Thirty-fold cross-validation was carried out because there were 30 participants (3 people in each of the 10 discussion groups) in total for the MATRICS corpus. To evaluate the performance of the binary classification model, we used the F1-score metric, which considers the balance between the precision and recall of the estimation results.

Results
This subsection presents the results of our experiments, including those obtained from (1) unimodal and multimodal analyses for both the MATRICS and the ELEA-AV corpora and (2) a comparison with prior works [2,25,29].
To investigate the effectiveness of each feature set, we carried out a unimodal analysis to estimate the Big Five personality traits. After obtaining the most effective feature sets for each modality, we carried out a multimodal analysis. Tables 4 and 5 show the unimodal analysis and multimodal analysis results regarding the inference of the Big Five personality traits in the MATRICS corpus, respectively. In the same way, we also conducted unimodal and multimodal analyses to infer the Big Five personality traits in the ELEA-AV corpus. Tables 6 and 7 show the results of the unimodal analysis and multimodal analysis, respectively, for the ELEA-AV corpus.

Speaker individuality features for personality estimation
The Big Five personality trait estimation results in the MATRICS corpus using speaker individuality features (ivector or x-vector) are shown in the first and second rows of Table 4. From this table, using speaker individuality features could effectively improve neuroticism trait estimation (F1-score > 70%). The x-vector is also useful for estimating the extraversion trait (F1-score > 65%). The comparison between the fusion of modality A (audio-related features) and A' (audio-related features without speaker individuality features) in Table 5 shows how speaker individuality affects the  Big Five personality estimation in multimodal analysis. For most of the traits (except neuroticism traits), using speaker individuality features could improve the estimation results. Similarly, we could see the Big Five personality trait estimation results in the ELEA-AV corpus using speaker individuality features in Table 6. Almost all of the personality trait estimations could achieve an F1-score of more than 60% (except conscientiousness trait). The best estimation using the x-vector could be achieved for the openness trait. When fusing with other modalities (as shown in Table 7), a noticeable improvement is shown in the estimation of openness and agreeableness traits. For instance, the estimation result using all modalities, including the x-vector, could achieve an approximately 8% higher F1-score than the one excluding the x-vector.

Nonverbal behaviors as features for personality estimation
We analyzed nonverbal behaviors, i.e., motion-and visualrelated features, CS indices, and Ld indices, for Big Five personality trait estimation. The nonverbal features available in the MATRICS corpus are HMs, AUs, PSs, GZs, and CS. From Table 4, the best results for the openness and conscientiousness traits were achieved by the GZs and HMs, respectively. The nonverbal features available in the ELEA-AV corpus are bMotion, hMotion, MEIs, VFOA, and Ld. The highest F1-scores obtained during the single feature set analysis Table 6 were mostly achieved using nonverbal features, except for the openness trait. The estimation trait was best predicted by the Ld feature. The VFOA feature was best for predicting the agreeableness trait. In addition, the most effective feature set for the neuroticism and conscientiousness traits was the set of MEIs.

Multimodal features for personality estimation
On the basis of the unimodal analysis results, we used the prospective feature sets as one modality group. For instance, the feature sets for A included the x-vector, MFCC, F0, PI, and LSP. For the MATRICS corpus. Four modality groups were considered in this multimodal analysis. An ablation test was carried out to check the significance of each modality. Table 5 shows the results of the ablation test. These  results demonstrate that the multimodal analysis could only slightly improve the prediction results of the extraversion and openness traits in comparison with those obtained in the single feature analysis. Unfortunately, the prediction results of neuroticism, agreeableness, and conscientiousness obtained using multimodal analysis were worse than those obtained by using a single feature set. The best predictors for each Big Five trait (neuroticism, extraversion, openness, agreeableness, and conscientiousness) were A' + L + M, A + L, L + M, CS, and L, respectively. As an overall review, we can conclude that the A features are the most significant features for predicting the Big Five personality traits. Aside from A, the features related to motion and vision (M) are best for predicting the openness and conscientiousness traits. Subsequently, Table 7 shows the multimodal analysis results for the ELEA-AV corpus. These results indicate that the multimodal analysis could slightly improve the estimation results of the neuroticism and agreeableness traits for this corpus. The best results were achieved by using the audiorelated modality (A). In contrast, the Big Five personality trait inference model for extraversion, openness, and conscientiousness could not achieve better performance than that yielded by the model utilizing a single feature set.

Comparison with prior work
We carried out a comparative analysis with [25] for the MATRICS corpus and other related works [2,19,29] for the ELEA-AV corpus regarding the proposed features. For the MATRICS corpus, the evaluation was conducted using 10fold cross-validation, and the dataset distribution was based on that contained in a prior study [25]. Table 8 shows the comparative results yielded by an ablation test in terms of the F1-score metric. The overall results of our current study were substantially better than those of the prior studies since the estimates of all traits were improved, with an F1-score increase of 8% on average. Significant improvement was achieved in terms of neuroticism and extraversion prediction (more than 10 %).
From Table 8, we could also conclude that the features related to A and M that we used in the current study were more suitable for Big Five estimation with the MATRICS corpus than the features used in the prior study. For instance, the F1score for predicting the neuroticism trait using the A features was improved from 68 to 79%, whereas the results obtained using the M features improved from 60 to 65%. Furthermore, the best modality for estimating the neuroticism and conscientiousness traits in current work matched well with that in Table 8 The Big Five personality trait estimation results obtained for the MATRICS corpus with 10-fold cross validation evaluation in the same manner with the prior work [25] (left) and current work (right) These results were obtained using the random forest algorithm with the optimal parameters. Red cells with bold captions represent the best overall prediction results. Blue cells with bold captions represent the best prediction results of each work. Green captions represent the improvement results. Meanwhile, red captions represent the declining results Table 9 The Big Five personality trait estimation results obtained for the ELEA-AV corpus based on the current work and three prior works by Aran et al. [2], Okada et al. [29], and Kindiroglu et al. [19] The three classifiers used in the corresponding works were a random forest, ridge regression, and a support vector machine (SVM). Red cells represent the best overall prediction results. Blue cells represent the best prediction results of each work prior work (A and M, respectively). In the current study, the highest F1-score for the estimation of each Big Five personality trait was acquired by the following pairs: neuroticism (A), extraversion (A + CS or A), openness (A + M), agreeableness (A + CS), and conscientiousness (M). Table 9 shows the comparative results obtained using various features proposed in the current work and three prior works [2,19,29]. The evaluation methods used in all of these works were based on LOPCV. These results show that for most of the Big Five traits (except for the agreeableness trait), the best results obtained by our proposed features could achieve better performance than those in prior works. Significant improvement was obtained by the using audio-related modality (A) for predicting the neuroticism trait (from 61% to 68%).

Discussion
In this section, we discuss the key information obtained in this study. We also discuss the prospective multimodal interfaces that utilize the results of our findings. Finally, the limitations and the future direction to address the remaining issues in this study will be discussed.
From the experimental results, as shown in Sect. 5.2, we can discuss two main points that answer the following key questions.
1. Is the speaker individuality feature effective for inferring the Big Five personality traits? On the basis of our experimental results, the speaker individuality feature, i.e., the i-vector or x-vector, could improve the prediction performance of the model several traits. For instance, as a unimodal feature, the vector could improve the prediction of the neuroticism and extraversion traits for the MATRICS corpus. On the other hand, it could also achieve accuracy values greater than 60% for the neuroticism, extraversion, openness, and agreeableness traits for the ELEA-AV corpus. These results suggest that the neuroticism and extraversion traits could be represented by the characteristics captured in the state-of-the-art speaker individuality feature from speech. We predicted that these results reflected that the speech characteristics representing speaker individuality were also related to several personality traits. For instance, it has been reported that prosodic features are highly related to speaker individuality [44]. As neuroticism represents the degree of being nervous and extraversion describes the degree of being energetic and active, the perceptions regarding the rising and falling patterns of the voice of a speaker affect the perceptions of these traits. In the case of the conscientiousness trait, our results show that speaker individuality and this trait do not share the same features. 2. What are the effective multimodal features for estimating the Big Five personality traits for the MATRICS and ELEA-AV corpora? As shown in Tables 5 and 7, most of the Big Five personality trait predictions obtained by using audiorelated features (A) or combining them with another modality achieved the best accuracy for both MATRICS and ELEA-AV corpora. Subsequently, if we use the motion-related feature (M), we could improve the prediction accuracy for the conscientiousness trait. With the MATRICS corpus, we also analyzed the languagerelated feature (L) and CS indices. Although it was not as effective as M, the conscientiousness trait could also be reflected in the DT feature in L. As a unimodal feature, CS was not as effective for this task as other features. In the ELEA-AV corpus, the Ld indices were effective for predicting the extraversion trait.
Most well-known studies for personality trait estimation focused on self presentation scenarios. For instance, the Speaker Trait Challenge 2012 [42] and the ChaLearn Looking at People 2016 [36]. However, the findings from these studies might be limited because psychological science suggested that situations and social interactions are highly associated with personality states [31]. Only a few studies worked on predicting personality traits in social interactions, including [19,22,25,29]. This study specifically addressed the personality traits estimation using the speaker individuality and multimodal cues in multiple languages group discussion corpora (i.e., MATRICS and ELEA-AV).
As one of the primary key findings, the speaker individuality feature is considered beneficial for estimating neuroticism and extraversion traits in the European or Japanese language group discussion corpus. The neuroticism and extraversion traits are statistically significant in stimulating peoples' attitudes when receiving or making a call at public places [24]. Hence, the estimated personality can be utilized in a virtual call center for giving customer-centric responses. Besides, we can also build an interface based on speaker embedding to detect the user's attitude. Similarly, the multimodal analysis results of this study could also be used for developing a virtual agent for group interactions that can respond appropriately to each participant based on the estimated personality traits. An appropriate response could lead to a smooth conversation.
While this study provides several key findings, the corpora used in this study might be considerably small-size, limited to the group discussion settings, and consist of only European and Japanese languages. The investigation of more diverse corpora will be considered as a future direction. In this study, we did not focus on analyzing the recent advanced machine-learning algorithms. Instead, we focus on mitigating individual differences from the relatively smaller size but more diverse group discussion corpora, which could be analyzed using classical machine learning algorithms. In future work, we will thoroughly consider how to model personality traits, and other internal properties based on the recent trends in multimodal machine learning [21].

Conclusion and future work
This paper analyzed the effectiveness of the state-of-the-art speaker individuality feature, namely, the i-vector, to predict the Big Five personality traits in two different group discussion datasets. Our experimental results showed that this feature could effectively estimate the Big Five personality traits in both datasets, i.e., MATRICS and ELEA. A significant improvement was obtained when predicting the neuroticism and extraversion traits. Subsequently, a multimodal analysis was also carried out to compare the effectiveness of each modality and psychological feature. The psychological features included CS and Ld indices. The results showed that the audio-related features contributed most significantly to this task. An improvement could be achieved by using motion-related features, especially for predicting the conscientiousness trait. Furthermore, the i-vector speaker embedding system could improve the estimation results of personality traits, even when only using one modality (audio-related).
In our future work, we will develop a multimodal interface based on speaker embedding for automatic personality trait estimation using multimodal features. For instance, an interface can give adequate personalized feedback to the user based on the estimated traits. Additionally, recent multimodal machine learning approaches, the relationship between personality traits and other internal properties, and the explainability of the multimodal cues will be thoroughly investigated.