Introduction

Interpersonal communication skills and personality traits have been identified as critical success factors for job performance and organization effectiveness [1, 2]. Communication skills enable workplace members to effectively exchange, share, and feedback information to different stakeholders through verbal and nonverbal messages [3]. Verbal messages are used to convey exact words, and nonverbal messages, such as gestures, facial expressions, posture, and tone of voice, are helpful for understanding underlying emotions, attitude, and feelings [1, 4]. Personality traits refer to individual patterns of thinking, feelings, and behaviors that can be used to predict whether an individual is a good fit for a specific job context or organizational environment [2]. Face-to-face interviews are a common method of employment selection [5], and this method is a valid assessment tool for measuring interpersonal communication skills in a structured manner [6]. Moreover, interviewers may judge a candidate’s personality traits based on his/her nonverbal communication during the interview, and the judgment may influence hiring recommendations [7].

However, inviting every job candidate to attend face-to-face interviews is not cost-effective [8]. The asynchronous video interview (AVI) has been developed as an alternative, in which job candidates are asked to login to an interview platform and record their responses to predefined interview questions via webcam and microphone on their mobile device or computer, with their answers being analyzed by human raters at a later time [9]. AVI allows candidates to record and answer questions at any place and time. Moreover, AVI accelerates the selection process because the interview video records can be shared and assessed independently at the human raters’ convenience without scheduling an interview [10]. Some scholars or practitioners may be interested in whether interviewing can be automated to fully or partially displace conventional human raters if there are some available standards to assess interview performance [11].

Advanced biometrics recognition [38] and facial detection [41] techniques have been developed for fast extracting multiple patterns from near-infrared images accurately with limited computing resource [40]. With the emergence of artificial intelligence (AI), many computer scientists apply AI-based decision agent enabled with biometrics and facial recognition techniques to develop automatic interview platform based on AVI (called AVI-AI) [17]. The AVI-AI technologies have attracted considerable attention in both fields of computer sciences and human resources, particularly for automatically assessing communication skills [12] and personality traits [13]. Because AVI-AI is a novel employment selection tool, to the best of our knowledge, its validity and accuracy are still unknown.

Artificial intelligence is a branch of computer science that seeks a new type of intelligent machine similar to human intelligence [14]. Machine learning (ML) is a popular way to achieve AI [15], and deep learning (DL) is a technique used to implement ML [16]. DL can automatically, rather than manually, perform feature extraction [14]. Three major approaches to DL exist: supervised learning, unsupervised learning, and semisupervised learning [14]. Research has shown that semisupervised learning can be achieved with relatively small quantities of unlabeled data plus some labeled data for pattern recognition [17]. Convolutional neural networks (CNNs) have been proven to be high-performing models that can automatically classify patterns in image records [13], and CNNs are the most commonly used classifier that can be trained to accurately detect and recognize facial impressions without manual feature extraction, according to Sun et al. [18].

Therefore, we used semisupervised DL and CNN classifiers based on TensorFlow to develop an AVI-AI that can automatically assess job candidates’ communication skills and predict the candidates’ big five personality traits as perceived by real interviewers according to the candidates’ facial expression. TensorFlow is a popular open source DL framework that can be transplanted onto different heterogeneous systems across a variety of devices and platforms, including mobile and desktop [19]. Accordingly, the TensorFlow-based CNN framework is expected to achieve a good face recognition effect in the context of video interviewing [17]. This study tested the validity and accuracy of assessing interpersonal communication skills and perceived big five personality traits [20] using AVI-AI.

Related works

Structured interviews for communication skills and personality

According to industrial and organizational (I/O) psychology, structured interviews are more reliable and valid than unstructured interviews [21]. Structured interview questions are specified beforehand, and all the candidates are asked the same set of predefined lead and probe (i.e., follow-up) questions and are scored based on the same scales. In an unstructured interview, the questions are spontaneous, may be different for every candidate, and are assessed individually and thus are not standardized and not reliable [21]. Structured interviews can be divided into situational and behavioral: situational interviews ask candidates to describe how they would behave in a simulated context, whereas behavioral interviews ask candidates to describe what they did in a similar context. Behavioral interviews have shown higher levels of validity because such interviews reflect how candidates are likely to perform a job and interact with others, not only that the candidate knows how to do the job [21]. In the AVI setting, a behavioral-based structured interview format can be used to assess a candidate’s interpersonal communication skills that are significantly related to self-rated job performance and organizational tenure [22]. In addition to assessing the interviewees’ answers for each interview question, many interviewers infer the interviewees’ personality traits according to the interviewees’ expressions during structured interviews to subjectively judge whether the interviewees’ traits match the requirements of the job context (called personal and job fit, P–J fit) and the organizational culture (called person and organizational fit, P–O fit). However, the interview questions do not assess personality traits directly [23] because personality traits imply how an individual would react to different situations [24].

Nonverbal cues for communication skills and personality

Social signaling theory [4] implies that a job candidate can demonstrate his/her past behavior in terms of interpersonal communication skills, which indicates that he or she possesses nonverbal communication skills as well as verbal communication skills because nonverbal signals have more influence than verbal signals during human interaction [25], including gesture and postures, face and eye movement, and vocal behavior [26]. According to Brunswik’s Lens Model [27], people observe and interpret nonverbal messages during an interaction in addition to verbal messages [28]. These nonverbal signals can provide clues and additional information and meaning above the verbal signals, and some estimates show that approximately 70% to 80% of effective communication is nonverbal [29]. Past research found that facial expression is the most important nonverbal message for greater control over interpersonal communication. Unlike other forms of nonverbal communication, facial expression is universal and conveys human emotions that can be recognized by computers with a high degree of accuracy [30]. In line with the Lens Model shown in Fig. 1, interviewees externalize their underlying traits to observable nonverbal cues, such as facial expression and movement in AVI, while human interviewers or raters can make attribution or inference about interviewees’ personality traits and communication skills, in addition to job-related behavior and information, during job interviews according to nonverbal cues [7]. Accordingly, the interviewees’ facial expression and movement determine their perceived personality traits as assessed by interviewers [23].

Fig. 1
figure 1

The process of interviewers’ judgments toward interviewees’ communication skills and traits in AVI

Previous studies found that high levels of agreement between interviewers and interviewees’ self-ratings on personality traits (also called “accuracy” in the Lens Model [29]) can occur when appropriate information (e.g. nonverbal cues) is available to the interviewers about the interviewees’ visible traits, and an interviewer can get to know the interviewee as well as his or her close friends through a short (approximately 15 min) interview process in a zero acquaintance situation [23]. Research suggests that other-rated personality traits are preferable to self-reported personality because self-rated results may have social desirability bias, especially in the job application process [7].

AI assessment agent for communication skills and personality

A study by [31] showed that AI can be used to extract and analyze nonverbal signals to predict an individual’s communication skills. Pooja Rao and colleagues used behavioral-based structured interviews in an automated communication skill assessment interface based on AVI and ML and found that autoextracted nonverbal features can accurately predict a candidate’s communication skills as scored by human interviewers [10]. Similar works in social computing and social signal processing have shown how advanced ML can contribute to an automatic understanding of how human nonverbal signals impact communication competencies [31] and the effectiveness of interpersonal interaction [12]. In other words, we can use ML plus signal processing to automatically predict a candidate’s interpersonal communication skills based on their nonverbal cues, rather than assessing their responses (i.e., past behavioral incidences) via human raters in an AVI setting. Moreover, researchers in the field of personality computing [27] have adopted AVI and ML to predict interviewees’ personality traits based on the Lens Model in the context of zero acquaintance, such as the relationship between job interviewers and interviewees [32, 33].

However, the related works of AI assessment were developed based on traditional ML or supervised DL, which require considerable manual effort for behavior annotation and labeling [12]. Although unsupervised DL can be adopted to automatically learn the correct patterns without requiring predefined labels, this approach requires huge quantities of data to learn the patterns [34]. Semisupervised DL can reduce the required labeling effort while maintaining high accuracy [17]. Since CNN can be used to effectively classify patterns from AVI image records [13] and the TensorFlow engine can be used to increase prediction accuracy [35], a CNN with a TensorFlow engine would be the ideal learning model to predict an interviewee’s attributes based on his/her facial expressions [17]. In line with previous works, our study aims to develop an intelligent video interview agent based on AVI and semisupervised DL using a CNN with TensorFlow to extract the facial expression features, learn the patterns between the interviewees’ facial expression and their communication skills and personality traits, and build a model to automatically predict an interviewees’ personality based on his/her AVI records without assessing personality traits with any assessment tool. Thereafter, we examine the validity and accuracy of predicting interviewees’ communication skills and personality traits as perceived by interviewers.

Method and modeling

Data collection

We invited 57 human raters and 57 interviewees to participate in our experiment. All human raters were human resource professionals, and their average work experience was 12.49 years (SD = 7.19), with an average of 5.81 years of experience as a job interviewer. The interviewees were new graduates or students who were seeking full-time or part-time job opportunities in the field of human resources (HR). The interviewees had an average work experience of 2.28 years (SD = 4.73). The interviewees were invited to sign up for our AVI-AI software application on any android or iOS mobile device, and the interviewees could decide when they were ready to start the interview. The software guided them through the interview step by step, and the interviewees were informed that their interview answers and responses, including audio and visual information, would be recorded and analyzed by our AI algorithms.

The questions for the interviewees were structured in a standard pattern, in which each interviewee answered the same five questions that were behaviorally orientated to assess interpersonal communication skills [36]. The questions were displayed on the screen, and 1 min was allowed for thinking after each question was announced. The audiovisual function was automatically started upon entering the answer screen. Three minutes were provided to answer each question. If an interviewee completed the question within 3 min, they could choose to skip to the next question or the system would automatically move on to the next question after 3 min. The entire video interview process for each interviewee was approximately 20 min. After all the interviewees finished the video interview, one of the human raters was randomly selected to evaluate three interviewees’ communication skills and personality traits.

Data labeling

Following Suen et al.’s [36] measures, the interpersonal communication skills score consisted of three raters’ mechanically averaged ratings using a five-point scale for the five interview questions, as shown in Table 1.

Table 1 Structured interview questions and scoring scale

The Cronbach’s \(\alpha\) value for commutation skills was 0.901, suggesting that the five question items have relatively high internal consistency (more than 0.7). The intraclass correlation coefficient (ICC) was 0.641. According to [37], the ICC ranges from 0 to 1; a value greater than 0.75 is considered excellent, a value between 0.6 and 0.74 is good, a value between 0.4 and 0.59 is fair, and a value less than 0.4 is poor. In this study, the interrater reliability of communication skills was good.

Additionally, we asked the raters to randomly judge three interviewees’ personality traits based on Goldberg’s [20] international personality item pool (IPIP) with a 50-item inventory that measures the big five dimensions of personality traits: openness to new experiences (be creative and imaginative), conscientiousness (be organized and self-disciplined), extraversion (be assertive and sociable), agreeableness (be tolerant, honesty, and altruistic), and neuroticism (be vulnerable to frequent strong negative emotions).

Every interviewee’s five personality trait scores were combined (averaged) from the three raters’ judgment according to the raters’ subjective perception of the interviewee’s self-presentation. The Cronbach’s \(\alpha\) values for the big five traits were all acceptable (more than 0.7): openness (\(\alpha\) = .93), conscientiousness (\(\alpha\) = .94), extraversion (\(\alpha\) = .93), agreeableness (\(\alpha\) = .90), and neuroticism (\(\alpha\) = .88). The ICC values for the big five traits were openness (ICC = .68), conscientiousness (ICC = .74), extraversion (ICC = .71), agreeableness (ICC = .67), and emotional stability (ICC = .50), suggesting that the interrater reliability was acceptable (more than 0.4) for all big five traits.

Feature extraction and modeling

To develop an AVI-AI software that could be used to predict interpersonal communication skills and personality traits as perceived the human raters, we constructed a three-stage model, as illustrated in Fig. 2: video data processing, classifier training, and classifier validation.

Fig. 2
figure 2

Video data processing, classifier training, and classifier validation

In the video data processing stage, we developed an AVI to extract facial expressions performed by the interviewees from each frame by using our own dataset in FFmpeg. The facial features were detected using OpenCV and Dlib by tracking 86 facial landmark points per frame. Each facial feature from each frame was extracted within a 5 s interval from the AVI records for each interviewee. Preprocessing was required to reduce undesirable noise in the feature extraction, such as interference caused by hair and cosmetics [38]. We detected and cropped face images as shown in Fig. 3, which illustrates how we obtain the original face image, detect the facial landmarks, and crop the facial image to train the classifier. Afterward, we converted the cropped images to a grayscale model to reduce the impact of illumination and highlight the facial expression and movement features. We then located the 86 facial landmarks shown in green in Fig. 4. Any frames that could not be detected were dropped.

Fig. 3
figure 3

Face detection and cropping

Fig. 4
figure 4

Converting the cropped face images to grayscale and locating the facial landmarks

In the classifier training stage, we combined the labeled data of the 57 interviewees with their extracted features to train our prediction model for communication skills and big five personality traits. The model was a TensorFlow-based CNN model, as illustrated in Fig. 5, in which the structure of the neural network consisted of four convolutional layers, three pooling layers, ten mixed layers, a fully connected layer and a softmax layer as the output. The size of the input images was 640 pixels (width only), which was normalized by processing the face images [39] because the cropped images could vary in terms of rotation and shifting and a fixed pixel ratio (VGA: 640 * 480) may distort the original facial image [40, 41]. We used the interviewees’ extracted facial expression features as the inputs (see Fig. 6), and the communication skill scores and big five traits as perceived by three human raters were used as the output in the neural network. In addition to the input, each layer contains training parameters (connection weights). We also used the rectified nonlinear unit (ReLU) to combat the vanishing gradient problem that may occur in a sigmoid function [39]. The final layer of the model was a softmax layer with 60 possible outputs.

Fig. 5
figure 5

The architecture of the CNN model

Fig. 6
figure 6

Inputting the featurized images into the CNN

In the classifier validation stage, the training set (50%) and validation set (50%) were obtained via random sampling. Each interviewee had six different features: one communication skill score and five personality traits. We conducted 4000 training iterations, in which the learning rate was 0.01, the evaluation frequency was 10, and the training batch size was 256.

Results

Following [17], we used the Pearson correlation coefficient (R), explained variation (\(R^{2}\)) and mean square error (MSE) to measure the concurrent validity of the AVI-AI. \(R^{2}\) represents the variance in the dependent variable that can be predicted by the independent variables. The higher \(R^{2}\) is (1 is prefect), the better the estimator is. Conversely, a lower MSE (0 is prefect) indicates smaller estimator error.

Table 2 shows that interpersonal communication skills, openness, agreeableness, and neuroticism as perceived by human raters were learned and predicted successfully by the AVI-AI, but conscientiousness and extraversion were not. The results suggest that a candidate’s facial expression patterns reflect his/her communication skills as scored by HR professionals based on structured behavioral interviews. Moreover, our prediction model learned how to judge whether a candidate is likely to be open minded, agreeable, and neurotic during the interview. However, our model could not learn (converge) how to judge whether a candidate is likely to be conscientious and extraverted.

Table 2 Experimental results

Discussion and conclusion

In this study, we developed a semisupervised CNN model based on TensorFlow to automatically predict an interviewee’s communication skills and personality traits. The results support social signaling theory [4] and the Lens Model [27] and indicate that human raters can judge candidates’ social skills and some apparent personal traits according to nonverbal communication signals, while different human raters may have a similar evaluation lens to perceive nonverbal cues and attribute a target’s characteristics. Therefore, we adopted AVI to extract an interviewee’s facial expression and embedded the AVI with an AI agent to learn the lens used to predict an interviewee’s communication skills and personality traits.

Although our AVI-AI can be used to help screen a bulk of job candidates if the job vacancy requires interpersonal skills and specific personality traits such as openness, agreeableness, and emotional stability (lower level of neuroticism), the study has some limitations that should be considered when interpreting the experimental results. First, a limited number of participants agreed to engage in this study, which might impact the predictive power, generalizability, and performance of the DL model. Thus, future research may invite more diverse participants and examine whether the AVI-AI can be designed to assess job performance criteria directly, and what the gains to predictive validity are by doing so. Second, we used only facial expression and movement as features to predict communication skills and big five personality traits, but some other nonverbal cues, such as gestures, prosody, gaze behavior, and upper body movement, are likely to influence interviewers’ attribution process [27]. In future work, we may include more diverse participants to develop the DL model and extract other forms of nonverbal messages, such as audio cues, to improve the validity and accuracy. These limitations may explain why the DL model could not predict interviewees’ conscientiousness and extraversion based on face cues because the interviewers might perceive other cues to infer these traits. Future study may expand the features from exclusively visual to multimodal by encoding additional cues in both audio and textual channels. Third, this study adopted an automatic personality perception (APP) approach [27] to train our intelligent interview agent in which the learning objectives were not the true personality of the interviewees but the personality attributed by the interviewers. Consequently, the AI agent might reproduce human biases if the model is trained by biased perception and attribution [42]. In contrast to APP, the automatic personality recognition (APR) approach focuses on externalization of interviewees’ self-assessed personality traits that reflect the true personality of an individual from nonverbal behavioral evidence [17, 27]. Although past research suggested that other-rated personality traits (e.g. APP) might be more preferable than self-rated personality traits (e.g. APR) because such attribution made by interviewers have been shown to converge with self-assessment, and other-rated assessment can avoid social desired and faking behaviors performed by job candidates [43]. Future work may combine the APP and APR approaches to train the DL model to predict an interviewee’s personality from different perspectives by multi-group analyzing and comparing the results among different raters (ratings from self and friends for example). Finally, psychologists have found that job applicants may present different behavior toward different interviewers (including an AI decision agent), such as impression management and socially desirable behavior [43], and that different interviewers may trigger different reactions during the interview process. This interaction may further influence how the interviewers judge the candidate’s personality [44]. Future work may include a comparative study to examine the accuracy of interviewers’ personality perception toward interviewees in different interview settings (such as AI vs. non-AI or synchronous video interview vs. AVI).

Many related works have developed facial expression detection tools to assess job applicants’ characteristics automatically, but those studies focused on “how” (i.e., methods to assess) and not “what” (constructs to be assessed), which is important for the purposes of validation, explainability, and acceptability in personnel selection and assessment [45]. By contrast, this study not only developed a detection tool but also assessed job applicants’ communication skills and personality traits, which have been identified as important criteria in employment selection. Commonly used selection tools include structured interviews, including face-to-face interviews, phone interviews, and conference or synchrony video interviews, which require considerable amounts of human effort and time and indirect financial cost. With the emergence of AI, people may imaging that AI agents can automatically perform work similar to a group of experienced interviewers to make the hiring process more efficient (for both employers and candidates) [17, 45]. In addition to the cost-efficient benefits of automation, AI decision agents can be adopted to decrease human biases (implicit or explicit) that may impact how cues from the interviewee are interpreted because the AI agent would evaluate all interviewees with the same criteria, which could make the judgment of communication skills and personality traits more consistent and fair.

This study applied a novel DL model with high concurrent validity and accuracy to automatically predict an interviewee’s communication skills and personality traits, which may provide a key way to bridge the gap between human imagination and computer reality and realize the potential of data-driven AI in selection assessment and human-centric computing [45].