Keywords

1 Introduction

Communication is vital in the digital age. The identity of participants in social networking, instant messaging, and customer care conversations are revealed. Identification of speaker identities in conversational texts could improve online forum and customer service interactions and consumer satisfaction. Marketing, customer service, and healthcare increasingly use persona identification. Persona identification analyzes consumer personalities to improve customer service. This analysis helps organizations classify client demands by attributes. Thus, customized support can increase customer satisfaction and loyalty. Persona identification helps chatbots grasp human tone and style which can help them increase communication, user engagement, and personalized responses.

Automatically creating human-comprehensible language is a major challenge in Natural Language Processing (NLP). Natural Language Generation teaches robots human-like speech (Santhanam and Shaikh 2019). Personalizing dialogue agents helps dialogue systems provide more specific, consistent, and engaging responses (Zhou, Li et al. 2021). Machine-to-human talks needed persona detection for better services. Conversational messages are informal and unstructured, making speaker persona detection difficult. Exploiting the data to create diverse and sustainable human-machine discussions is still difficult (Song, Zhang et al. 2019). User profile, language behavior, and interaction style make up a persona (Li, Galley et al. 2016). The persona is a complete description of an individual, comprising demographic information like age, gender, and location and personal data like interests and hobbies. Personal values, opinions, and purchase habits may be included in the persona.

NLP methods and Person Match on Persona Chat (PMPC) and ROCStories datasets are proposed to improve persona detection. These datasets provide rich linguistic information that we hope to use to improve persona detection programs. We suggest modeling persona-specific traits in PMPC and ROCStories data using CNN, BERT, and GPT. The hypothesis states that training machine-learning models on additional data sources can improve their persona recognition.

Our study found that integrating PMPC and ROCStories datasets using NLP improves persona detection algorithms. This suggests that the supplemental data sources obtained the persona-specific features needed for accurate persona identification. Our experiments showed that Convolutional Neural Networks (CNN), Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformers (GPT) captured the dataset’s linguistic characteristics and improved model performance. These findings affect chatbots, virtual assistants, and tailored information delivery systems.

2 Related Work

Personality identification has been extensively researched. Zhou et al. (2021) used interaction history to automate speaker recognition in order to personalize dialogue agents. Li et al. (2016) investigated the Speaker Model and Speaker-Addresses Model using a sequence-to-sequence architecture. Gu et al. (2021) employed natural language processing and machine learning methods to identify personalities in conversational text. Their research was influenced by Gao et al. (2020), Gupta et al. (2020), and Yang et al. (2018).

In persona recognition investigations, SVMs and decision trees have been applied (Liu et al., 2017; Zhang et al., 2017). Recent research has concentrated on the use of deep learning networks, such as RNNs and transformers, for persona detection. Previous research on persona detection has focused on unsupervised learning methods such as clustering and dimensionality reduction (Kannan et al., 2018). Human-machine communication is still evolving, despite these attempts (Zhang, Dinan et al., 2017).

Persona Chat and ROCStories datasets have been utilized in numerous experiments for NLP-based persona detection. The Persona Chat and ROCStories datasets were integrated by Zhang et al. (2020, b) to improve chatbot responses, making them more human-like and diverse. The researchers used word embeddings and attention mechanisms to capture each persona’s distinct characteristics. On several evaluation measures, their investigation revealed that their proposed model outperformed numerous baseline models.

Moreover, Wu et al. (2019) created a framework for dialogues that take into account the personas of the speakers. This framework was built using data from Persona Chat. They employed a hierarchical recurrent neural network (RNN) and an attention mechanism. This method was developed in order to collect dataset persona-specific data. The researchers demonstrated that their technique resulted in more diverse and tailored replies.

Persona Chat and ROCStories datasets have also been used in other projects for NLP tasks. Gao et al. (2019) proposed generating persona-based questions utilizing Persona Chat data. A sequence-to-sequence model with an attention mechanism was adopted by the researchers. Using this technique, user personas were utilized to develop questions. Mostafazadeh et al. (2016) developed the ROCStories dataset to aid with narrative production. The dataset was useful in assessing models’ ability to construct logically consistent and engaging narratives, according to the researchers. The preceding tests show the Persona Chat and ROCStories datasets’ promise in natural language processing (NLP) tasks, notably persona detection.

As for the use of a multi model approach, several research have used CNN, BERT, and GPT to detect personas in natural language processing. To recognize personalities, the researchers utilized a hybrid model that integrates CNN, BERT, and GPT, according to the Journal of Artificial Intelligence Research. The model exceeded earlier benchmarks on the Persona-Chat dataset. Integrating numerous pre-trained models is effective.

Another paper published in the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing proposed a multi-view framework for detecting personas that integrates BERT, GPT, and LSTM results. On the Persona-Chat dataset, the model outperformed baseline models.

Researchers employed Convolutional Neural Networks (CNN), Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformers (GPT) to identify social media personality traits in a recent Journal of Big Data study. The program correctly recognized personality traits after training with a huge collection of tweets. This supports utilizing different persona detection models in different settings.

According to our literature review, combining CNN, BERT, and GPT models improves natural language processing persona detection. Combining models can improve the accuracy and efficacy of persona detection models, making them helpful in a variety of natural language processing applications. Building on previous research, this work recognizes personas using several methods and datasets. This study demonstrates how using many data sources might improve persona detection methods.

3 Approach

3.1 Dataset Selection and Preprocessing

The Person Match on Persona Chat (PMPC) and ROCStories datasets were utilized in this study for the purpose of training and assessing our models. The PMPC dataset comprises conversational texts that are accompanied by predetermined personas, whereas the ROCStories dataset consists of lengthier tale texts that do not possess established personas. Prior to training our models, the data will undergo preprocessing procedures such as tokenization, stop word removal, and text conversion to lower case.

3.2 Model Architecture

3.2.1 Pre-training

During the initial training phase, the Convolutional Neural Network (CNN), Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer (GPT) models undergo training using a substantial collection of textual data. Each model is trained utilizing a distinct architecture and objective function.

The CNN model is specifically built for text classification tasks. It employs a convolutional layer, which is subsequently followed by a max-pooling layer, to effectively detect and recognize patterns within the input data.

The BERT model is a transformer-based model that employs a disguised language modeling objective to acquire the semantic understanding of words within their respective contexts. The model undergoes training utilizing an extensive corpus of textual data, enabling it to effectively grasp the intricate connections between words inside a given sentence.

The GPT model, like other transformer-based models, is specifically engineered to excel at text generation tasks. The model is trained with a language modeling aim, which involves predicting the subsequent word in a given sequence.

3.2.2 Fine-Tuning

During the fine-tuning phase, the pre-existing models undergo further training using the PMPC and ROCStories datasets, specifically for the purpose of persona detection. The fine-tuning procedure entails training the models to perform a particular goal, namely, discerning the persona of a speaker or writer by analyzing their language patterns. The process of fine-tuning entails iteratively adjusting the parameters of the model in order to enhance its alignment with the specific job being addressed.

The ultimate model design incorporates a fusion of pre-training and fine-tuning techniques. The base models utilized in this study include the pre-trained Convolutional Neural Network (CNN), Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer (GPT) models. These models are further refined using a process known as fine-tuning, using the PMPC and ROCStories datasets. The objective of this fine-tuning process is to enhance the models’ performance specifically in the context of persona detection. The final forecasts are generated by aggregating the outputs of the three models.

Overall, the model architecture for the experiment on persona detection using pre-training on CNN, BERT, and GPT involves a combination of different architectures and objective functions, and is designed to leverage the strengths of each model for improved performance on the persona detection task.

4 Experiments

Building a persona detection model typically requires a dataset that includes demographic and behavioral information about users. There has been growing research interest in training conversation systems from large datasets containing human-to-human conversation (Li, Galley et al. 2016). Existing persona-oriented dialogue systems can be classified into two categories: Structured Persona-oriented Dialogue Systems (SPDS) and Unstructured Persona-oriented Dialogue Systems (UPDS) (Xu, Li et al. 2020). In this study, two different datasets are used and below is the description of each:

Dataset 1:

The first dataset used in this study is UPDS Person Match on Persona -Chat developed by (Gu, Ling et al. 2021). This dataset construction is based on the Persona-Chat dataset (Zhang et al., 2018) that contributes a persona-chat dataset with natural sentences persona information (Xu, Li et al. 2020).

The Persona-Chat dataset and Persona Match on Persona-Chat (PMPC) dataset are both used for the task of persona detection. However, there are some differences between the two datasets:

  1. 1.

    Dataset Size: The Persona-Chat dataset contains 10,936 dialogues, while the PMPC dataset consists of 6,000 dialogues. Therefore, the Persona-Chat dataset is larger than the PMPC dataset.

  2. 2.

    Task: The Persona-Chat dataset task involves one speaker assuming a given persona, and the other speaker tries to guess it. In contrast, the PMPC dataset requires both speakers to assume a given persona, and the task is to match each speaker with their corresponding persona.

  3. 3.

    Persona Diversity: The Persona-Chat dataset contains a wider range of personas, including fictional characters, celebrities, and historical figures. In contrast, the PMPC dataset places its emphasis on personalities that embody diverse demographic characteristics, including age, gender, and occupation.

  4. 4.

    Annotation Quality: The PMPC dataset exhibits superior annotation quality in comparison to the Persona-Chat dataset. The PMPC dataset was annotation by many annotators and underwent a rigorous quality control process to assure the accuracy of the annotations. For the purpose of this study, which focuses on detecting specific demographic characteristics of a person in a story, context, which will in return expand the demographic identification quality of the person the PMPC dataset has been seen as more appropriate.

The PMPC dataset is annotated by multiple annotators in order to enhance the accuracy of the annotations. The process of annotating the data is conducted at the level of dialogue, wherein each dialogue is assigned annotations indicating the appropriate persona for each speaker. In addition to the primary data, the dataset incorporates supplementary information known as metadata, which encompasses attributes such as the age, gender, and occupation of each individual persona.

The Person Match on Persona-Chat (PMPC) dataset is a compilation of conversational data specifically curated for the purpose of doing research in the domain of natural language processing and its associated disciplines. The composition comprises a collection of talks characterized by the presence of two speakers, each of whom assumes a distinct persona. The dataset included in this study was created by a team of researchers affiliated with the University of California (Gu, Ling et al., 2021) and is openly accessible for the purpose of academic investigation.

The PMPC dataset comprises dialogues including two speakers, namely A and B, each of whom is assigned a persona. The persona refers to a concise passage that delineates the attributes, inclinations, and personal history of the individual assuming the role of the speaker. The personalities have been strategically developed to offer further background during the conversation, facilitating the speakers in establishing their own identities and objectives.

The dataset has a total of 10,197 dialogues, which are divided into three subsets: the training set consisting of 4,347 dialogues, the validation set containing 1,000 dialogues, and the test set comprising 4,850 dialogues. Every conversation is accompanied by a series of potential responses, with the objective being to choose the response that best suits the given persona and context of the dialogue.

Dataset 2:

According to the study conducted by Majumder, Berg-Kirkpatrick, and colleagues in 2021 (Majumder, Berg-Kirkpatrick et al. 2021), Although dialog models based on personas are capable of generating responses that align with a specific persona, they frequently overlook persona-related events. In this study, the researchers enhanced dialogue models by incorporating background narratives associated with a persona. This was achieved by utilizing fictitious tale datasets, such as ROCStories (Mostafazadeh et al., 2016).

The ROCStories dataset comprises a compilation of concise narratives specifically curated for the purpose of conducting research in the domain of natural language processing and its associated disciplines. The content comprises a collection of narratives, each of five sentences. Each narrative is accompanied by a prompt and five alternative conclusions, presented in the form of multiple-choice questions. The dataset was created by the University of Rochester and is accessible to the public for the purpose of doing research.

In addition to the five-sentence stories, the dataset includes five possible endings for each story, labeled A through E. These endings provide different resolutions to the conflict or obstacle introduced in the story.

The dataset contains 98,162 five-sentence stories with five possible endings each, resulting in 490,810 multiple-choice questions. The stories are divided into a training set of 73,806 stories, a validation set of 9,637 stories, and a test set of 14,719 stories. The dataset also includes additional metadata such as story IDs, prompt IDs, and the correct ending for each multiple-choice question.

5 Evaluation Metrics

In the case of using ROCStories with Persona Match on Persona-Chat (PMPC) dataset, we experimented on this task with the F1 and Bleu measurements.

F1-score is a harmonic mean of precision and recall. Precision measures the proportion of true positives over the total number of predicted positives, while recall measures the proportion of true positives over the total number of actual positives. In the context of persona detection, precision measures the proportion of correctly identified personas of characters over the total number of predicted personas, while recall measures the proportion of correctly identified personas of characters over the total number of actual personas. F1-score balances between precision and recall and provides an overall measure of the model’s performance.

BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate the quality of machine translations or text generations. It measures the similarity between the machine-generated text and one or more human-generated reference translations. Bleu is not usually used as an evaluation metrics for persona identification but we wanted to experiment and assess the similarity between a machine-generated text and a reference text. In order to achieve this goal we followed the below steps:

  • Defined a set of persona traits or characteristics that we wanted to identify in the text. For example, to identify whether the text was written by an extroverted person, we defined a set of extroverted phrases or words.

  • Collected a set of reference texts that represent the persona traits or characteristics we want to identify. These texts were written by individuals who are known to possess the traits or characteristics you want to identify.

  • Generated texts using the model and calculated the BLEU score between the generated texts and the reference texts.

  • If score is high, it suggested that the generated texts are similar to the reference texts and may indicate that the texts were written by individuals with the identified persona traits or characteristics. A low score on the other hand identifies that there is no match between the reference text and the individual trait identified.

We evaluated the performance of the fine-tuned model on the validation and test sets using F1 and BLEU. The F1 score is a measure of the model’s precision and recall on the task of persona detection, while BLEU measures the quality of the generated responses by comparing them with human-generated responses.

6 Results

Based on the experiment results in the table, we can see that the BERT model performs the best in terms of F1 score on both evaluation sets, with F1 scores of 0.1981 and 0.1997 for ROC-STORIES and PMPC, respectively. The GPT model has the second highest F1 score, while the CNN model has the lowest F1 score.

Similarly, the BERT model also performs the best in terms of BLEU score on both evaluation sets, with BLEU scores of 0.009131 and 0.01911 for ROC-STORIES and PMPC, respectivelyFootnote 1.

The GPT model has the second-highest BLEU score, while the CNN model has the lowest BLEU score (Table 1).

Table 1. Results of the pre-training and fine-tuning

Upon examining the outcomes, it is evident that the pre-trained models exhibit superior performance compared to the non-pretrained models across both datasets. This is seen from the higher F1 and Bleu scores. Out of the three pre-trained models, BERT exhibits the most superior performance, as evidenced by F1 scores of 0.1981 and 0.009131 for the ROCStories and PMPC datasets, respectively. The GPT model exhibits notable performance, as evidenced by F1 values of 0.1901 and 0.007131 for the ROCStories and PMPC datasets, respectively.

The CNN model has the least favorable performance when compared to the other two pre-trained models. It achieves F1 scores of 0.0801 and 0.000431 for the ROCStories and PMPC datasets, respectively.

In general, the findings indicate that incorporating pre-training techniques using CNN, BERT, and GPT can enhance the accuracy of persona detection on both the PMPC and ROCStories datasets. The impressive performance exhibited by BERT and GPT models implies that these models provide significant potential for effectively addressing persona detection challenges within the field of natural language processing (NLP).

7 Conclusion

This study used the PMPC and ROCStories datasets to examine how pre-training affects CNN, BERT, and GPT models’ persona detection performance. It is found that pre-training improved persona detection models. BERT performed best of the three models. Moreover, the study shows how pre-training CNN, BERT, and GPT improves persona detection on the PMPC and ROCStories datasets. The findings are promising, but more research is needed to address the study’s limitations. To improve natural language processing persona detection, pre-training and other methods must be explored.

The result of this study can be utilized by researchers and practitioners who detect personas. Pre-training on a large corpus of textual data allowed the models to recognize linguistic patterns and linkages. Thus, they became better at assessing speakers’ and writers’ personalities, backgrounds, and motives. Accurate and effective persona detection can improve user experiences in chatbots, virtual assistants, and customer support.

8 Future Work

The lack of large, diverse, and annotated datasets for persona detection was a major challenge in this research. The PMPC and ROCStories datasets used in this study were small, which may have reduced their ability to capture persona detection tasks. The datasets focused on persona matching rather than detection, limiting their application. The computational resources needed for model pre-training and fine-tuning were another issue. The computational and time requirements of pre-training on large textual data sets may limit model scalability. Future study may focus on data augmentation, model architecture design, transfer learning, and assessment measures to address these concerns.