Evaluation of effectiveness in conversations between humans and chatbots using parallel convolutional neural networks with multiple temporal resolutions

Escobar-Grisales, Daniel; Vásquez-Correa, Juan Camilo; Orozco-Arroyave, Juan Rafael

doi:10.1007/s11042-023-14896-y

Evaluation of effectiveness in conversations between humans and chatbots using parallel convolutional neural networks with multiple temporal resolutions

Open access
Published: 03 June 2023

Volume 83, pages 5473–5492, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Evaluation of effectiveness in conversations between humans and chatbots using parallel convolutional neural networks with multiple temporal resolutions

Download PDF

Daniel Escobar-Grisales¹,
Juan Camilo Vásquez-Correa ORCID: orcid.org/0000-0003-4946-9232^1,2,3 &
Juan Rafael Orozco-Arroyave^1,2

1445 Accesses
Explore all metrics

Abstract

Chatbots enable the automation of several components in customer service and allow the support of multiple users. Despite their multiple advantages, due to the large amount of conversations generated by a chatbot, it is difficult to determine whether customer requests are well-addressed. For practical reasons, chatbot’s effectiveness is evaluated manually based upon a small sample (randomly chosen) of conversations or through self-reported user satisfaction. This procedure does not guarantee the correct evaluation of the service because the sample is generally not large enough and self-reports might be influenced by different external factors not directly associated to the chatbot’s functioning. This study proposes a methodology for automatic evaluation of chatbot effectiveness in real production environments. The analysis considers convolutional neural networks adapted for natural language processing, using two parallel convolutional layers to evaluate questions and answers independently. The proposed model also incorporates filters to extract features with multiple temporal resolution. This methodology is tested upon real conversations of chatbots that provide service to two different companies. The results are compared to baseline models based on classical techniques with different pre-trained word embedding models. According to our results, the proposed approach provides accuracies between 78.95% and 80.18%, which outperforms the best result of the baseline models by 2.9%.

ChatGPT is bullshit

Article Open access 08 June 2024

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Article 27 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Customer service is necessary for companies to keep their customers and to attract new ones [8]. Each consultant in the customer service team spends time solving different requests from customers. Nevertheless, as the number of customers increases the waiting time increases as well, which results in poor customer satisfaction [49]. Different results report that about 75% of customers experience poor customer service [1, 57]. Traditional customer service has two main weaknesses: (1) the staff usually receives repetitive questions asked by a variety of customers and (2) it is difficult to support services 24/7, especially for small companies [8]. Chatbots are able to handle multiple users simultaneously and operate all day and night long [49].

A chatbot is a conversational software system that automatically interacts with a user/customer. It is designed to emulate the communication capabilities of a human being [39]. The service-oriented chatbot acts as an automated customer service representative, giving natural language answers [22]. Despite their great potential, many customer service chatbots do not meet customer expectations because the chatbot fails to correctly recognize the customer’s requirements [37]. As a result, many service providers stop using their chatbots due to negative feedback received from their customers [10]. For this reason it is necessary to evaluate chatbot’s effectiveness, i.e. their ability to recognize costumer’s questions or requirements and to give a clear response or enable the required service.Typically, automatic approaches aim to evaluate a chatbot according to its ability to generate responses similar to a human [60]. However, in production environments, chatbots must correctly address the customer’s requirements above other aspects, such as naturalness, personality, or their ability to imitate human conversations. Figure 1 shows the operational cycle of a chatbot in production environments, where a chatbot is exposed to a group of customers for some time to collect conversations between the chatbot and real customers. Then, those conversations are evaluated to determine the chatbot’s effectiveness and to update the chatbot rules in a retraining process. Conversations are typically manually assessed using a small reference group (randomly chosen) to determine the chatbot’s effectiveness, and this small group is rather limited due to the high cost involved in the manual evaluation process [42]. Hence, the number of evaluated conversations is not representative of the real state of the customer service process.This work proposes an automatic effectiveness detector, where all collected conversations are used to measure the chatbot’s effectiveness. This study considers a joint fusion strategy to merge information from customer questions and chatbot responses in each conversation. Customer questions and answers provided by chatbots are modeled with an embedding layer that is further used to feed the convolutional layers of a Parallel Convolutional Neural Network. In the convolutional layers, questions and answers are analyzed independently. Then both representations are concatenated in the flatten layer before the classification stage. The main difference between this fusion strategy in comparison with traditional approaches, such as early or late fusion, is that the loss is back-propagated to the feature extraction stage of the neural network, generating a loop that enables the finding of better feature representations based on both information sources [19]. This methodology is compared with respect to a baseline model based on classical neural embedding methods using an early fusion strategy. Accuracies of up to 80.1% were obtained with the methodology presented in this work, which shows that the proposed methodology is accurate to evaluate chatbots effectiveness in real production environment automatically.

2 Related works

Usually, the effectiveness of chatbots in production is manually evaluated based on a small sample of randomly chosen conversations, or by using self-reported customer satisfaction. The latter one being usually more often considered because it is less expensive. Most of studies based on self-reported satisfaction use the Likert scale to evaluate different aspects related to the conversation, such as effectiveness, quality, humanity, manner, and others [7, 16, 20, 44, 54, 70]. In [7] the authors proposed a framework for chatbot evaluation based on the Grice’s maxims, which consists of four conversational maxims originated from the natural language: quality, quantity, relation, and manner. A similar methodology is presented in [24] where the users were asked to rate chatbot’s performance for Grice’s maxims on a Likert scale. The authors used the chatbot conversation to test the correlation of human judgments with the Grice’s maxims. Other methodology widely used to evaluate chatbots’ performance is the PARAdigm for DIalogue System Evaluation (PARADISE), which estimates subjective factors by collecting user ratings through questionnaires. Some of the subjective factors include ease of use, clarity, naturalness, friendliness, robustness and willingness to use the system again [4]. The main drawback of using self-reported data to evaluate customer service and effectiveness in conversations with chatbots is that, typically only few users complete the questionnaire and those few completed questionnaires are generally influenced by different external factors [40, 66].

There are also evaluation methods based on statistics about the conversation between customers and chatbots. For instance the number of times the chatbot was used [23], the number of times the user had to use help commands [45], total number of dialogues, their duration [64], and frequency of a keyword related to the customer’s feeling [42, 45]. In [17], authors considered online conversations between human-human and human-chatbot. They computed three principal variables: words per conversation, messages per conversation, and the average number of words per message, to evaluate conversation quality. Results of this work indicated that human-chatbot conversations had a longer duration than human-human conversations but with shorter messages. In addition, human-human conversations had a more expansive vocabulary, and in these conversations, people employed more words that expressed positive emotion than in human-chatbot conversations. This approach can help to evaluate customer satisfaction or customer’s empathy with the chatbot. However, for chatbot effectiveness evaluation in real production environments, the conversation duration depends on the complexity of the requirement, so approaches based on word count or conversation duration could not be proper.

Other methodologies aim to assess effectiveness of chatbot’s by comparing chatbot responses with reference corpora using similarity metrics such as: BiLingual Evaluation Understudy (BLEU) [41] and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [32]. These metrics were originally proposed for machine translation systems and are based on similarity measures between generated texts and expected responses. Several works are based on the aforementioned methods [12, 29, 30, 56, 58, 59, 69]; however, they have different limitations. For instance, they require reference corpora and are focused on token-level overlap between the reference text and the generated one, hence a valid response to a certain statement in a conversation might have low token overlap with a reference response [64]. Additionally, the token-level overlap has shown poor correlation with human judgments, which limits its use in real-world applications [33].The adversarial evaluation methods to assess dialogue systems aim to solve these problems. In [25], the authors train a Generative Adversarial Network, where the generator responds according to a message and the discriminator aims to predict whether, given a message and response, the response was given by the generator or a human. The idea of this paper was to evaluate the Chatbot quality according to whether it is possible to distinguish the conversations generated by it from conversations generated by a human. They take as a generator a fully trained production-scale conversation model deployed as part of the Smart Reply system. The authors train a Recurrent Neural Network (RNN) for the discriminator. The discriminator achieved accuracy up to 62.5% to discriminate between conversational model and human responses. The authors demonstrated that the discriminator scores were strongly correlated with response length, which could lead to challenges for chatbot evaluation in production environments where the response length depends on the customer’s requirements. Moreover, this approach evaluates the dialogue system quality based on the similarity of the dialogue system response with the human response. However, in this paper, we aim to assess Chatbot effectiveness by their ability to correctly address customer requests, other important aspects such as naturally, satisfaction, customer empathy, and other evaluation criteria are not considered.Recent studies are mainly based on word embeddings extracted from pretrained models, where chatbot is evaluated using a similarity metric between a given response and the reference [15, 33, 55, 67]. The most commonly used embeddings are: GLobal Vector for word representation (GLoVe) [43], Word2Vec [36], Bidirectional Encoder Representations from Transformers (BERT) [9], and BETO which is the Spanish version of BERT [5]. These embedding-based approaches are less studied in the topic of evaluating effectiveness in communications between chatbots and customers but they are capturing the attention of the research community thanks to their flexibility and good performance in other fields like sentiment analysis [18], mental health evaluation [47, 52], satisfaction assessment [51], and others.

3 Materials and methods

3.1 Data

Two databases are considered in this work. Both contain conversations between chatbots and customers of two different companies in Colombia. A summary of the metadata for the two databases is provided in Table 1. Each conversation was labeled by a group of linguistic experts according to the effectiveness in the service/responses provided by the chatbot (i.e., effective vs. ineffective). Each corpus was designed with a specific semantic content because the companies offer different products and their chatbots were trained independently. The average number of words in the answers of the chatbots are different because their response structure is different. These databases allow us to validate whether our approach is general and suitable to be used in different types of chatbots trained for different markets, regardless of the semantic content or structure of their responses. In both databases all the conversations go through a previous pre-processing stage, which consists of removing capital letters, accents, emoticons, URLs, buttons, punctuation, and numbers. Figure 2 shows real examples observed in our databases for ineffective (left) and effective (right) conversations when the user reports technical issues with the Internet service. The conversations were translated from their original versions in Spanish.

Table 1 Summary of information included in the two databases

Full size table

3.1.1 Database 1

Conversations were collected from a chatbot trained to provide service to customers of a pension administration company. Most customers who interact with the chatbot are looking for information about an already purchased service, account status, membership certificates, how to cancel a service, and others. This corpus consists of 3536 conversations between the chatbot and customers. Effective conversations have an average of 5.53 interactions (questions + answers) while the ineffective ones have an average of 5.29 interactions.

3.1.2 Database 2

This corpus contains 1660 conversations collected from a chatbot that provides customer service to a Telco company. The requirements of this chatbot are related to technical support or contracting telecommunication services. The chatbot should provide information about service plans, coverage, technical assistance, and others. In this case the chatbot answers have a simpler structure, therefore uses less words in its answers compared to database 1. The average number of words in effective and ineffective conversations are 2.93 and 5.76, respectively.

3.2 Text representation with word embeddings: baseline models

For a given set of text documents \(D = \left \{d_{1}, d_{2}, d_{3}, \ldots , d_{n}\right \}\), the problem of text representation consists in finding a numerical representation for each element in D, such that the similarity between each pair of points is well defined [68]. Word embeddings are real-valued representations of words produced by distributional semantic models [2]. There are different methods and architectures to generate word embeddings and these can be either context-independent or context-dependent. In this work we consider four different embedding methods as baseline for comparison purposes: two variants of Word2Vec which are context-independent, and BERT/BETO, which are context-dependent. Details of each method are presented below.

3.2.1 Context-independent method: Word2Vec

Word2Vec is one of the most used embedding methods in the literature [31]. This model takes a large text corpus as input and produces a vector space, typically of several hundred dimensions. Word vectors are positioned in the vector space such that words sharing common context in the corpus are geometrically close to each other [36]. There is a unique vector to represent each word in the corpus. For this reason it is known as a context-independent embedding where the representation of a word is the same regardless of its context.This algorithm uses a neural network model to learn word relations. The original words are transformed to a one-hot representation to feed the network, whose size depends on the number of words in the corpus. Word embeddings based on Word2vec can be obtained by following two strategies: Skip Gram or Continuous Bag Of Words (CBOW). In this study we only consider the second strategy for practical reasons. CBOW takes the context of each word (one-hot encoded) as input and the network aims to predict the word corresponding to the context. The number of context words is previously defined, typically a number between 3 and 7 is a good choice [6]. The neural network structure is shown in Fig. 3. The input layer includes the V -dimensional one-hot encoded vectors of the context words. The hidden layer contains N neurons and the output is a V -dimensional vector that corresponds to the target word.

Embeddings are trained on large-scale corpora and their dimension typically varies between 50 and 1024 depending on the vocabulary size. To obtain a vector representation for a document or a sentence, statistical functionals are computed upon the embeddings of each word within the text. We trained two models with dimensions of 100 and 300, namely 100-W2V and 300-W2V, respectively. Both models are trained with the Spanish WikiCorpus, which contains 120 million words [50] and using the CBOW strategy with 7 context words.

3.2.2 Context-dependent methods: BERT and BETO

In this type of embeddings the representation for each word is not unique because the representation depends on the context of the word. Recent developments of context-dependent embeddings [9, 48] show that systems based on such representations achieve good results in many different Natural Language Processing (NLP) tasks [35]. BERT [9] is one the most popular context-dependent embeddings. It is based on a Transformer [63] originally created for machine translation. This Transformer includes two separate mechanisms: an encoder that reads the text inputs and a decoder that produces a prediction for the task. The most important part in the Transformer architecture is the multi-head attention mechanism, which learns contextual relations among words (or sub-words) in a text. The encoder is formed with a stack of layers that include self-attention and feed-forward connections. Decoders include all the elements present in the encoder with an additional encoder-decoder attention layer between the self-attention and the feed-forward layers [63]. Figure 4 shows the encoder and decoder of the Transformer architecture. As in the above described word embeddings (context-independent), documents or sentences are represented by statistical functionals.

Two pre-trained BERT models are considered for our baseline. The first one is the BERT-Base, Multilingual Uncased pre-trained model, which was trained with the Multi-Genre Natural Language Inference (MultiNLI) corpus. The second model is BERT-Base trained with Spanish data from Wikipedia and all of the sources of the OPUS project [62]. This corpus has about 3 billion words and it is available online.^{Footnote 1} This model is commonly known as BETO [5]. The architecture of the BERT-Base model consists of 12 self-attention layers each one with 768 hidden units, for a total of 110M parameters. The last layer (768 units) is taken as the word-embedding representation. The source code to compute BERT and BETO embeddings is also available online^{Footnote 2} [46].

3.3 Parallel convolutional neural networks (P-CNN)

Convolutional networks integrate feature extraction and feature selection stages together with the pattern classification algorithm in a single architecture [71]. These networks contain a structure formed with convolutional filters and grouping layers, instead of the fully connected layers typically used in classical deep neural networks [13].CNNs have been widely used in computer vision [3, 21, 28] and since recent years their applications have been extended to other domains/applications of NLP including machine translation [22], sentence/document classification [26, 65], generic text representations [11, 61], text-based sentiment analysis [27, 34, 38], and others. There are also works related to market analysis such as prediction of customer withdrawal based on transcriptions [14, 71].CNN architectures for NLP are typically considered to extract sentence representations. Typical architectures include convolutional layers and max-pooling operations over all resulting feature maps. A sentence with l words can be represented as a matrix \(\mathbf {X}\in \mathbb {R}^{l \times d}\), where X are the word embeddings with dimension d. In the convolutional layer the matrix is convoluted with some filter \(W\in \mathbb {R}^{m \times d}\), where each filter has different size m but all filters have the same dimension d, which corresponds to the word-embedding dimension. The main idea of the CNN for text classification is to extract semantic features with a temporal resolution through the convolution operation. Different filter sizes correspond to different number of n in n-grams. Filters of 2 × d, 3 × d, 4 × d are designed to map different semantic relationships including bi-gram, tri-gram, and four-gram, respectively. During the convolution process, each filter gradually moves down a word at a time along the sequence of words (i.e., vertically). Finally the max-pooling operation is applied after the convolutions. The final step is performed by a fully connected layer. The classification result of a sample is obtained by using a Sigmoid function applied to the output layer.

In this work, we proposed a novel architecture based on parallel CNNs as shown in Fig. 5. The input corresponds to word embeddings of questions and answers of each conversation. Then, two parallel convolutional layers are used in order to extract the corresponding feature vectors for questions and answers, separately. Each convolutional layer has three parallel filters with different orders to exploit bi-gram, tri-gram, and four-gram relationships among words, and simultaneously allowing semantic features to be extracted with multiple temporal resolutions. Output vectors of this process have the same dimension because both convolutional layers have the same number of filters. These two vectors are concatenated before the fully connected layer to obtain a complete representation of the conversation. l_q and l_a in Fig. 5 correspond to the maximum number of words in the questions and answers in the corpus, respectively. l_q takes two values: 1297 for database 1 and 665 for database 2. Similarly, l_a is 4429 for database 1 and 7790 for database 2. The other dimension of the matrices corresponds to the size of the embedding layer. The embedding dimension was set to 100 to make it consistent with other studies with simple pre-trained models of Word2Vec, and with our baseline models. The best configuration of parameters is experimentally chosen based on performance evaluation considering the smallest possible number of parameters. The proposed approach allows keep a low computational cost in comparison with methodologies based on Transformers, also, allow capture spacial dependencies between words, which is not possible with architectures based on RNNs. The general methodology of the proposed approach is shown in Fig. 6.

3.4 Classification

Four word embedding models are used separately as baseline. Each conversation is divided into two subgroups: questions and answers. Once the word embedding model (100-W2V, 300-W2V, BERT or BETO) is used to create the embeddings in each subgroup, six statistical functionals are computed: mean, standard deviation, skewness, kurtosis, minimum, and maximum. Vectors with the six functionals of the two subgroups are concatenated obtaining a new vector p. This new vector has different dimensions depending on the embedding model: \(\textbf {p} \in \mathbb {R}^{\textbf {1200}}\) for 100-W2V, \(\textbf {p} \in \mathbb {R}^{\textbf {3600}}\) for 300-W2V and \(\textbf {p} \in \mathbb {R}^{\textbf {4608}}\) for BERT and BETO. An SVM classifier with Gaussian kernel is considered for classification. The hyper-parameters of the classifier (C and γ) are optimized within the train set in a grid search where C ∈{0.0001,0.001,...,1000} and γ ∈{0.0001,0.001,...,1000}.Note that in the proposed approach the classification decision is made on the final layer of the architecture, which has a single output neuron, as shown in Fig. 5. The activation function is a Sigmoid, therefore output values range from 0 to 1. The decision threshold is set to 0.5 to classify ineffective vs. effective conversations.

4 Experiments and results

We evaluated the proposed P-CNN architecture upon the two databases described in Section 3.1. Results obtained with the proposed approach are compared with respect to those obtained with the baseline models. The datasets are divided into 70% for train and 30% for test making sure of class-balance in each subset. Optimal parameters are found in train and further tested in the test set. This process is repeated 10 times with a random selection of train and test subsets, i.e., independent experiments. Details of the baseline experiments, the proposed architecture and the evaluation process can be found in the source code available online.^{Footnote 3}

4.1 Evaluation of the baseline models

Context-dependent and context-independent embeddings are considered to evaluate the baseline approaches following the methodology depicted in Fig. 7.Results are reported in Table 2 in terms of accuracy (Acc), sensitivity (Sens), specificity (Spec), and Area Under the ROC Curve (AUC). Note that among the classical word embeddings 100-W2V is the one that yields better results in the two databases with accuracies of 76.04% and 79.80% for database 1 and database 2, respectively. It is also worth that specificity is always higher in the two databases, indicating that the models are better to detect conversations where the chatbot was not able to provide a good service to the customer or was not unable to understand what the customer was asking for. This is actually a good characteristic because QoS areas in the companies are mainly focused on detecting problematic cases, such as those where the company or the service provider was not able to make the customer happy or satisfied. Figure 8 shows the results more compactly through the ROC curves obtained from experiments with each database separately and for each pre-trained word embedding model.

Table 2 Results obtained with classical word embeddings

Full size table

4.2 Evaluation of the parallel CNN

In this experiment each database is evaluated independently following the methodology shown in Fig. 6. Results are reported in Table 3 in terms of accuracy (Acc), sensitivity (Sens), specificity (spec), F1-score, and AUC.

Table 3 Results obtained with the proposed CNN

Full size table

Accuracies of 79.0% and 80.2% are obtained for database 1 and database 2, respectively. Note that the proposed approach yields better results in the two databases than those obtained with the baseline models (see Table 2). The main advantage of the proposed approach is that no pre-trained models were used to generate the different embeddings to build the input matrices. The embeddings are generated directly by the network through an embedding layer. Note that as in the case of the results with the baseline models, the performance is similar in both databases. Besides, specificity is also higher in the two cases, indicating that the proposed approach is more accurate to detect ineffective conversations. The ROC curves for these experiments are shown in Fig. 9. Note that performance is similar in both cases, which likely indicates that the proposed method is equally suitable for the two scenarios considered in this paper.

4.3 Comparison between the proposed approach and the baseline models

Besides classification experiments, Mann U Whitney tests are performed to evaluate the significance of the classification scores to differentiate the two classes: effective vs. ineffective conversations. Four cases are explored and results are summarized in Fig. 10. For the CNN approach the scores from the activation function in the output layer are used to perform the tests. In the baseline approach, the sample’s distances to the optimal hyper-plane of the SVM are used as scores to perform the tests. Only the results of the W2V-100 embedding model are considered. SVM scores were divided by the maximum score in absolute value, and then we applied the Sigmoid function. Thus, scores from P-CNN and the baseline model have the same rank.

According to the statistical tests, there are significant differences between the distribution of the two classes in both approaches for both databases (p-value ≪ 0.001). Hence, in principle, the two methods (word embeddings and the proposed parallel CNN) are equally suitable to classify between effective vs. ineffective conversations. We develop other Mann U Whitney tests to evaluate whether the difference between P-CNN and a baseline model is significant. For each class (ineffective and effective), we compare the distribution of the scores from P-CNN and the best baseline model considering both databases. According to the statistical tests, there is a significant difference between the scores distribution of the models in both classes for both databases (p-value ≪ 0.001). Figure 11 shows the comparison between the distribution of the scores from both models. In addition, the classification results exhibited an improvement of 2.9% and 0.38% for database 1 and database 2, respectively by using the proposed P-CNN in comparison with the baseline models.

5 Discussion

The methodology proposed in this paper aims to discriminate between effective and ineffective conversations. The proposed methodology allows word embeddings to be trained within the model itself, therefore specific terms commonly used within the company’s or market’s language have a representation according to the semantic of conversations and do not depend on the semantic of an external corpus. In addition, the vocabulary of the trained model is limited to the one used in the conversations of the training database. Therefore, if the model is tested in another environment, its performance will depend on the similarity of the semantic fields of the new environment with the vocabulary of the training database. This limitation may cause a negative effect on its generalization capability; however, we believe that in general terms its advantages surpass this drawback, especially considering that the typical application scenario of this technology is when a company already has a chatbot in production and needs to evaluate the effectiveness of its service. In these cases, a training set can be generated to feed the P-CNN and create the evaluation model.Regarding the experiments with the baseline models, we expected to obtain better results with modern pre-trained models such as BERT or BETO. However the evidence shows that a simpler model is sufficient to address the problem presented in this paper. We believe that this is due to the specific application we have addressed because in the two cases the customer knew that (s)he was interacting with a chatbot, thus more concrete and precise language with few context words was used.

It is noteworthy that the classification of conversations using the proposed model does not evaluate the quality of service perceived by the customer. Our proposed system is trained with information labeled by a human expert in linguistics based on whether the customer requirements were correctly addressed (effective conversation) or not. This means that the result of the system is in line with the evaluation that a human would assign to a given conversation. Hence, the result can be used to assess the effectiveness of a given chatbot in order to detect ineffective conversations, and then to re-train the chatbot to improve its capability to effectively provide service to the customers. Note that a model aligned with customer’s feedback would be sensitive to capture customer satisfaction and not chatbot effectiveness. For instance, a conversation can be correctly addressed by the chatbot but the service that the chatbot provides can make the customer not to feel satisfactorily served.

The main application of our system is to evaluate and to improve the chatbot’s performance. On the one hand, the percentage of ineffective conversations detected by the model in a given time period can be useful to determine whether the chatbot’s training was correct or if it is necessary to update some rules. On the other hand, the conversations classified as ineffective can be analyzed to determine the source of the chatbot’s failure, and to update the rules.A limitation of the proposed methodology is its computational complexity, which does not allow the real-time evaluation of conversations. However, it is possible to make evaluations by time intervals, for instance the number of ineffective conversations per hour, per day or per week, depending on the traffic of conversations processed by the chatbot. Finally, this methodology was implemented by Colombian company Pratech Group to metric the effectiveness of its Chatbots.

6 Conclusion

The effectiveness of chatbots during conversations with customers of two different companies is automatically evaluated in terms of whether the service requested by the customer was effectively or ineffectively provided by the chatbot.Classical word-embedding approaches like Word2Vec and BERT are used as baseline and their performance is compared with respect to a novel approach, introduced in this paper, based on parallel CNNs with multiple temporal resolutions. Questions from customers and answers from the chatbots are modeled independently by two parallel convolutional layers. Each layer is composed of three filters, considering multiple temporal resolutions. Bi-gram, tri-gram and four-gram relationships among the words are considered simultaneously to extract the feature vector of the question and the answer.The proposed approach produces better accuracies than those obtained with the baseline models in the two databases. Observed improvements range between 0.38 and 2.9 percentage points depending on the database. The main advantage of the proposed approach is that it does not depend on pre-trained models which are typically created with millions of words that are not necessarily related with the context of the given task (i.e., the target corpus). Our proposed method allows to create specific models per each context, generating more accurate systems adapted to particular needs. This work is a step forward in the automatic chatbot effectiveness evaluation and will allow companies to improve their QoS monitoring process. By using this approach it will not be necessary to use self-reported satisfaction surveys, instead, the service provided by the chatbot can be accurately and automatically evaluated. Currently, a product was development by the company

Other challenges like naturalness and personality for the chatbots’ language might be studied in future research.

Data Availability

Due to the nature of the research, due to legal reasons, supporting data is not available.

Notes

References

Aksu H (2013) Customer Service: the new proactive marketing. https://www.huffpost.com/entry/customer-service-the-new-_b_2827889?guccounter=1. Accessed 2021 Feb 17
Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv:1801.09536
Basak H et al (2022) A union of deep learning and swarm-based optimization for 3D human action recognition. Sci Reports 12(1):1–17
Google Scholar
Cahn J (2017) Chatbot: architecture, design, & development. University of Pennsylvania School of Engineering and Applied Science Department of Computer and Information Science
Canete J et al (2020) Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020
Caselles-Dupré H, Lesaint F, Royo-Letelier J (2018) Word2vec applied to recommendation: hyperparameters matter. In: Proceedings of the 12th ACM conference on recommender systems (RecSys), pp 352–356
Chakrabarti C, Luger GF (2013) A framework for simulating and evaluating artificial chatter bot conversations. In: The 26th international Florida artificial intelligence research society conference (FLAIRS)
Cui L et al (2017) Superagent: a customer service chatbot for e-commerce websites. In: Proceedings of ACL 2017, system demonstrations, pp 97–102
Devlin J et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Feine J, Morana S, Gnewuch U (2019) Measuring service encounter satisfaction with customer service chatbots using sentiment analysis. In: Proceedings of the 14th international conference on wirtschaftsinformatik (WI2019)
Gan Z et al (2016) Learning generic sentence representations using convolutional neural networks. In: Proceedings of the 2017 conference on empirical methods in natural language processing (EMNLP), pp 2390–2400
Ghazvininejad M et al (2018) A knowledge-grounded neural conversation model. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Goodfellow I et al (2016) Deep Learning, 2, vol 1. MIT Press, Cambridge
Google Scholar
Gridach M, Haddad H, Mulki H (2017) Churn identification in microblogs using convolutional neural networks with structured logical knowledge. In: Proceedings of the 3rd workshop on noisy user-generated text (WNUT), pp 21–30
Gu X et al (2018) Dialogwae: multimodal response generation with conditional wasserstein auto-encoder. In: International conference on learning representations (ICLR)
Heller B et al (2005) Freudbot: an investigation of chatbot technology in distance education. In: EdMedia + innovate learning, association for the Advancement of Computing in Education (AACE), pp 3913–3918
Hill J, Ford WR, Farreras IG (2015) Real conversations with artificial intelligence: A comparison between human–human online conversations and human–chatbot conversations. Comput Human Behav 49:245–250
Article Google Scholar
Hoang M, Bihorac OA, Rouces J (2019) Aspect-based sentiment analysis using Bert. In: Proceedings of the 22nd nordic conference on computational linguistics (NoDaLiDa), pp 187–196
Huang S-C et al (2020) Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. npj Digit Med 3. https://doi.org/10.1038/s41746-020-00341-z
Hung V et al (2009) Towards a method for evaluating naturalness in conversational dialog systems. In: 2009 IEEE International conference on systems, man and cybernetics (SMC). IEEE, pp 1236–1241
Hussain R et al (2021) Revise-net: exploiting reverse attention mechanism for salient object detection. Remote Sens 13(23):4941
Article Google Scholar
Jenkins M-C et al (2007) Analysis of user interaction with service oriented chatbot systems. In: International conference on human-computer interaction (HCI). Springer, pp 76–83
Jia J (2009) CSIEC: a computer assisted English learning chatbot based on textual knowledge and reasoning. Knowl-Based Syst 22(4):249–255
Article Google Scholar
Jwalapuram P (2017) Evaluating dialogs based on grice’s maxims. In: Proceedings of the student research workshop associated with recent advances in natural language processing (RANLP), pp 17–24
Kannan A, Vinyals O (2017) Adversarial evaluation of dialogue models. arXiv:1701.08198
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751
Kim H, Jeong Y-S (2019) Sentiment classification using convolutional neural networks. Appl Sci 9(11):2347
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Li J et al (2015) A diversity-promoting objective function for neural conversation models. In: 2015 Annual conference of the North American chapter of the association for computational linguistics (NAACL-HLT 2015)
Li J et al (2016) A persona-based neural conversation model. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL), vol 1, pp 994–1003
Li B et al (2019) Scaling word2vec on big corpus. Data Sci Eng 4 (2):157–175
Article Google Scholar
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out association for computational linguistics (ACL), pp 74–81
Liu C-W et al (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 conference on empirical methods in natural language processing (EMNLP)
Luo L (2019) Network text sentiment analysis method combining LDA text representation and GRU-CNN. Personal and Ubiquit Comput 23:405–412
Article Google Scholar
Miaschi A, Dell’Orletta F (2020) Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In: Proceedings of the 5th workshop on representation learning for NLP (RepL4NLP), pp 110–119
Mikolov T et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems (ANPIS), pp 3111–3119
Mimoun MSB, Poncin I, Garnier M (2012) Case study—embodied virtual agents: an analysis on reasons for failure. J Retail Consum Serv 19(6):605–612
Article Google Scholar
Minaee S, Azimi E, Abdolrashidi A (2019) Deep-sentiment: sentiment analysis using ensemble of cnn and bi-lstm models. arXiv:1904.04206
Nuruzzaman M, Hussain OK (2018) A survey on chatbot implementation in customer service industry through deep neural networks. In: 2018 IEEE 15th International conference on e-business engineering (ICEBE). IEEE, pp 54–61
Oliver RL (2014) Satisfaction: A Behavioral Perspective on the Consumer: A Behavioral Perspective on the Consumer. Routledge, Evanston
Book Google Scholar
Papineni K et al (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for computational linguistics (ACL), pp 311–318
Park Y, Gates SC (2009) Towards real-time measurement of customer satisfaction using automatically generated call transcripts. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM), pp 1387–1396
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peras D (2018) Chatbot evaluation metrics. Economic and Social Development: Book of Proceedings:89–97
Pereira J, Dıaz O (2018) A quality analysis of facebook messenger’s most popular chatbots. In: Proceedings of the 33rd annual ACM symposium on applied computing (SAC), pp 2144–2150
Perez-Toro PA (2020) PauPerezT/WEBERT: word embeddings using BERT, version V0.0.1, Jul. 2020. https://doi.org/10.5281/zenodo.3964244
Pérez-Toro PA et al (2021) Acoustic and linguistic analyses to assess early-onset and genetic Alzheimer’s disease. IEEE International conference on acoustics, speech and signal processing, in press
Peters ME et al (2018) Deep contextualized word representations. arXiv:1802.05365
Ranoliya BR, Raghuwanshi N, Singh S (2017) Chatbot for university related faqs. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, pp 1525–1530
Reese S et al (2010) Word-sense disambiguated multilingual Wikipedia corpus. In: 7th International conference on language resources and evaluation (LREC)
Roccetti M et al (2017) Attitudes of Crohn’s disease patients: infodemiology case study and sentiment analysis of Facebook and Twitter posts. JMIR Public Health Surveill 3(3):e7004
Article Google Scholar
Rodrigues Makiuchi M et al (2019) Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. In: Proceedings of the 9th international on audio/visual emotion challenge and workshop (AVEC), pp 55–63
Rong X (2014) Word2vec parameter learning explained. arXiv:1411.2738
Schumaker RP et al (2007) An evaluation of the chat and knowledge delivery components of a low-level dialog system: The az-alice experiment. Decision Support Syst 42(4):2236–2246
Article Google Scholar
Sedoc J et al (2019) Chateval: a tool for chatbot evaluation. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL), pp 60–65
Shang L, Lu Z, Li H (2015) Neural responding machine for short-text conversation. arXiv:1503.02364
Siddiqui MH, Sharma TG (2010) Analyzing customer satisfaction with service quality in life insurance services. J Target Meas Anal Market 18 (3-4):221–238
Article Google Scholar
Song Y et al (2016) Two are better than one: an ensemble of retrieval-and generation-based dialog systems. arXiv:1610.07149
Sordoni A et al (2015) A neural network approach to context-sensitive generation of conversational responses. arXiv:506.06714
Suhaili SM, Salim N, Jambli MN (2021) Service chatbots: a systematic review. Expert Syst Appl 184:115461
Article Google Scholar
Tang S et al (2018) Exploring asymmetric encoder-decoder structure for context-based sentence representation learning
Tiedemann J (2012) Parallel data, tools and interfaces in opus. In: Lrec, vol 2012, pp 2214–2218
Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems (NIPS), pp 5998–6008
Venkatesh A et al (2018) On evaluating and comparing conversational agents. arXiv:1801.03625 4:60–68
Wang G et al (2018) Joint embedding of words and labels for text classification. arXiv:1805.04174
Xiang Y et al (2014) Problematic situation analysis and automatic recognition for chinese online conversational system. In: Proceedings of the 3rd CIPS-SIGHAN joint conference on chinese language processing, pp 43–51
Xu Z et al (2017) Neural response generation via gan with an approximate embedding layer. In: Proceedings of the 2017 conference on empirical methods in natural language processing (EMNLP), pp 617–626
Yan J (2009) Text representation
Yao K et al (2016) An attentional neural conversation model with improved specificity. In: Proceedings of the 53th annual meeting of the association for computational linguistics (ACL)
Zhao R, Romero OJ, Rudnicky A (2018) SOGO: a social intelligent negotiation dialogue system. In: Proceedings of the 18th international conference on intelligent virtual agents (IVA), pp 239–246
Zhong J, Li W (2019) Predicting customer churn in the telecommunication industry by analyzing phone call transcripts with convolutional neural networks. In: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence (ICAI), pp 55–59

Download references

Acknowledgements

This work was funded by CODI from Universidad de Antioquia grant # PRG2020-34068. Juan Camilo Vásquez is funded by the European Union (Horizon 2020) research and innovation programme under the Marie Sklodowska-Curie Grant Agreement No. 766287. The work received funding also from Pratech Group grant # PI2019-24110.

Funding

Open Access funding provided by Colombia Consortium.

Author information

Authors and Affiliations

GITA Laboratory, Faculty of Engineering, University of Antioquia, Medellín, Colombia
Daniel Escobar-Grisales, Juan Camilo Vásquez-Correa & Juan Rafael Orozco-Arroyave
Pattern Recognition Laboratory, University of Erlangen, Erlangen, Germany
Juan Camilo Vásquez-Correa & Juan Rafael Orozco-Arroyave
Pratech Group, Medellín, Colombia
Juan Camilo Vásquez-Correa

Authors

Daniel Escobar-Grisales
View author publications
You can also search for this author in PubMed Google Scholar
Juan Camilo Vásquez-Correa
View author publications
You can also search for this author in PubMed Google Scholar
Juan Rafael Orozco-Arroyave
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Escobar-Grisales.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Escobar-Grisales, D., Vásquez-Correa, J.C. & Orozco-Arroyave, J.R. Evaluation of effectiveness in conversations between humans and chatbots using parallel convolutional neural networks with multiple temporal resolutions. Multimed Tools Appl 83, 5473–5492 (2024). https://doi.org/10.1007/s11042-023-14896-y

Download citation

Received: 15 October 2021
Revised: 07 December 2022
Accepted: 06 February 2023
Published: 03 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-14896-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Evaluation of effectiveness in conversations between humans and chatbots using parallel convolutional neural networks with multiple temporal resolutions

Abstract