1 Introduction

Customer service is necessary for companies to keep their customers and to attract new ones [8]. Each consultant in the customer service team spends time solving different requests from customers. Nevertheless, as the number of customers increases the waiting time increases as well, which results in poor customer satisfaction [49]. Different results report that about 75% of customers experience poor customer service [1, 57]. Traditional customer service has two main weaknesses: (1) the staff usually receives repetitive questions asked by a variety of customers and (2) it is difficult to support services 24/7, especially for small companies [8]. Chatbots are able to handle multiple users simultaneously and operate all day and night long [49].

A chatbot is a conversational software system that automatically interacts with a user/customer. It is designed to emulate the communication capabilities of a human being [39]. The service-oriented chatbot acts as an automated customer service representative, giving natural language answers [22]. Despite their great potential, many customer service chatbots do not meet customer expectations because the chatbot fails to correctly recognize the customer’s requirements [37]. As a result, many service providers stop using their chatbots due to negative feedback received from their customers [10]. For this reason it is necessary to evaluate chatbot’s effectiveness, i.e. their ability to recognize costumer’s questions or requirements and to give a clear response or enable the required service.Typically, automatic approaches aim to evaluate a chatbot according to its ability to generate responses similar to a human [60]. However, in production environments, chatbots must correctly address the customer’s requirements above other aspects, such as naturalness, personality, or their ability to imitate human conversations. Figure 1 shows the operational cycle of a chatbot in production environments, where a chatbot is exposed to a group of customers for some time to collect conversations between the chatbot and real customers. Then, those conversations are evaluated to determine the chatbot’s effectiveness and to update the chatbot rules in a retraining process. Conversations are typically manually assessed using a small reference group (randomly chosen) to determine the chatbot’s effectiveness, and this small group is rather limited due to the high cost involved in the manual evaluation process [42]. Hence, the number of evaluated conversations is not representative of the real state of the customer service process.This work proposes an automatic effectiveness detector, where all collected conversations are used to measure the chatbot’s effectiveness. This study considers a joint fusion strategy to merge information from customer questions and chatbot responses in each conversation. Customer questions and answers provided by chatbots are modeled with an embedding layer that is further used to feed the convolutional layers of a Parallel Convolutional Neural Network. In the convolutional layers, questions and answers are analyzed independently. Then both representations are concatenated in the flatten layer before the classification stage. The main difference between this fusion strategy in comparison with traditional approaches, such as early or late fusion, is that the loss is back-propagated to the feature extraction stage of the neural network, generating a loop that enables the finding of better feature representations based on both information sources [19]. This methodology is compared with respect to a baseline model based on classical neural embedding methods using an early fusion strategy. Accuracies of up to 80.1% were obtained with the methodology presented in this work, which shows that the proposed methodology is accurate to evaluate chatbots effectiveness in real production environment automatically.

Fig. 1
figure 1

Operational cycle of a chatbot in production environments using the automatic effectiveness detector proposed in this paper (denoted by dotted lines)

2 Related works

Usually, the effectiveness of chatbots in production is manually evaluated based on a small sample of randomly chosen conversations, or by using self-reported customer satisfaction. The latter one being usually more often considered because it is less expensive. Most of studies based on self-reported satisfaction use the Likert scale to evaluate different aspects related to the conversation, such as effectiveness, quality, humanity, manner, and others [7, 16, 20, 44, 54, 70]. In [7] the authors proposed a framework for chatbot evaluation based on the Grice’s maxims, which consists of four conversational maxims originated from the natural language: quality, quantity, relation, and manner. A similar methodology is presented in [24] where the users were asked to rate chatbot’s performance for Grice’s maxims on a Likert scale. The authors used the chatbot conversation to test the correlation of human judgments with the Grice’s maxims. Other methodology widely used to evaluate chatbots’ performance is the PARAdigm for DIalogue System Evaluation (PARADISE), which estimates subjective factors by collecting user ratings through questionnaires. Some of the subjective factors include ease of use, clarity, naturalness, friendliness, robustness and willingness to use the system again [4]. The main drawback of using self-reported data to evaluate customer service and effectiveness in conversations with chatbots is that, typically only few users complete the questionnaire and those few completed questionnaires are generally influenced by different external factors [40, 66].

There are also evaluation methods based on statistics about the conversation between customers and chatbots. For instance the number of times the chatbot was used [23], the number of times the user had to use help commands [45], total number of dialogues, their duration [64], and frequency of a keyword related to the customer’s feeling [42, 45]. In [17], authors considered online conversations between human-human and human-chatbot. They computed three principal variables: words per conversation, messages per conversation, and the average number of words per message, to evaluate conversation quality. Results of this work indicated that human-chatbot conversations had a longer duration than human-human conversations but with shorter messages. In addition, human-human conversations had a more expansive vocabulary, and in these conversations, people employed more words that expressed positive emotion than in human-chatbot conversations. This approach can help to evaluate customer satisfaction or customer’s empathy with the chatbot. However, for chatbot effectiveness evaluation in real production environments, the conversation duration depends on the complexity of the requirement, so approaches based on word count or conversation duration could not be proper.

Other methodologies aim to assess effectiveness of chatbot’s by comparing chatbot responses with reference corpora using similarity metrics such as: BiLingual Evaluation Understudy (BLEU) [41] and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [32]. These metrics were originally proposed for machine translation systems and are based on similarity measures between generated texts and expected responses. Several works are based on the aforementioned methods [12, 29, 30, 56, 58, 59, 69]; however, they have different limitations. For instance, they require reference corpora and are focused on token-level overlap between the reference text and the generated one, hence a valid response to a certain statement in a conversation might have low token overlap with a reference response [64]. Additionally, the token-level overlap has shown poor correlation with human judgments, which limits its use in real-world applications [33].The adversarial evaluation methods to assess dialogue systems aim to solve these problems. In [25], the authors train a Generative Adversarial Network, where the generator responds according to a message and the discriminator aims to predict whether, given a message and response, the response was given by the generator or a human. The idea of this paper was to evaluate the Chatbot quality according to whether it is possible to distinguish the conversations generated by it from conversations generated by a human. They take as a generator a fully trained production-scale conversation model deployed as part of the Smart Reply system. The authors train a Recurrent Neural Network (RNN) for the discriminator. The discriminator achieved accuracy up to 62.5% to discriminate between conversational model and human responses. The authors demonstrated that the discriminator scores were strongly correlated with response length, which could lead to challenges for chatbot evaluation in production environments where the response length depends on the customer’s requirements. Moreover, this approach evaluates the dialogue system quality based on the similarity of the dialogue system response with the human response. However, in this paper, we aim to assess Chatbot effectiveness by their ability to correctly address customer requests, other important aspects such as naturally, satisfaction, customer empathy, and other evaluation criteria are not considered.Recent studies are mainly based on word embeddings extracted from pretrained models, where chatbot is evaluated using a similarity metric between a given response and the reference [15, 33, 55, 67]. The most commonly used embeddings are: GLobal Vector for word representation (GLoVe) [43], Word2Vec [36], Bidirectional Encoder Representations from Transformers (BERT) [9], and BETO which is the Spanish version of BERT [5]. These embedding-based approaches are less studied in the topic of evaluating effectiveness in communications between chatbots and customers but they are capturing the attention of the research community thanks to their flexibility and good performance in other fields like sentiment analysis [18], mental health evaluation [47, 52], satisfaction assessment [51], and others.

3 Materials and methods

3.1 Data

Two databases are considered in this work. Both contain conversations between chatbots and customers of two different companies in Colombia. A summary of the metadata for the two databases is provided in Table 1. Each conversation was labeled by a group of linguistic experts according to the effectiveness in the service/responses provided by the chatbot (i.e., effective vs. ineffective). Each corpus was designed with a specific semantic content because the companies offer different products and their chatbots were trained independently. The average number of words in the answers of the chatbots are different because their response structure is different. These databases allow us to validate whether our approach is general and suitable to be used in different types of chatbots trained for different markets, regardless of the semantic content or structure of their responses. In both databases all the conversations go through a previous pre-processing stage, which consists of removing capital letters, accents, emoticons, URLs, buttons, punctuation, and numbers. Figure 2 shows real examples observed in our databases for ineffective (left) and effective (right) conversations when the user reports technical issues with the Internet service. The conversations were translated from their original versions in Spanish.

Table 1 Summary of information included in the two databases
Fig. 2
figure 2

Examples of ineffective (left) and effetive (right) conversations when a user reports failures with the Internet

3.1.1 Database 1

Conversations were collected from a chatbot trained to provide service to customers of a pension administration company. Most customers who interact with the chatbot are looking for information about an already purchased service, account status, membership certificates, how to cancel a service, and others. This corpus consists of 3536 conversations between the chatbot and customers. Effective conversations have an average of 5.53 interactions (questions + answers) while the ineffective ones have an average of 5.29 interactions.

3.1.2 Database 2

This corpus contains 1660 conversations collected from a chatbot that provides customer service to a Telco company. The requirements of this chatbot are related to technical support or contracting telecommunication services. The chatbot should provide information about service plans, coverage, technical assistance, and others. In this case the chatbot answers have a simpler structure, therefore uses less words in its answers compared to database 1. The average number of words in effective and ineffective conversations are 2.93 and 5.76, respectively.

3.2 Text representation with word embeddings: baseline models

For a given set of text documents \(D = \left \{d_{1}, d_{2}, d_{3}, \ldots , d_{n}\right \}\), the problem of text representation consists in finding a numerical representation for each element in D, such that the similarity between each pair of points is well defined [68]. Word embeddings are real-valued representations of words produced by distributional semantic models [2]. There are different methods and architectures to generate word embeddings and these can be either context-independent or context-dependent. In this work we consider four different embedding methods as baseline for comparison purposes: two variants of Word2Vec which are context-independent, and BERT/BETO, which are context-dependent. Details of each method are presented below.

3.2.1 Context-independent method: Word2Vec

Word2Vec is one of the most used embedding methods in the literature [31]. This model takes a large text corpus as input and produces a vector space, typically of several hundred dimensions. Word vectors are positioned in the vector space such that words sharing common context in the corpus are geometrically close to each other [36]. There is a unique vector to represent each word in the corpus. For this reason it is known as a context-independent embedding where the representation of a word is the same regardless of its context.This algorithm uses a neural network model to learn word relations. The original words are transformed to a one-hot representation to feed the network, whose size depends on the number of words in the corpus. Word embeddings based on Word2vec can be obtained by following two strategies: Skip Gram or Continuous Bag Of Words (CBOW). In this study we only consider the second strategy for practical reasons. CBOW takes the context of each word (one-hot encoded) as input and the network aims to predict the word corresponding to the context. The number of context words is previously defined, typically a number between 3 and 7 is a good choice [6]. The neural network structure is shown in Fig. 3. The input layer includes the V -dimensional one-hot encoded vectors of the context words. The hidden layer contains N neurons and the output is a V -dimensional vector that corresponds to the target word.

Fig. 3
figure 3

CBOW model. C: number of context words; V: number of unique words in the corpus; W: weight matrix before the hidden layer, W: weight matrix after the hidden layer; and N: Word2Vec embedding dimension. Figure adapted from [53]

Embeddings are trained on large-scale corpora and their dimension typically varies between 50 and 1024 depending on the vocabulary size. To obtain a vector representation for a document or a sentence, statistical functionals are computed upon the embeddings of each word within the text. We trained two models with dimensions of 100 and 300, namely 100-W2V and 300-W2V, respectively. Both models are trained with the Spanish WikiCorpus, which contains 120 million words [50] and using the CBOW strategy with 7 context words.

3.2.2 Context-dependent methods: BERT and BETO

In this type of embeddings the representation for each word is not unique because the representation depends on the context of the word. Recent developments of context-dependent embeddings [9, 48] show that systems based on such representations achieve good results in many different Natural Language Processing (NLP) tasks [35]. BERT [9] is one the most popular context-dependent embeddings. It is based on a Transformer [63] originally created for machine translation. This Transformer includes two separate mechanisms: an encoder that reads the text inputs and a decoder that produces a prediction for the task. The most important part in the Transformer architecture is the multi-head attention mechanism, which learns contextual relations among words (or sub-words) in a text. The encoder is formed with a stack of layers that include self-attention and feed-forward connections. Decoders include all the elements present in the encoder with an additional encoder-decoder attention layer between the self-attention and the feed-forward layers [63]. Figure 4 shows the encoder and decoder of the Transformer architecture. As in the above described word embeddings (context-independent), documents or sentences are represented by statistical functionals.

Fig. 4
figure 4

Topology of the Transformer architecture. K is the number of layers in the encoder and decoder. Figure is adapted from [63]

Two pre-trained BERT models are considered for our baseline. The first one is the BERT-Base, Multilingual Uncased pre-trained model, which was trained with the Multi-Genre Natural Language Inference (MultiNLI) corpus. The second model is BERT-Base trained with Spanish data from Wikipedia and all of the sources of the OPUS project [62]. This corpus has about 3 billion words and it is available online.Footnote 1 This model is commonly known as BETO [5]. The architecture of the BERT-Base model consists of 12 self-attention layers each one with 768 hidden units, for a total of 110M parameters. The last layer (768 units) is taken as the word-embedding representation. The source code to compute BERT and BETO embeddings is also available onlineFootnote 2 [46].

3.3 Parallel convolutional neural networks (P-CNN)

Convolutional networks integrate feature extraction and feature selection stages together with the pattern classification algorithm in a single architecture [71]. These networks contain a structure formed with convolutional filters and grouping layers, instead of the fully connected layers typically used in classical deep neural networks [13].CNNs have been widely used in computer vision [3, 21, 28] and since recent years their applications have been extended to other domains/applications of NLP including machine translation [22], sentence/document classification [26, 65], generic text representations [11, 61], text-based sentiment analysis [27, 34, 38], and others. There are also works related to market analysis such as prediction of customer withdrawal based on transcriptions [14, 71].CNN architectures for NLP are typically considered to extract sentence representations. Typical architectures include convolutional layers and max-pooling operations over all resulting feature maps. A sentence with l words can be represented as a matrix \(\mathbf {X}\in \mathbb {R}^{l \times d}\), where X are the word embeddings with dimension d. In the convolutional layer the matrix is convoluted with some filter \(W\in \mathbb {R}^{m \times d}\), where each filter has different size m but all filters have the same dimension d, which corresponds to the word-embedding dimension. The main idea of the CNN for text classification is to extract semantic features with a temporal resolution through the convolution operation. Different filter sizes correspond to different number of n in n-grams. Filters of 2 × d, 3 × d, 4 × d are designed to map different semantic relationships including bi-gram, tri-gram, and four-gram, respectively. During the convolution process, each filter gradually moves down a word at a time along the sequence of words (i.e., vertically). Finally the max-pooling operation is applied after the convolutions. The final step is performed by a fully connected layer. The classification result of a sample is obtained by using a Sigmoid function applied to the output layer.

In this work, we proposed a novel architecture based on parallel CNNs as shown in Fig. 5. The input corresponds to word embeddings of questions and answers of each conversation. Then, two parallel convolutional layers are used in order to extract the corresponding feature vectors for questions and answers, separately. Each convolutional layer has three parallel filters with different orders to exploit bi-gram, tri-gram, and four-gram relationships among words, and simultaneously allowing semantic features to be extracted with multiple temporal resolutions. Output vectors of this process have the same dimension because both convolutional layers have the same number of filters. These two vectors are concatenated before the fully connected layer to obtain a complete representation of the conversation. lq and la in Fig. 5 correspond to the maximum number of words in the questions and answers in the corpus, respectively. lq takes two values: 1297 for database 1 and 665 for database 2. Similarly, la is 4429 for database 1 and 7790 for database 2. The other dimension of the matrices corresponds to the size of the embedding layer. The embedding dimension was set to 100 to make it consistent with other studies with simple pre-trained models of Word2Vec, and with our baseline models. The best configuration of parameters is experimentally chosen based on performance evaluation considering the smallest possible number of parameters. The proposed approach allows keep a low computational cost in comparison with methodologies based on Transformers, also, allow capture spacial dependencies between words, which is not possible with architectures based on RNNs. The general methodology of the proposed approach is shown in Fig. 6.

Fig. 5
figure 5

Typical CNN architecture. lq and la are the number of words in questions and answers, respectively

Fig. 6
figure 6

General methodology of the proposed approach

3.4 Classification

Four word embedding models are used separately as baseline. Each conversation is divided into two subgroups: questions and answers. Once the word embedding model (100-W2V, 300-W2V, BERT or BETO) is used to create the embeddings in each subgroup, six statistical functionals are computed: mean, standard deviation, skewness, kurtosis, minimum, and maximum. Vectors with the six functionals of the two subgroups are concatenated obtaining a new vector p. This new vector has different dimensions depending on the embedding model: \(\textbf {p} \in \mathbb {R}^{\textbf {1200}}\) for 100-W2V, \(\textbf {p} \in \mathbb {R}^{\textbf {3600}}\) for 300-W2V and \(\textbf {p} \in \mathbb {R}^{\textbf {4608}}\) for BERT and BETO. An SVM classifier with Gaussian kernel is considered for classification. The hyper-parameters of the classifier (C and γ) are optimized within the train set in a grid search where C ∈{0.0001,0.001,...,1000} and γ ∈{0.0001,0.001,...,1000}.Note that in the proposed approach the classification decision is made on the final layer of the architecture, which has a single output neuron, as shown in Fig. 5. The activation function is a Sigmoid, therefore output values range from 0 to 1. The decision threshold is set to 0.5 to classify ineffective vs. effective conversations.

4 Experiments and results

We evaluated the proposed P-CNN architecture upon the two databases described in Section 3.1. Results obtained with the proposed approach are compared with respect to those obtained with the baseline models. The datasets are divided into 70% for train and 30% for test making sure of class-balance in each subset. Optimal parameters are found in train and further tested in the test set. This process is repeated 10 times with a random selection of train and test subsets, i.e., independent experiments. Details of the baseline experiments, the proposed architecture and the evaluation process can be found in the source code available online.Footnote 3

4.1 Evaluation of the baseline models

Context-dependent and context-independent embeddings are considered to evaluate the baseline approaches following the methodology depicted in Fig. 7.Results are reported in Table 2 in terms of accuracy (Acc), sensitivity (Sens), specificity (Spec), and Area Under the ROC Curve (AUC). Note that among the classical word embeddings 100-W2V is the one that yields better results in the two databases with accuracies of 76.04% and 79.80% for database 1 and database 2, respectively. It is also worth that specificity is always higher in the two databases, indicating that the models are better to detect conversations where the chatbot was not able to provide a good service to the customer or was not unable to understand what the customer was asking for. This is actually a good characteristic because QoS areas in the companies are mainly focused on detecting problematic cases, such as those where the company or the service provider was not able to make the customer happy or satisfied. Figure 8 shows the results more compactly through the ROC curves obtained from experiments with each database separately and for each pre-trained word embedding model.

Fig. 7
figure 7

General methodology to evaluate the baseline models

Table 2 Results obtained with classical word embeddings
Fig. 8
figure 8

ROC curves obtained for database 1 (left) and database 2 (right) with classical word embeddings

4.2 Evaluation of the parallel CNN

In this experiment each database is evaluated independently following the methodology shown in Fig. 6. Results are reported in Table 3 in terms of accuracy (Acc), sensitivity (Sens), specificity (spec), F1-score, and AUC.

Table 3 Results obtained with the proposed CNN

Accuracies of 79.0% and 80.2% are obtained for database 1 and database 2, respectively. Note that the proposed approach yields better results in the two databases than those obtained with the baseline models (see Table 2). The main advantage of the proposed approach is that no pre-trained models were used to generate the different embeddings to build the input matrices. The embeddings are generated directly by the network through an embedding layer. Note that as in the case of the results with the baseline models, the performance is similar in both databases. Besides, specificity is also higher in the two cases, indicating that the proposed approach is more accurate to detect ineffective conversations. The ROC curves for these experiments are shown in Fig. 9. Note that performance is similar in both cases, which likely indicates that the proposed method is equally suitable for the two scenarios considered in this paper.

Fig. 9
figure 9

ROC curves obtained for database 1 (DB1) and database 2 (DB2) with the proposed approach

4.3 Comparison between the proposed approach and the baseline models

Besides classification experiments, Mann U Whitney tests are performed to evaluate the significance of the classification scores to differentiate the two classes: effective vs. ineffective conversations. Four cases are explored and results are summarized in Fig. 10. For the CNN approach the scores from the activation function in the output layer are used to perform the tests. In the baseline approach, the sample’s distances to the optimal hyper-plane of the SVM are used as scores to perform the tests. Only the results of the W2V-100 embedding model are considered. SVM scores were divided by the maximum score in absolute value, and then we applied the Sigmoid function. Thus, scores from P-CNN and the baseline model have the same rank.

Fig. 10
figure 10

Distribution and box-plots of the scores obtained from the CNN approach and the model

According to the statistical tests, there are significant differences between the distribution of the two classes in both approaches for both databases (p-value ≪ 0.001). Hence, in principle, the two methods (word embeddings and the proposed parallel CNN) are equally suitable to classify between effective vs. ineffective conversations. We develop other Mann U Whitney tests to evaluate whether the difference between P-CNN and a baseline model is significant. For each class (ineffective and effective), we compare the distribution of the scores from P-CNN and the best baseline model considering both databases. According to the statistical tests, there is a significant difference between the scores distribution of the models in both classes for both databases (p-value ≪ 0.001). Figure 11 shows the comparison between the distribution of the scores from both models. In addition, the classification results exhibited an improvement of 2.9% and 0.38% for database 1 and database 2, respectively by using the proposed P-CNN in comparison with the baseline models.

Fig. 11
figure 11

Comparison between best baseline model and the CNN approach

5 Discussion

The methodology proposed in this paper aims to discriminate between effective and ineffective conversations. The proposed methodology allows word embeddings to be trained within the model itself, therefore specific terms commonly used within the company’s or market’s language have a representation according to the semantic of conversations and do not depend on the semantic of an external corpus. In addition, the vocabulary of the trained model is limited to the one used in the conversations of the training database. Therefore, if the model is tested in another environment, its performance will depend on the similarity of the semantic fields of the new environment with the vocabulary of the training database. This limitation may cause a negative effect on its generalization capability; however, we believe that in general terms its advantages surpass this drawback, especially considering that the typical application scenario of this technology is when a company already has a chatbot in production and needs to evaluate the effectiveness of its service. In these cases, a training set can be generated to feed the P-CNN and create the evaluation model.Regarding the experiments with the baseline models, we expected to obtain better results with modern pre-trained models such as BERT or BETO. However the evidence shows that a simpler model is sufficient to address the problem presented in this paper. We believe that this is due to the specific application we have addressed because in the two cases the customer knew that (s)he was interacting with a chatbot, thus more concrete and precise language with few context words was used.

It is noteworthy that the classification of conversations using the proposed model does not evaluate the quality of service perceived by the customer. Our proposed system is trained with information labeled by a human expert in linguistics based on whether the customer requirements were correctly addressed (effective conversation) or not. This means that the result of the system is in line with the evaluation that a human would assign to a given conversation. Hence, the result can be used to assess the effectiveness of a given chatbot in order to detect ineffective conversations, and then to re-train the chatbot to improve its capability to effectively provide service to the customers. Note that a model aligned with customer’s feedback would be sensitive to capture customer satisfaction and not chatbot effectiveness. For instance, a conversation can be correctly addressed by the chatbot but the service that the chatbot provides can make the customer not to feel satisfactorily served.

The main application of our system is to evaluate and to improve the chatbot’s performance. On the one hand, the percentage of ineffective conversations detected by the model in a given time period can be useful to determine whether the chatbot’s training was correct or if it is necessary to update some rules. On the other hand, the conversations classified as ineffective can be analyzed to determine the source of the chatbot’s failure, and to update the rules.A limitation of the proposed methodology is its computational complexity, which does not allow the real-time evaluation of conversations. However, it is possible to make evaluations by time intervals, for instance the number of ineffective conversations per hour, per day or per week, depending on the traffic of conversations processed by the chatbot. Finally, this methodology was implemented by Colombian company Pratech Group to metric the effectiveness of its Chatbots.

6 Conclusion

The effectiveness of chatbots during conversations with customers of two different companies is automatically evaluated in terms of whether the service requested by the customer was effectively or ineffectively provided by the chatbot.Classical word-embedding approaches like Word2Vec and BERT are used as baseline and their performance is compared with respect to a novel approach, introduced in this paper, based on parallel CNNs with multiple temporal resolutions. Questions from customers and answers from the chatbots are modeled independently by two parallel convolutional layers. Each layer is composed of three filters, considering multiple temporal resolutions. Bi-gram, tri-gram and four-gram relationships among the words are considered simultaneously to extract the feature vector of the question and the answer.The proposed approach produces better accuracies than those obtained with the baseline models in the two databases. Observed improvements range between 0.38 and 2.9 percentage points depending on the database. The main advantage of the proposed approach is that it does not depend on pre-trained models which are typically created with millions of words that are not necessarily related with the context of the given task (i.e., the target corpus). Our proposed method allows to create specific models per each context, generating more accurate systems adapted to particular needs. This work is a step forward in the automatic chatbot effectiveness evaluation and will allow companies to improve their QoS monitoring process. By using this approach it will not be necessary to use self-reported satisfaction surveys, instead, the service provided by the chatbot can be accurately and automatically evaluated. Currently, a product was development by the company

Other challenges like naturalness and personality for the chatbots’ language might be studied in future research.