1 Introduction

The rapid development of the Internet and mobile intelligent devices has motivated more enterprises to develop their online business while creating new demand for developing user-friendly and effective human–computer interaction technologies such as chatbots. A business chatbot refers to a conversational agent that can interact with users via text (Przegalinska et al. 2019), image (Chiu and Chuang 2021), or voice (Sanchez-Diaz et al. 2018) to accomplish specific commercial tasks, such as customer relationship management (Steinbauer et al. 2019), financial investment advising (Wen 2018), online shopping (Thomas 2016) and so on. During the pandemic, the global demand for epidemic prevention resulted in increased online business demand and higher requirements for 24/7 services. Using chatbot technology to automate messaging services with customers could reduce customer servicing costs or save on the operating costs of entire call centers (Lee et al. 2018). Considering the rapid increase in chatbot demand to improve customer services, a thorough study of state-of-the-art technology in business chatbots can help popularize chatbot applications and identify potential business values from human–computer interactions (Steinhoff et al. 2019). It is beneficial to chatbot developers, researchers, end-users, and sellers.

Various computational methods from multiple disciplines have been applied to chatbot design. The evolution of its mainstream techniques follows a similar development pattern as that of information technology (IT), including the algorithm, computer language, operating system, hardware, and Internet (Lokman and Zain 2010). Nowadays, chatbot development is in the middle of a boom driven by contemporary AI technologies. In particular, recent deep learning advancements have led to various innovative and effective approaches to improving and enriching chatbot functionalities. A qualified business chatbot is expected to understand the context and user intent, integrate with domain knowledge (Luo et al. 2021), orchestrate workflows within a customer relationship management (CRM) system (Jonke and Volkwein 2018), deliver personalized experiences and emotional values (Zhu et al. 2022), and collaborate with humans to ensure efficient and secure services (Ferrod et al. 2021; Yang et al. 2021b).

In comparison, chatbots in other fields may not require all these characteristics. For example, chatbots designed for psychological counseling may not need to integrate with CRM systems because they only interact with individuals and record their mental health status. Educational chatbots responsible for student assignments are not required to provide emotional values. However, to our knowledge, a literature review outlining the manifold deep learning technologies and identifying their utilization has been lacking in business chatbot studies. Conducting a review that presents the technology preferences of researchers in the business chatbot literature will be helpful because it can provide focused insights for further business value exploration. Through a review, a better understanding of researcher-preferred techniques and recognition of the technical and application gaps can be achieved. Hence, such a review is expected to help researchers and developers delve into the value of chatbots under specific commercial applications, considering the distinct characteristics and applicable scenarios related to diverse deep learning branches. Accordingly, we focus on the thorough review and analysis of mainstream deep learning technologies in the dialogue system, which produces the core service of business chatbots. It is an essential component that can significantly enrich user experience during human–computer interactions.

We start this systematic literature review following the guidelines proposed by Petersen et al. (2015), which were upgraded from Petersen et al. (2008). According to their summary, literature review has two types: systematic review and mapping study. Our literature review belongs to the review paradigm of mapping study, which aims at structuring a research area through classifying and counting contributions concerning the categories of that classification (Petersen et al. 2008). To provide differentiated insight into this burgeoning technique and commercial use, we aim to contribute in the following aspects: (1) provide conclusions on the current development structure of chatbots and recognize the mainstream deep learning techniques adopted for contemporary business chatbot building; (2) summarize the usages of the deep learning technologies and compare their performance in each usage; and (3) propose a comprehensive classification framework to characterize different chatbot architectures in terms of the response-producing ways.

Hence, we identify three sets of keywords from the research motivation: chatbot, business, and deep learning. Each set of searches is performed on the databases of Web of Science (WoC) and Scopus using a Boolean expression (“OR,” “AND”). The search strings used for each database can be found in Table 1 and have been applied to all fields. This review paper focuses on the publications of English proceedings papers or articles in the past 5 years (2017 to 2021) to ensure the timeliness of recognized technologies. The process of filtering literature from two databases is shown in Fig. 1, and the distribution of the search results is shown in Fig. 2.

Table 1 Definition of search strings
Fig. 1
figure 1

Process of filtering literature from WoC and Scopus

Fig. 2
figure 2

Numbers of retrieved publications over time (2017–2021)

To outline our mapping study, we created a taxonomy diagram after reviewing all the qualified literature, as shown in Fig. 3. Deep learning technologies have four usages (in blue boxes) in chatbot dialogue systems, and some usages can be subdivided further (in purple boxes). We identify the mainstream deep learning techniques (in orange boxes) mentioned in the literature for each use and provide relevant references (in green boxes). We will conduct a more detailed study surrounding this taxonomy in Sects. 3 and 4.

Fig. 3
figure 3

Taxonomy of deep learning usages in chatbot dialogue systems

The remainder of this article is organized as follows. Section 2 introduces the chatbot background and emphasizes the pipeline and end-to-end development structures at the current stage. Section 3 summarizes seven deep learning technologies from business chatbot literature and describes their computational mechanisms in dialogue systems. Section 4 presents the main applications of deep learning and compares various neural network characteristics for the same usage, followed by Sect. 5, which conducts a critical analysis of technical features for four chatbot architectures. Sections 6 and 7 discuss the future research directions and conclude the review article, respectively.

2 Background

Generally, a chatbot is composed of three parts: (1) a user interface for receiving inputs and delivering outputs; (2) the management system to provide a variety of chatbot core services; and (3) hardware support for the entire operation of a chatbot system. The general system architecture of a chatbot is shown in Fig. 4.

Fig. 4
figure 4

General architecture of a chatbot system

The interface of a chatbot receives an input message or instruction from a user and transmits it to the respective service management components. The service management components accept the input request and assign tasks to the respective sub-service components for operations. If a chatbot receives a query, the dialogue management system needs to produce an appropriate response to start a conversation with the user. The functionalities of a chatbot can be enriched by embedding it with internal or external application programming interfaces (APIs), such as the APIs for e-mail service, external messaging, geographical maps, weather, and other standard utility services. The chatbot interface also receives responses from the service management system and formats the responses according to the specific presentation style required by the user. These chatbot services are executed by invoking hardware components, including temporary storage for dialogue and service data, permanent memory for fundamental system operations, central processing units and graphical processors, intelligent routers for Internet connections, and so on. The automatic reply (auto-responder) of the dialogue system is one of the most primitive functions of chatbots (Mufadhol et al. 2020), and its benefit of emancipating productivity is one of the well-recognized values of developing business chatbots (Sandu and Gide 2019; Steinbauer et al. 2019).

Starting in the early twenty-first century, a surge in machine learning and deep learning research has spread to chatbot studies (Adamopoulou and Moussiades 2020), advancing chatbots gradually into becoming more intelligent and modernized. The research and development of business chatbots have entered an era of explosive growth. Contemporary chatbots are designed to provide context-sensitive responses and deliver an array of sophisticated functionalities. They have gradually evolved into two common architectures: the pipeline and end-to-end structure. Figure 5 illustrates the information processing flow of a dialogue system in the pipeline structure. Conceptual components in this kind of conversation system generally include natural language understanding (NLU) for dialogue intent and slot recognition, dialogue state tracking (DST) for conversation record management, dialogue policy learning (DPL) for response policy controlling, and natural language generation (NLG) for response generation (Chen et al. 2018; Hirschberg and Manning 2015). In some cases, the DST and DPL components will be integrated into a united concept named dialogue management (DM), which is responsible for storing and controlling the conversation states. Current studies might weaken the DM conception and distribute its functionality to NLU or NLG components when not emphasizing the dialogue state management in system design. Chatbots with this architecture explicitly present the response generation process to developers and researchers for the convenience of frequently adjusting or independently improving each component with user requirements adapted.

Fig. 5
figure 5

Illustration of a dialogue system in the pipeline structure

Building a chatbot in pipeline architecture requires massive manual operations for designers, especially in adapting the NLU module to specific application scenarios. To lessen manual intervention, some researchers prefer the other kind of architecture, the end-to-end structure. An example is shown in Fig. 6; the dialogue system operates with raw data input and directly outputs the final processed results. Except for the text vectorization and result interpretation, the remaining computational space allows the ensemble neural network to adjust its model parameters automatically according to the training data. This structure increases the model’s overall fit but requires massive data to “teach” the model to learn the intrinsic data relationships (Yang et al. 2019). The choice of two structures depends on the actual business needs and designers’ preferences.

Fig. 6
figure 6

Illustration of a dialogue system in the end-to-end structure

To facilitate both academic research and practical application, several reputable enterprises have provided their respective open-source development frameworks that can unify and simplify chatbot design, including Google’s DialogFlow, Facebook’s Wit.ai and Messenger, and Microsoft Bot Framework, Amazon Alexa, IBM Watson Assistant, and Rasa’s RASA. Some tools may provide development structures to satisfy designers’ diverse needs. For example, RASA is a machine learning infrastructure to automate text-based conversations, providing pipeline and end-to-end developing libraries (Bocklisch et al. 2017). In the pipeline structure, each dialogue system component is trained and can be replaced by equivalent methods independently. In contrast, the ensemble model in end-to-end design is trained simultaneously, where the parameter updates of each subpart affect each other. These two architectures have developed distinctive construction advantages to shape the mainstream options. A detailed comparative analysis involving architecture characteristics will be carried out along with their internal deep learning applications in Sect. 5.

3 Mainstream deep learning methods for business chatbot development

Deep learning refers to a series of burgeoning computational technologies with a unique operation structure named neural network (LeCun et al. 2015), originating from the biological neural networks that constitute animal brains (McCulloch and Pitts 2016). A computational deep learning model comprises many nodes (or artificial neurons) connecting in diverse forms. Different connection and operation ways constitute various artificial neural networks. Although Rosenblatt (1958) created the first neural network and Rumelhart et al. (1986) designed the backpropagation algorithm to train the model decades ago, deep learning research was stuck for a long time due to the limitation of computing capability. Driven by the rapid development of modern computer software and hardware technology, the feasibility and potential of deep learning have revived researchers’ interest in diverse neural network algorithms. They enrich the machine learning family and expand AI influence with its ground-breaking capability of multidimensional data processing. In this section, we identify several mainstream deep learning technologies from recent business chatbot research and briefly introduce their characteristics and computational mechanisms. A concise summary of the technologies and their characteristics is shown in Table 2.

Table 2 Summary of mainstream deep learning technologies and their uniqueness

3.1 Artificial neural network

Artificial neural network (ANN), also called the neural network, is a mathematical model that imitates the structure and function of a biological neural network to estimate or approximate the nonlinear functional relationship between the network inputs and outputs. An ANN comprises many nodes connected in different ways to convey information. Each non-input node represents a specific output function called the activation function. Each connection (edge) between two nodes represents a weighted value for the signal passing through the connection, equivalent to the ANN memory. We usually divide the neural network into the input, output, and hidden layers according to the input and output positions of the model. The network output varies with the connection mode (network structure), weight values of edges, and activation functions of nodes for accomplishing different tasks.

An instance of a simple neural network is illustrated in Fig. 7. It is a fully connected feed-forward neural network where each neuron in a layer connects to each node in the adjacent layer. Its information flows only in a forward direction (from the input to the hidden to the output layers) without a cycle or loop connection (Schmidhuber 2015). This sample model has one hidden layer with N dimensions, and the operations of the different layers are as follows.

$${\varvec{h}}={\sigma }_{h}({{\varvec{W}}}_{{\varvec{h}}}{\varvec{x}}+{b}_{h})$$
$$y={\sigma }_{y}({{\varvec{W}}}_{{\varvec{y}}}{\varvec{h}}+{b}_{y})$$

where \({\varvec{x}}=\left\{{x}_{0}, {x}_{1}, \dots , {x}_{i}, \dots , {x}_{M-1}\right\} i\in (0, M)\) stands for an M-dimensional input vector, \({\varvec{h}}=\left\{{h}_{0}, {h}_{1}, \dots , {h}_{j}, \dots , {h}_{N-1}\right\} j\in (0, N)\) and \(y\) stand for the hidden layer vector and output of the network, respectively, \({{\varvec{W}}}_{{\varvec{h}}}\) and \({{\varvec{W}}}_{{\varvec{y}}}\) are the weight matrices, \({b}_{h}\) and \({b}_{y}\) are the bias terms in the corresponding layer, and \({\sigma }_{h}\) and \({\sigma }_{y}\) stand for the activation functions that are often Rectified Linear Unit (ReLU), tanh, or sigmoid to calculate a weighted sum of the inputs in the node.

Fig. 7
figure 7

Double-layer fully connected feed-forward neural network

An ANN with few hidden layers is named a shallow neural network and has been explored for decades in the last century (Schmidhuber 2015). In 2006, Hinton et al. (2006) alleviated the local optimal solution problem using pre-training methods using an ANN with seven hidden layers, allowing for deep-layer operations and rekindling people’s attention to deep learning. An ANN with complex neural structures and many network layers is called a Deep Neural Network (DNN). For example, a fully connected neural network with multiple hidden layers is a standard DNN. DNN has dominated ANN applications due to its robust feature extraction and learning ability. The following sections introduce several common DNN variants in business chatbots.

3.2 Recurrent neural network

A recurrent neural network (RNN) is an essential branch of deep learning technologies with directed cycles in model memory that can allow the temporal sequence as input. It is particularly specialized in textual data processing to capture the semantic relationship between words, with the information stored in the multidimensional weights of the networks. The rise of chatbots that can “generate” a response also greatly benefits from RNN development. Two primitive types of RNN have the characteristic of using the internal state (memory) stored in the hidden layer unit to process sequences of inputs. Jordan network (Jordan 1986) uses the output of the output layer at a previous time point as one of the inputs in the current hidden layer. Elman network (Elman 1990), a more general RNN, uses the output of the hidden layer at a previous time point as one of the inputs in the current hidden layer. The inputs of the hidden layer distinguish these two networks.

However, simple RNN models often have difficulty obtaining satisfactory results. The gradient vanishing will happen with the gradient exploding in the simple RNN model, which is difficult to solve by adjusting the learning rate or other model parameters. LSTM is an efficient gradient-based method Hochreiter and Schmidhuber (1997) proposed to alleviate the gradient vanishing problem. Compared to simple RNN units, LSTM adds three logic gates (input, forget, and output) to control the input information, long-term memory, and short-term memory. These gates enable the model to solve the gradient vanishing problem so that the learning rate can be set small. GRU (Cho et al. 2014) is another RNN variant similar to LSTM, and its most prominent characteristic is reducing three gates to two (update and reset); hence, the training speed is accelerated. The GRU model can only read in the new input if it empties the existing state. The performances of these two structures are not far apart but are better than those of traditional recurrent units (Chung et al. 2014).

With the efforts of deep learning researchers, RNN has formed various typical structures after multiple iterations. Seq2seq is one of the most groundbreaking and far-reaching designs. It is a general end-to-end model in an encoder-decoder structure proposed by Cho et al. (2014) and Sutskever et al. (2014). As shown in Fig. 8, it comprises two RNNs, an encoder and a decoder. One RNN encodes a sequence of word vectors into a fixed-length vector representation, and the other decodes the representation into another sequence. The units in this model (blue and green squares in Fig. 8) are usually LSTM or GRU. The encoder and decoder are trained jointly to maximize the conditional probability of a target sequence based on the given source sequence (Cho et al. 2014). This kind of design that can generate a response word by word has been widely used in text generation tasks.

Fig. 8
figure 8

Seq2seq model

Other well-known improvements include the Bi-LSTM design and hierarchical recurrent encoder-decoder (HRED) architecture. Zhou et al. (2016) proposed the Bi-LSTM that could read the input sequence from two directions to utilize the contextual features instead of the one-side information at a specific time. HRED was proposed by Serban et al. (2016) to model the contextual information for dialogue generation. They considered the turn-taking nature of dialogues and added a context encoder based on the Seq2seq structure to encode the temporal features of appeared utterances. This contextual modeling can reduce the computational steps among adjacent sentences and facilitate the propagation of the information and gradients in the network, thus enabling multi-turn conversation.

3.3 Convolutional neural network

A convolutional neural network (CNN) is a special neural network that replaces the general matrix multiplication with the convolution operation in at least one network layer (Goodfellow et al. 2016). Inspired by the research findings of the cat and monkey’s visual cortices (Hubel and Wiesel 1959, 1968), Fukushima (1980) proposed a neural network composed of a convolutional and a downsampling layer, and the max-pooling computation approach was further introduced by Weng et al. (1993) for downsampling. A max-pooling layer often follows a convolution layer in modern CNN design. The uniqueness of CNN architecture lies in extracting spatial features through sequential operations in the convolutional and pooling layers, which validly enables the network to cope with two and three-dimensional input data.

Kim (2014) and Zhang and Wallace (2015) developed the TextCNN for sentence classification tasks to adapt the CNN model for text analysis. A simplified illustration is shown in Fig. 9. Each word in the input sentence will be transformed into a vector with word embedding techniques, and the combined matrix represents the whole sentence. Suppose an input sentence comprises N word vectors of D dimensions, an N × D word matrix. The CNN model in Fig. 9 has six filters of three window sizes, each corresponding to two filters with different values to capture distinct information. The window size (note as M) represents the number of word vectors an M × D filter can cover in a convolution operation. Each filter slides along the input matrix to perform convolution operations in the convolutional layer, and the computed (N − M + 1) × 1 vector is passed to the next pooling layer. A pooling operation is essentially a nonlinear form of downsampling that can reduce the dimensions of the received vector. The maximum value will be extracted from the vector for the frequently used max-pooling. In this instance, six values produced by the pooling layer correspond to the convolution and pooling results of six filters; they will be conveyed to a fully connected network to carry out the final calculation.

Fig. 9
figure 9

CNN’s computational mechanism for textual data

3.4 Capsule neural network

The capsule neural network (CapsNet) was initially proposed by Sabour et al. (2017) to overcome CNN’s shortcoming that the max-pooling layer could only concentrate on detecting important local features in computer vision applications. Due to the lack of relative positions among the features, CNN models would misidentify an image of a face with disordered features as correct or an oblique real face as wrong in cases where sufficient data samples are unavailable. Hence, Sabour et al. (2017) invented a special neuron called a capsule to replace the max-pooling operation they thought was causing valuable spatial features to be lost. The most distinguishing characteristic of the capsule neuron is it utilizes a vector as the model output to represent the spatial information (the direction of a vector) and probability value (the norm of a vector) of a detected pattern. This design enables the network to intelligently model the intrinsic spatial relationship between a part and a whole, automatically generalizing the learned knowledge to novel viewpoints and reducing the sample size requirement. Similar to CNN, CapsNet has also been applied to textual data mining; it was first adapted by Zhao et al. (2018) for text classification tasks.

The input of a CapsNet model is usually multiple vectors. Figure 10 illustrates a capsule neuron with two input vectors reshaped from the computed results of the convolutional layer in Fig. 9. The affine transformation is first applied to the input vectors (\({{\varvec{v}}}_{1}\) and \({{\varvec{v}}}_{2}\)) to convert the low-dimensional feature vectors into high-dimensional ones (\({{\varvec{v}}}_{1}^{\boldsymbol{^{\prime}}}\) and \({{\varvec{v}}}_{2}^{\boldsymbol{^{\prime}}}\)). Transformed vectors are then computed by weighted sum to calculate a new vector \({\varvec{s}}\). The weights \({c}_{1}\) and \({c}_{2}\) (named coupling coefficients) are determined by the dynamic routing algorithm (Sabour et al. 2017), and the sum of the coupling coefficients equals 1. Finally, vector \({\varvec{s}}\) is processed by a “squashing” operation (the formula is shown in the grey box of Fig. 10) to generate the output vector \({\varvec{v}}\) with the norm between 0 and 1. This value represents the predicted probability of the pattern that the neuron is responsible for detecting.

Fig. 10
figure 10

Computational mechanism of a capsule neuron

3.5 Graph neural network

A graph neural network (GNN) is an optimizable transformation on all attributes of a graph (nodes, edges, global-context) that preserves permutation equivariant architecture (Sanchez-Lengeling et al. 2021). The concept was proposed by Gori et al. (2005) and further elaborated by Scarselli et al. (2009). GNN adopts a “graph-in, graph-out” architecture, as shown in Fig. 11. Its input is a graph with feature information loaded into its nodes, edges, and global context, and the model progressively transforms these embeddings without changing the connectivity of the input graph.

Fig. 11
figure 11

One GNN layer for graphic data

In a business chatbot, GNN is usually associated with a knowledge graph, a semantic network that can provide additional domain knowledge and assist chatbots in downstream decision-making. A knowledge graph represents a network of entities (nodes) and illustrates their relationships (edges). The entity can be any object, event, situation, or abstract concept. Each node has its attributes. For example, Fig. 12 presents a product knowledge graph Lin et al. (2021a) used for beauty shops. Take the node Product 2 and its relations as an example. It connects two nodes, the entities Ingredient 1 and Flavor 1, through the relations has_ingredient and has_flavor, respectively. Its attributes include title and description in this knowledge graph.

Fig. 12
figure 12

Product knowledge graph used in the beauty shopping contexts

GNN is adopted to process such a graph structure as input and produce new representations embedded with graph information for nodes or edges. Usually, the graph can be modeled as G = (V, E), where V is the node set and E is the edge set. U is the global-context set, such as the number of nodes and edges. The corresponding feature representations of the nodes, edges, and global contexts in the GNN’s n-th layer are Vn, En, and Un. The product knowledge graph in Fig. 12 illustrates the entity attribute information, and thus, the node feature matrix Vn can be considered here. For node i, the GNN’s n-th layer updates its feature representation from Vi.n to Vi.n+1, aggregating information from its attributes and immediate neighbors with a specific optimizable transformation method. Zhou et al. (2020) have summarized existing optimizable transformation methods, mainly including the convolution operator, recurrent operator, skip connection, etc. The updated feature representations of all nodes Vn+1 will be passed to the next layer n + 1, and outputs of the GNN’s final layer are new node representations that can be further used for downstream tasks. After GNN’s iterative processing, the representations for nodes have integrated with the whole graph information.

3.6 Generative adversarial network

The generative adversarial network (GAN) is a fascinating semi-supervised learning framework that Goodfellow et al. (2014) proposed for estimating generative models. It is widely used in computer vision to generate new image samples initially. It comprises two sub-models: a generative model (generator) to approximate the data distribution and a discriminative model (discriminator) to estimate the probability that a sample came from the real data rather than the generative model. These two parts are trained simultaneously in the form of a minimax two-player game, and the training procedure for the generator is to maximize the probability of the discriminator making a mistake.

Seq2seq-based artifacts are the most common generative model for text generation. However, they suffer from a severe problem in that the generative model tends to produce a safe response (Zhang et al. 2018b) without practical significance, such as “I don’t know” or “I think so.” The main reason comes from the effects of many safe answers in the training corpus, with many responses starting with “I.” The probability distribution of words in different sentence positions has an apparent long tail characteristic. Hence, the decoder would be affected to select the most probable “I” as the first word of the response, further affecting the generation of subsequent words. The appearance of a safe response implies that the Seq2seq model is trapped in a locally optimal solution. A feasible operation is to impose a disturbance, such as GAN, on the model to make it jump out of the local solution and enter a more optimal global state, thus alleviating the problem.

Figure 13 briefly illustrates a response generation model implemented in a GAN schema with the Seq2seq model as the generator and another neural network as the discriminator. The generator outputs a generated utterance based on the received input that might be processed in some noises or interferences (if any), such as masking some words of the input utterance (Fedus et al. 2018), to increase the conversation variety. The discriminator is responsible for judging the gap between the generated text and the real one and transmitting the computed loss to update the generator parameters. The upgrade of this powerful framework owes much to the work of some researchers who contributed to migrating the GAN mechanism from image creation to text generation. Kusner and Hernández-Lobato (2016) introduced the Gumbel-softmax distribution for the generator to handle the discrete sequence data generation problem. Zhang et al. (2016) designed the textGAN model with a feature distribution matching method to solve a similar text generation problem, and they further improved the objective function with a kernelized discrepancy metric to ameliorate the mode-collapsing problem in GAN’s training process (Zhang et al. 2017). In addition, Yu et al. (2017) and Guo et al. (2018) proposed the SeqGAN and LeakGAN models to optimize the generative performance of long text with policy gradient-based reinforcement learning. These GAN-based models have a prominent characteristic of generating from text to text with a similar structure, which brings the models the potential to solve data paucity problems. They can enhance the original data by this kind of text retelling and improve the robustness of unknown data by data augmentation from the generative model.

Fig. 13
figure 13

GAN schema for text generation

3.7 Deep reinforcement learning

Deep reinforcement learning (DRL) refers to a series of algorithms that blend deep learning with reinforcement learning to optimize objective functions and make better decisions in sequential decision problems. Reinforcement learning is a cyclic process in which an agent takes actions according to explicit or implicit policies and then interacts with the environment to gain a reward and change its perceived state. It is designed to maximize the cumulative reward (formulated as the objective function) and settle the decision optimization problem. The distinct characteristic is to consider long-term income through frequent sequential interactions in a trial-and-error mechanism.

The reinforcement learning environment was initially abstracted as a Markov Decision Process (MDP) and solved with the dynamic programming method (Otterlo and Wiering 2012; Bellman 1954). It required much computation time and space to figure out the state transition process and caused the algorithm development to stagnate. The renaissance primarily comes from an event where a computer Go program, AlphaGo, won the Go world champion Fan Hui without handicap on a full-sized 19 × 19 board in 2015. AlphaGo was deployed with the DRL that united function approximation and objective optimization. It leveraged the deep learning perceptual ability to retain the state transition and modeled the policy and objective function of reinforcement learning. The program leader, David Sliver, was thus written into history as the pioneer of DRL, and the achievement has been published in Nature (Mnih et al. 2015; Silver et al. 2016).

A typical DRL process is shown in Fig. 14. Regarding the environment observed at time step t, the agent constructs a perceived state \({S}_{t}\) and follows a value-based or a policy-based method to map from the state to an action \({A}_{t}\). Then, the environment reacts to the agent’s action with a scalar reward \({R}_{t}\). The agent receives feedback from the environment and updates its perception into a new state \({S}_{t+1}\). Deep learning is leveraged to capture complicated relationships among the state, the agent action, and the reward or objective function for action determination and policy learning.

Fig. 14
figure 14

Overview of the DRL process

The agent built with a value-based method can utilize deep learning models to simulate a value function that estimates the cumulative reward. It selects the response action of the highest value, and one classical DRL algorithm is Deep Q-Network (DQN) developed by DeepMind (Mnih et al. 2015). The policy-based methods, such as Policy Gradients (Sutton et al. 2000), compute a probability distribution over actions according to the learned policy (represented by network parameters), so every action is likely to be chosen and is especially applicable to the scenario with a continuous action space. To sum up, DRL combines the perceptual ability of deep learning and the decision-making ability of reinforcement learning to define a decision problem and optimize the objective function in consideration of long-term benefits. Serban et al. (2017) adopted it in a social bot design with the dialogue process as state and the candidate response as action to improve users’ multi-round interactive experience. Their Amazon Alexa Prize winning highlights DRL's feasibility in conversation systems.

3.8 Transformer

Transformer is an astonishing deep learning architecture initially proposed by a team from Google Brain (Vaswani et al. 2017) to accomplish machine translation tasks. It has been extended to various AI applications and is widely used to handle sequential data. Concerning traditional deep learning technologies for processing sequence data, the convolutional operation makes CNN essentially a type of local encoding of n-gram models, limiting the network’s capability to establish long-range dependencies. And though the recurrent structure of RNN can capture long sequence information, it is challenging to run in parallel, often costing a tremendous amount of time. To process an entire sentence parallelly and globally, Vaswani et al. (2017) intensively used the self-attention mechanism, a weighted average method to extract key features and fuse the focal information, to design the Transformer model of the encoder-decoder structure.

The general process of the transformer architecture is shown in Fig. 15. With receiving the vector representation of a whole sentence, the transformer model encodes the positional information into it to form the model input. The transformer encoder consists of six identical blocks whose operations are shown on the far left of Fig. 15. Each block includes a multi-head self-attention layer and a position-wise fully connected feed-forward layer. The multi-head self-attention comprises multiple self-attention sub-layers that share the same input. They can relate different positions of a single sequence using the transformation and operation of three matrices, V, K, and Q. For each self-attention, V, K, and Q are derived from three different linear transformations of the same source, designed to capture the relationships between the input information.

Fig. 15
figure 15

Transformer computational mechanism

When it comes to the transformer decoder, its structure is very similar to the encoder’s, also consisting of six identical blocks, as shown on the far right of Fig. 15. The most significant difference is that each block in the decoder has two multi-head self-attention layers. And they are both slightly differentiated from those in the encoder. The first layer's input matrix is partially masked to prevent positions from attending to subsequent positions. For example, Fig. 15 illustrates the information state at a certain moment during the transformer training process, where the transformer model generates the word “doing” based on the input utterance “How is it going?” and the generated contents “I am” only. The masking operation in the first multi-head self-attention layer ensures that the decoder block cannot utilize the contents that will be produced. For the second layer, its matrices V and K are computed from the encoder output, while the matrix Q is derived from a previous layer in the block. The decoder with six identical blocks is required to capture the information extracted by the encoder and the contents produced by itself before the current moment.

A burgeoning paradigm of pre-training and fine-tuning in recent years further intensifies the propagation of transformer architecture. Fine-tuning is a common transfer learning method that partially adapts a model pre-trained on a large scale of datasets for a new scenario with similar data characteristics. It is particularly beneficial when the data volume is limited to prevent overfitting or in an application where the data amount is enormous to accelerate the model training process. In most cases, the success of a text-related task largely depends on the semantic feature mining from the given corpus. A model pre-trained on a large number of corpora can facilitate efficient text feature extraction by fine-tuning operations for transfer to new applications. Two well-acknowledged pre-trained transformer techniques are the Bi-directional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT). The more commonly used former proposed by Devlin et al. (2019) from the Google AI team is a bi-directional language model based on the transformer encoder structure, while the other provided by OpenAI (Radford et al. 2018) is essentially a transformer decoder that adopts the masked self-attention and can only embed one-side information for a word in the sentence. They are trained on massive unlabeled datasets and can be fine-tuned for various applications without extensive modifications to the specific architecture required for a given task.

4 Summary of deep learning applications in business chatbots

With the understanding of various deep learning structures and their computational methods, we further analyze how each stream of deep learning technology can be used in dialogue systems and compare their characteristics, as summarized in Tables 3 and 4. Table 5 presents selected papers’ adopted deep learning streams in each application and concludes their artifact design. This section provides a thorough summary and critical discussion on the deep learning applied in business chatbots from four critical perspectives: pre-processing for natural languages, NLU, NLG, and external knowledge enhancement for response quality improvement.

Table 3 Summary of deep learning technologies and their usages
Table 4 Summary of deep learning technological highlights and limitations
Table 5 Summary of paper contributions

4.1 Pre-processing of natural languages

Pre-processing is a series of adjustment operations to convert natural language text into an analyzable and predictable computer language form, including spelling correction, tokenization, stop word removal, word normalization, and text vectorization (word embedding). Traditional pre-processing technologies are based on the intuitive understanding of human languages and statistical learning methods (Johnson 2009). For computer processing, designers would program to split the words, remove the unimportant high-frequency terms, convert words to lowercase, change parts of speech, and conduct many other operations. The texts are eventually mapped to vector representations through bag-of-words or n-gram statistical language models before further NLU or NLG processing. This process of vector representation of words is called word embedding. However, not all tasks require the same level of pre-processing operations, and previous methods might ignore the order of words, part-of-speech, and synonyms, resulting in a massive loss of information. The current increase in deep learning technologies provides more options for the word embedding process of NLP. Novel text vector representation techniques in neural network structure reduce the tedious step-by-step pre-processing, replace the traditional statistical learning methods, and overcome the strict demand of statistical learning for elaborate feature engineering. The performance of chatbots has benefitted from the leap in deep learning. It has improved significantly from the pre-processing stage by capturing more information from human language, which further promotes chatbot upgrading. The following contents elaborate on how developers choose deep learning to pre-process received input texts for dialogue systems in consideration of development requirements and technical features.

One typical neural network embedding technology is Word2vec, proposed by Mikolov et al. (2013). It is a group of techniques that builds a standard neural network structure model to reconstruct linguistic word texts and detect synonymous words or suggest additional words for a partial sentence. The by-products of the Word2vec model are used to generate word vectors. The essence of this technology is to reduce dimensions for the one-hot vector of a word through the parameters of the input or output layer of a shallow, double-layer neural network, which significantly improves the performance of computing continuous vector representations of words from huge datasets. Many business chatbots adopted this technology as the standard pre-processing of text vectorization for the received user input, such as customer service support chatbots on Tweet (Aleedy et al. 2019) or in call centers (Xue et al. 2019), a train ticket trading system (Prasomphan 2019a), and an online stock information provider (Jiao 2020). Other technologies, such as GloVe developed by Pennington et al. (2014) at Stanford and fastText offered by Facebook’s AI Research lab (Bojanowski et al. 2017), apply a similar standard neural network structure while exploring more relationships among words. GloVe considers the global word-word co-occurrence statistics of the training corpus for word embedding, in which the distance between words implies more semantic information. This technology has been utilized in common chatbots devoted to customer support services on social media for marketing needs (Hardalov et al. 2019; Kushwaha and Kar 2020). FastText aggregates the n-gram model for word order information representation and the subword model (a text processing method between word and character levels) for atomic unit segmentation of word vectors. Its feasibility has been tested under e-commerce scenarios, such as the commercial telco platform (Ferrod et al. 2021) and the online food delivery company (Brahma et al. 2021).

The convolutional and pooling operations allow CNN to perform well in dealing with two-dimensional data. At the same time, some researchers have attempted to adapt the structure for the one-dimensional convolutional model to extract text features. For instance, Chen et al. (2019) combined CNN with attention and gate mechanisms to extract product review features and encode product-related questions in an e-commerce application. Liu et al. (2020) used a similar technique to learn the dialogue sequence representation. In other customer service scenarios, such as hotel and restaurant consulting, developers (He et al. 2019; He and Tang 2021) have also attempted to use CNN to encode conversation information. In particular, CNN can produce embedded information at the character level. This advantage of CNN is conducive to solving the out-of-vocabulary problem. One well-accepted technique is Char-CNN proposed by Zhang et al. (2015), and Quan et al. (2018) used it to process dialogue information of a chatbot built for the real estate industry.

RNN and its variants, such as LSTM and GRU, are created to handle sequence data, which is naturally suitable for dialogue information processing as human languages are sequential. The internal state (memory) structure can help the system learn the nonlinear features of input sentences, integrate the order of words, and process variable-length sequences, which solves the polysemy problem, earning the interest of many dialogue system developers. In Word2vec, the order of words is not emphasized under the assumption of the bag-of-words model, but it is crucial in the temporal dynamic recurrent network. Because of the powerful text parsing performance of the Seq2seq (encoder-decoder) structure, encoders with different processing units are widely used for sentence embedding. Bartl and Spanakis (2017) and Kushwaha and Kar (2020) used the RNN encoder to extract context features and embed dialogues to build customer service supporting social media chatbots. Primitive encoders only encoded the sentence in a fixed order, sequentially or reversely, to process the text word by word. The improved RNN techniques with the bi-directional structure have proved more effective (Zhou et al. 2016) and have been extensively used in NLP. ELMo (Peters et al. 2018) is one of the representative bi-directional language models to create contextualized word embedding. This architecture can capture more text features by reading the input sentence from two directions, which means the content before and after a word can be processed simultaneously. Variants such as Bi-LSTM and Bi-GRU are common for the dialogue embedding in auto-response applications, including the service industry (Quan et al. 2018; Ren et al. 2020) and e-commerce platform (Kulkarni et al. 2019).

The essence of neural network embedding technologies is to leverage the network structure characteristics to explore the relationships between words in a given text. Each network can extract distinct text features, although not necessarily effective. A hybrid method might be feasible for complex text processing to capture multiple text features. For example, CNN is good at dealing with image data but did not improve much when representing one-dimensional text information. However, when combined with RNN, it can help integrate various extracted text features. Moirangthem et al. (2018) introduced a hybrid of CNN and GRU and evaluated its performance in modeling longer semantic sequences. Prasomphan (2019b) combined CNN with RNN to extract the underlying abstract features of data for dialogue representation in an online sales assistant application system. Ren et al. (2020) adopted a similar method to extract n-gram features from conversations and learn the synergic representation in a restaurant conversational recommender system.

Since the proposal of transformer architecture by the team from Google Brain (Vaswani et al. 2017), its astonishing performance in handling sequential data has induced a boom in replacing RNN-based models to achieve various NLP tasks, even extending to other research fields, such as computer vision. The full use of the encoder-decoder structure and self-attention mechanism makes it more efficient for transformers to process a sentence parallelly and globally than RNN models with a word-by-word mode. Transformer embedding can provide each text input word vector with the global position and context information. BERT and GPT are the most frequently mentioned transformer embedding techniques in up-to-date business chatbots. They have attracted considerable attention in the latest chatbot design and have been rapidly utilized in various business scenarios, including multi-domain FAQs for small and medium enterprises (SMEs) (Damani et al. 2020; Shalyminov et al. 2020), customer services in the banking industry (Yang et al. 2021b) or e-commerce platforms (Yang et al. 2021a; Li et al. 2021), and emotionally resonant conversation (Chang and Hsing 2021). Observing an increasing number of studies using transformer technology, we believe it will continue to dominate the NLP field for a while.

4.2 Natural language understanding

NLU refers to the challenging task of endowing a machine (computer) with the reading comprehension capability of human languages, including making a computer understand natural languages and confirming its realization. It is a common, necessary component in pipeline chatbot design to maintain human–computer conversations and recognize the primary semantic information based on pre-processed user input. Previous practices were closely related to pattern-matching techniques. An NLU module must acknowledge the possible sentence pattern in the user utterance and determine the most matched predefined one for the subsequent NLG. The matching process was usually string-level and widely used the distance comparison between two sentences (e.g., cosine similarity) as the matching similarity (Thomas 2016; Pilato et al. 2007; Wei et al. 2014; Setiaji and Wibowo 2016; Augello et al. 2009). Currently, the modularization design of the pipeline structure demands refining the extracted knowledge and capturing more information from natural languages. Deep learning technologies can produce extraordinary performance in such fine-grained mining of sentence semantic information. Unlike matching a similar sentence structure to give a preset response, the NLU module in the current pipeline structure will perform several subtasks to generate more granular semantic content, leading to more accurate answers.

4.2.1 Intent recognition and slot filling

The most commonly detected information includes user intents and slots concerned with the predefined entities. The intents and slots are template information designed in advance. Intent recognition refers to classifying the user intent in the utterance. Slot filling detects the possible entity values and matches the correspondent entity type from the utterance. Figure 16 illustrates the intents and slots in the model training corpus of the RASA Version 3.0 NLU component. The intent “greet” includes three kinds of greeting ways in which the third example involves the entity “Sara” that belongs to the entity type “name.” These corpora are designed to train the NLU model. In practical application, the trained NLU component will classify a user greeting as “greet” intent and try to detect the entity (username) to fill in the “name” entity type.

Fig. 16
figure 16

Source rasa.com/docs/rasa/training-data-format/#example

Short example of intents and slots defined in RASA 3.1.

Existing literature deals with these two classification subtasks in two modes, separately or jointly. For orderly detection, the intent recognition will usually be conducted independently before the slot filling. Similar to the technique preference in word embedding, CNN- and RNN-based deep learning models were widely used to figure out the relationships and features among the user input and the predefined intents and slots because of their outstanding data processing ability. Li et al. (2017) used a one-layer convolution-pooling CNN to classify user intents in an e-commerce service assistant. LSTM or its well-known variant, Bi-LSTM, have been frequently applied to recognize the intents in customer service scenarios, such as fashion need analysis (Liao et al. 2018), call centers (Xue et al. 2019; Zhao et al. 2019), and other commercial applications (Majid and Santoso 2021; Arsovski et al. 2019). Regarding slot filling, researchers regard it as a sequence labeling task naturally suitable to adopt an improved RNN model. For example, the Bi-LSTM model could be used to achieve the task based on intent classification (Yu et al. 2020) or independent of intent classification (Haihong et al. 2020).

Some researchers prefer a joint training method for its advantages of utilizing the dependency between intents and slots (Liu and Lane 2016; Xu and Sarikaya 2013; Hakkani-Tur et al. 2016), which appears to be closer to how the information flows in human brains with semantic hierarchy exploited (Zhang et al. 2018a). For example, the sentences “Welcome to New York is a marvelous song” and “Welcome to New York, babe” have the same strings, “Welcome to New York”, but their intents are intuitively different based on our understanding. Slot types may be helpful for a computer to understand the intents of human languages. The slot in the first one could be recognized as a “song” slot with the value of “Welcome to New York,” while a “location” slot could be extracted from the second one with the value of “New York”. Their intents should be classified separately according to different slot values because of the discussion about a song and the welcome greeting. Deep learning technologies based on RNN, especially the improved LSTM and GRU, were once more the common choices for chatbot designers (Bhathiya and Thayasivam 2020; Wu et al. 2021) to complete the joint task of intent and slot detection. In the last two years, the potential of transformer architecture has been exploited greatly in NLP, and the mode of pre-training and fine-tuning makes BERT applicable for various tasks related to text features, including this joint detection task. It can appear in scenarios such as mobile selling-buying in an integrated form of BERT and CapsNet (Tiwari et al. 2021), financial investment (Yu et al. 2021), and banking (Lothritz et al. 2021) for customer services.

4.2.2 Topic domain and question type classification

To improve the responding performance of pipeline structure, some researchers have tried coming up with more refined subtasks to help the chatbot obtain a better understanding of natural languages. Apart from intent and slot detection, chatbot developers seek to produce more semantic information through the topic classification of a user utterance to increase the accuracy of downstream tasks. Oh et al. (2018) used a fully connected neural network to classify the dialogue domain to detect out-of-domain utterances in banking services. Similarly, Paul et al. (2019) applied a neural network to classify topics in given texts for the services in an electric shop. A pre-division of user questions might be another feasible way of improving reading comprehension. Kulkarni et al. (2019) developed a CNN model to classify e-commerce question types for customer services. Zhao et al. (2019) adopted a hybrid model of CNN and LSTM to implement a question category classifier in mobile customer services.

4.3 Natural language generation

NLG is the fundamental function of a chatbot that produces responses that interact with users. It is the last component in the dialogue systems of pipeline structure and the only essential constitution in end-to-end design. Response-producing methods can be categorized into two types from an acknowledged technical point of view: retrieval- and generation-based. When receiving user input, responses given in the retrieval-based methods are predefined, while those provided with the latter methods are generated word- for-word. Furthermore, based on our observations from the literature of the last five years, retrieval-based techniques can be refined further as scoring models and response action selection regarding the utilization of the preset responses. This difference in model input and output design affects developers’ preference for deep learning technologies in the response retrieval function. Accordingly, we summarize these three response-producing ways of NLG to conduct comprehensive comparative analysis in this section: (1) scoring model to rank response candidates, (2) classification model for response selection, and (3) response generation.

4.3.1 Response scoring model

A scoring model is designed to calculate a matching score for evaluating a preset response filtered by the NLU component (if any). The model inputs generally include the pre-processed representations of the user utterance, a possible response, and the chat context or dialogue history. Then, the model computes a score to measure the matching degree between given contexts and the response for all possible candidates. The chatbot will select the one(s) with the highest score from the response candidates and react to continue the conversation with users. Two formats of how a predefined response can be represented as input to the model are using the “answer” of given contexts or the “question” in “question–answer” pairs. A dialogue system in the previous format replies to users with the input “answer” directly, while that in the latter form will use the “answer” in “question–answer” pairs based on the correspondent “question.” These two ways share a similar design philosophy and technique principle. An effective scoring model is expected to learn the features of the input contexts and responses and capture the relationships among them. The score can be the similarity (Bartl and Spanakis 2017; Prasomphan 2019b; Zhao et al. 2019), probability (Shukla et al. 2020; Hardalov et al. 2019), matching degree (Yang et al. 2018; Qiu et al. 2018; Prasomphan 2019a), confidence score (Li et al. 2017), and other metrics to evaluate the extent to which the response fits the correspondent user utterance.

The CNN- and RNN-based deep learning technologies or their hybrids are not unexpected to be frequently used for complex text feature learning. Seq2seq and bi-directional structures of RNN-based improvements remain the most common choices for generating the evaluation score in customer service scenarios such as e-commerce (Li et al. 2017), Microsoft technical support (Yang et al. 2018) and other applications (Bartl and Spanakis 2017; Damani et al. 2020). CNN-based models integrated with strategies such as attention mechanism and transfer learning also apply in e-commerce services (Qiu et al. 2018; Song et al. 2020). The performance of hybrid models has been demonstrated in different combinations, including CNN and RNN in SME trading systems (Prasomphan 2019b), CNN and LSTM in mobile customer services (Zhao et al. 2019), and CNN and GRU in a train trading system (Prasomphan 2019a).

RNN- and CNN-based deep learning technologies have been commonly used in the pre-processing and NLU stages of extracting semantic features for input texts, and hence, some researchers might try other types of neural network models in the NLG process to learn natural language information that differs from that captured by CNN- or RNN-based models. The standard neural networks with structural fine-tuning can also effectively calculate ranking scores. Hardalov et al. (2019) improved the neural network model to predict the probability of response candidates in the simulation of Apple’s customer support on Twitter. Kulkarni et al. (2019) built a neural network for similarity-based response ranking in e-commerce services. With the advancements in transformer technology, its pre-training and fine-tuning schema has been utilized to upgrade a similar function. For example, Tahami et al. (2020) developed a scoring model in BERT to accelerate the computation of the matching degree between any new conversation history and a response candidate.

4.3.2 Response selection model

Compared to the scoring model using responses as input, a classification model for response selection takes responses or response-producing ways as output with user utterance and chat contexts as input. A dialogue system in such classification methods reacts to users by selecting the most probable output action (a response or a way to produce the response) according to the classification results (usually a probability distribution over actions). This approach is common in small and medium-scale application scenarios where the conversation domain and response type are finite.

Almost all typical neural network models are capable of the response classification task. In the context of customer services for small businesses, Singh et al. (2018) and Paul et al. (2019) adopted RNN and CNN, respectively, to classify the “tags” that can uniquely identify the correspondent “pattern-response” pairs. With the support of the DialogFlow framework, Canas et al. (2021) added a standard neural network for response selection in e-commerce services, and Franco et al. (2020) created an LSTM model for a similar purpose with another popular framework, RASA, in the scenario of cybersecurity-related queries. They all treated each chatbot response as action and used a neural network model to select the most suitable one based on the NLU information processed by the pipeline design. The fashionable transformer technique with the pre-training mechanism was applied to this task. Li et al. (2021) combined RNN, CNN, and BERT to improve the response selection performance of multi-turn conversations in e-commerce.

Because of the technical characteristics of the response classification task, the quantity of the responses (actions) is fixed, and each unit in the output layer of neural network models uniquely corresponds to a predetermined action, enabling it to leverage the value-based DRL at an advantage in long-term (multi-turn) performance. DQN is a reinforcement learning algorithm integrated with a deep learning model that calculates a value for measuring the long-term reward of the corresponding response. It has been applied in the scenarios of restaurant services (Williams et al. 2017), movie booking (Hatua et al. 2019), and Microsoft’s customer support (Shukla et al. 2020). Researchers have also improved the DQN-based algorithm to fit specific application scenarios, such as the Dynamic Reward-based Dueling Deep Dyna-Q to mitigate the negative impact of data noise (Zhao et al. 2020) and the Emotion-Sensitive Deep Dyna-Q to provide emotion-related immediate feedback (Zhang et al. 2021) in the movie-ticket booking communication. They significantly improve users’ interactive experience in long-term transactions.

4.3.3 Response generation

Compared to the previous retrieval-based response-producing methods, generation-based methods are those technologies that can “generate” responses without a searching step when receiving user utterances. Generation-based technologies imply that the neural network model can generate a brand-new response word-for-word. Theoretically, a dialogue system built with generation-based techniques has the potential to answer any query compared to retrieval-based dialogue systems that rely on predefined corpora to give responses.

The rise of these generative methods in chatbots follows the development of RNN-based models proficiently processing sequence data. Among the RNN variants, LSTM has been the most preferred neuronal structure in recent years. With the enhancement of the Seq2seq design, the generative operation mode of RNN has gradually evolved into a universal architecture, as shown in Fig. 8. A Seq2seq model built with LSTM for generating responses can be extensively found in the customer services of social media (Xu et al. 2017; Kushwaha and Kar 2020, 2021), web assistants (Pradana et al. 2017; Prajwal et al. 2019), chit-chat (Kang and Lee 2019), insurance consultancy (Nuruzzaman and Hussain 2020), e-commerce (Lin et al. 2021b), and many other scenarios (Ma et al. 2018). LSTM can deal with long-range dependencies of the input sequence to avoid the tendency of stressing recent contextual information. Improvements with bi-directional design, such as Bi-LSTM and Bi-GRU, can be adopted to increase the extracted sequence information from two directions (Aalipour et al. 2018; Sheikh et al. 2019; Haihong et al. 2020; Chang and Hsing 2021). Seq2seq allows variable-length sentences as input and output to simplify the processing of model training data. Its upgrade, HRED, can consider the turn-taking nature and perform better in multi-turn conversations (Olabiyi et al. 2019; Liao et al. 2018; Bartl and Spanakis 2017). The feasibility of all these improvements has been examined in applications from large enterprises to small shops, and the Seq2seq-based design has also become a classic generation-based artifact.

RNN and its variants can be further improved by integrating other neural networks and advanced technologies to build a satisfactory generation-based dialogue system. Similar to other applications of deep learning in text processing, CNN is the most common combination with RNN or Seq2seq-based models. For example, Aleedy et al. (2019) integrated CNN, LSTM, and GRU into a hierarchical dialogue system design for customer support services on Twitter, while Chen et al. (2019) incorporated the Gated CNN into the Seq2seq model to build an e-commerce chatbot in cellphone and household electrics contexts. Utilizing these techniques enables learning the different text features with the support of various network structures. GAN is another common choice integrated with RNN-based models to generate from text to text with a similar structure. This technique can enhance the original data characteristics with text retelling and improve the robustness of unknown data by data augmentation. Olabiyi et al. (2019) used it to capture and shrink the syntactic and semantic difference between the ground truth and the generated persona-specific response. Similarly, Ren et al. (2020) designed a GAN in the dialogue system of restaurant services to make the response generated by Bi-LSTM close to that of high quality produced by humans.

The progressive technologies of transformer and DRL have also been examined in the generative task. The transformer’s parallel computing capability in global sequence processing greatly speeds up the computation of encoder-decoder architecture compared to RNN models, and its combination with pre-training and fine-tuning mechanism can help with the data efficiency problem. For example, Shalyminov et al. (2020) employed the transformer technique GPT-2 in a multi-domain chatbot design to fit the dialogue system into a rapid prototyping cycle for new products.

Another advanced technology, DRL, can be used to train the model to favor generated responses with high long-term rewards instead of being stuck in universal and repetitive words such as “I don’t know.” Generation-based methods can be regarded as models that can produce infinite response actions; hence, the responses need to be evaluated by continuous indicators with a limited range of values to design the rewards. The rewards are made to adjust the probability that the model generates a response. Choosing the policy gradient algorithm to help train such policy-based DRL design that can generate a high-quality response is appropriate to facilitate the continuation of the conversation. Kandasamy et al. (2017) and Liao et al. (2018) applied this technology to improve the user experience in recommendation services, and Ren et al. (2020) employed it in a generative ensemble model of Bi-LSTM and GAN for serving restaurant customers.

4.4 External knowledge enhancement

To enrich diversity and improve the relevance of responses, some researchers utilize external knowledge apart from the given contexts and utterances to assist in response production. A common practice based on deep learning is introducing extra memory space to build a neural network and training it to learn and store task-related knowledge. This additional knowledge needs to be provided artificially to train the neural networks. The models are required to derive more accurate and valuable content from user utterances and enhance the performance of downstream identification and generation components with the additional learned knowledge.

One trending approach is infusing user emotion recognition into the NLU process to optimize the response performance regarding anthropomorphism. Researchers considered empathetic conversation an interactive service of high quality in business situations (Chang and Hsing 2021). They believed that user emotions could reflect the performance of dialogue systems and thus have made various attempts to incorporate human emotion detection into conversational agents and improve their anthropomorphism. Majid and Santoso (2021) added a sentiment classification function in the NLU process to provide satisfactory services. Chang and Hsing (2021) integrated a CNN into Bi-LSTM for emotionally resonant response generation to recognize user emotions, which were further infused into the generating process. Detected emotions can be utilized as indicators for user desire measurement to design the reward in DRL technology, which could speed up the agent learning human emotional feedback. For example, Tiwari et al. (2021) and Zhang et al. (2021) designed immediate rewards based on the numerical representation of contextual sentiments detected by GRU or LSTM models to achieve an instant evaluation of response actions in transaction services.

In addition to the universal sentiment knowledge, some attempts at the utilization of specialized external knowledge have been made for specific commercial applications., Some researchers tried to incorporate a topic domain-related recognizer into the NLU component to improve the user service experience from the relevance of responses. In transaction services, Liao et al. (2018) leveraged an HRED model to provide tips on product styles satisfying customers’ fashion needs. Nangoy and Shabrina (2020) utilized a CNN model to classify the product type in user queries. For persona-specific conversations, Olabiyi et al. (2019) introduced and upgraded the HRED model integrated with GAN to learn the features of speaker personality, location, and sub-topic and disambiguate them before the process of generating responses.

Some additional knowledge can be brought in to increase the efficiency of automated conversation services. Ferrod et al. (2021) designed a bi-directional RNN-based model to match an appropriate expertise level with user demands in telco commerce. Lin et al. (2021b) added a text classifier to decide whether to wait for user input or give a response during the e-commerce conversation. Yang et al. (2021b) adopted an LSTM to predict user abandonment behaviors to switch to other modes of banking services timely.

The external knowledge can be represented in a high-dimensional form to facilitate pertinent conversations. Lin et al. (2021a) developed a GNN to jointly learn the customer and product embedding from a customer-product knowledge graph, which was expected to enhance the conversation connections between the customers and their target products in shopping guidance scenarios. The above attempts can effectively improve the performance of downstream tasks and optimize the self-service experience for users, which inspires us to enrich and intensify NLU functions by magnifying the multidimensional exploitation of available knowledge.

5 Critical analysis of the characteristics of business dialogue systems

Section 2 divides contemporary dialogue systems of chatbots appearing in academic business studies into two categories (pipeline and end-to-end chatbot) based on the building architecture. Combined with the acknowledged taxonomy from the NLG perspective, we further differentiate the chatbots regarding the retrieval-based and generation-based deep learning methods and conduct a comprehensive analysis of business chatbots in the four categories, as summarized in Table 6. Whether in the pipeline or end-to-end structure, a dialogue system employed with different deep learning methods in NLG has distinct technology choices and realization characteristics. We believe a critical comparison of technical features would provide detailed insights into the construction of business chatbots in the era of flourishing deep learning technologies. In this section, we illustrate the characteristics of each category and compare their existing application scenarios.

Table 6 Classification of contemporary dialogue systems and their characteristics

5.1 Chatbot in pipeline architecture

Chatbot’s built-in pipeline architecture presents an explicit information processing flow for user utterances. Its NLU component extracts the knowledge from pre-processed natural languages, while the NLG component receives the NLU extracts and produces the response considering dialogue contexts and states. Each component can be adjusted and improved independently based on the designers’ requirements, and the NLU component is a signature ingredient in this structure. Although the pipeline design requires more human manipulation and suffers the unavoidable error propagation issue among components, it is beneficial to distributed processing and validation of effectiveness for independent optimization of each component. In addition, the eventual pipeline design can be further differentiated due to the technical characteristics of different NLG technologies.

5.1.1 Pipeline architecture with retrieval methods

Most chatbots in pipeline design adopt a retrieval method (scoring model or response action selection) to determine a predefined response. Developers need to preset all the replies corresponding to the specified questions and segment conversational knowledge that each component is supposed to capture at different stages. Competent deep learning methods are selected to exhibit the extraction capability of their expert features and learn the exclusive knowledge in the assigned NLU or NLG tasks. This approach is particularly suited for application scenarios where collected dialogue data are limited, with the chatting domain and query type easily identified. The entire design is intuitive and easy to adjust and part, even in large-scale applications. As the volume of data increases, so does the manual operation and system maintenance workload.

Many online shopping platforms have created pipeline-based chatbots responding with retrieval methods. Chatbot developers design a deep learning classification model to recognize the intents, slots, query patterns, product types, emotional states, or other prerequisite knowledge and train another model to determine the most suitable response candidate in a regression or classification manner (Li et al. 2017; Paul et al. 2019; Kulkarni et al. 2019; Nangoy and Shabrina 2020; Tiwari et al. 2021; Lin et al. 2021a; Canas et al. 2021). In particular, Singh et al. (2018) applied the machine learning software library TensorFlow to demonstrate a neural network instance that ranked the response candidates filtered from the preset JSON file for feasibility in small business applications. Interestingly, some researchers (Hatua et al. 2019; Zhao et al. 2020; Zhang et al. 2021) have examined the utilization of DQN-based response selection methods and evaluated their availability to consider long-term effects in a movie-ticket booking scenario where multiple rounds of interaction and frequent information queries exist.

Another common commercial use is technical support services. Chatbots are expected to provide customers with instructional answers to technical problems that might otherwise require massive time, money, energy, or material costs. In mobile services, Xue et al. (2019) and Zhao et al. (2019) improved LSTM models adapted to match the most semantically similar responses based on recognized intent categories and question types. For simple enterprise-grade conversational AI experiences, a well-adopted approach is building a chatbot based on an existing comprehensive framework. Franco et al. (2020) embedded a RASA-based chatbot into Telegram software and trained an LSTM model to select a response for cybersecurity-related queries. Shukla et al. (2020) used the Hybrid Code Network (HCN), an RNN-based improved artifact proposed by Williams et al. (2017), to build a scoring model integrated into a chatbot designed in Microsoft Bot Framework SDK.

In the financial area, a trend in chatbot adoption over the past two years has been to migrate a mature development framework directly into specific applications. Financial firms prefer stable techniques that can accurately provide the specified responses. For example, RASA and Facebook Messenger have been widely used in the primary banking business (Bhattacharyya et al. 2020; Lothritz et al. 2021; Yang et al. 2021b), stock information query (Jiao 2020), and financial investment services (Yu et al. 2021). Previously, self-established attempts without a building template could also be found in the banking industry (Oh et al. 2018), where the authors created a fully connected neural network to detect user queries from the banking service domain.

5.1.2 Pipeline architecture with generative methods

A chatbot implemented with generative methods can add an NLU component to form a pipeline structure, increasing the coherence and relevance of the generated response. In some cases, many corpora were available to train a generative model. However, the model might get lost in the massive dialogue data and fail to capture topic-related features, generating universal but meaningless replies. Accordingly, some researchers introduced the NLU design to extract domain-related knowledge and conducted experiments to inspect whether the knowledge could help enhance the thematic coherence and relevance between the generated response and dialogue contexts. Liao et al. (2018) used an LSTM model for intent identification and an HRED model for style preference extraction to emphasize users’ fashion needs in the generative model. Ren et al. (2020) and Haihong et al. (2020) improved customer services similar to the NLU of retrieval-based chatbots; they utilized the slot and response token information to assist in the generation process. These attempts revolved around the dialogue contents, and the mechanism was extended to an emotion-infused improvement. Instead of focusing on topic enhancement, Chang and Hsing (2021) focused on producing emotional replies. They designed a hybrid model of CNN and LSTM to predict user emotions and incorporate them to optimize the training process of the generative model.

5.2 Chatbot in end-to-end architecture

An end-to-end architecture means the design artifact can process original input data and produce final output without following the paradigm of step-by-step knowledge extraction. The development of deep learning technology enables these distributed operations to be integrated into a hierarchical model for collaborative learning of text features. From a supervised learning perspective, all constituent parts of an end-to-end model are trained jointly to learn the data features and update the model parameters. A dialogue system built in this structure can lessen cumbersome manual intervention in the knowledge construction process. It can be regarded as an ensemble model with NLG functionality. Generally, researchers would utilize sufficient corpora of a specific domain to train a deep learning model with a complex structure to capture the syntactic and semantic features of dialogues as fully as possible. Similarly, chatbots in end-to-end architecture can also be differentiated by the common retrieval- or generation-based methods, essentially the same as the techniques used in the NLG component of the pipeline-based design.

5.2.1 End-to-end architecture with retrieval methods

This category of chatbots is employed with retrieval methods described in Sects. 4.3.1 (response scoring model as a regression task) or 4.3.2 (response selection model as a classification task) to directly determine a predefined response based on word embedding of input utterances. It applies to small and medium-scale scenarios (Prasomphan 2019b) where the prepared responses are limited, with the infrequent maintenance work of corpora. The deep learning model is trained to learn hierarchical features of the relationships between the user queries and correspondent responses from the given dialogue data, and this training process will be repeated for the whole dialogue system once the predefined corpora are updated. Hence, the end-to-end structure of high complexity may increase the difficulty of model tuning as the data volume increases.

The approach of building a scoring model in an end-to-end chatbot has been widely attempted in e-commerce services (Qiu et al. 2018; Prasomphan 2019a, b; Song et al. 2020) and customer technical support (Yang et al. 2018; Hardalov et al. 2019; Tahami et al. 2020; Damani et al. 2020). They utilized deep learning techniques to produce matching or ranking scores for the query and response candidates. As for treating the response-producing process as a classification task, the mapping from the input utterances to the multidimensional output places higher requirements on the structural design of neural networks to parse the input text information as much as possible. Li et al. (2021) leveraged an RNN, CNN, and BERT hybrid to build a complex response-selecting model in e-commerce conversation to classify the selection action. In addition, an upgrade of the combination with DRL was introduced to improve the long-term interactive performance of the response classification model corresponding to this dimensionally fixed multi-output structure. Williams et al. (2017) optimized the HCN model with a policy gradient-based reinforcement learning to enable interaction with various users and improve the consumer experience of restaurant services.

5.2.2 End-to-end architecture with generative methods

The generation-based chatbots were commonly built in end-to-end architecture after the LSTM model and Seq2seq structure proposal. Their unique generative performance of literally generating a response based on the given utterance and contexts was widely sought after by researchers. With the theoretical potential of answering any query, this particular structure enables this kind of chatbot to be applied in scenarios covering a wide range of conversation topics.

As mentioned in Sect. 4.3.3, the typical Seq2seq-based model has been examined in e-commerce (Kandasamy et al. 2017; Chen et al. 2019; Lin et al. 2021b; Ma et al. 2018), social media (Xu et al. 2017; Kushwaha and Kar 2020, 2021; Aleedy et al. 2019), technical support (Aalipour et al. 2018; Golchha et al. 2019), and other applications (Olabiyi et al. 2019; Prajwal et al. 2019; Kang and Lee 2019). The techniques present powerful computing and abstraction capabilities to generalize and extract valuable knowledge and text features from massive dialogue data automatically to solve the response-generating problem, allowing developers to avoid the uncertainty and heavy workload associated with artificial feature engineering. However, this model still suffers shortcomings in the contextual logic, correctness, or coherence of the generated response at the current stage. Section 4.3.3 also introduced various methods for improvement from the perspective, such as combining with hybrid neural network models (Lin et al. 2021b; Aleedy et al. 2019), optimizing the objective function (Kandasamy et al. 2017; Golchha et al. 2019), embedding personalized features (Olabiyi et al. 2019), and adding external memory (Kang and Lee 2019). In general, these methods are difficult to apply in scenarios where the accuracy and professionalism of responses are critical, but their potential to generate new responses deserves further exploration.

6 Future research directions

Currently, research on business chatbots is still in the developing stage. Most published papers discussed applying relatively mature deep learning technologies to develop business chatbots. More research related to emerging technologies should be explored further. From our perspective, research on deep learning utilization in business chatbots should be extended concerning three aspects: (1) new scenarios and emerging technologies; (2) human–computer interaction and usability analysis; and (3) meta-theory and design principles for chatbot development. A summary of the research opportunity is shown in Table 7.

Table 7 Summary of possible research directions and topics

6.1 New scenarios and emerging technologies

Chatbots should be deployed in more industries or businesses to automate customer services because user queries can be handled 24/7, thus increasing productivity and reducing the need for human labor and expenditure. Researchers or developers have designed and implemented chatbot functions adapted to specific domain requirements. However, most existing chatbots are built for e-commerce scenarios or technical support of products where the business value is not fully reflected. Applying chatbots to new areas may have unexpected effects. Even Eliza’s creators did not realize that chatbots could be used beyond psychology and that we are currently in a deep learning boom. This exploration requires us to observe the tedious or complex human tasks that can be automated or partially replaced in daily life, especially under the impact of COVID-19. Thus, broadening chatbot adoption into new areas, such as individualized customer care, live streaming, short video marketing, online collaboration, farmers’ market digitization, and so on, is well worth exploring.

We should also try to integrate more emerging technologies into chatbot design to examine the improvement and adaptiveness of the technique. Attempts to apply new technologies to chatbots are conducive to popularizing chatbot technology. A chatbot is an application closely related to the development of computer technology, while the chatbot technology involved in business research is often less cutting-edge. In addition to being unable to tap into the commercial value of burgeoning technologies in time, studies using relatively backward technologies have difficulties attracting the interest of researchers from other fields, which is not conducive to enhancing the influence of our discipline in this field. For example, as a branch of deep learning, the practical commercial value of the CapsNet technique has rarely been reported in the business chatbot literature, which is expected to achieve better performance than CNN.

Apart from the fascinating algorithms used for better response production and plentiful functionalities, chatbot performance also benefits considerably from ground-breaking physics and engineering technology similar to the computational applications of neural networks inspired by biology. Quantum technology application is a field where an exponential set of input information can be manipulated simultaneously (Chen and Zhang 2014) from a micro Quantum perspective. The Quantum computational system can deal with data with distinctively quantum properties of uncertainty and entanglement (Bennett and DiVincenzo 2000) and represent more data states with less space than conventional computation. A possible Quantum chatbot is expected to receive multiple contradictory inputs from the environment and still develop an excellent response given the conflicting inputs. For example, a Quantum chatbot can react accurately to complicated human expressions in a targeted manner if complex emotions in an utterance can be understood and captured effectively.

We believe that as computer technology matures and develops, chatbot behaviors will be closer to human beings. Chatbots are also expected to be the next significant technological leap in automatic conversational services (Suhaili et al. 2021). Consequently, we should keep pace with advanced technology associated with chatbots and try to adapt and compare them for business chatbot design.

6.2 Human–computer interaction and usability analysis

The chatbot is an IT artifact that engages in frequent human–computer interaction, and user feedback is integral to the interaction. Considering and studying users’ behaviors or psychological activities is necessary when applying and designing chatbots. However, deep learning models are particularly perplexing because of their black-box nature (Gehrmann et al. 2020), confining human understanding and technology adoption (Gaur et al. 2021). Although this article conducts a thorough review and systematic analysis of applications and characteristics of mainstream deep learning technologies, how the adoption of different network techniques in chatbot design affects human–computer interactions remains unclear.

Inspired by a famous design science research framework proposed by Hevner et al. (2004), we address chatbot research by building and evaluating the artifact and testing whether the employed deep learning technologies meet the identified business needs. At this stage, where deep learning has been widely utilized in laboratory-level chatbot design, few studies have examined its human-oriented effects, limiting its extension to practical business scenarios that require high certainty and stability. An observation is that the banking industry prefers a pipeline-based chatbot framework that can explicitly demonstrate the information flow in chatbot systems with a predictable response produced. Therefore, apart from the feasibility of new technology, assessing the effects of adopting the technology on human experience and activities and examining its practical usability for improving chatbot design are also necessary. Although lacking in examinations of deep learning under chatbot contexts, some interesting studies can contribute to our better understanding. Luo et al. (2021) summarized several chatbot studies of usability analysis from the perspective of task performance, user attitudes, user trust, and user adoption, which could be extended to deep learning perception in chatbot design to unfold the mechanism of how it affects human beings.

6.3 Meta-theory and design principles for advanced chatbot development

A framework or guideline for designing a practical chatbot systematically and identifying its reasonable value may be necessary due to the different extents of mastery of various disciplines and technologies. The major challenge for chatbot development is its resource-intensive nature in terms of skills, time, and user interactions (Jackson and Latham 2022). With the rise of powerful technology, such as new deep learning and reinforcement learning methods, considering the technique features, business needs, and human factors in IT artifact design to improve performance comprehensively has become possible. As modern chatbots and their building technologies have become more diverse, more meta-theories or design principles are necessary to guide and regulate artifact construction.

Some researchers have been committed to generalizing practical frameworks or approaches to guide the chatbot design in some commercial applications. Lai et al. (2019) designed a security control procedure for banking chatbots and analyzed the e-commerce security strategies to reduce the security risk and concretely protect customer data security and personal privacy. Albert et al. (2019) affirmed the powerful impact of deep learning and considered its incorporation into a robust step-by-step development approach to business interaction systems. Sperli (2021) proposed a framework to model different cultural objects into a unified data model to support tourists’ self-services. Other researchers have shared far-sighted views following characteristics of different scenarios based on their observations of chatbot performance. Prabowo et al. (2018) compared the LSTM and simple RNN models in the context of customer service talks across business fields. They summarized their performances and offered suggestions from the perspective of the response accuracy and consumed time. Khan et al. (2019) conducted a correlative analysis of advanced AI technologies. They discussed the applications and inexpediencies of intelligent assistants to provide insight into adapting chatbots for specified scenarios.

Chatbot measurement methods are also necessarily upgraded to support advanced chatbot testing. Przegalinska et al. (2019) proposed a new approach to track and evaluate the interaction performance of deep learning-supported chatbots to help build better social bots fitting business or commercial environments. Huang et al. (2021) designed an assessment framework that can be used to identify the adoption susceptibility of chatbot applications in the hospitality and tourism industries.

The above studies provide a theoretical reference for the construction of business chatbots and guide individuals and enterprises to consider the aspects or indicators in evaluating the chatbot design. It is conducive to various disciplines learning from each other on the focus and improvement direction of appropriate and effective chatbot construction.

7 Conclusion

We present a thorough literature review on the design and applications of business chatbots in several application domains. We are currently in the middle of a boom in applying state-of-the-art deep learning techniques to design intelligent and adaptive chatbots. We first introduce two chatbot architectures that have evolved before conducting the mapping study. One of the main contributions of our study is the systematic illustration of the deep learning methods applied to construct business chatbots in the literature. We elaborate on the seven classes of computational approaches for business chatbot development. Another contribution of our study is that we summarize the four main usages of deep learning and compare their performance in each use. We thoroughly discuss the mainstream technology usages from the perspectives of natural language pre-processing, NLU, NLG, and external knowledge enhancement. The third contribution of our study is that it provides a new framework to classify chatbot construction architectures according to their technical characteristics. In particular, we differentiate the traditional classification of retrieval- and generation-based chatbots in terms of the pipeline and end-to-end structures. Finally, we highlight three promising future research directions for business chatbot design and development and call for a more profound exploration of the commercial values of business chatbots.