1 Introduction

Conversational agents have been proposed and designed to enable seamless interactions with people, through computer-based means for communication, language processing, interpretation, and dialogue exchange (Adamopoulou and Moussiades 2020). These agents have substantially evolved from its first incarnation, the seminal project ELIZA developed by Joseph Weizenbaum (Weizenbaum 1966). Ever since, conversational agents have leveraged on Natural Language Processing (NLP), state machine engines, and pattern matching with the intent of engaging in purposeful conversations with human users. Several milestones marked the technological evolution of conversational bots. Towards the end of the 1980 s, Rollo Carpenter developed Jabberwacky (Rollo 1997), a self-learning agent mainly employing contextual pattern matching to identify the best answer (accessible over the internet only later in 1997). Later, in 1994, the term ChatterBot made its first appearance, used by Michael Mauldin to describe conversational programs (Mauldin 1994). Nowadays, this term has been shortened to chatbot, and it is used on a daily basis to describe these technologies. In the 1990 s, considerable progress was made on conversational agent technologies, based on advances in Artificial Intelligence. For example, Richard Wallace developed ALICE (Artificial Linguistic Internet Computer Entity), which leveraged on heuristical pattern matching.Footnote 1

In the 2010 s, chatbot technologies started to gain adoption, outside the academic sphere, in industrial and mainstream applications. Apple was among the first to commercialize a personal assistant with conversational capabilities in 2011 with the release of Siri.Footnote 2 Initially based on the Active platform (Guzzoni 2008), it assisted iPhone users recognizing both written and spoken language. Other major technology companies released their virtual assistants shortly. Google Now for Android and iOS devices appeared in 2012,Footnote 3 evolving from a simple recommendation engine to a personal assistant able to dialogue with the user (similar to Siri).Footnote 4 Microsoft followed with Cortana,Footnote 5 which was released in 2014. The same year, Alexa was launched by Amazon, primarily targeting home automation and online shopping. Although it is not linked to any OS, it quickly gained adoption in the market (Etherington 2014). The widespread acceptance of these major companies’ virtual assistants and their usage of asynchronous text-based interactions stimulated instant messenger applications to release APIs for third-party development of chatbots (i.e., Telegram, Facebook Messenger, and WhatsApp), in addition to those mainly dedicated to customer services through their web pages.

The increasing adoption of chatbots has been boosted by anywhere/anytime availability, immediate response, confidentiality, social acceptance, and massive scalability. Leveraging on these aspects, chatbots have proven to be effective in a wide range of domains such as eCommerce (Cui et al. 2017), education (Winkler and Söllner 2018), and in particular for motivational (e.g., social network campaigns (Calvaresi et al. 2019)) and support (e.g., customer management (Xu et al. 2017), eHealth (Calbimonte et al. 2019), and assisted-living scenarios (Fadhil and Gabrielli 2017)).

Recent remarkable technological advancements are pushing the evolution of chatbots using keyword-based text recognition or static finite state machines (FSM) to interpret and orchestrate user interactions (today still representing a significant share of the market), to hybrid solutions merging NLP (for text recognition) and FSM (for the management of intentions and user stories) (DeepLink 2022). However, the solely FSM-based solutions still expose significant limitations, such as inadequate personalization, lack of real-time monitoring, reporting and customization, lack of mechanisms to integrate communities of chatbots, limited knowledge sharing capabilities, and the impossibility of deploying multi-domain campaigns within the same framework. These limitations are linked to the predominantly rigid architectures proposed in most existing approaches. These rely on very specific scenarios translated into chatbot logic, which have to be reprogrammed every time a new scenario arrives. This raises the costs of modifying the behavior of a chatbot and prevents administrators from adapting it to specific situations. Moreover, most chatbot solutions rely on monolithic and centralized data management strategies, making it hard to comply with privacy regulations (e.g., European Union’s General Data Protection Regulation – GDPR (Voigt and Von dem Bussche 2017)). The sensitive nature of data collected through chatbot interactions makes it necessary to shift the control of personal data towards the users themselves, empowering them in the process. Many chatbot systems have used AI to boost the accuracy and user-experience of its interactions. Examples include the use of NLP to generate asynchronous follow-up questions (Rao et al. 2021), or the application of neural networks to perform emotion detection in chatbot conversations (Huddar et al. 2021). However, these AI techniques focus more on the generation of responses and monitoring conversational context, without considering the autonomous, decentralized and collaborative nature of chatbots.

Nevertheless, in the last decade, the trend of combining chatbots with multi-agent systems (MAS) models and technologies tried to mitigate the limitations mentioned above. Particular emphasis is given to application domains where the social and collaborative dimensions (e.g., crowd-sourcing, user profiling and personalization) is essential in the interaction with users. These features are particularly relevant for domains such as healthcare fostering behavioral change (Pereira and Díaz 2019), where the majority of the studies/contributions bridging chatbots and MAS can be found (Calbimonte et al. 2019; Calvaresi et al. 2019).

To better understand the current panorama of the different chatbot technology solutions employing agent-based approaches, this work presents a Systematic Literature Review (SLR) investigating application domains, end-users, requirements, objectives, technology readiness level (TRL) (European Commission 2017), designs, strengths, limitations, and future challenges of the solutions found in the literature. The goal is to provide a tool for researchers, software engineers, innovation managers, and other practitioners to investigate the current state of the art and discuss the open challenges.

The rest of the paper is structured as follows: Sect. 2 presents the methodology applied for performing the SLR. Section 3 presents the review planning phase, including the definition of the protocol and the research questions. Section 4 describes how the review was performed. Section 5 analyses the outcomes of the applied methodology structured according to the research questions. Section 6 discusses the obtained results, projecting them into the stated (by the primary studies) and envisioned (by the authors of this paper) future directions. Finally, Sect. 7 concludes the paper.

2 Systematic literature review methodology

The approach employed in this paper aims at being both rigorous and reproducible. It relies on the methodology outlined by Kitchenham (Kitchenham et al. 2009), which has also been employed in a similar contexts (Palmarini et al. 2018; Calvaresi et al. 2021b; Anjomshoae et al. 2019; Mualla et al. 2019; Calvaresi et al. 2018). Figure 1 proposes a schematic representation of the adopted procedure. In particular, it comprises three stages:

P1::

Planning the review. This phase consists of defining the main generic question(s) and deriving Structured Research Questions, characterizing the entire search protocol, matching the requirements (rigorousness and reproducibility), and validating the protocol.

P2::

Performing the review. Entails the execution of the following planned activities: collection and selection of literature, literature elaboration, and disagreement resolution.

P3::

Dissemination. Includes analysis, documentation, reporting, and summary of the learned lessons.

Fig. 1
figure 1

Systematic literature review phases (Kitchenham et al. 2009)

3 Review planning

This section describes the definition of the structured research questions and the development of the review protocol describing the search strategy, the inclusion and exclusion criteria, the biases and disagreement resolution, and the quality criteria.

3.1 Research questions

As introduced in Sect. 1, the research community has proposed the usage of multi-agent-based chatbots in recent years, for different domains, stakeholders, and purposes. Therefore, the main research question can be contextualized in these terms as follows: How are agent-based chatbots characterized, envisioned, and employed? To better investigate such a question, we comply with the Goal-Question-Metric (GQM) approach introduced by (Galster et al. 2014; Kitchenham et al. 2010). Such an approach has been employed in several other studies in the computer science-related domain (e.g., augmented reality for maintenance (Palmarini et al. 2018), virtual reality for education (Radianti et al. 2020), explainable agents and robots (Anjomshoae et al. 2019), agents and blockchains (Calvaresi et al. 2018)) and other domains (e.g., tourism (Yang et al. 2017; Calvaresi et al. 2021b). The dimensions targeted in this study apply to “intelligent” technologies and research. In particular, they are scientific interest over the years, application domains, stakeholders, requirements, goals, technologies, advantages, limitations, countermeasures, and future research. By formulating questions addressing such aspects, provide investigations and analysis in support of practitioners (providing an aggregated understanding of the current works), new tech pioneers (understanding what has been tried and what might be future targets), and industrial researchers (to bring research ideas onto the real-world market). Thus, we devised a set of ten structured research questions.

SRQ1:

To establish an understanding of the demographic evolution of agent-based chatbots, we inquire: How are the research efforts temporally and geographically distributed?

SRQ2:

To elicit the domains on which the agent-based chatbots research focuses, we inquire: Which application domains have employed agent-based chatbots?

SRQ3:

To clarify who are the stakeholders of agent-based chatbots, we inquire: Who are the users of the chatbot systems relying on the agent paradigm?

SRQ4:

To formalize the requirements arranged w.r.t. the given stakeholders, we inquire: What are the requirements standing behind the employment of agent-based chatbots?

SRQ5:

To explore what research tried to achieve with agent-based chatbots, we inquire: What are the objectives set for agent-based chatbots?

SRQ6:

To better understand the technological characterization, we have structured SRQ6 in four sub-questions:

a):

Which chatbot design (e.g., paradigms) and implementations have been proposed?

b):

Which technologies have been employed in the proposed solutions?

c):

Which technologies have been previously employed?

d):

What is the Technology Readiness Level (European Commission 2017) of the solutions proposed in the primary studies?

SRQ7:

To explore the benefit of existing solutions, we inquire: What are the strengths of employing agent-based chatbots?

SRQ8:

To identify the shortcomings of the existing solutions, we inquire: What are the limitations of employing agent-based chatbots?

SRQ9:

To understand the measures employed by the authors to achieve their objectives and overcome the limitations, we inquire: What are the proposed solutions for the limitations identified in SQR8?

SRQ10:

Finally, to foster the establishment of future objectives, we inquire: What are the future challenges for chatbot-based solutions envisioned by the primary studies?

3.2 Review protocol

The search strategy included the selection of the following information sources: IEEE Xplore,Footnote 6 ScienceDirect,Footnote 7 ACM Digital Library,Footnote 8 Citeseerx,Footnote 9 and Pubmed.Footnote 10 The selection of the keywords relied on the reviewers’ background and knowledge in the context of agent-based chatbots, and they include the following: Multi-agent system, MAS, agent-based, chatbot, conversational agent, virtual assistant, personal assistant. To increase the results’ accuracy, some keywords have been aggregated. For example, MAS was expanded to three different queries: (i) MAS + chatbot + virtual assistant, (ii) MAS + chatbot + personal assistant, and (iii) MAS + chatbot + conversational agent.

Each search query produced a set of articles added to the list of papers to be considered. The result of each query has been screened by the reviewers to evaluate the articles’ coherence with the study. In particular, title and abstract have been pre-processed according to the criteria presented in the next section.

3.2.1 Inclusion and exclusion criteria

The initial search collected 108 papers, hereafter referred to as primary studies. Additional filtering criteria have been applied (see Table 1). In particular, such criteria have been selected aiming at (i) avoiding multiple papers (usually incremental) describing the same work, (ii) bounding the time window for the investigation (e.g., excluding too old and less-relevant works, given the technological advancements), (iii) selecting works contributing to the actual investigated topic, and (iv) ensuring the presence of a tangible theoretical/practical contribution – avoiding purely visionary and blue sky studies). The criteria definition is usually quite specific per topic/review. Nevertheless, several studies recall similar criteria selections (Yang et al. 2017; Anjomshoae et al. 2019). Applying the criteria defined in Table 1, we purged unrelated papers and narrowed them down to a set of 38 contributions. Three reviewers were instructed to verify the compliance of the papers with the aforementioned inclusion criteria. Each reviewer operated independently while filtering out the list of papers. After the filter process ended, a review process was established so that a paper was included if at least two reviewers agreed on it.

Table 1 Inclusion and exclusion criteria

3.2.2 Biases and disagreement resolution policy

The policy for biases and disagreement resolution allows the reviewers to cross-examine each task to limit biases and resolve disagreements among themselves. In particular, during the articles selection task, three reviewers cross-validated the inclusion/exclusion. During the elaboration of the articles, uncertainties have been discussed during periodic meetings.

3.2.3 Features and quality criteria

Assessing the quality of the extracted information is crucial. The following set of features has been chosen to answer the structured research questions: Publication year, geographical localization, main purpose, context, kind of users involved, scenarios, level of abstraction\(\dagger\), architectures and designs, development methodologies, techniques, technologies and devices, user needs coverage\(\ddagger\), need - offered support relation, kind of disease or difficulties supported\(\ddagger\), awareness provided, architectural evidence\(\ddagger\), technological evidence\(\ddagger\), technical evidence\(\ddagger\), architectural limitations\(\ddagger\), technological limitations\(\ddagger\), technical limitations\(\ddagger\), identified future directions, identified future challenges. The features annotated with (\(\dagger\)) are classified with C, P, or T as possible values, that respectively stand for C = Conceptual; P = Prototype Architectures and Frameworks, no results are provided; T = Tested Architectures and Frameworks, results are provided. The features annotated with (\(\ddagger\)) are associated to Y, P, or N values, that stand for: Y = information are explicitly defined / evaluated; P = information are implicit / stated; N = information are not inferable. Such a categorization of the collected features has been performed according to the DARE criteria, elaborated and proposed by (Kitchenham et al. 2009).

4 Review execution

This section details the Perform Review task in Fig. 1. In particular, it elaborates on the review’s execution, including details on the article collection, selection, and elaboration. The semi-automatic search presented in Sect. 2 resulted in a total of 108 selected articles. The assessment of the primary studies to be finally included in the elaboration phase has been conducted by a total of three reviewers. In particular, the articles have been organized into three equally distributed groups, each of them elaborated by two reviewers (in rotation) with the third one involved in the case of conflict.Footnote 11 Table 2 details the selection assessments, referring to the reviewers with the letters \({\mathcal {A}}\), \({\mathcal {B}}\), and \({\mathcal {C}}\).

Table 2 Summary of the inclusion/exclusion phase of the collected papers

The papers have been listed following the collection order and respecting the relevance-based sorting obtained when querying the scientific web collectors. It is possible to notice that the third set of papers recorded a drastic reduction in the acceptance rate. Such a piece of information offers two possible reading keys: (i) the stop criteria has acted too loosely and/or (ii) title & abstract do not mirror the papers’ content properly.

The filtering phase concludes by selecting 38 papers to be elaborated out of the 108 initially collected (21.1% total acceptance rate). In turn, the features presented in section 3.2 have been extracted and collected in a tabular format to facilitate their elaboration and the understanding of possible correlations to be discussed. Nevertheless, in some cases, the extraction of relevant information has been challenging due to the lack of explicit statements (e.g., very few studies have clearly mentioned the limitations of their approaches). To cope with this situation, the reviewers have leveraged their knowledge of the topic to produce a more comprehensive understanding and propose to the reader additional information (rigorously decoupled with the presentation of the results and solely addressed in the discussion).

5 Review results and analysis

In the following, we structure the results of the SLR according to the research questions defined in Sect. 3.1.

5.1 Demographics

Referring to question SRQ1, Figs. 2 and 4 show the temporal and geographical distribution of papers targeting agent-based chatbots. Figure 2 reports the primary studies’ distribution over the time-window selected for this study. A slight upward trend can be observed in recent years. Nevertheless, the research field of multi-agent-based chatbots still seems to be a niche area. Indeed, looking at Fig. 4, the geographical localization of the first authors’ institutions (organized per country) relates to the distribution of research groups in the field of multi-agent systems (i.e., centered in the US and Europe). Finally, Fig. 3 provides a further view on the selected primary studies by grouping the papers per continent.

Fig. 2
figure 2

Total papers per year

Fig. 3
figure 3

Number of papers per continent per year

Fig. 4
figure 4

Number of papers per country

5.2 Application domains

Regarding SRQ2, we graphically represented in Fig. 5 the selected application domains of the primary studies. It is noticeable that the panorama of the application domains is remarkably broad and diversified. For example, it ranges from education (Alencar and Netto 2014) to healthcare (Kökciyan et al. 2021) and financing (de Bayser et al. 2018). Nevertheless, it appears that personalized assistive purposes have attracted most efforts across domains.

Fig. 5
figure 5

Contributions per application domain

Fig. 6
figure 6

Type of studies

5.3 Intended user classes

Concerning SRQ3, Fig. 7 shows the distribution of the diverse intended user classes identified by the selected primary studies, which is a direct consequence of the application domains. On the one hand, it is evident that most of the literature operates in the context of education, having either students, tutors, or professors as the main users. On the other hand, although being a minority, a considerable amount of studies is solely conceptual or general (see Fig. 6) and does not tackle a specific use case. Overall, the majority (\(57.89\%\)) of the primary studies presented some form of prototypes, \(23.69\%\) deal with technical or scientific concepts, and \(18.42\%\) of the selected papers contains extensively tested artifacts.

Fig. 7
figure 7

Number of papers per type of users

5.4 Requirements

Concerning question SRQ4, we elicited the requirements expressed by the primary studies. We can see the evolution of the main features captured by these requirements in Fig. 9. We categorized the requirements as follows:

  • Functional Requirements: requirements affecting the behavior of the platforms (see Table 3);

  • Architectural Requirements: requirements stirring the system or the back end of the platforms (see Table 4);

  • Front end Requirements: requirements applied to the front end of the platforms (see Table 5);

Figure 8 depicts the distribution of types of requirements characterizing the primary studies. The authors of the elaborated papers focus primary on functional (41.7%) and architectural (40.0%) requirements. Requirements concerning the front end were only explicitly formalized in 18.3% of the studies.

Fig. 8
figure 8

Type of requirements

Table 3 Functional requirements
Table 4 Architectural requirements
Table 5 Front end requirements
Fig. 9
figure 9

Evolution of features in agent-based chatbots according to the requirements

5.5 Objectives of the studies

Investigating SRQ5, we collected and clustered the objectives of the primary studies as depicted in Fig. 10. Most of the papers tackle the theoretical foundations of MAS-based chatbots (i.e., nine studies focus primarily on conceptual aspects of the current state of the art or non-concrete systems). Among them, we can mention (Augello et al. 2017), where a notion of “social intelligence” for chatbots is defined, and linked to current technologies’ capability to develop social chatbots. Also, (Hung et al. 2009) defines a method for an evaluation process to assess the “naturalness” of a chatbot system.

Concerning more practical studies, goal-driven behaviors (e.g., intended to tackle user personalization) have been studied for dietary and entertainment proposes. (Angara et al. 2017) describes a chatbot designed to support users in their kitchen by providing recipe recommendations while adhering to their dietary goals, medical conditions, preferences, and available ingredients. Similarly, (Wong et al. 2012) describe a goal-oriented virtual chat companion for children with a focus on structured entertainment (e.g., story-telling, collaborative games) and engaging in “free-flowing” dialogue with unstructured responses. Concerning behavioral change, studies such as (Calvaresi et al. 2019; Calbimonte et al. 2019) target profiling and cravings’ analysis to tailor smoking cessation support, (Calvaresi et al. 2021a) target the maintenance/improvement of physical balance capabilities with personalized exercises. (Chapman et al. 2019), and (Kökciyan et al. 2021) demonstrate the development of a chatbot system to help stroke patients manage their care. The system processes data from multiple inputs (e.g., blood pressure monitor, electronic health record) to serve a computational argumentation engine and respond to user queries.

From a different perspective, data-driven behavior has been addressed in contributions including (Agostaro et al. 2005; Pilato et al. 2007; Augello et al. 2009) which deal with the limitations of the conventional, rule-based, data-driven semantics by introducing the paradigm of LSA. Indeed, according to (Landauer et al. 1998), LSA allows overcoming rule-based pattern matching limits and introduces an element of intuitiveness by constructing a conceptual space. Another targeted objective is the integration of multiple domain-specific knowledge sources into one chatbot system. For example, (Jiang et al. 2015; Augello et al. 2011) deal with the integration of different static sources (i.e., vector space model-based indices, XML, relational databases, SPARQL queries, and AIML), while (Pilato et al. 2011; Tarau and Figa 2004) are intended to manage knowledge dynamically based on the current dialogue context.

While the studies mentioned above are in a user-to-single agent scope, a few studies are in the user to multi-agents (i.e., chatbots) scope. For example, (de Bayser et al. 2017, 2018) address the coordination of multiple bots providing financial advice within the same chat. Their final goal is the moderation of the user-bots’ interaction. Finally, (Calvaresi et al. 2021a) focused, among other aspects, on the facets of data protection and data privacy.

Fig. 10
figure 10

Primary studies’ objectives

5.6 Technology characterization

Studying SRQ6, we have classified the primary studies according to the technology readiness level (European Commission 2017) (see Table 6). In turn, we have analyzed the technologies, architecture, and design principles employed in the primary studies.

Assessing the TRL is a valuable way to measure the maturity of a technology/system. The scale was originally devised by NASA ((Sadin et al. 1989)) and is nowadays used in many areas in various forms. In this context, we rely on the definition provided by the European Commission in the context of research and innovation projects ((European Commission 2017)) as shown in Table 6.

Table 6 Technology readiness levels according to the definition provided by (European Commission 2017)

The TRL distribution of the primary studies is depicted in Fig. 11. It is noticeable that most of the studies are in Levels 3 and 4 (68.1%). This entails that the final outcome of these studies is either a non-validated prototype (TRL 3) or is at the laboratory test stage (TRL 4). Two studies (i.e., (Calvaresi et al. 2019) and (Calvaresi et al. 2021a)) are classified as TRL 5. Indeed, such studies have been deployed and validated in real-world health and social-related campaigns.

Fig. 11
figure 11

Technology readiness level distribution of the primary studies

In addition to analyzing the TRL of each study, the front-end and back-end technologies applied in the presented systems were analyzed. All studies with a TRL of 3 and higher were considered. Figure 12 depicts the distribution of the back-end technologies used in the primary studies. The majority (38.7%) of the systems employ Java-based back ends. This prevalence can be related to the wide use of MAS frameworks such as JADEFootnote 12 and MaSMT.Footnote 13 For example, studies such as (Alencar and Netto 2014), (de M. Batista et al. 2009), and (Bentivoglio et al. 2010) rely on JADE and (Hettige and K. 2015) implemented the system based on MaSMT. Although not relying on a pre-existing MAS framework, (Pilato et al. 2007) and (Tarau and Figa 2004) implemented their own ad-hoc Java-based systems. Moreover, (Estes 2011) exploit features of the Java Enterprise Edition platform (JavaEE) to develop their chatbot system and (Memon et al. 2018) use communication sockets of the Java Standard Edition (Java SE). Several studies use unconventional technologies to develop MAS. For example, (de Bayser et al. 2017) use Akka,Footnote 14 an actor-based framework, and (Z. et al. 2016) relied on ActiveMQ,Footnote 15 a multi-protocol messaging server.

Fig. 12
figure 12

Overview of utilized back-end technologies

Python-based back ends are 9.7% of the total. In particular, (Jiang et al. 2015) and (Calvaresi et al. 2019) have developed ad hoc systems, while (Calvaresi et al. 2021a) rely on the SPADE framework.Footnote 16

Several studies (9.7%) relied on existing proprietary systems. For example, (Kalia et al. 2017) and (Angara et al. 2017)) rely on IBM Watson’s Conversation PlatformFootnote 17 and (Zolitschka 2020) rely on Aimpulse Spectrum.Footnote 18

A number of studies (9.7%) developed their systems’ back end as ad-hoc solution using JavaScript (i.e., (de Bayser et al. 2018), (Thosani et al. 2020) and (Bosse. 2021)).

6.5% of all studies (i.e., (Tarau and Figa 2004) and (Bosse. 2021)) implemented a PROLOGFootnote 19-based back end. Finally, With a share of 25.8%, a substantial number of studies have developed prototypes but failed to mention details regarding their back end implementation. One such example is (Kökciyan et al. 2021). Although the authors specify the human interface, it does not go into detail about how the actual backend is implemented.

Fig. 13
figure 13

Overview of utilized frontend technologies

Figure 13 displays the distribution of the front-end technologies used in the developed chatbot systems. Web-based technologies have received the most attention (31.3%), mostly using JavaScript or JavaServer Pages (JSP) in Java.

Using existing web/mobile messaging platform is a choice undertaken by 15.6% of the studies. In particular, (Calvaresi et al. 2019) rely on Facebook Messenger,Footnote 20 (Calvaresi et al. 2021a) offer Telegram MessengerFootnote 21 among the available interfaces, (Tarau and Figa 2004) use Yahoo Instant Messenger (deprecated since 2012), and (Bentivoglio et al. 2010) adopt Jabber.Footnote 22

The development of ad hoc solutions accounts for 15.6%. the programming languages involved are Java (e.g., (Hettige and K. 2015) or (Tatai et al. 2003)), C#, and C++ (e.g., (Huang et al. 2008)).

6.3% of the elaborated solutions’ front ends uses cross-platform frameworks. Such frameworks allow the same code base to be used for web and smartphone app development. For example, The studies used (Thosani et al. 2020) use Ionic,Footnote 23 and (Calvaresi et al. 2021a) offer among the possible interfaces HemerApp, which is written in Flutter.Footnote 24

3.1% of systems used an Android application as front end (e.g., (Kökciyan et al. 2021)).

Finally, 28.1% of the studies do not mention what technologies are used in their solution or provide only simplistic and non-classifiable descriptions. For example, (de Bayser et al. 2018) focuses primarily on the conception of the backend side without mentioning how their human interfacing system was implemented.

5.7 Strengths of the primary studies

Referring to question SRQ7, the strengths of the primary studies are listed in Table 7. Among all the strengths, 22% of the strengths are classified as Y, which means that the strengths are explicitly defined and evaluated, 21% are classified as P, indicating that the information is implicitly defined, 57% are classified as N, denoting that the information is not inferrable (see Fig. 14). Figure 15 shows, in particular, the classification per strength.

Table 7 Strengths of the primary studies
Fig. 14
figure 14

Overview of strength assessment according to the YPN classification. In particular, Y = the information is explicitly defined/evaluated; P = the information is implicit/stated; N = the truthfulness of the information is not inferable

Fig. 15
figure 15

Qualitative assessment of the strengths (Y-P-N criteria).S1: dynamic update of knowledge base; S2: adaptability to different domains; S3: Profiling (according to user behavior); S4: personalization (according to user input); S5: reusability of components; S6: scalability; S7: performance

5.8 Limitations and solutions of the primary studies

Referring to questions SRQ8 and SRQ9, the limitations stated in the studies and their proposed solutions were analyzed. Table 8 lists all limitations acknowledged by the authors and their proposed solutions. Only five of the ten papers that point out limitations proposed solutions to address them. As an unfortunate habit, limitations are often overlooked. However, among those who mentioned limitations, it is possible to identify two main categories: architectural and functional. As architectural limitation, we specify limitations that are of technical nature and can be solved by changing the applied architecture or technologies. An example of architectural limitations is (de Bayser et al. 2017), which states performance problems when raising the number of participants in a chat group. To solve this problem, they suggest switching to a micro-service architecture. Another example is (Calvaresi et al. 2019), emphasizing several limitations of their current system architecture, specifically scaling issues with more complex behaviors, lack of standardized inter-agent communication, and no means of integrating third-party data analysis tools. The solution to these limitations is an entirely new platform based on a MAS. Functional limitations are issues on a functional level that can usually be overcome by exploring alternative approaches to a problem. Examples of functional limitations are (Hettige and K. 2015) and (Jiang et al. 2015), both of which mention limitations related to semantic processing. The proposed solution of (Hettige and K. 2015) is to update the corresponding subsystem, while (Jiang et al. 2015) proposes to analyze the user input with domain-independent analyzers (e.g., linguistic analysis or keyboard analysis).

Table 8 Study limitations and proposed solutions

5.9 Future challenges stated in the primary studies

Concerning SRQ10 giving the heterogeneous perspective of the future challenges are rather disparate. However, generally, future challenges can be divided into three categories:

  • System-related challenges relate to extending already existing functionalities.

  • Functionality-related challenges refer to new functionality to be implemented.

  • User-related challenges refer to collecting user experiences (usually in the form of trials).

The studies were analyzed for these three categories. Figure 16 shows the breakdown of the three categories across all studies. With 57.9%, most studies desire to enhance their current system’s stability or expand already implemented functionalities. For example, (Shashaj et al. 2019) see improving the system component stability and interoperability with other FIPAFootnote 25-compliant MAS environments as a future goal, whereas (Calvaresi et al. 2019) wish to adapt their architecture to allow distributed computing among several servers to increase performance and to handle agent migration from one server instance to another. A complete list of system-related challenges can be seen in Table 9. At 28.9%, about one-third of studies are endeavoring to add new functionalities to their existing system. (Vasconcelos et al. 2017) attempt to implement more test metrics to test more aspects of a chatbot system, and (Memon et al. 2018) seek to expand their chatbot with a graphical user interface and extend its user input capabilities with voice recognition and interpretation. All functionality-related future challenges are listed in Table 10. 13.2% of all future challenges focus on capturing user feedback. (Alencar and Netto 2014) are seeking to test their tutoring system with the help of students and make further improvements to the system based on the feedback collected, and (Kökciyan et al. 2021) are conducting two pilot studies with patients to test different aspects of their system. Table 11 lists all user-related challenges stated in the primary studies.

Fig. 16
figure 16

Distribution of future challenge per category

Table 9 Future challenges: system-related
Table 10 Future challenges: functionality-related
Table 11 Future challenges: user-related

6 Discussion

Analyzing the primary studies emerges that the application of the MAS’ paradigm has slightly increased in the past twenty years, although only moderately. The elaborated works acknowledge the suitability and the intrinsic added value of agent-based systems, including autonomy, goal-setting, and behavior definition. Nevertheless, it appears that these technologies are mostly at an early stage of development. On the one hand, the TRL of most primary studies did not exceed level 3 or 4 (as shown in Fig. 11), and it is questionable whether these early-stage systems would be capable of meeting the requirements of a real-world scenario. On the other hand, a few systems have been studied in real-world scenarios (i.e., (Calvaresi et al. 2021a)—testing the developed chatbot in a physical balance-preserving campaign and (Kökciyan et al. 2021) – letting both experts and real users analyzing the system. However, it still remains to test such systems in fully operational environments.

Several studies focused on aspects revolving around the management and reconciliation of different knowledge bases. However, only one (Calvaresi et al. 2021a) has addressed the topic of data privacy and user consent directly. To date, this is a remarkable concern that practitioners have to address imperatively. Indeed, too many studies addressing topics such as user profiling and the processing of user input to enhance chatbot knowledge have either ignored data privacy or not tackled it explicitly. If people are involved, it is of paramount importance to ensure their control over their data. Due to the implementation of more rigid data privacy laws such as GDPR, next-generation systems must have no room to neglect this topic.

The analysis of the technologies’ distribution within the primary studies suggests some trends to be observed. Figure 17a shows the back-end technologies used over the years. It is possible to notice that Java-based systems have been used extensively. However, since 2015, Python-based systems have emerged, likely due to Python’s prevalence in areas such as machine learning and data science libraries. Moreover, since 2017, the employment of proprietary systems (e.g., IBM Watson) has been increasingly considered. Although initially rather rudimentary, such platforms now offer a wide range of possibilities, such as integrating machine learning modules or extensive analytical capacities. Figure 17b shows that a shift occurred in the area of front-end technologies too. In addition to the increasing prevalence of web-based solutions, messaging services such as Facebook Messenger or Telegram have become increasingly popular since 2015. Nevertheless, in recent years, the use of cross-platform frameworks became a consistent practice. Cross-platform frameworks such as Ionic or Flutter make it possible to develop front-end solutions for mobile phones and web browsers using a single code base. Moreover, it can be observed a trend to use more complex multi-agent chatbots (e.g., (Bosse. 2021)) to blend in IoT and micro-services domain with highly scalable multi-agent chatbot networks.

Fig. 17
figure 17

MAS-based chatbot technologies over the years

Most studies have used MAS enabling agents to abstract individual components such as language processing or output composition. (Calvaresi et al. 2019) and (Calvaresi et al. 2021a) have taken a different approach by coupling the users themselves with personalized agents. According to such studies, the goal of this 1:1 relation is to facilitate user profiling, data management, privacy preservation, and personalization. Indeed, by interacting with the user, the respective agent is expected to increase its knowledge and enhance the personalization’s accuracy level over time.

Looking at the evaluation of the strengths of the primary studies in Fig. 14, it is noticeable that S2 (i.e., adaptability to different domains) and S6 (i.e., scalability) have an above-average number of implicitly defined and evaluated strengths. In the case of S2, this is primarily due to studies having justified their system’s adaptability with the implementation of a single case study to conclude that the system can also be applied to other domains. This is not necessarily a wrong assumption, but the implementation of several distinct scenarios would have been more effective to show this strength explicitly. Compared to S2, S6 is a more generic strength. Since most of the studies are in an early prototype stage, even if the respective systems’ scalability was reported as a strength, this strength was mostly not evaluated. This leads to the question of what methods can be used to evaluate a chatbot platform’s scalability. All studies use the term scalability as a synonym for Size Scalability as defined by (Neuman 1994). Size Scalability defines that a system scales easily with the number of users and resources without noticeable loss of performance. To implicitly define this aspect, a load test with several simulated users in which the response times and hardware load of the system are analyzed could be theoretically sufficient.

7 Conclusions

This paper has analyzed the current state of the art of chatbot solutions leveraging the multi-agent approach and agent-based frameworks by performing an SLR. In particular, it employs a well-established methodology characterized by ten structured research questions. Such an investigation focused on aspects including application domains, end-users, requirements, objectives, technology readiness level, designs, strengths, limitations, and future challenges of the solutions found in the literature. Such aspects have been analyzed “per-feature” and overall aggregated in a reconciling discussion. The insights elicited in this work can be beneficial for both theoretical and practical future research.