1 Introduction

This work focuses on a linguistics-based methodology to select and assess arguments in an argumentation-based dialogue (ABD) scenario. According to the past research on ABD, argumentation takes into account agent-related and dialogical aspects. In fact, ABD deals with phenomena depending on the dynamic exchange of information, which can vary according to turns and participants. For the specific case of argumentation in dialogues, although it has long been studied, there is still no shared theoretical framework to manage it (Prakken 2018). Nevertheless, one of its most popular fields of application is represented by Conversational Recommender Systems (CoRS).

In recent years, CoRS have gained significant attention for their ability to provide personalised recommendations in a natural and interactive manner (Pramod and Bafna 2022; Jannach et al. 2021). These systems engage users in dialogues, understanding their preferences and constraints, and offering recommendations that align with their individual needs. Nevertheless, one important aspect emphasised by several scholars of argumentation theory lies in the assessment of the quality and effectiveness of the arguments (Macagno 2016, 2022). Since this aspect was not widely investigated in CoRS, we introduce a novel methodology for evaluating the quality of argumentation dialogues in recommendation scenarios within the framework of what we call argumentative conversational recommender systems (A-CoRS). Differently from purely data-driven approaches, our cross-disciplinary approach is fully explainable and goal-oriented, while still being domain independent. The linguistics-based computational model implements an exploration–exploitation mechanism based on mathematical measures corresponding to cognitive pragmatics concepts, so that at each step it is always possible to investigate the reasons behind a system dialogue move. Also, it is necessary to adopt a low-cost procedure for first evaluation of such models, which in our case, consists of synthetic dialogue evaluated by human judges for naturalness and efficacy.

The definition of this methodology is founded on specific starting hypotheses that offer a comprehensive framework for investigating the dynamics of argumentative dialogue and its potential applications.

H 1

Argumentative dialogue can be primarily described using pragmatic features. This postulates that the essence of argumentative dialogue can be effectively captured by pragmatic features, emphasising the importance of context, intention, and meaning in such interactions. For this reason, we present a theoretical model of argumentation-based dialogue, tested on the specific case of argumentative conversational recommender systems (A-CoRS). This model is based on cognitive principles such as relevance, importance, credibility, and (un-)likeability.

H 2

It is possible to find mathematical descriptors for the features identified by H1. This hypothesis suggests that these abstract features can be quantified through mathematical descriptors. This concept bridges the gap between the qualitative nature of communication and quantitative metrics, potentially leading to a deeper understanding of argumentative discourse. Our model is implemented by combining different AI tools that together represent a) the system’s domain (i.e. graph knowledge base), b) its decision-making capabilities, regarding the elements to be explored and proposed, and c) the priority of the communicative strategies to be adopted. Decision-making with mathematical descriptors correlating with linguistic measures must be deployed on a specific task, of course. The movie recommendation task is very popular to develop and test this kind of systems and data availability is wide so the domain is appropriate for the proposed tests.

H 3

Simulated users can be used to evaluate dialogue systems that apply the aforementioned mathematical descriptors. This represents the main investigation point of this work, leveraging on the argumentative infrastructure designed to investigate H1 and H2. Specifically, H3 proposes that these mathematical descriptors can be harnessed as parameters for the evaluation of dialogue systems. More specifically, these evaluation parameters can be applied to simulated dialogues as a preliminary trial ground. They collectively pave the way for a systematic exploration of argumentative dialogues. We propose an evaluation protocol assessing the plausibility and effectiveness of the system’s argumentation strategies in guiding users towards optimal choices. It is important to note that plausibility does not guarantee the truth or correctness of an argument. A plausible argument may still be incorrect or incomplete. Plausibility is a subjective assessment that depends on the evaluation of the dialogue participants or the audience. To facilitate this task to the human judges, we employ the movie recommendation task, which can be reasonably proposed for evaluation in this sense.

The evaluation process adopted for H3 involved generating dialogues between the system and a population of simulated users, in which the system employed theory-based argumentation strategies to support recommendations. The simulator comprises the system’s theory-based strategies generated considering the beliefs collected from the personal experience (i.e. the knowledge base) and from the user’s feedback. It also has the capability to refer to the preferences expressed by simulated users. To generate responses from simulated users, we utilised the Movielens dataset (Harper and Konstan 2015), a widely used and comprehensive movie rating database. The dataset provides a rich source of information, including user preferences, movie ratings, and contextual information. This enabled us to simulate plausible conversations between simulated users and the recommender system. The resulting conversations are then manually evaluated with crowd-sourcing on multiple aspects: (i) naturalness, concerning dialogue life-likeness and the perceived recommender expertise, (ii) plausibility, in terms of appropriateness of the selected arguments both during information-seeking and recommendation phases. The dialogues were evaluated together with data collected from a control group composed by two subsets: a positive subset consisting of real human–human examples, and a negative subset consisting of dialogues generated with random argumentative strategies. This allows to quantify the effectiveness of the argumentation and the plausibility of the system’s recommendations. Our results confirm that, from the perspective of functional aspects, both the source theory and its implementation are valid. Conversely, that social aspects appear not to be captured by the current version of the simulator. This is reasonable as these aspects were not explicitly considered by the implementation. This validates the methodology and will allow its use in future experiments in which the theoretical model will include social aspects. Moreover, the results obtained in a simulated environment motivate the future organisation of more time-consuming human–machine interaction experiments with real users.

The paper is organised as follows: Sect. 2 provides a general overview of how Conversational Recommender Systems have been studied in the literature. To cover the general problem of ABD, Sect. 3 provides a comprehensive review of related work in the field, with a specific focus on argumentative conversational recommender systems. Using a mathematical interpretation of this theoretical basis, Sect. 4 presents our proposed model of argumentation-based dialogue (H2) and its technological implementation (H1, H2). Additionally, it describes an evaluation protocol based on simulated dialogues for preliminary validation of argumentation dialogue management (H3). Section 5 presents the experimental setup and the adopted evaluation procedures. Finally, Sect. 6 presents the results of our evaluation, followed by a discussion in Sect. 7.

2 Related work

Unlike traditional approaches to item recommendation, like the popular Matrix Factorization (Rendle 2010), recent developments in natural language processing have been consistently applied to conversational technology for recommender systems (Zhang et al. 2018; Lei et al. 2020). Past approaches have mainly concentrated on the problem of using machine learning to derive the policy to explore the domain of interest so that a recommendation can be produced. The importance of the capability of such systems to ask the right questions has been the basis of such interest in the scientific community (Zou et al. 2020).

Graph-based models, in particular, have gained significant attention in the field of CoRS due to their ability to capture complex relationships between items and users and to the enhanced expressiveness and explainability they can offer. Graphs can be used to represent the ongoing dialogue and relate it to the knowledge base to extract the appropriate features to explore, depending on their connectivity in the current graph configuration (Di Maro et al. 2021; Origlia et al. 2022b). Similar path reasoning strategies have already been investigated in other studies (Lei et al. 2020), where recommendation dialogue was treated as an interactive path reasoning problem on a graph. According to their conversational path reasoning framework, by leveraging the graph structure, irrelevant candidate attributes can be effectively pruned, thereby increasing the chances to identify user-preferred attributes. Conversational path reasoning is also at the basis of the approach presented in (Deng et al. 2021), where the proposed dialogue system iteratively explores the knowledge domain using graph convolutional networks to learn recommendation policies. While this approach provides an elegant framework to integrate structured knowledge and probabilistic approaches, it does not offer a way to interpret the decisions made by the system at each step: a problem common to most approaches relying heavily on machine learning.

This work is framed within a research program aiming to combine different artificial intelligence models, providing an interpretable, theoretically motivated framework for the management of conversational recommendations, including argumentation. In Sect. 5, we will present the description of the technological implementation of one of the core components of such a system, managing the exploration/exploitation cycle while collecting users’ preferences and using them to support claims.

Our main contribution in this work consists of a general methodology to manage CoRS through an interpretable policy that is not learned from the data but is rather derived from formal theories proposed in the literature to describe the problem of assuming a stance and providing arguments to support it.

3 Theoretical background

In this work, we present a system for ABD leveraging a linguistics-based theoretical background. Specifically, we investigate this topic by concentrating on argumentative conversational recommender systems (A-CoRS). In general, CoRS are acquiring a fundamental role in information seeking and retrieval to point users to potential items of interest (Jannach et al. 2021). Recommendation dialogues are characterised by two or more participants disclosing their preferences and make recommendations to satisfy the requirements retrieved during the communicative exchange. CoRS, in the same way, aim at finding or recommending the most relevant information (e.g. web pages, answers, movies, products) for users based on textual or spoken dialogues, through which users can communicate with the system more efficiently using natural language conversations (Fu et al. 2020). CoRS, being typically equipped with some kind of strategy to collect information and support recommendations with rich natural language capabilities, may be developed in the frame of formal argumentation and, more specifically, using the ABD framework. For this reason, we refer here to A-CoRS.

The recommendation task, with its inherent dialogical structure and goals, represents a relevant case study to test a theoretical model of argumentation, particularly concerning the selection of arguments supporting claims. The recommendation task, in fact, tends to present a clear dialogical pattern structured in two phases, exploration and exploitation (E &E). These can be viewed as two types of dialogues embedded into each other and are the basis on which the proposal of this work is modelled." According to (Gao et al. 2021, p. 15), with exploration “[...] the system takes some risks to collect information about unknown options”. On the other hand, during the exploitation phase, “[...] the system takes advantage of the best option that is known”. In this type of interaction, there are different goals that keep changing according to the dialogue state. The main goal of a recommendation dialogue is to have the interlocutors, during the exploitation phase, agree over the selection of a specific item, due to its supporting arguments. The need to select these arguments constitute a secondary goal pursued during the exploration phase, while building the common ground (Clark 1996).

The exploitation phase can be seen as a ground-level dialogue (Krabbe 2003, p. 83). If the recommendation had not been accepted, a shift into the exploration phase would occur. In this case, participants may then move to a meta-dialogue to have a secondary dialogue on whether the moves in the first dialogue can be judged as correct or not by some criteria (Krabbe 2003; Macagno and Bigi 2020). A meta-dialogue primary purpose, thus, is to help the first dialogue achieve its end successfully. For this reason, the information collected during the exploration phase is used to support the proposals used in the exploitation phase by relying on the argumentative features typical of an explicative/negotiating CoRS.

The joint purposes of a dialogue, namely the interlocutors’ generic “we-intentions” of pursuing a joint activity (Searle and Willis 2002), were classified by Walton in seven “types of dialogue”, namely persuasion, negotiation, inquiry, discovery, deliberation, information seeking, and eristic (Walton and Krabbe 1995; Walton 1998; Macagno 2008; Macagno and Bigi 2020). This typology represents the most common and generic goal-oriented types of dialogical interactions (Dunin-Keplicz and Verbrugge 2001; McBurney and Parsons 2009). These types of dialogues described by Walton show how arguments become essentially intertwined with the dialogical dimension, defined by the dialogical goal that the user proposes through their argument. Furthermore, an argument has a pragmatic dimension since it is part of a dialogue and is grounded in accepted inferential rules, according to the common ground shared between the participants. Thus, this dimension mainly refers to how an argument is related to the individuals involved (Kecskes 2014) in terms of their wills, beliefs, and commitments. This highlights how the acceptability of the arguments is strictly correlated with the shared information but also with the amount of good evidence provided in support of them, framing the evaluation of the arguments in the epistemic dimension as well (Macagno 2022).

Starting from the presented theoretical background and focusing on our specific case of recommendation tasks, we addressed the dialogical goals and considered the deliberation process occurring in the exploitation phase. The deliberation dialogue is a collaborative type of dialogue in which parties collectively steer actions towards a common goal by agreeing on a proposal that can solve a problem affecting all of the concerned parties, taking all their interests into account (Walton 2019). Argumentation in deliberation primarily involves identifying proposals and the arguments supporting them, as well as finding critiques of other proposals (Walton 2010). Regarding the exploration phase, on the other hand, the type of dialogue we take into account could be the information-seeking one, since it is described according to an initial situation in which one party lacks the information that is known by the other, and the goal of the former is to request information from the other (Walton and Krabbe 1995). However, since in real dialogues the situation of lack of knowledge is not stable (i.e. new information can modify the epistemic status and the type of information needed by the participants), a dialogue move can be considered more generally as aimed at information sharing, namely providing, requesting, offering information (Macagno and Bigi 2017, 2020).

The evaluation parameters of A-CoRS, thus, would not only focus on the quality of the item proposed in the exploitation phase but should also consider the features collected and shared by the user during the information-sharing phase. Other studies, such as (Lamche et al. 2014; Millecamp et al. 2019), also investigate the impact of the use of different feature-based explanations on users’ perception of a recommender system. The selection of the most appropriate items and features is pivotal for achieving the conversational goal. In this context, this can be accomplished through the collection of beliefs during the exploration phase. The recovering of relevant information during this phase is fundamental, so we need to adopt a strategy to collect feedback about system beliefs candidates.

For this reason, we need a theoretical model that allows the exploration of relevant data, selecting them as beliefs for the system to succeed in the process and confirm H1. Paglieri and Castelfranchi (2005b), in this regard, presented an alternative model of belief revision (Data-oriented belief revision, DBR) where a selection over such data is performed, to determine the subset of reliable information (i.e. beliefs) and their degree of strength. Moreover, the authors state that Toulmin (2003)’s model, which is one of the most influential schemas of argumentation (Paglieri and Castelfranchi 2005b, p.9)“[...] is liable for immediate implementation in our model of belief revision, since it defines a specific data structure” to represent the reasons that determine the construction and revision of beliefs. This epistemic-driven process is in line with the theory, as argumentation is mainly concerned with the manipulation of reasons in order to change the audience’s beliefs. According to the authors, previous models of belief revision fail to integrate with argumentation theories. Particularly, the AGM paradigm (Gärdenfors 1988) deviates from argumentation for two main reasons: (i) It does not make any prediction about why a proposition should be believed, (ii) there is no characterisation of the epistemic states. In their model, on the other hand, data are selected as beliefs on the basis of their properties, i.e. the possible cognitive reasons to believe such data. More specifically, their theoretical model accounts for four distinct properties of data, which are as follows:

  • Credibility a measure of the number and values of all supporting data, contrasted with all conflicting data, down to external and internal sources;

  • Importance a measure of the epistemic connectivity of the datum, i.e. the number and values of the data that the agent will have to revise, should he revise that single one;

  • Relevance a measure of the pragmatic utility of the datum, i.e. the number and values of the (pursued) goals that depends on that datum;

  • (Un-)Likeability a measure of the motivational appeal of the datum, i.e. the number and values of the (pursued) goals that are directly fulfilled by that datum.

The authors pointed out that credibility, importance, and likeability determine the outcomes of belief selection, i.e. whether a candidate data is to be believed or not, and with which strength. Meanwhile, relevance is crucial in pre-selecting the subset of active data (focusing), i.e. determining which information in the agent’s database is useful/appropriate for the current task and should therefore be taken into consideration as candidate beliefs (Paglieri and Castelfranchi 2004).

An important aspect for the selection of this model is also the measurability of the features described, which guided us in the verification of H2. In the next section, we will focus on how the theoretical model motivated the definition of the computational approach and how this model will be validated through it, in order to answer H3.

4 A model for argumentation-based dialogue management

The first part of this Section consists of the definition of a CoRS with argumentative capabilities (A-CoRS), representing the module to be tested. The implementation of the dialogue management module is based on a graph database (GD) hosting a knowledge base of common facts collected from Linked Open Data sources using the procedure described in Origlia et al. (2022b).

Next, graph analysis allows finding regularities that can be exploited to form new background theories and to support technological approaches built upon them. Among the various measures, we adopt a network analysis procedure with the HITS algorithm (Kleinberg 1999), which attributes authority and hub scores to the nodes. Respectively, these measures indicate how many nodes with high hub scores point towards the considered node and, symmetrically, how many authority nodes can be reached from the considered node. This step, combined with the human–human dialogues analysis, provides indications about how the disambiguation of nodes in the network should be prioritised. For our case study on the movies domain, other information is represented on the basis of sources like Movielens, to estimate priors concerning how popular movies are, or IMDB for catalographic information. For details about the structure of this knowledge graph, the reader is referred to Origlia et al. (2022b). Such obtained knowledge base is queried by the A-CoRS to extract, at each step, a subset of relevant data, with respect to beliefs the system has about the user, on which it reasons.

Plausibility can be defined as the degree of connectivity or effectiveness of an argument within the dialogue. It assesses how relevant and believable a new argument is, based on the quality of the new data and also on its connection with data already available to that user. Specifically, a plausible argument is one that appears to be well-supported, logical, and consistent with the available information and common ground (Paglieri and Castelfranchi 2004). In our proposal, as shown in Fig. 1, we mapped the theoretical concepts of data plausibility onto numerical aspects computed over the graph database structure, such as authority and hub scores (Fig. 1b). As previously mentioned in Sect. 1, a crucial factor in determining whether a new piece of information will be accepted or rejected as belief is the degree of connectivity of the new datum in the user’s background knowledge, defining, thus, its level of plausibility. According to Paglieri and Castelfranchi (2004)’s data networks, there are two case of argumentation through plausibility: (i) self-evident data having a large number of data connections that support them; (ii) explanatory data which, in turn, are connected to many other data to support them, as shown in Fig. 1a. At the same time, from a mathematical point of view, a datum is considered authoritative among the others when it has several data supporting it (i.e. self-evident datum), whereas a datum has a high hub score when it supports other kind of authoritative data (i.e. explanatory data), as shown in Fig. 1b. Hubs and authorities represent what Kleinberg (1999) called a mutually reinforcing relationship, meaning that a good hub is a node that points to many good authorities, while a good authority is a node that is referred to by many good hubs.

Fig. 1
figure 1

The structural appearance of plausible arguments (a) and a densely linked set of hub (cyan) and authoritative nodes (orange) (b). Nodes with a high hub score refer to nodes with a high authority score, while nodes with a high authority score are referred to by nodes with a high hub score. These scores provide insight over the relevant parts of the graph structure. The image shows how the structure of the cognitive model for data representation in (a) can be easily mapped on the network analysis concepts presented in (b). (Color figure online)

The proposed system selects data based on four measures corresponding to cognitive properties. The system maps these properties onto numerical aspects computed over the graph structure:

  • Credibility corresponds to the authority score since it can be associated with “a measure of the number and values of all supporting data” (Paglieri and Castelfranchi 2005b). In fact, an authority node has a fundamental role in the graph since a solid number of hub nodes support its validity;

  • Importance corresponds to the hub scores, considered as “a measure of the epistemic connectivity of the datum” (Paglieri and Castelfranchi 2005b) since it is connected towards knowledgeable nodes.

  • Relevance corresponds to the BNs entropy since it can be mapped onto “a measure of the pragmatic utility of the datum” (Paglieri and Castelfranchi 2005b), that means a value that allows deciding which relevant and less certain data needs a feedback to continue with the dialogue interaction.

  • (Un-)likeability corresponds to system beliefs (hard evidence) involved in the selection of the feature, since it explicates the user appeal towards that kind of data.

Leveraging the above-described measures, the system is able to explore the knowledge base and select the most appropriate arguments to recommend, and using them or their features to support its decision. The mapping between cognitive properties and their mathematical interpretation constitutes the basis from which the presented computational model has been built. Given this computational model, we will now describe how the simulator adopted to test it has been built.

4.1 A hybrid architecture for dialogue management

The presented approach is based on the system architecture assumed by the Framework for Advanced Natural Tools and Applications with Social Interactive Agents (FANTASIA) (Origlia et al. 2019, 2022a), summarised in Fig. 2.

Fig. 2
figure 2

The FANTASIA architecture. The system organisation presented here reflects the system that is simulated in this work. The future implementation of the full system will be based on this framework to keep consistency between simulated and real interactions

FANTASIA is a plugin for the Unreal EngineFootnote 1 designed to support the development of Embodied Conversational Agents, and it is freely available on GitHub.Footnote 2 The Unreal Engine is one of the most powerful tools to implement Real-Time Interactive 3D (RTI3D) experiences, and the advent of Metahumans, developed by Epic Games, has significantly increased the potential of Virtual Humans acting as dialogical interfaces.

While most present approaches heavily rely on Large Language Models (LLMs), FANTASIA enables the integration of multiple AI tools into the efficient, real-time infrastructure of the Unreal Engine. Among its features, FANTASIA provides a connector to run Neo4j queries from within the Unreal Engine and functionalities to dynamically assemble and query Bayesian Networks. In principle, dialogue systems based on FANTASIA are designed to keep knowledge representations, decision models, and logical capabilities outside the domain of LLMs, which are, instead, used as extremely powerful natural language generators acting on directives generated by other AI modules. From a design point of view, a Conversational AI built with FANTASIA follows these main principles:

  • Behaviour Trees (Flórez-Puga et al. 2009), implemented by the Unreal Engine, are used to organise and prioritise sub-tasks. For example, they may hierarchically structure the sequence of checks needed to generate clarification requests;

  • Graph databases are used for knowledge representation and dialogue state tracking. Combining the representation of the knowledge domain with the way people are referring to it is used to extract relevant sub-parts of the available knowledge on which to reason to produce the next system utterance;

  • Bayesian Networks, implemented using the aGRuM library (Ducamp et al. 2020), are used as decision models, to estimate what the most useful system action is, depending on the target goals. Bayesian Networks and their variants, like Influence Diagrams, can be dynamically assembled based on knowledge sub-graph structures;

  • Large Language Models are used to verbalise the decisions taken by probabilistic graphical models.

While most of the current research attempts at introducing graph structures inside LLMs to improve their reliability, FANTASIA keeps graph-based knowledge and functionalities separated and exploits the structural similarity between graph structures and probabilistic graphical models to dynamically assemble decision systems. In very loose terms and only to provide a general idea of the approach, FANTASIA-based dialogue systems use graph databases to simulate long-term memory and probabilistic graphical models assembled using relevant sub-graphs as working memory to take decisions. Template-based decisions are, then, verbalised using LLMs.

Although we do not directly use FANTASIA in this work, the components used in the simulator we present and their organisation follows the FANTASIA architecture and its design principles. This ensures that the results obtained with the simulator can be transferred in a full dialogue management architecture based on FANTASIA. For a general overview of FANTASIA, the reader is referred to the original papers. Specific details on how the simulator adapts these principles to facilitate preliminary evaluations of the model’s efficiency are presented in the rest of this section.

4.2 A-CoRs simulator

To conduct an initial evaluation of the described hybrid model, we implemented a dialogue simulator to generate synthetic exchanges among a recommender and a seeker in the movies domain. Although simulated dialogues are inherently less informative than real interactions between human users and a machine, they are more convenient in this phase of our research for developing and testing the theory before involving human users. A dialogue simulator has the advantage of highlighting major flaws in the theoretical model development: by generating clearly unacceptable dialogues and providing clear interpretation of the causes of the failure, it is possible to refine the theoretical model before involving human users, which is a costly activity to perform. Building dialogue simulators is a common practice for the development of dialogue systems and provides the same advantages, here, that classic approaches to AI development used to provide, for example, with the use of inferential engines and expert systems. They allow to refine the theory by studying the characteristics of the unwanted items the theory admits in its iterative formulations. We provide, in this Section, first a general overview of the simulator’s architecture and, then, provide the implementation details of each module.

The dialogue simulator is composed by the conversational recommender AI, implemented on the basis of the previously described principles, and a batch of simulated users, acting on the basis of a probabilistic model that approximates the expected behaviour of real users. While the simulator generates template sentences, the GPT-4 model from OpenAIFootnote 3 was used to rephrase the template sentences to improve variability and contextual cohesion. This indicates our position with respect to the role of Generative AI for dialogue systems: we advocate that this kind of technology can be very effectively used to solve the problem of natural language generation, while decision-making should be delegated to other kinds of AI techniques.

The communication between the system and the simulated user is template based, which means that both natural language understanding (NLU) and natural language generation (NLG) modules make use of regular expressions to decode the intentions of the interlocutor. This is useful to exclude interpretation problems from the simulated environment, where the dialogue management strategy is the focus of interest. When the user produces an utterance, the system NLU module decodes it and the dialogue state tracking (DST) module updates the graph structure so that the dialogue evolution is related to the domain knowledge. Given the updated graph, the system extracts a sub-graph of relevant items on which to reason. This sub-graph is composed of a set of items that can potentially be recommended, of their most important features and of another set of secondary items sharing the selected features with the first set of items. The structure connecting these nodes, composed by part of relationships, allows the system to convert it in a BN by replicating the orientation of the arcs. The system, then, applies evidence to the BN given both population and dialogue-specific data so that it can be queried to estimate what the most useful dialogue move is (exploration or exploitation). The generated system move, in the form of a template utterance assembled by the system NLG module, is passed to the model simulating the user. More details about this process are provided in the following of this Section.

The simulated user model uses its own NLU module to interpret the system move and generates a user reaction given the data coming from a randomly selected user in the Movielens dataset. This way, the simulated user acts on the basis of the information collected about real people. An overview of the simulator’s architecture, including the role of LLMs, is shown in Fig. 3.

Fig. 3
figure 3

The architecture of the simulator. The system and the simulated user communicate using template sentences, since the focus is on dialogue management rather than understanding. The resulting interaction is, then, passed to a Large Language Model (ChatGPT-4) to generate a more natural conversation, which can be evaluated by human users. The parts where the two models access the graph database are marked accordingly

In the rest of this Section, we describe how the models simulating the Conversational AI and the users are implemented.

4.3 Conversational AI

The Conversational AI represents the implementation of the proposed theoretical model. By making it interact with simulated users, in this work we aim at verifying that it is sufficiently stable to be deployed in a human–machine interaction scenario. The model is designed to extract relevant sub-graphs from the knowledge database, analyse their structure with respect to what is known about the user, and compute the utility of possible dialogue moves. This approach aims at combining the advantages of long-term planning, typical of rule-based AI, with generalisation capabilities and fuzzy decision-making typical of probabilistic approaches.

When the interaction starts, nothing is known about the user. The system, then, uses population data coming from Movielens (Harper and Konstan 2015) and the results of the network analysis to extract the most informative, generic sub-graph (i.e. focusing (Paglieri and Castelfranchi 2004)). To this aim, the system computes a utility value related to the probability of a movie to have been seen by the user, using Movielens users ratings as an estimator. As the goal is to recommend a movie that is not obscure nor too well known, we adopt a scoring function peaking at maximum k value of 1 at 50% probability while producing 0 values at both 0% and 100% probability levels, as follows:

$$\begin{aligned} k = 1-\left( \frac{2n}{\textrm{max}(n)}-1\right) ^2 \end{aligned}$$
(1)

where n is the number of opinions expressed by Movielens users. This function attributes maximum utility value to movies that have a normalised probability of being known of 0.5. The lowest values are assigned to both movies that are most certainly known or unknown, highlighting that a useful suggestion consists of an item that is neither obvious nor obscure.

The utility value of each candidate item is obtained by considering the best balance between the probability of an item to be known and the probability of the same item to be liked. We therefore define a utility function \(U_{sel}\) as the harmonic mean of k and the average Movielens opinion score o, expressed on a scale of 1 to 5 about the item of interest, normalised between 0 and 1. We therefore obtain

$$\begin{aligned} U_{\textrm{sel}} = \frac{2 \cdot k \cdot o}{k+o}. \end{aligned}$$
(2)

The best three items are taken as reference to start building the BN together with their features, like actors, directors, and genres are also considered to extract the most useful sub-graph to reason upon. Also, secondary items sharing features with the primary ones are considered. Specifically, the following procedure is adopted to assemble the BN:

  • Extract the features of primary items;

  • For each primary item, extract the top five most useful secondary items, ranked using equation 2;

  • Extract all ontological part_of relationships involved in the set of nodes composed by the union of both features and primary/secondary items;

  • Rank the list of relationships using, for each relationship, the authority score of target nodes as the primary sorting value and the hub score of source nodes as secondary sorting value.

The size of the BN is limited to at most 30 arcs for performance reasons. Ordering candidate relationships, in the graph, first by authority scores and then by hub scores follows the principles described in Sect. 3 and summarised, for the readers’ convenience, in Table 1. Authority is a measure of credibility, so that the nodes with high support are preferred, in general, in the selection. Hub scores measure importance, so that, when there are not enough authoritative nodes in the network, the possibility of discovering them is increased by considering nodes that support many authoritative nodes, in the network.

Table 1 Cognitive properties described in Paglieri (2004) mapped on computational scores

The extracted data composes a network in which part of relationships guide the assembling of the BN, used for decision-making. The orientation of part of relationships defines the relationships in the BN, as shown in Fig. 4.

Fig. 4
figure 4

The structure of dynamically assembled BNs. Blue nodes represent primary items (movies the system may want to recommend). Red nodes represent secondary items (movies the system may ask about to collect information). Green nodes represent features of primary items and may either be part_of items (actors and directors) or items may be part_of them (genres). Secondary items are selected among items sharing these features with primary items. (Color figure online)

Concerning a priori distributions, we apply uniform, therefore maximally entropic, distributions to nodes with no incoming relationships. All other nodes are represented as aggregator nodes, computing the median appreciation value of their parents. The contribution of each parent node may, in principle, be weighted depending on authority and hub scores, for example, but we do not consider this aspect, in our experiments, for the sake of simplicity. In future versions, the relevance of parent nodes may be used to assign more detailed causal inference from parents to children nodes.

Given the obtained Bayesian Network, ratings distributions extracted from Movielens are applied as soft evidence, where the applied probability density is computed using kernel density estimation. This way, the BN can represent the probability of each movie and each feature to be of interest for a generic user, after applying Bayesian inference. At this point, it is possible to estimate the utility of recommending each item using the \(U_{\textrm{sel}}\) scoring function 2. To establish whether the most useful item is adequate for a recommendation, we use a dynamic threshold, implemented as a sigmoid function taking the number of turns as a parameter. The threshold dynamically decreases at increasing speed while the dialogue becomes longer, simulating the necessity, for the system, to reach a conclusion more and more urgently as the dialogue continues. As soon as the item with the maximum utility value gets over the dynamic threshold, the system attempts a recommendation. When a recommendation cannot be provided, the system computes the most useful question to ask to the user.

To perform an exploration move, the system follows, in order of priority, four different strategies, employing the use of the Markov blanket of the most useful item to recommend the m movie. The Markov blanket MB of a random variable (node) X in a set of random variables is defined as a subset of random variables belonging to the same set X belongs to and conditioned on which other variables are independent with X. Given a parameterised exploration utility function \(U_{ex}(f_{m_1},f_{m_2})\) providing the harmonic mean of \(f_{m_1}\) and \(f_{m_2}\) as a utility score, the values assigned to the parameters of this function vary depending on the following priority order:

  • Attempt 1: starting from MB(m), identify the most useful node to explore by assigning their authority score to \(f_{m_1}\) and their entropy score to \(f_{m_2}\);

  • Attempt 2: if no useful nodes can be found with the preceding strategy (e.g. all authoritative nodes in MB(m) have been explored), identify the most useful node to explore among the ones outside MB(m) by assigning their authority score to \(f_{m_1}\) and their entropy score to \(f_{m_2}\);

  • Attempt 3: if no useful nodes can be found with the preceding strategy, starting from MB(m), identify the most useful node to explore by assigning their hub score to \(f_{m_1}\) and their entropy score to \(f_{m_2}\);

  • Attempt 4: if no useful nodes can be found with the preceding strategy, starting from MB(m), identify the most useful node to explore among the ones outside MB(m) by assigning their hub score to \(f_{m_1}\) and their entropy score to \(f_{m_2}\).

This step relies on the entropy of each nodes following the principles described in Sect. 3: entropy is a measure of relevance, so the system aims at collecting information about data showing the highest pragmatic utility.

Exploration moves can be presented in two forms: an open question or a polar question. The system prefers the former case when the accumulated utility for a certain node class (i.e. genres, actors, movies, etc...) accumulates more that 60% of the total nodes utility and the relative size of the class domain is lower that 10% of the total domain size of the candidate items. These threshold were chosen to represent the information that a single category node collects most of the informative nodes in the network, so that it is useful to investigate the full category instead of the single nodes, for example, with an open question. At the same time, the number of items in the domain of the category (e.g. movie genres) should be small to avoid cognitively overloading the user with a difficult question. These thresholds are empirically set, for the simulator, and will be further refined in the future. This leads the system to ask open questions when most of the utility falls in a specific class of nodes and the number of options is not relatively large. This is based on the assumption that an open question is intrinsically harder to answer, from a cognitive point of view, than a polar question so it should be asked only when the number of possible answers is reduced to avoid imposing cognitive load on the user. If this is not the case, a polar question in the basic form “Do you like X?” is asked to disambiguate the user’s stance towards the most useful node, instead.

Fig. 5
figure 5

The interpretation steps performed to update the belief graph, once a new user utterance is collected. In the example, an initial exchange is analysed using natural language processing techniques (e.g. intent and entity recognition) and represented in the form of a graph to support subsequent system decisions on the basis of the belief graph, consisting of the system’s representation of the common ground

When the simulated user answers, the graph structure is updated to represent the belief graph according to the feedback and a new set of base target items, consistent with the new beliefs, is extracted together with their features. The details of this processing are exemplified in Fig. 5. When beliefs become available in the graph, the extraction procedure uses these as a further ordering factor to prioritise extraction, in the following order:

  • items that the user explicitly reported not having seen, therefore being the predicate of a negated knows belief and having the highest number of features being the predicate of positive wants beliefs;

  • items having the highest number of features being the predicate of positive wants beliefs;

  • items that the user explicitly reported not having seen, therefore being the predicate of a negated knows belief and having the highest number of features being the predicate of positive likes beliefs;

  • items having the highest number of features being the predicate of positive likes beliefs;

When the utility of best recommendable item exceeds the dynamic threshold, an exploitation move is produced. Together with the actual suggestion, the parameters considered in the theoretical model are used to rank the item’s features. Consistently with what has been described in Sect. 3, the number of features that are a predicate for the positive wants or likes beliefs associated to the user represents the (un-)likeability l parameter. The relevance parameter r is estimated as the normalised feature entropy of the probability distribution over ratings, after Bayesian inference. The credibility parameter c is represented by the authority score of the feature. The importance parameter i is represented by its hub score. After normalisation, the harmonic mean of the four parameters, for each feature, is considered as a ranking score \(U_f\), as follows:

$$\begin{aligned} U_f = \frac{4}{\frac{1}{r}+\frac{1}{c}+\frac{1}{i}+\frac{1}{l}} \end{aligned}$$
(3)

The first three features in the ranking are, then, used to support the recommendation of the item. The system also reports a short description of the item (the movie plot) to complete the recommendation. This is mainly necessary to help human evaluators judge whether the proposed item, with the selected features, would be accepted by the user. If the user reports having already seen the proposed item, a new knows belief is created in the graph and the full process repeated until a recommendation is accepted. The procedure is summarised in its main parts by Algorithm 1.

Excluding the NLU module, the implementation of the functional capabilities of the simulator can be transferred in a FANTASIA-based Conversational AI system as it is. The use of BTs, in a real setting, allows to handle clarification requests, common ground inconsistencies in general and barge-in.

Algorithm 1
figure a

System move generation pseudocode

4.4 Simulated users

In the presented simulator, users behaviours are generated using Movielens data. Specifically, for each run, a random user is chosen from Movielens and their scores distribution is used to generate answers.

The simulated user can act either to respond to a recommendation or to answer a question from the system, covering the following cases

  • Answer to a recommendation if Movielens contains explicit data concerning the actual response of the user taken as reference, this feedback is provided to the system to state that the user knows the movie and how much they liked it. If no explicit feedback from the considered user is found in Movielens, the simulated user reports not having seen the movie and accepts the recommendation. Whether the recommendation was a good one and the supporting arguments convincing is one of the aspects that are evaluated by human participants in our experimental protocol so there is no need to simulate reasoning in the acceptance part in the simulated user;

  • Answer to a question (movie) if Movielens contains explicit data concerning the actual response of the user taken as reference, this feedback is provided to the system to state that the user knows the movie and how much they liked it. If no explicit feedback from the considered user is found in Movielens, the simulated user reports not having seen the movie. The simulated user can also trivially answer to an open question over items (e.g. their favourite movie);

  • Answer to a question (feature) the distributions of the Movielens scores from the considered user to movies exhibiting the target feature is considered. When the average score is higher than 3.5, the user reports liking the feature. When it is lower that 2.5, the user reports not liking the feature. When the two thresholds are not exceeded, the user reports not having a strong opinion about the feature. In all cases, the simulated user also considers the presence of outliers, in the distribution, to generate sentences containing exceptive statements as “I don’t like adventure movies but I liked Indiana Jones”. For the case of open question, given the target class (e.g. a genre or an actor), items showing features belonging to that class are considered. The feature showing the highest average score among the selected items is provided as an answer.

The user answer generation strategy for the case of system queries is summarised in Algorithm 2, while the strategy to answer system recommendations is summarised in Algorithm 3. Since the simulator concentrates on dialogue management strategy, the A-CoRs dialogue system and the simulated user interact using templates, generating repetitive and mechanical transcripts that do not need advanced NLP processing, as it would introduce a potential source of error that is not of interest at this step. NLP analysis is, of course, important, as shown in Fig. 5 to deal with the noisy way real human users express themselves. To enable human evaluation of the simulated dialogues and their credibility, the generated templates are passed to an LLM to generate a more fluent and natural transcript, while keeping the underlying management strategy intact. For our experiments, we used ChatGPT-4 to convert the template-based synthetic dialogues in more natural transcripts using the following prompt: Rephrase the following dialogue to make it sound more natural. Keep the structure and only change the sentences. In this case, we chose to use a state-of-the-art natural language generation tool to avoid problems with alternative approaches that may have produced less varied dialogues or introduced errors.

Algorithm 2
figure b

Simulated user query answer generation pseudocode

Algorithm 3
figure c

Simulated user recommend answer generation pseudocode

5 Experimental setup

In this work, we delve into the evaluation parameters of A-CoRS, which do not only focus on the quality of the item proposed, as for other recommender systems’ evaluations, but they also consider the features constructing the argument adopted. Other studies, like Lamche et al. (2014); Millecamp et al. (2019), also investigate the impact of the use of different feature-based explanations on user’s perception of a recommender system. In this work, we concentrate on a human evaluation of the plausibility of synthetic dialogues considering also the quality of the supporting features. As also reported by other scholars (Walton 2001; Jiménez-Aleixandre and Brocos 2018), plausibility should be evaluated considering several factors, including (i) evidence, the presence of relevant, reliable, and sufficient evidence strengthens the plausibility of an argument; (ii) reasoning, the logical coherence and soundness of the reasoning used to connect the evidence and the claim influence the plausibility, (iii) consistency, the argument should be consistent with established facts, widely accepted principles, and background knowledge, (iv) contextual factors, plausibility can be influenced by contextual factors such as the expertise of the participants, the relevance of the argument to the topic under discussion, and the prior beliefs or biases of the audience.

Given that plausibility is influenced by many different factors, human evaluations are the most reliable indicator for it. For this reason, we evaluate our approach by collecting human evaluations in a test comparing dialogues generated with our model, dialogues based on a random item selection model, and dialogues taken from the INSPIRED Corpus (Hayati et al. 2020)—a dataset of human–human interaction for movie recommendation. Results will be analysed both from a quantitative and qualitative point of view. Quantitatively, aspects like focusing (Paglieri and Castelfranchi 2005a), entropy, and cognitive properties are considered. Qualitatively, ratings about the level of plausibility of the dialogue and the quality of the selected feature supporting argument are considered.

For our subjective evaluation protocol, we adopt crowd-sourced judgements about a set of simulated dialogue. Online crowd-sourcing platforms have become increasingly popular among researchers as a way to collect data from a large and diverse population. These platforms allow researchers to recruit participants for online studies quickly and efficiently and have been used in a variety of fields, including psychology, sociology, and computer science (Gadiraju et al. 2017). The current study utilised the crowd-sourcing platform ProlificFootnote 4 (Palan and Schitter 2018) to investigate how people evaluate our simulated dialogues compared with dialogues from two different types of control groups, a positive and a negative example. We presented a survey on QualtricsFootnote 5 to 20 participants, consisting of 20 dialogues, and asked them to answer specific questions. The study aimed to explore how participants’ responses to the questions varied across different dialogues and whether there were any patterns or trends in the data.

In this section, we will provide the details of the study methodology, including the selection of dialogues, the question design, and data collection and analysis. Overall, we believe that this study provides insights into how people evaluate recommendation dialogues and highlights whether the theoretical background here described contributes to the generation of plausible recommendation strategies.

The dialogues selection for this experiment aimed to compare simulated dialogues to those representing the two positive and negative extremes. Specifically, we used:

  1. 1.

    Positive subset of the control group five dialogues from the INSPIRED Corpus (Hayati et al. 2020) to represent ideal human–human interactions in a recommendation scenario; the dialogue were extracted from the Corpus following the level of coherence based on the concept of ring-like patterns, as described in (Di Bratto et al. 2021), for which the items explored are connected as based on the knowledge representation;

  2. 2.

    Negative subset of the control group five dialogues generated with our system, where both the selection of the target items and of the supporting features during exploitation was randomised;

  3. 3.

    Target group ten simulated dialogues produced using the proposed computational model. In five dialogues, the simulated user did not express any initial preference (system initiative). In the remaining five, the simulated user would start the interaction by stating their main preference (user initiative), as computed to answer an open question about the preferred genre. This was useful to account for different interaction strategies we observed in the INSPIRED corpus.

As an example, the following dialogue represents the result of a simulated interaction between our A-CoRS (Mary) and a simulated user (George).

figure d

Each participant was given the following instructions:

figure e

For each of the above-mentioned questions, participants were asked to give a score on a Likert scale ranging from 1 (not at all) to 5 (very). People were not informed that the dialogues were mostly synthetically generated. Q1 is important to understand whether the questions posed during the exploration phase are consistent. In fact, if the participant perception tends towards consistency, the argument features towards which feedback is sought are correctly selected. Q2 refers to the perception of naturalness of the dialogue. Since naturalness can be rather subjective, the absence of communicative efforts was here used to refer to this. Q3 refers to another aspect of naturalness, in that if the dialogue is plausible, the Recommender should be able to show a certain degree of expertise. Q4 is about the quality of the arguments used, similarly to Q1, but during the exploitation phase.

6 Results

Since the collected data represent opinion scores on an ordinal scale, we aim at detecting the statistical value of the association between the observed scores and the dialogue types. For this, we fit on our data a Cumulative Link Mixed Model (Agresti 2012) with Laplace approximation (Shun and McCullagh 1995) (CLMM), accounting for random effects due to either the single participants or to the specific stimuli by considering these as blocking variables. This model determines the odds of observing high values on the dependent variable (i.e. the Likert score) given the value of the independent variables (i.e. dialogue type). We fit five different models to analyse our data as a whole and, subsequently, to concentrate on each specific aspect investigated through the proposed questions. Figure 6 shows the full distribution of the scores among the questions for each considered dialogue type. Figure 7 shows the distribution of the scores for each question.

Fig. 6
figure 6

The full distribution of scores. The negative subset of the control group has a lower average evaluation scores than the positive subset, showing that human judges were able to correctly evaluate the control group. Target scores are positioned in-between negative and positive groups, as expected

Fig. 7
figure 7

Scores distributions for each question. The association between high scores and positive dialogues is very strong for Q2, Q3, and Q4 and weak for Q1. For the target dialogues, the association is strong for Q4 and weak for Q2. The negative dialogues are never found to be associated with high scores

The full model detected a highly significant association between positive dialogues and high scores (\(p < 0.0001\)) while, on the other hand, the negative control dialogues did not exhibit any statistically significant association with high values. The target group was just weakly associated with high scores but the p value was very close to the significance threshold (\(p = 0.0144\)). This first model confirms the quality of the collected data, as the participants were clearly able to separate positive from negative samples in the control group.

The model built specifically for Q1 provides information about the perceived coherence of the features explored during the exploration phase. In this case, positive dialogues showed only a weakly significant association with high scores, whereas the other two groups had no significant association. It is possible, in this case that participants had trouble following the rationale of the exploration strategy adopted by the recommenders. This, however, appears to be true for human–human dialogues, too, suggesting that an evaluation of the exploration strategy from an external point of view may not be feasible. This aspect will further be investigated during the next step of the research project, involving human users directly in the interaction.

Analysing Q2, positive dialogues showed a highly significant association with high scores (\(p < 0.0001\)). The target group exhibited a weakly significant association with high scores, with the p value also being again very close to the significance threshold (\(p = 0.0145\)). This result shows that the recommender was not perceived to be as natural as the positive control dialogues, which is expected from a simulated interaction. However, the communication strategy built upon the theoretical model was sufficient to make a shift in the perception of naturalness although the natural language generation strategy remained the same, with respect to negative examples.

For Q3, only positive dialogues displayed a strong and highly significant association with high scores, while the other two groups did not exhibit any significant association. This is coherent with our previous observations, as, in the simulated dialogue, the system does not provide sufficient explanations for its behaviour during interaction, while human users do. In the simulated setting, where the more functional requirements of the theoretical model have to be evaluated, these aspects cannot be easily tested. This observation will be confirmed by comparing these data with the ones collected in a human–machine interaction experiment.

Regarding Q4, positive dialogues were strongly and highly significantly associated with high score (\(p < 0.0001\)). Additionally, the target group also showed a significant association (\(p<0.01\)) with positive values. This aspect is important for our theoretical model, as it shows that the selection of supporting arguments, based on the aspects we are considering, makes the recommendation both plausible and convincing.

Table 2 summarises the statistical results we reported, highlighting the association between high scores and the type of dialogue.

Table 2 Summary of the statistical significance of the association between high scores and dialogue types for each considered question

On average, we report that the target dialogues obtained a higher average score than the negative samples, while still inferior to the positive samples. The analysis we present here, however, provides a more accurate view about our results than simple average scores. For completeness, in Table 3, we also summarise the average scores obtained on the considered groups.

Table 3 Average scores for the considered groups over the four questions

However, the proposed analysis highlights that, because also the average scores of the negative samples are higher on Q3 while the separation is clearer on Q4, the probability of the change depending on the type of dialogue is higher for Q4. The methodological procedure we propose in this paper concentrates on establishing the capabilities of the recommender on different aspects as perceived by external observers, highlighting the ones that need further investigation with real users and certifying the quality of the ones that a simulator can cover. This result is obtained while avoiding considering the collected scores in an absolute way, which may be misleading, and providing an analysis considering the ordinal nature of the scale, also including random effects and inter-group interaction to identify the correct indications coming from the collected data. For the case of Q3 and Q4, we were able to detect that, although the average score is higher for Q3 than Q4, the probability that high scores obtained by Q4 are due to the type of dialogue is higher than it is for Q3.

While this may seem counter-intuitive, the reader should note that we are not concentrating on absolute scores, for which the target stimuli systematically outperform the negative ones. Specifically, by using the Cumulative Link Mixed Model, we are concentrating on the much more specific question concerning the hypothesis that the occurrence of high scores is due to the type of dialogue, rather than on other factors. While, on one side, this very restrictive evaluation downplays the good scores obtained by the system for all questions, it provides a stronger background for the claims where statistical significance was reached or for the ones where it was very close.

Fig. 8
figure 8

Objective measures computed over a sample simulated dialogue. The utility of the best recommendable item increases during the interaction until it meets the dynamic threshold (a). The domain size in which the solution is searched tends to get smaller (b) and the average entropy of the distributions in the Bayesian Network decreases, showing higher confidence in the decision model (c)

Concerning the objective measures we collect from the simulation, Fig. 8 shows the evolution, over the dialogue, of a) the score of the best recommendable item together with the dynamic threshold, b) the domain size of the items selected as candidates from which to extract the three target ones, and c) the average entropy of the BN. The tendency that emerges from the application of the linguistic-based principles has the effect of reducing, at each step, the size of the potentially interesting items and/or the average entropy of the decision model, which becomes more informed at each turn. Cases where these measures increase are typically associated with negative feedback from the user about the subject of the posed question. In these cases, there are strong changes in the relevant sub-graph due to potentially interesting items leaving the network and new ones entering. The difference with the same measures taken from a negative sample produced using the random item selection strategy is clear, as shown in Fig. 9.

Fig. 9
figure 9

The same objective measures shown in Fig. 8 computed over a negative simulated dialogue. The evolution of the curves is ineffective with respect to the target solution

7 Discussion and conclusions

To summarise, we verified our starting hypothesis as follows:

H 1

Argumentation dialogues have been described through a data-oriented model of belief revision based on cognitive principles for the selection and assessment of arguments.

H 2

Mathematical descriptors of the pragmatic features at the basis of the theoretical model are presented and motivated; they were adopted for the selection of plausible arguments and modelled through its computational implementation.

H 3

On the basis of the mathematical scores, it is possible to evaluate A-CoRS using simulated environments. We have presented a methodology to accomplish this in the technological framework of the FANTASIA architecture.

Concerning the results of A-CoRS evaluation, participants were able to effectively differentiate subsets in the control group, indicating solidity in the crowd-sourced data. There is a statistically significant likelihood of observing positive values for Q4 when the stimuli originate from either the target group or the positive group. The same may be concluded for Q2 as well, although the significance threshold was not technically reached for the target group. These results indicate that, in general, the proposed model does not behave in a generally inconsistent way and that it appears to be convincing in its argumentation capabilities. On this basis, it is now possible to plan human–machine interaction experiments targeted at better measuring the experience humans may have with such an artefact. By involving actual participants, the study will evaluate the theory’s applicability in a broader context and account for nuances and complexities that may arise through human interaction. Furthermore, a deeper attention to the register and personalisation of the exploration phase will also be deepened to improve the reception of the system turns. The promising results concerning the effectiveness of argument selection strategies allows us to say that the proposed theoretical model is valid for the continued implementation of the system in its entirety.

In Conclusion, this work presents a theoretical model for argumentative recommendation dialogues based on linguistic and pragmatic theories. Our cross-disciplinary approach has led us to propose a new methodology to assess the quality of argumentation dialogues in the context of movie recommendations using the A-CoRS framework. The objective was to evaluate the effectiveness and plausibility of the system’s argumentation strategies in guiding users towards optimal item choices. The proposed evaluation method involves using simulated dialogues generated based on a theoretical model that incorporates cognitive pragmatics principles. An experiment was carried out to evaluate the plausibility of the resulting dialogues. For the test, 20 participants were recruited using crowd-sourcing to investigate how people evaluate our simulated dialogues (target group) in comparison with the dialogues of two different types of control groups, namely human–human conversations (positive subset) and a simulated dialogues without strategies (negative subset).

Results revealed that participants were able to distinguish between the dialogues in the control group, suggesting the responses were not randomly given. Specifically, in all the four questions subject of the experiment it emerges that the scores were higher for the positive group, as expected. This finding supports the notion that the questions were appropriately formulated and logical, coherent with the characteristics typically observed in human–human dialogues. On the other hand, the simulated dialogues were usually scored higher than the negative ones, especially for Q2 (naturalness) and Q4 (argument selection in exploitation), whose differences with the negative group were statistically significant. The adopted statistical model checks for the occurrence of high scores to be associated with the type of dialogue, so that the detected statistical significance is strongly related to the aspects of interest for this work.

The results suggest that, although communicative sociable aspects are not adopted by the simulator, the perception of relevance and plausibility of selected arguments is rated as efficient for the acceptability of the recommendation. Future developments will focus on conducting tests with real users interacting with the system, a crucial step to assess the theory’s validity in more realistic scenarios.

Concerning the objective measurements, we considered the evolution of metrics describing the size of the domain of possible items to recommend, the average entropy of the BN and the utility score of the best recommendable item at each step. These metrics show the emergence of a coherent strategy that filters out irrelevant items and increases confidence in the information used for decision-making. When compared to a random selection strategy, the difference is clear so that we can conclude that the theoretical model does, in fact, produce a behaviour, from the system, that is reasonable both from an objective point of view and from a human perception point of view.

The limitations of the presented study lie in the fact that the presented results are only related to the functional aspects of argumentative dialogue management. However, dialogues involve a significant amount of linguistic and sociable strategies, as highlighted in Hayati et al. (2020) concerning the case of movies recommendations. Results obtained with the presented simulator confirm the validity of the theoretical model but still need confirmation in a real setting where human users are involved. In such setting, social aspects, not captured by the simulator, are expected to have a significant impact in the perception of the system by the involved users. Additionally, the need for clarification requests and common ground management is more complex in real settings, so NLU models must be adequately designed and different dialogue moves should be considered and prioritised. In this sense, the role of Behaviour Trees will become more relevant than in the simulated setup.