1 Introduction

Chatbots can be defined as intelligent conversational applications that can simulate natural language conversation by engaging in text or voice (or both) input and output exchange with humans [56]. These tools may be designed to perform in different contexts (web platforms, social networks, home devices etc.) and to serve a wide range of goals in different domains from entertainment to health assistance and customer service support [4, 28]. As suggested by Radziwill and Benton [69]: ‘chatbots are one class of intelligent, conversational software agents activated by natural language input’. Conversational agents are generally categorised as highly driven by artificial intelligence while chatbots could be more or less sophisticated in their ability to drive the natural conversation with end-users or to help customers in achieving their goals. Nevertheless, in literature chatbots and conversational agents are often used as synonyms [42, 77].When attached to a company service, chatbots aim to support the decision-making and information retrieval of end-users [62] and are generally used as customer relationship management (CRM) tools. These CRM chatbots may be used to reduce operational costs associated with customer service and to enhance the brand image by providing 24/7 rapid and effective exchanges with costumers to facilitate the access to information [28]. These service tools, being usually proprietary or customised systems of a company, may substantially vary in terms of appearance, behaviour and capabilities and provide a different experience to end-users [17].

Forecasting data suggest that around the 85% of customer interactions will be handled without a human agent by 2020 with an expected market value of conversational agents of $ 6 Billion by 2023 [7].

Despite the potential market for conversational agents, Valério et al. [78] suggest there is still too little known about how to assess the end-user perception of quality when interacting with chatbots. Evaluation frameworks such as the PARAdigm for Dialogue System Evaluation (PARADISE, [82]) suggests the end-users satisfaction with chatbots should be considered as a weighted product of success in achieving the tasks (maximise task success) at an acceptable cost (efficiency and quality of Chabot’s performance). In line with this paradigm, Radziwill and Benton [69] recently conducted a literature review and compiled a list of thirty-eight quality attributes which can be used to design conversational agents. These authors proposed a list of qualities attributed to Chatbots; these are intended to be used as guidelines (or checklists) for designers. In this work, we convert a design-oriented ‘attributes list’ into an inventory to measure satisfaction with chatbots. Satisfaction is a tricky measure of the end-user reaction and reasoning about systems which relates to efficiency and effectiveness and accurate and reliable modalities of assessment [3, 22, 29, 44, 54]. Nevertheless, as recognised by Thorne [76], researchers in the field of conversational agents tend to translate methods from Human-Computer Interaction (HCI) simply. Often satisfaction with chatbots is measured using tools developed to assess web or digital interfaces [71]. Reliable and short scales such as the System Usability Scale (SUS, [6]) and its shorter proxies the Usability Metric for User Experience (UMUX, [25]) and UMUX LITE [53] can undoubtedly guarantee comparable measures of satisfaction; however, these tools were not developed to consider the conversational aspects which relate to a user’s interaction with conversational agents. Tools to assess speech user interface, voice respondents and voice controlled interface are available –e.g. Speech User Interface Service Quality scale [52, 67]; Mean Opinion Scale [50]; Subjective Assessment of Speech System Interfaces [38]. Such tools, however, focus on technologies that are significantly less interactive than artificial intelligence (or advanced algorithms) based chatbots for CRM. The ability to communicate and maintain an efficient and effective conversational exchange is not a secondary, but actually, a characterising element of chatbots that should be considered in the assessment of satisfaction with these tools [18, 19]. By partially recognising this issue, some researchers used qualitative instruments that directly inquire about the overall impression/experience of the end-users after a given interaction with chatbots; these assessments also take the conversational aspects of the experience into account [61, 68, 72]. The use of qualitative methods provides an insight into what constitutes a quality interactional experience of chatbot system, but such methods have not yet been translated into reliable and comparable instruments for assessment. As recently noted by Federici et al. [23], there is a growing need in the domain of chatbots and conversational agents to translate qualitative results into a validated scales to measure, diagnose and compare the quality of an chatbot-based interaction.

Attempts were made, in the marketing domain to systematise customer satisfaction toward a brand or a service that utilise conversational agents. For instance, in a recent marketing-oriented study [13], consumers of luxury brands with previous experience with chatbots assessed different agents by simply viewing screenshots of these systems to identify the benefit of using chatbots for marketing purposes. Concurrently, a recent work investigated the sources of satisfaction and dissatisfaction during the interaction with chatbots from the marketing perspective [86]. Moreover, it was recently proposed to use sentiment analysis as a way to automatically infer the sentiment toward a brand or a company after the exchange with a chatbot [24].

However, there is a difference between the satisfaction intended in the marketing domain as ‘the customer’s emotive post-consumption evaluation of the service performance’ [80], which is inherently connected to the concept of loyalty [8], and the satisfaction of interaction defined in the ISO 9241-11 [43] as the ‘extent to which the user’s physical, cognitive and emotional responses that result from the use of a system, product or service meet the user’s needs and expectations’. In the first case, the conversational agents are assessed to understand how to optimise the reaction toward a brand or a company service; in the latter chatbots are the object of observation and are evaluated to understand how to ameliorate the chatbots’ performances to make the users satisfied of the interaction with the chatbot as a tool.

While we believe that the marketing and the interaction perspectives on satisfaction are complementary, the present work focuses on the latter by aiming at providing a toolkit to help designers of chatbots to consider during the design the needs of the end-users and to assess during the formative phase of development the satisfaction of the user with the interaction with a chatbot without considering the marketing implications that can and should be integrated at later stages of product development.

To the best of our knowledge, no previous studies have attempted to identify and test a model of users’ satisfaction in the context of interaction with conversational agents by aiming at developing a reliable tool to guide the designers during the development.

To achieve the goal of developing tools to support designers in the evaluation of chatbots interactive quality, we performed four studies in sequence:

  1. i.

    The first study re-examines the attributes identified by Radziwill and Benton [69], based on a systematic review to identify attributes that end-users may indirectly or directly use to assess the quality of interaction after interacting with an information retrieval chatbots.

  2. ii.

    The second study was aimed at reaching consensus on this list of attributes. An online survey with chatbot designers and end-users was developed to accomplish this.

  3. iii.

    The third study aimed to expand the list attributes and to develop a list of ‘items’ for the questionnaire. Focus groups sessions were used to develop an initial version of the scale called the: Bot Usability Scale (BUS).

  4. iv.

    The goal of the fourth was to pilot the initial version of the BUS scale to explore its psychometric properties to create a final version of the scale for future analysis.

2 Study 1Attributes Collection

2.1 Methods

In line with Scale Development Theory [21, 75], we adopted a deductive approach to defining an initial construct in order to assess the end-users satisfaction. Researchers screened and reviewed 26 references and 38 quality attributes proposed by Radziwill and Benton [69], which were initially developed as guidelines for designers. The goal of this screening was to identify attributes that could be used by end-users. During the re-examination of the literature and ‘quality-attributes’, attributes were retained only when these were described as having a perceivable characteristic that people may use to assess and judge (a system and the experience of using that system) after they had used a CRM chatbot, to rate their experience as satisfactory or not.

In parallel, a systematic literature review was performed following PRISMA guidelines [59]. The outcomes of the review were also used to specify and add attributes to the final list. Researchers performed the initial process of review and adaptation of the list attributes and also reviewed the process and the list (see Appendix 1).

2.2 Results

Figure 1 reports on the PRISMA process that resulted in a final database of thirty-four literature items.

Fig. 1
figure 1

PRISMA process of literature selection on the quality attributes of chatbots

Twelve attributes from the list of Radziwill and Benton [69] were excluded because they were not relevant or not applicable for the assessment satisfaction with CRM chatbots (see Appendix 2). A revised list of 27 attributes was composed by using the remaining set of attributes from Radziwill and Benton [69] as a driver and by adding attributes in line with the new set of references (see Table 1).

Table 1 The revised list of 27 attributes that can play a role in the end-users’ assessment of satisfaction after CRM chatbot–user interaction

2.3 Discussion

A total of 27 attributes was identified by extending and reviewing the previous work of Radziwill and Benton [69] for the specific purpose of assessing user satisfaction with CRM chatbots. Using the same list mechanism proposed by Radziwill and Benton [69], these attributes could be used as a checklist in order to control the quality of the chatbot functionalities during the design phase. In order to further refine the list, a group of experts and end-users were involved in a second study to ensure that the list was developed in a robust manner.

3 Study 2Attributes Selection

3.1 Methods

3.1.1 Participants

Fifty experts and users were invited to complete an online survey based on the quality assessment attribute collection. Participants were recruited from a pool of expert designers and end-users provided by industry—the company UserBot.ai (https://userbot.ai/) and from the student population of the University of Twente. Twenty-nine (58%) completed the survey.

3.1.2 Procedure

First, the participants were asked to complete a consent form and provide demographic data; participants indicated their role as either chatbot designers or as end-users. Designers declared their expertise in the number of years they had worked in the field and end-users declared the amount of interaction with chatbots they had had in the last 12 months (the scale used ranged from 1 = None to 6 = Every Day). In the main part of the survey, participants rated how much they agreed with the importance of each attribute; this was accomplished using a 7-point Likert scale mechanism. Finally, participants were asked to leave comments related to (i) comprehensibility and the wordings of the attribute’s name and descriptions and (ii) missing attributes and additional aspects which they thought should be included.

3.1.3 Data Analysis

The consensus on attributes was analysed by the median scores for each factor. To be inclusive and representative of the different experiences and targeted at building a tool that could be used by end-users with different levels of expertise, we weighted the value of all of the opinions of the stakeholders equally. Interquartile ranges (IQRs) were used to estimate the level of agreement per factor (Polisena et al., 2019). In order to be precise, only attributes with an overall median IQR between 5 (agree) and 7 (strongly) were retained in-line with Polisena et al. (2019). Moreover, agreement on the final list of attributes was estimated using Krippendorff’s Alpha with a 10,000 bootstrap resampling to estimate inter-coder reliability [35].

3.2 Results

Among the 29 (volunteers - 27 male, 2 female; Mean Age: 36.5; SD: 9.3) stakeholders involved in the survey: (i) eight declared themselves experts (designers or programmer) with average expertise of two years in the chatbot field. (ii) ten reported they were frequent users, having used a chatbot every day or at least once a week in the last 12 months and, (iii) eleven declared themselves novices, with minimal experience with chatbots in the last 12 months. A Consensus Analysis (Table 2) showed that only seventeen of the twenty-seven attributes that were included in the revised list were considered important enough to assess ‘satisfaction’ for the different stakeholder types. The agreement among the participants was equal to 0.780, which is acceptably higher than the minimum level of .667 for the Krippendorff’s Alpha [35].

Table 2 Agreement of the different stakeholders (Expert Designers, End-Users with a good or high level of expertise and Novices) on the importance of the attributes used to assess the quality of interaction with CRM chatbots

Among the attributes which were excluded, seven were related to conversational capabilities and appearance of the chatbots (A2, A9, A10, A14, A18, A19 and A22), and one attribute was related to the sensitivity of the chatbot to recognise if people needed help or support (A24). Finally, two attributes that are usually reported in the literature as critical interactive aspects to determine the overall quality of experience with chatbots, such as ‘Personality’ (A20) and ‘Interaction enjoyment’ (A21), were not considered essential for the stakeholders of the survey to determine people’s satisfaction with CRM chatbots.

Participants suggested some minor changes in the wordings of the attributes to improve the readability of each attribute characterisation and highlight any omissions or perceivable errors. However, only two designers commented on potential extra attributes that were missing from the list. One designer suggested adding attributes connected to the linguistic capability of the chatbot by saying that it is vital that from the conversational point of view that a ‘chatbot understands needs and mood of the users by giving precise information in as less time as possible’ (D3). The other designer suggested that it is crucial that a conversational agent was set up and reframed the end-user’s expectations by ‘acknowledging when it doesn’t have enough confidence in emitting a response’ (D11).

3.3 Discussion

Study 2 suggested that seventeen attributes were considered the most relevant. Among the attributes that participants rated less important were attributes that were hard to judge (A24) or related to aesthetics and conversational capabilities that could only minimally affect the overall experience of use with chatbots (as seen in A2, A9, A10, A14, A18, A19 and A22). Conversely, the exclusion of the attributes ‘Personality’ (A20) and ‘Interaction enjoyment’ (A21) was unexpected. These attributes seem very strongly connected to the user experience so that the personality of the chatbots and the enjoyment of interaction are often discussed as key elements to assess the adoption and use of conversational agents [11, 45, 64, 87]. The fact that CRM chatbots usually have a short-time relationship with end-users could have led to the exclusion of these attributes [28], whereas attributes such as A20 and A21 could be more seen more critical in judging conversational agents which are meant for long-term interactional exchanges. Attributes A20 and A21 are excluded from this study that is focused on CRM chatbots, but these could be used to assesses satisfaction with conversational agents. The results also suggested that other potentially essential attributes could be included in the list concerning linguistic capabilities and user expectations. These suggestions seem in line with indications of Zamora [87] who reported that aspects such as the capabilities of chatbots to accommodate to different conversational style, and the ability to make it easy for end-users to start a conversation and achieve relevant results is essential to boost the experience of interacting with conversational agents.

4 Study 3Revision of Attributes, Item Generation and Focus Groups

4.1 Methods

In line with the recommendations provided by participants of the previous study and supported by the research literature, three more attributes were added to the list:

  1. i)

    Linguistic flexibility. This attribute refers to the perceived capabilities of the chatbot to manage and adapt to the different conversational styles by avoiding, for instance, that end-users should rephrase input in different ways to get answers from the conversational agent [15, 47, 87];

  2. ii)

    Easiness to start a conversation. This attribute refers to the affordances provided by the design of the chatbot to make it easy for an end-user to understand how to initiate a conversation [10, 87].

  3. iii)

    Expectations setting. This attribute refers to the ability of a chatbot to make clear its capabilities and not to create false expectations in the end-users [5, 32, 45, 87].

Moreover, attributes that were unexpectedly excluded in study 1 (Personality and Enjoyment) were re-inserted in the list to double-check their importance further. Therefore, the list of attributes for Study 2 was composed of 22 elements (see Table 3).

Table 3 The revised list of attributes to assess the quality of interaction with CRM chatbots (code, name) and descriptors of new items. Attributes included in the previous list were coded A, new attributes were coded N, attributes previously excluded and re-inserted were coded R

A panel of four experts on interaction (three junior experts, external to the previous phase and one of the authors) proposed for each of the attribute a list of three items to create a questionnaire. Across the groups, similar items were merged, and the wording of each item was discussed in multiple sessions. However, only 21 of the attributes in the list were used. Attribute A15 was excluded from this exploratory study because it was not possible to recruit people with disability for the panel or the focus group. In this sense, by endorsing the motto ‘Nothing About Us Without Us!’ [9], the authors of the present work decided to postpone and adapt in future studies the scale by including people with disabilities. Agreement amongst panel members was reached on 61 of the 63 items generated (Appendix 3); therefore, two items were excluded from the study.

Focus group sessions were performed in order to revise the wording of the items and to inspect whether the connection between items and attributes was understandable for potential end-users.

4.1.1 Participants

A total of 16 volunteers (8 female & 8 male, Age M.= 22.1, SD = 2.84) participated in the focus group sessions. Participants were randomly assigned to a focus group session with a maximum of five people.

4.1.2 Material

During the focus group, the list of attributes (seen in Table 3) and the list of items (Appendix 3) were reviewed by participants. Consent and demographic data were obtained by Qualtrics (Appendix 4). Moreover, a demonstration to exemplify the interaction with service chatbots was given by using an actual bot; the Finnair Messenger was used: (https://www.messenger.com/t/Finnair). The Finnair Messenger represented a real-world example of a CRM chatbot which is integrated into a social media platform. Each session of the focus group was both audio and video recorded to facilitate and support the analysis to provide reliable data.

4.1.3 Procedure

Each participant was asked to fill in a consent form and demographic questionnaire. A definition of CRM chatbots and conversational agents was given to and discussed with the group. During the demonstration, the moderator operated the Finnair chatbot while asking the participants to offer input. Following the demonstration, the moderator asked participants to reflect and discuss the positive and negative aspects of interaction with chatbots. At the end of the discussion, participants were asked to:

  1. i.

    Review the list of attributes. Each participant was provided with the list of attributes, and they were asked to discuss each attribute in terms of relevance to assess their satisfaction in the use of a CRM chatbot and to review the clarity of the attributes’ descriptors.

  2. ii.

    Review the list of items: Each participant was provided with a list and asked to read the list of items to comment about the clarity of the wordings, and they were also asked to express verbally any unclear association between items and attributes. It was explained to participants that an item could be matched to several attributes or none if they thought this was the case.

4.1.4 Data Analysis

The panel reviewed video recordings and notes of the focus group session to:

  1. i.

    Change or adapt the list of attributes: Positive written indication and verbal comments of the participants about the importance of each attribute, and the comprehensibility of its descriptor in the list was used to assign the value 1 to indicate that the attribute was comprehensible and considered necessary by a participant. Conversely doubt about the attribute, its descriptor and its importance to assess satisfaction was coded as ‘0’. Positive responses were used to estimate the level of agreement on the relevance of each attribute to assess user satisfaction during the use of service chatbots.

  2. ii.

    Change or adapt the list of items: Comments of participants about ambiguity in the item’s wording, typos, or unclear association between items and attributes were noted during the focus group and analysed post-session using video recordings.

4.2 Results

As reported in Table 4, the initial list of 21 attributes was reduced to a list of 14 main attributes. Attributes R21 and R22 (Enjoyment and Personality) were excluded. As was the case in Study 1, these attributes were not considered as an essential factor in assessing the satisfaction with CRM chatbots. The attribute A11 (ease of use) despite being considered important as a factor was described by 15 out of 16 participants as too vague. Participants suggested that ‘Ease of use’ was already covered by other attributes and that each person may have a different idea of what ‘easy to use’ entails. Similarly, A11 was also excluded from the list. Moreover, the description of five attributes was slightly adjusted, to avoid ambiguity, concerning the feedback data from the participants.

Table 4 Revised list of key attributes from Study 1 (code and names). These attributes were listed as essential aspects to assess the quality of interaction with CRM chatbots after the focus group. The participants’ agreement on the importance of each attribute, and indications emerged during the focus group were used to decide whether to retain (R), change (C) or merge (M) attributes. The rationale behind the decision-making is reported together with the final list of attributes (name and amended descriptions)

Participants also suggested merging the following attributes:

  • Attributes A3 (Maxim of quality) and A16 (Trustworthiness) were often confused by participants who reported that to judge the trustworthiness of a chatbot; they will rely on its ability to act and respond credibly. In agreement with participants, we only retained items of A3 (see Appendix 3) to measure a new attribute that we named: ‘Perceived conversational credibility’.

  • Attributes A1 (Response time), A12 (Engage in on-the-fly problem solving) and A17 (Process tracking and follow up) were considered by participants all attributes related to the ability to answer in a quick way to the request of end-users. In agreement with participants, we only retained items of A1 (see Appendix 3) to measure the new attribute ‘Speed of answer’.

  • Attributes A4 (Maxim of manners) and A6 (Appropriate language style) were both considered associated. Therefore, items of A4 (see Appendix 3) were retained to measure the new attribute ‘Understandability and politeness’.

Regarding the quality of the items wording, no major request for changes was outlined, despite some typos were highlighted by participants. Therefore, all the proposed items were corrected and retained for further testing. In tune with the indication from the focus group, the preliminary version of the BUS was composed of 42 items associated with 14 attributes (see Appendix 5).

4.3 Discussion

Participants of Study 3, in line with results of Study 2, suggested excluding the attributes ‘Interaction Enjoyment’ and ‘Personality’. This seems to confirm that these attributes are considered less important than others by end-users to assess the satisfaction with CRM chatbots, or as earlier mentioned, too generic and addressed by other factors in the scale. However, as stated earlier, these two attributes should be considered and employed when dealing with chatbots for long-term interaction/relationship-based interaction. It is also worth discussing the exclusion from the attribute list ‘ease of use’. The overall perspective of the participants was that ‘ease of use’ could not be fully represented by one attributional factor, but that the ability to judge ‘ease of use’ with a CRM chatbot is something that could emerge by considering a related set of interactive and conversational factors during the exchanges with chatbots. Participants in the focus groups also considered those attributes and items that could be concretely perceived and observed during the interaction as relevant. Participants agreed that from an end-user perspective:

  1. i)

    It is easier to assess ‘trust’ in a CRM chatbot interaction by assessing the bot’s capacity to provide information and helping to attain a goal (i.e. the credibility of information) instead of by assessing trustworthiness as a general and unspecified sense of trust. Assessing trustworthiness could require a different set of items more in line with trust and technology acceptance theory [55].

  2. ii)

    The ability of chatbots to provide speedy (and accurate) answers to their request was considered easier to assess, than its capacity to solve emerging issues or its ability to inform them about their progress toward the achievement of the goal.

  3. iii)

    It was more comfortable and more relevant to assess the capability of chatbots to understand and be understandable than its ability to use an appropriate style of language.

The list of 14 attributes resulted from the analysis is reported in Appendix 6 as a checklist to assess the quality of chatbots (BOT-Check). BOT-Check could be used to enable designers to control quality during the development of CRM chatbots, i.e. agents for short-term interaction. Moreover, by adding three other attributes to the list that were excluded from the present work as previously discussed, such as ‘Interaction Enjoyment’ and ‘Personality’ and ‘Meets neurodiverse needs’ designers, could aim to assess long-term conversational agents more inclusively.

5 Study 4. Psychometric Exploration of the BUS

A test was performed with participants interacting with multiple chatbots (five out of ten) to explore the psychometric properties of the scale (BUS-42) and to reduce the number of items systematically.

5.1 Methods

5.1.1 Participants and Measures

A total of 480 questionnaires were collected from a sample of 96 volunteers (22 Female, 74 Male Age M: 23.7, SD: 4.8). Eight percent (385) of the questionnaires were entirely or correctly completed.

5.1.2 Material

Ten chatbots were used in this pilot study which used the scale; each one of these was associated with an information retrieval task (Appendix 7). Qualtrics was used to collect information relating to demographics (see Appendix 4), to present the tasks to be accomplished and to collect feedback after the use of each chatbot using the 42-item BUS and a UMUX-LITE [53]. Each item of the BUS was presented as a statement to the participants, and they were asked to assess their agreement with each statement on a five-point Likert scale from 1 (‘Strongly Disagree’) to 5 (‘Strongly Agree’). A five-point Likert scale version of the UMUX-LITE was used in line with the recommendations of Sauro [73] and Lewis [51].

5.1.3 Procedure

Participants were tested in a dedicated room. Consent and demographic information were acquired, and participants were asked to interact randomly with five of the ten chatbots available (see the list of chatbots and tasks, in Appendix 7) to achieve a goal; this was presented as an information-retrieval task. After the interaction with each chatbot, if the participants achieved the task or not, they were required to fill the 42-item BUS and the UMUX-LITE, and they then had a 10-min break. Each participant used the same computer and monitor for the test. As some of the data were collected before and during the pandemic crisis due to COVID19, 60% of the data were collected in presence, and 40% of the data were collected by in-presence remote testing mediated by video calling systems with the same procedure of the in-presence collection.

5.1.4 Data Analysis

The 385 questionnaires were used to perform a 50,000 iterations Bayesian Exploratory Factor Analysis (BEFA, [16]) with R package ‘BayesFM’ [66]. ‘Psych’ R package was used to perform a parallel analysis [70]. Multiple BEFA were performed as defined by Conti et al. [16] suggested that BEFA is an iterative approach which reduces items and analyses factor-loading. Bayesian approaches of factorial analysis are considered more reliable compared with classic approaches [39]. Reliability analysis was conducted individually for each latent factor using the alpha function from the R package ‘psych’ [70]. This analysis was used to drop items and improve internal consistency systematically. Finally, participants’ answers (per chatbot) were used to perform descriptive and Pearson correlation analyses to explore the relationship among the final version of the BUS (and its factors) and the UMUX-LITE.

5.2 Results

5.2.1 Bayesian Exploratory Factor Analysis

A parallel analysis suggested a structure with five components. The BEFA analysis confirmed the structure with five factors (35%, Metropolis-Hastings acceptance rate = 0.996). In tune with DeVellis [21], we only retained the items with loading over 0.7 (Table 5).

Table 5 Loading of attributes and items and excluded items

5.2.2 Internal Consistency

By aiming at reducing the number of items and concurrently maintaining a level of reliability above .7 for each factor, multiple iterations of reliability analysis were performed by dropping items iteratively until a satisfactory solution was identified. Coherence between the attributes associated in each factor was also considered to exclude or retain an item.

As reported in Table 6 the final questionnaire was reduced to 15 items (BUS-15, see Appendix 8) as follows:

  • Factor 1, initially composed of 8 items (alpha=.77) was reduced to 2 items (alpha=.87). This factor was named ‘Perceived accessibility to chatbot functions’ intended as the design of the chatbot to enable users to start a conversation and to achieve their goal.

  • Factor 2, initially composed of 14 items (alpha=.85) was reduced to 7 items (alpha=.74). This factor was named ‘Perceived quality of chatbot functions’ intended as the ability of the chatbot to communicate its functions and use the information available on the screen to drive people’s interaction in a polite way and in line with end-user expectations.

  • Factor 3 initially composed of 12 items (alpha=.95) was reduced to 4 items (alpha=.86) after repeated dropping of items. This factor was named ‘Perceived quality of conversation and information provided’ intended as the perceived ability of a chatbot to engage in a conversation adequately.

  • Factor 4 initially composed of 3 items (alpha=.87) was reduced to one item regarding privacy and security of interaction exchange. This factor was named ‘Perceived privacy and security’ intended as the perceived ability of the chatbot to enable people to achieve their goal.

  • Factor 5 composed of 3 items (alpha=.92) was reduced to one item concerning the response waiting time. Therefore, this factor was named ‘Time response’.

Table 6 Reliability analysis of the items: Initial reliability estimated per each factor, reliability expected when dropping items, retained (R) items and the final alpha of each factor after dropping items

When all the items included in the BUS-15 are considered, the overall alpha was equal to .87.

5.2.3 Relationship between BUS-15 and UMUX-LITE

Figure 2 reports the average reaction to the different chatbots under assessment measured by UMUX-LITE and by BUS-15. A total of 13 participants only partially completed the UMUX-LITE. Therefore, this analysis was performed on 372 valid measurements. The satisfaction measured by BUS-15 spanned from a min. of 51.9% to a max. of 80.1% compared with the UMUX-LITE results that spanned from a min. of 46.8% to a max. of 88.1%.

Fig. 2
figure 2

Graphic presentation of the average score of participants’ satisfaction measured by the BUS-15 and the UMUX-LITE per chatbot. A descriptive analysis is included by reporting per chatbot for the number of participants, the mean and the standard deviation of the BUS-15 and the UMUX-LITE

Table 7 suggests that by looking at the results per chatbot, the UMUX-LITE and the overall scale of BUS-15 strongly correlate, however, the five factors of the BUS-15 seem to provide a broader perspective and capture aspects not considered by UMUX-LITE. Specifically, two factors of BUS-15, namely, ‘Perceived quality of chatbot functions’ (F2), ‘perceived quality of conversation and information provided’ (F3) consistently correlate with the average items of UMUX-LITE. In comparison, three factors. namely, ‘Perceived accessibility to chatbot functions’ (F1), ‘Perceived privacy and security’ (F4) and ‘time response’ (F5) seem to have mild correlation or to not correlate with UMUX-LITE on several occasions.

Table 7 Correlation between UMUX-LITE, the five factors of the BUS-15 and the overall BUS-15 scale

5.3 Discussion of Study 4

The exploratory analysis we performed suggested that with 15 items, the BUS could reliably enable end-users to express their perception about their experience with a chatbot. The overall scale of BUS seems to strongly correlate with the ultra-short and unidimensional standardised measure of satisfaction proposed by the UMUX-LITE. However, BUS-15, with its five factors, would still enable the assessment of differences in people’s perspectives by considering aspects such as accessibility to the chatbots’ functions, time to response and privacy. Factors not usually considered as ‘classic’ measurements of satisfaction would be developed for non-conversational tools.

The current version of the BUS-15 (Appendix 8) should be considered an initial step into a somewhat uncharted domain, i.e. the assessment of satisfaction with conversational agents. This scale could be applied to practical use or used to get comparable data among/across chatbot-based tools/systems or during cycles of design and redesign; however, the results cannot be yet considered conclusive and further studies are needed to extend and revise the construct and to validate the scale fully. Conversely, BOT-Check (Appendix 6) could be used by designers as a tool to ensure quality in the design and functioning of chatbots before the testing with end-users. This checklist should be considered complementary to the use of BUS-15.

6 Conclusion

The advantage of having a reliable scale to test people’s perception of the quality of interaction with conversational agents is that such a tool may enable (i) potential end-users to express their level of satisfaction in a consistent and replicable way, (ii) designers and evaluators to develop benchmarks to compare their results by modelling the different end-users and their need during the formative and summative phase of product assessment. Currently, BOT-Check could be considered a ready to use diagnostic tool to control how much a chatbot interacts with people in line with guidelines and principles of quality design for conversational agents e.g. heuristic inspection. Conversely, BUS-15 currently cannot be used as an off-the-shelf product for user research and usability tests. Although we included a reasonable number of chatbots widely used by customers, further validation studies are needed with a larger number of chatbots and a diverse range of participants to ensure the reliability of the construct and to streamline the current version of BUS. During the testing as part of the exploratory analysis of the BUS, some tools were closed for proprietary reasons or temporarily suspended due to COVID19, e.g. https://www.ato.gov.au/. This was not an issue, as we were able to collect data to perform the analysis; however, it is representative of the volatile nature of the market for CRM chatbots. The threats to the validity of the present study should also be considered before using BUS-15. As we stated earlier, a more diverse range of people (age, gender and ability) are needed to use the system in future iterations; in this study mainly young participants with age below 35 years old were involved in focus groups and in the pilot of the scale. A more systematic analysis of people should be performed in future works to capture the perspective of different potential end-users better. Concurrently, as we reported above, the present version of the construct did not include the perspective of people with disabilities, and future research and evaluations should plan for this.

Despite the limitations, the present work provides a new list of attributes specifically developed to measure satisfaction with CRM chatbots and a preliminary tool for assessment. We invite practitioners and researchers who want to contribute to the development of this tool to use BUS, together with other tools, as a way to get insights about the needs and the point of view of end-users about the interaction with a chatbot.

Conversational agents are creating an interactional paradigm shift and a range of new research and design opportunities in the field of HCI [27]; nevertheless, the quality of interaction with these tools can only be ensured by defining reliable criteria and assessment tools that can ensure comparability and support a satisfactory exchange between people and this evolving type of intelligent technology.