1 Introduction

Chatbots for customer relationship management (CRM) are intended as intelligent conversational applications that can assist users in decision making through text (or voice) input and output [30, 33]. CRM chatbots are usually developed and adapted by the service provider to enable 24/7 rapid exchanges with potential customers. In this sense, CRM chatbots can vary substantially in terms of appearance, behaviour, and capabilities, providing a different experience to end-users [10].

As indicated by the ISO 9241–210 [24], a central aspect of the user experience (UX) is the satisfaction of the end-users defined as “extent to which the user’s physical, cognitive, and emotional responses that result from the use of a system, product, or service meet the user’s needs and expectations”.

Satisfaction is a complex measure of the end-user reaction to and their reasoning about systems relating to the efficiency and effectiveness, and accuracy and reliability of assessment modalities [313, 2025, 29].

User satisfaction is generally assessed after interaction with a given system, by using reliable usability scales such as the System Usability Scale (SUS, [7] and its shorter proxies, the Usability Metric for User Experience (UMUX, [18]) and UMUX LITE [28]. Instead of questions about user satisfaction and ease of use, that are barely comparable [2, 3], standardised scales aim to assess the users’ perspective after interacting with products, usually on a score from 0 to 100, to provide comparable insights regarding the quality of tools, by investigating the participants’ perception of, and reaction to key interactive aspects of such experiences. Such standardised subjective assessment, when coupled with objective measures of effectiveness and efficiency can provide relevant, replicable, and comparable information about the usability (ISO 9241–11). Moreover, when this is used in conjunction with data collected over time in the context of use, the expected value and acceptance in respect to the satisfaction measures can support user experience researchers in their efforts to model the overall experience of people developing a given product [4].

Nevertheless, unlike classic interactive systems based on graphical elements, chatbots rely on textual and conversational aspects to engage the end-users [38], co-constructing the interaction and the meaning of the conversation [12]. In this sense, chatbots in many respects create a new paradigm in human–computer interaction by placing the conversational exchange at the centre and the interaction between the user and the technology [19]. Therefore, the assessment of the chatbots’ end-user satisfaction should also consider aspects that are not usually included in the classic satisfaction evaluation, e.g. the quality of the conversational exchange.

As reported by Borsci et al. [5] when reviewing the domain of chatbots, it is the case that little is known about how to evaluate the end-user’s perception of quality when interacting with chatbots. There is a growing interest in understanding how to assess and improve the interaction with such systems [16, 22, 31]; however, to our knowledge, there are currently no standardised tools to assess then end-user’s satisfaction with chatbots, except for the recently developed ChatBot Usability Scale (BUS-15) [5]. The BUS-15 scale was developed and tested using an exploratory factorial analysis. This scale was developed by proposing an initial model of 42 items. It was developed via a systematic literature review, and by interviewing and surveying designers and users of chatbots. The exploratory analysis of the initial model of 42 items (i.e. key aspects associated with the experience with chatbots) resulted in 15 items divided into 5 factors (Table 1) with an overall reliability of 0.87, with factor 1 being composed of two items and Cronbach’s alpha equal to 0.87, factor 2 composed of seven items and reliability equal to 0.74, and factor 3 composed by three items with an alpha value of 0.86; factors 4 and 5 were composed of single items. Moreover, The BUS-15 factors strongly correlate with UMUX-LITE (between 0.61 and 0.87), suggesting that the BUS-15 is reliable when used to assess the end-user’s overall satisfaction and it also adds new elements not considered by classic satisfaction scales.

Table 1 The original (English) version of BUS 15. Each item is assessed on a 5-point Likert scale from 1 (“strongly disagree”) to 5 (“strongly agree”)

The present work aims to perform a confirmatory factorial analysis on BUS-15, testing its psychometric properties and potential alternative factorial models. The confirmatory analysis is considered a necessary step [32, 40] to validate a new scale by statistically checking and optimising the factorial model that emerged in the exploratory analysis [3]. In addition, we also conduct an analysis under a designometric perspective. Designometrics has recently been introduced by Schmettow [36], noting that the purpose of a user experience self-report scale (such as BUS or UMUX-LITE) is to compare designs, whereas psychometrics is focused on people. While the statistical workflow is the same, the data and the interpretation differ. A psychometric analysis of reliability requires multiple persons to respond to multiple items, which is often referred to as the psychometric response matrix. In designometric assessment the data collection must also include a sample of designs, resulting in a three-dimensional response matrix, which can be reduced to a design-by-item matrix to fit standard psychometric tools (such as reliability scores, exploratory, and confirmatory factorial analysis). The interpretation of designometric analysis refers to designs, rather than to people. If a chatbot satisfaction scale has a good designometric reliability, this means it measures very precisely how well the chatbot can lead to a high degree of user satisfaction. In contrast, under a psychometric perspective, the same scale measures how easily individual users are satisfied by any chatbot. Obviously, these interpretations are not the same and the second could be considered less relevant in the context of interactional assessment; Schmettow [36] even went as far as calling this a psychometric fallacy, if designometric scales are validated only under a psychometric perspective. In respect of this, we take a stance which presents both perspectives on two grounds: firstly, psychometric evaluation of user experience scales is mainstream, and we aim for compatibility with existing research. Secondly, in the present case, the focus is on the structure of a multi-scale inventory (BUS-15). We predict that the partitioning of items into multiple scales is dominated by mental processes, hence would produce similar results under both perspectives.

Moreover, additional aims of this work are (i) to test the validity of three versions of the BUS, translated by native speakers, in Spanish, Dutch, and German, and (ii) to investigate the correlation (convergent validity) between the BUS scale and the UMUX-LITE that was identified in the previous study of Borsci et al. [5].

2 Methods

2.1 Participants

The study was approved by the ethical committee of the University of Twente; it was advertised by a specialised service and using social media aiming to target a pool of international potential users of different ages. The sampling strategy was, by convenience, the only inclusion criteria specified in the advertisement was that participants should be proficient in writing and reading English to take part in the study.

Each participant evaluated the interaction experience with a minimum of five to a maximum of ten chatbots, resulting in a total of 1292 completed questionnaires in multiple languages (English, German, Dutch, and Spanish) as reported in Table 2.

Table 2 Number of questionnaires (BUS-15) per each available language: English, German, Dutch, Spanish

A total of 259 people participated in the testing of the scale, of these only 80.7% (209) participants filled the survey correctly—i.e. 128 female, 131 male, age average: 37.78 min 18, max 83, 89% of the participants were European. Fifty-four percent of the sample was under 40 years old, while the remaining part of the sample was over 40 years old. The 20.3% of participants were excluded because they decided to stop the assessment for personal issues, or they had technical problems and were not willing to continue the evaluation. In some cases, less than ten chatbots were correctly displayed to participants for different reasons, e.g. issues in the availability of the chatbots, issues due to internet connections, etc. In such cases, we retained the answer of the participants only when the answer to at least 5 chatbots was collected correctly.

2.2 Materials

The study was designed to enable participants to interact with chatbots and answer the questionnaire by using a survey developed with Qualtrics software. At the beginning of the survey, participants were asked to declare their nationality and native language. If their native language was Dutch, German, or Spanish, the participants were assigned to answer the questionnaire in one of these languages. If participants were not native in one of these three languages, they were asked to fill out the questionnaire in English. Participants were informed (see instructions in Appendix A) that most of the chatbots were mainly in English, and when chatbots were available in the other languages (Dutch, German, or Spanish) these were also presented in the native language of the participants. Each participant was asked to assess ten chatbots extrapolated from the list of 26 (see the list in Appendix A) by performing a task of information retrieval, e.g. find specific information to inform their decision making (see an example in Appendix A). The language capabilities of the chatbots were considered in the randomised presentation of the chatbot by allowing participants to also experience some of the chatbots in their native language. Moreover, participants were asked to fill a demographic section reporting two individual characteristics: age and sex.

The chatbot systems were selected among available CRM chatbots retrievable online and associated with a service offered by a company to guide their users in the process of information retrieval.

For each chatbot, after the interaction, the participants were asked to fill in the BUS-15 [5] and the two items of the UMUX-LITE [28]. The UMUX-LITE was presented to the participants on a scale with 5-Likert points instead of the classic 7-point commonly considered a safe reduction [27, 35]. The UMUX-LITE items were presented in one of the four languages. The process of translation and back translation of the questionnaires was performed by native speakers.

2.3 Procedure

After the introduction to the study, participants were asked to fill the demographic section. Then they were asked to interact with each chatbot to achieve a specific information retrieval task.

As the online survey was designed to ask participants to perform tasks with chatbots, each participant, when possible, performed the test by sharing their screen with a member of the research team. This procedure was done to offer support to the participants during the interaction and to ensure control over the gathered data. Researchers were instructed to only answer questions regarding potential misunderstanding or incomprehension related to the survey, and to mainly monitor that participants interacted with the chatbots.

When it was not possible to connect during the session (about 10% of the participants), a post-interview was performed to ask confirmation to the participants that they interacted with the chatbots and to ask about their difficulties in performing the tasks.

After each interaction with a chatbot, a total of seventeen questions (15 items from the BUS and the 2 items of the UMUX-LITE) were presented in a fully randomised order.

2.4 Data analysis

Data analysis was performed in R. A linear regression model was used to inspect whether there was a significant difference between the rating of satisfaction obtained with the translated version of the scales. For this analysis the BUS and the UMUX- LITE were used as the dependent variables, while the different languages of translations were used as the independent variable with the English version as the intercept.

The “lavaan” and “ggplot” packages of R were used for the confirmatory factorial analysis (CFA) with weighted least squares means and variance adjusted estimation. Factor loading was considered acceptable when at least 0.6 and optimal at 0.7 and above [21]. Model fit was established by looking at multiple criteria including [82341] the ratio between chi-square and the degrees of freedom below 3; the comparative fit index (CFI) aiming for a value of 0.90 or higher; the root mean squared error approximation (RMSEA) aiming for values less than 0.07; the standardised root mean square residual (SRMR) looking for a value below 0.08.

Cronbach alpha was calculated for the overall scale and per each factor of the BUS 15. To establish convergent reliability, a Kendall tau correlation analysis was performed between BUS and UMUX-LITE. Finally, a regression analysis was performed to assess the differences among the different chatbots in terms of satisfaction measured by the BUS and the effect of individual characteristics on the satisfaction rated by participants with this scale.

3 Results

3.1 Validity of the translated version of the scales

A sub-sample of 503 questionnaires regarding 5 chatbots were collected using all the four available languages of the two scales, i.e. BUS-15 and UMUX-LITE. The regression analysis suggested that there are no significant differences among the three translated versions of BUS-15 and the original version in English; however, the participants who used the German version, on average, tended to rate the satisfaction with chatbots slightly lower when compared to the participants who used the other versions (see Fig. 1).

Fig. 1
figure 1

Overall score of BUS 15 per each available language: English, Dutch, German, and Spanish

A significant effect was identified for the UMUX-LITE (Fig. 2) suggesting that the satisfaction ratings obtained with the Dutch and the German version of the UMUX negatively affect the overall rating of the participants’ satisfaction (R2 = 0.036, F(3, 500) = 5.83, p < 0.001). Specifically, compared to the participants who used the original scale in English, participants who rated their satisfaction with the Dutch version of the scale tended to report significantly lower satisfaction ratings (b = –0.13, t(503) =  − 2.63, p = 0.05). Similarly, participants who used the German version tended to rate their satisfaction lower than the other participants (b = –0.11, t(503) =  − 3.860, p < 0.001). This suggests that the Dutch and the German UMUX-LITE used in this study cannot be considered reliable for further analysis. Conversely, the original and Spanish versions (Appendix B) could be retained for further tests. The reliability of English UMUX-LITE (α = 0.89) is higher than the one identified in previous studies (UMUX-LITE Cronbach’s alpha between 0.82 and 0.83, [28] the Spanish version shows a comparable level of reliability (α = 0.83).

Fig. 2
figure 2

Overall score of UMUX-LITE per each available language: English, Dutch, German, and Spanish

3.2 The factor loading of the BUS-15

The results of the CFA is in line with the previous exploratory analysis [5] suggesting that the solution with five factors is acceptable with a CFI of 0.924 with loadings over the threshold of 0.6 (Table 3). The SRMR is equal to 0.039, and the RSMEA is equal to 0.065. The scale is strongly reliable with an overall Cronbach’s alpha of 0.90.

Table 3 Factor loading of BUS 15 with the original solution at 5 factors proposed by [5]

Although the model appears robust (Fig. 3) the factors loading for items 4, 5, 7, and 8 are only acceptable, i.e. above but very close to 0.6. This might indicate that alternative models could be explored.

Fig. 3
figure 3

Graphic representation of the factorial model of the BUS 15

As reported in Table 4, two alternative factorial models were tested to optimise the scale. The first attempt was performed by removing the items with a barely acceptable factor loading, i.e. items 4, 5, 7, and 8. This resulted in a solution with 11 items (BUS-11) and five factors. The second alternative model was tested by also removing the factors with single items providing a solution with 9 items (BUS-9). This second model was tested because usually it is not considered an optimal solution to retain factors with less than three items [9, 15].

Table 4 Comparative analysis of the BUS-15 and two alternative models: BUS-11 and BUS-9

The BUS-9 is a short and very reliable solution with a Cronbach’s alpha of 0.89; nevertheless, it is missing two aspects that emerged as relevant in interviews and focus groups in the original study [5], namely perceived privacy and security (factor 4) and time response (factor 5).

BUS-11 has a better fit than BUS 15, and it maintains the structure of original solutions while reducing the scale of four items (see Fig. 4). The overall reliability of the 11 items solution is relatively high (α = 0.89) but the RSMEA on average is slightly over the expected threshold of 0.07.

Fig. 4
figure 4

Graphical representation of the factorial model of the BUS-11

The difference in terms of items between the two alternative models (BUS-11 and BUS-9) is minimal with the same overall reliability, but the BUS-11 is a more complete solution as this provides insights on specific and relevant characteristics of chatbots. Hence, the 11-item model (BUS-11) seems preferable to the shorter one (BUS-9).

The result of the CFA using the designometric matrix shows that while the model was stable in terms of factor loading for the first item of the scale, a problem due to negative variance, and the model fit was inferior compared to the psychometric model (chi-square/df = 2.7; CIF = 0.86, RSMEA = 0.27; SRMR = 0.06).

3.3 Correlation between UMUX-LITE and BUS11 and effects of individual characteristics on BUS 11

The Kendall tau correlation analysis suggests that there is a positive significant relationship (rt = 0.68, p < 0.001) between the BUS-11 and the UMUX-LITE (English and Spanish versions) as shown in Fig. 5.

Fig. 5
figure 5

Graphical representation of the correlation between the overall scores of the BUS-11 and the UMUX-LITE

A linear model regression was performed to observe the difference in the satisfaction rating of the participant using BUS 11 after the interaction with the chatbots. Using a random chatbot as an intercept (here reported as C1) displayed significant differences, except for chatbots 13, 18, and 19 (see Appendix C). Figure 6 shows the difference in the assessment of satisfaction among the chatbots.

Fig. 6
figure 6

Overall scores of BUS-11 per chatbots. Chatbots are numbered in an order different from the list presented in Appendix A as we did not ask permission or inform the service providers about the usage of these chatbots

The gender declared by participants does not affect the overall scores reported with BUS 11; however, the age of the participants has a moderate but significant effect (R2 = 0.008, F (1, 1285) = 12.17, p < 0.001), suggesting that older participants were more conservative in their satisfaction rating towards chatbots compared to younger participants (b = –0.0009, t(1290) =  − 3.48, p < 0.001).

4 Discussion

The 15-item model of the BUS, previously identified by Borsci et al. [5], is reliable but could be further optimised. We identified two solutions. The 9-item solution (BUS-9) offers a very reliable, short, and solid solution, but in such a case designers will lose relevant aspects to inform their decision making. Therefore, we recommend using BUS-11 also to collect data about key aspects such as privacy, security, and time to respond. However, BUS-9 could be used with mock-ups or early-stage prototypes when chatbots are not yet fully functional and specific aspects of the systems are still hard to judge for participants.

When comparing the 11 items modelled from a psychometric approach, using the 1259 answers of the participants, with the designometric perspective composed of 26 chatbots, BUS 11 appears to be stable in terms of factor and item organisation. Nevertheless, the designometric perspective resulted in an inferior fit with the factorial model as the CFI and SRMR indexes are acceptable while the RSMEA is particularly high, and for one item there is a problem of variance. These are common issues with small cohorts (W. R. [14, 26] and should be compensated by adding more chatbots to the database to completely model the ability of BUS 11 to differentiate critical design aspects of chatbots. However, looking at the psychometric and designometric analysis, the BUS-11 can be considered overall reliable. Moreover, the tool correlates with a classic satisfaction questionnaire (UMUX-LITE) by adding relevant elements regarding the specificity of the chatbots’ functions, and it is now available in four languages: English, Dutch, German, and Spanish (Table 5).

Table 5 The BUS-11 versions are reported in English, Dutch, German, and Spanish. Each item is associated with a 5-point Likert scale from strongly agree (1) to strongly disagree (5)

The age of the users seems to affect the BUS-11 satisfaction rate of the participants slightly; this result is entirely in line with the results of a recent qualitative study which suggested that the age of the end-users is a relevant factor to assess the trustworthiness and interaction with chatbots [39]. Although the effect is minimal, it should be further explored in future studies.

The BUS-11 is a flexible scale used to assess user satisfaction of CRM chatbots; the current version of the tool has yet to be tested outside the domain of CRM and it could be considered providing a solid basis to investigate and support the design with other types of chatbots (e.g. general or domain-specific conversational agents) such as, for instance, tools for daily interaction, to support rehabilitation and adherence to medical treatment, etc. In such cases, we recommend using the BUS together with other reliable scales. Future studies will investigate the application of the BUS in other domains.

The present study has some limitations. First, by focusing on investigating the factorial structure of the BUS, participants were asked to interact with several chatbots; therefore, we minimised the questions about the individual characteristics. Future studies are going to explore which characteristics are affecting the satisfaction with chatbots measured with BUS 11 considering, for instance, education, expertise with chatbots, etc. [6]. Second, participants who had different native languages interacted with chatbots in English, despite this not affecting the goal of the study (i.e. testing the validity of the scale), this could have affected the interaction with the chatbots, and future studies should adopt a more naturalistic approach to testing chatbots available in the native language of the participants aiming to assess the perceived quality of the tools. Third, UMUX-LITE translation for the Dutch and the German versions was not optimal, as results suggest a significant difference with the English version. However, we proposed and preliminarily validated the Spanish version of UMUX-LITE that, to our knowledge, was not yet available in the literature. Future studies could use this version of the UMUX-LITE for further validation purposes. Finally, we used a small population of chatbots to build a three-dimensional perspective on the reliability of the BUS looking at the scale from the participants’ (psychometric) and the chatbots’ (designometric) perspectives. Despite the results are suggesting the overall construct of the BUS-11 is holding up when tested from the designometric, a perspective future study is under preparation where we are collecting data on additional chatbots to increase the sample of our “design” population.

5 Conclusion

The interest of practitioners regarding the usage of chatbots to support end-users of services is growing even in susceptible domains, including, for instance, e-government [17, 34] and health and rehabilitation [1, 16, 37]. As suggested by De Filippis et al. [11] and Federici et al. [16], there is a need for specific and calibrated tools to assess the quality of interaction of chatbots to support the designer during the development and the assessment of such systems. Borsci et al. [5] suggested that the quality of interaction with chatbots can only be ensured by defining reliable assessment criteria to ensure comparability and support a satisfactory interaction between people and these new types of intelligent technology.

The BUS-11 is a tool that can facilitate the evaluation of interaction with chatbots, and its diffusion could enable practitioners to compare the performances and benchmark their conversational systems during the formative and summative phase of product assessment. Concurrently, as Borsci et al. [5] proposed, designers could rely on a specific heuristic list, the BOT-check, to support their design thinking during the development phase of chatbots.

The new interactional paradigm shift created by chatbots is also opening a range of new research and design opportunities in the field of HCI [19], and the diffusion and usage of the BUS-11 could be a way to harmonise methods and ensure comparability of results.