Chatbots for customer relationship management (CRM) are intended as intelligent conversational applications that can assist users in decision making through text (or voice) input and output [30, 33]. CRM chatbots are usually developed and adapted by the service provider to enable 24/7 rapid exchanges with potential customers. In this sense, CRM chatbots can vary substantially in terms of appearance, behaviour, and capabilities, providing a different experience to end-users .
As indicated by the ISO 9241–210 , a central aspect of the user experience (UX) is the satisfaction of the end-users defined as “extent to which the user’s physical, cognitive, and emotional responses that result from the use of a system, product, or service meet the user’s needs and expectations”.
Satisfaction is a complex measure of the end-user reaction to and their reasoning about systems relating to the efficiency and effectiveness, and accuracy and reliability of assessment modalities [3, 13, 20, 25, 29].
User satisfaction is generally assessed after interaction with a given system, by using reliable usability scales such as the System Usability Scale (SUS,  and its shorter proxies, the Usability Metric for User Experience (UMUX, ) and UMUX LITE . Instead of questions about user satisfaction and ease of use, that are barely comparable [2, 3], standardised scales aim to assess the users’ perspective after interacting with products, usually on a score from 0 to 100, to provide comparable insights regarding the quality of tools, by investigating the participants’ perception of, and reaction to key interactive aspects of such experiences. Such standardised subjective assessment, when coupled with objective measures of effectiveness and efficiency can provide relevant, replicable, and comparable information about the usability (ISO 9241–11). Moreover, when this is used in conjunction with data collected over time in the context of use, the expected value and acceptance in respect to the satisfaction measures can support user experience researchers in their efforts to model the overall experience of people developing a given product .
Nevertheless, unlike classic interactive systems based on graphical elements, chatbots rely on textual and conversational aspects to engage the end-users , co-constructing the interaction and the meaning of the conversation . In this sense, chatbots in many respects create a new paradigm in human–computer interaction by placing the conversational exchange at the centre and the interaction between the user and the technology . Therefore, the assessment of the chatbots’ end-user satisfaction should also consider aspects that are not usually included in the classic satisfaction evaluation, e.g. the quality of the conversational exchange.
As reported by Borsci et al.  when reviewing the domain of chatbots, it is the case that little is known about how to evaluate the end-user’s perception of quality when interacting with chatbots. There is a growing interest in understanding how to assess and improve the interaction with such systems [16, 22, 31]; however, to our knowledge, there are currently no standardised tools to assess then end-user’s satisfaction with chatbots, except for the recently developed ChatBot Usability Scale (BUS-15) . The BUS-15 scale was developed and tested using an exploratory factorial analysis. This scale was developed by proposing an initial model of 42 items. It was developed via a systematic literature review, and by interviewing and surveying designers and users of chatbots. The exploratory analysis of the initial model of 42 items (i.e. key aspects associated with the experience with chatbots) resulted in 15 items divided into 5 factors (Table 1) with an overall reliability of 0.87, with factor 1 being composed of two items and Cronbach’s alpha equal to 0.87, factor 2 composed of seven items and reliability equal to 0.74, and factor 3 composed by three items with an alpha value of 0.86; factors 4 and 5 were composed of single items. Moreover, The BUS-15 factors strongly correlate with UMUX-LITE (between 0.61 and 0.87), suggesting that the BUS-15 is reliable when used to assess the end-user’s overall satisfaction and it also adds new elements not considered by classic satisfaction scales.
The present work aims to perform a confirmatory factorial analysis on BUS-15, testing its psychometric properties and potential alternative factorial models. The confirmatory analysis is considered a necessary step [32, 40] to validate a new scale by statistically checking and optimising the factorial model that emerged in the exploratory analysis . In addition, we also conduct an analysis under a designometric perspective. Designometrics has recently been introduced by Schmettow , noting that the purpose of a user experience self-report scale (such as BUS or UMUX-LITE) is to compare designs, whereas psychometrics is focused on people. While the statistical workflow is the same, the data and the interpretation differ. A psychometric analysis of reliability requires multiple persons to respond to multiple items, which is often referred to as the psychometric response matrix. In designometric assessment the data collection must also include a sample of designs, resulting in a three-dimensional response matrix, which can be reduced to a design-by-item matrix to fit standard psychometric tools (such as reliability scores, exploratory, and confirmatory factorial analysis). The interpretation of designometric analysis refers to designs, rather than to people. If a chatbot satisfaction scale has a good designometric reliability, this means it measures very precisely how well the chatbot can lead to a high degree of user satisfaction. In contrast, under a psychometric perspective, the same scale measures how easily individual users are satisfied by any chatbot. Obviously, these interpretations are not the same and the second could be considered less relevant in the context of interactional assessment; Schmettow  even went as far as calling this a psychometric fallacy, if designometric scales are validated only under a psychometric perspective. In respect of this, we take a stance which presents both perspectives on two grounds: firstly, psychometric evaluation of user experience scales is mainstream, and we aim for compatibility with existing research. Secondly, in the present case, the focus is on the structure of a multi-scale inventory (BUS-15). We predict that the partitioning of items into multiple scales is dominated by mental processes, hence would produce similar results under both perspectives.
Moreover, additional aims of this work are (i) to test the validity of three versions of the BUS, translated by native speakers, in Spanish, Dutch, and German, and (ii) to investigate the correlation (convergent validity) between the BUS scale and the UMUX-LITE that was identified in the previous study of Borsci et al. .