1 Introduction

In recent years, technological advances in artificial intelligence (AI) and natural language processing (NLP) have accelerated the proliferation of speech-based dialog systems (SDSs) in customer service (Rzepka et al. 2020). Organizations leverage SDSs as a cost-efficient alternative to human operators by performing routine support tasks such as answering frequently asked questions, authenticating customers, or transmitting process-relevant information (Jusoh 2018; Doherty and Curran 2019). Businesses can benefit from the use of SDSs by saving personnel costs while satisfying customers with 24/7 availability and reducing hold times in phone queues (Jusoh 2018; Kaczorowska-Spychalska 2019). Given these capabilities, SDSs have the potential to create competitive advantages such as increasing customer loyalty, net promoter scores (i.e., the likelihood to recommend a company), and sales conversion rates (Deloitte 2019).

However, customers often report frustration when interacting with SDSs due to poorly designed solutions (Walsh et al. 2018; zendesk 2019). Aside from technical issues such as limited natural language understanding capabilities, SDSs in business practice frequently exhibit deficiencies in their capacity to engage in convincing, human-like, and goal-oriented conversations (Forrester 2017). These deficiencies are related to the design of the dialog strategy employed by an SDS. For instance, many SDSs employ closed dialog strategies that allow a navigation along predefined paths (Dale 2016). This feature can be perceived as unsatisfactory by users, as listening to long instructions and predefined menu options can be tedious (Walsh et al. 2018). Similarly, SDSs, which are less path-oriented and more open to direct customer queries, fail to meet customer expectations raised by their human-like imitation, as they are incapable of responding to users’ individual intents in a convincing manner, e.g., due to the variety of possible inputs (Forrester 2017; Kirkpatrick 2017). Taken together, these issues demonstrate that reaping the touted benefits of SDSs requires careful dialog design to create a satisfying user experience.

Following the technological advances in the realms of AI and NLP, related studies in the fields of human–computer interaction and information systems (IS) have devoted growing attention to the design of dialog systems in recent years. However, the vast majority of these studies address the design of text-based dialog systems (Diederich et al. 2022). For instance, Gnewuch et al. (2017) introduce four design principles for social and cooperative chatbots in customer service. Aside from suggesting the integration of social cues to provide a human-like dialog, the authors propose the implementation of informative opening messages and conversational breakdown recovery strategies to ensure a goal-oriented interaction. Design cues for speech-based interaction address socio-phonetic design (Schmitt et al. 2021) or anthropomorphic features (Pfeuffer et al. 2019) without focusing on dialog strategies. Moreover, studies addressing SDSs frequently adopt a behavioral science perspective to synthesize evidence-based recommendations for the design of dialog systems. Thereof, several studies have examined the empirical comparison of user experience of different dialog strategies (e.g., Jurafsky 2000; Chu et al. 2005; Merdivan et al. 2019). Accordingly, the existing body of knowledge is lacking in design principles targeted at operationalizing dialog strategies for SDS in customer service.

On a conceptual level, dialog strategies can be operationalized through a frame-based or a finite-state dialog strategy. SDSs that follow the finite-state approach are characterized by a system-guided dialog based on predefined menu options (also referred to as closed dialog strategies). On the contrary, frame-based SDSs offer users the possibility to freely express their concerns based on open questions in a human-like conversation (also denoted as open dialog strategies) (Griol et al. 2017). However, the particular dialog strategy that is appropriate for providing a satisfying user experience in customer service settings is not evident (Meng et al. 2003; Savcheva and Foster 2018). To provide guidance to scholars and practitioners on this design uncertainty and address the lack of design knowledge related to dialog strategies for SDSs, our overarching objective in this study is to systematically develop, theoretically ground, and justify design knowledge in terms of a design theory. The design theory draws on Bunt’s (2000) dialog theory and comprises both requirements and design principles (DPs) for SDS dialog strategies in customer service. We empirically evaluate the instantiations of the DPs in terms of user experience through a two-phase experiment with 205 participants concerning the proposed design theory.

The remainder of this study is organized into several sections. In Sect. 2, we outline the theoretical background and related work in the area of SDSs. In Sect. 3, we explain our multi-method design science-oriented research approach. In Sect. 4, the requirements and DPs are elaborated and modified in three iterations. Additionally, we describe the development of the SDS prototypes based on the elaborated requirements and DPs. In Sect. 5, we describe the evaluation of the artifact. We thereafter discuss the findings of our research in Sect. 6 by highlighting the main implications for research and practice and outlining the limitations of our study. Concluding remarks are provided in Sect. 7.

2 Theoretical background

2.1 Speech Dialog Systems in customer service

To date, a wealth of terms is frequently used for different kinds of dialog systems, including digital assistants, chatbots, conversational agents, and machine conversation systems (McTear et al. 2016, p. 39; Luger and Sellen 2016; Diederich et al. 2019a). Dialog systems can process concerns and inquiries from customers based on text- or speech-based inputs. Our focus is on SDSs used in phone-based customer service. Speech as an interaction modality in customer service remains very popular with customers (zendesk 2019). An SDS thereby serves as the machine-based interface to customer service, which resembles human support personnel (Cho et al. 2019); furthermore, as several studies indicate, an SDS can be convenient for customers because issues can often be explained more quickly orally than in writing (Ruan et al. 2018; Pfeuffer et al. 2019; Schmitt et al. 2021). This is especially true for the elderly who are not as familiar with typing (Pfeuffer et al. 2019; Gupta 2021).

SDSs can be differentiated in task-oriented and non-task-oriented systems (Hussain et al. 2019; Mairittha et al. 2019). Task-oriented systems are designed to assist users in performing basic tasks in short dialogs, such as booking a flight or purchasing a product, whereas non-task-oriented systems are configured to simulate a natural conversation that resembles human-to-human interactions (Hussain et al. 2019). The focus of our study is on task-oriented SDS, as we explore customer service, which is generally about solving a specific request or concern.

As schematically illustrated in Fig. 1, an SDS consists of five central modules: (1) automatic speech recognition, (2) natural language understanding, (3) dialog management, (4) response generation, and (5) speech generation (Merdivan et al. 2019).

Fig. 1
figure 1

Schematic structure of an SDS, adapted from Merdivan et al. (2019)

When contacting customer service, usually via a telephone, the system provides an introduction to which the customer responds with spoken words. These prompts are picked up by the microphone of the user device and transmitted to an automatic speech recognition module, which in turn converts the speech into text for further processing. The outputted text serves as input for an NLP engine, which interprets the text by extracting semantic information such as dialog acts, entities, and intents (Firdaus et al. 2021). In linguistic terms, a dialog act is a functional tag of an utterance (e.g., question, statement, conversation opener), entities are proper names (places, times, customer numbers, personal names), and intents relate to the user goal (Chen et al. 2018; Firdaus et al. 2021). To understand the semantics of utterances, the utterances, entities, or intents are classified according to predefined classification schemes (Firdaus et al. 2021). The central module of an SDS is the dialog manager that fulfills several functions, namely providing and updating the dialog context, coordinating external modules, and deciding what information is needed and when this information should be extracted (Traum and Larsson 2003). Thus, dialog management can be understood as the component of an SDS that is responsible for controlling dialog flows and making context-based decisions (McTear et al. 2016, p. 210; Zhao et al. 2019). In addition, dialog management defines how incorrect, unforeseen, or unclear information is handled. Coordinating the dialog flow has a major impact on user satisfaction and consequently requires an ample amount of attention during development (McTear et al. 2016, p. 210).

Depending on the utterance of the customer and the selected dialog strategy, the system response is generated by the response generation module (Klüwer 2011). The most widespread method is the use of response templates with so-called slots or placeholders filled with the entities from dialog management (Singh and Arora 2020). In the last step, the generated response is reproduced in natural language by the speech generation module that synthetically generates speech (Burgoon et al. 2017, p. 257). Depending on the dialog flow, multiple conversation turns may be necessary to fulfill the customer’s inquiry (Merdivan et al. 2019).

2.2 Dialog strategies for Speech Dialog Systems

The underlying strategy of a dialog system steers the flow of the conversation. This conversational flow can be directed by following the finite-state approach, such that the user navigates through the system by using predefined menu options (McTear 2017). With this approach, the system initiative is generally high, offering possible paths of the conversational direction, whereas user initiative is limited to selecting the preferred path by command (Chu et al. 2005). Thus, the underlying closed dialog strategy provides system-guided support, which aims to collect relevant data successively through a fixed sequence of questions (McTear 2002). Accordingly, the main goal of the closed dialog strategy is successful task realization through system-controlled guidance, thereby providing structure for all menu and error correction options and narrowing down the possible utterances (Lee et al. 2017). By contrast, the frame-based approach (open dialog strategy) merely determines the boundaries of the conversation and offers users the possibility to freely express their concerns (Torres et al. 2019). Instead of detailed menu prompts, open questions convey a natural conversation, thus imitating human dialogs (Griol et al. 2017). This feature allows users to directly name multiple entities to be captured and, if necessary, supplemented by specific system questions to obtain the required information (slot filling) (Singh and Arora 2020). Consequently, in contrast to the finite-state system, this approach is characterized by low system initiative and low level of system support. However, this conversational flexibility also has shortcomings. Due to the wider range of expressions to be considered in model training, such systems are more error-prone (Lee et al. 2017).

Although closed dialog strategies predominate in business practice (Dale 2016), researchers have been interested in the differences and comparisons between dialog strategies since the 1990s. Delogu et al. (1998) examine the first forms of interactive voice response systems based on natural speech input and compare them with closed dialog technologies such as dual-tone-multi-frequency, which allows the user to interact with the system via a closed menu using phone keys. Similarly, a number of more recent studies compare open with closed SDSs, which yield rather contradictory results. For example, Meng et al. (2003) demonstrate that open dialog systems are superior to closed dialog systems in terms of performance accuracy and error rates when used in a simple foreign exchange domain. On the contrary, the study by Savcheva and Foster (2018) shows that open dialog systems do not provide higher customer satisfaction, but the more human-like interaction has led to a somewhat more efficient interaction in terms of errors encountered. As these studies indicate (Meng et al. 2003; Savcheva and Foster 2018), we do not think that a comparison to determine whether an open or a closed dialog strategy is generally preferable to the other is beneficial, as both strategies have their strengths and weaknesses, depending on the use scenario. Rather, dialog strategies must be adapted to the specific use case to ensure a satisfactory user experience (Kvale et al. 2021). Therefore, our study focuses on devising a design theory that integrates design principles that apply to both strategies and aim for a satisfactory user experience in customer service. We instantiate an SDS with an open and a closed dialog strategy to evaluate the utility and effectiveness of the devised design theory for both dialog strategies and to highlight the strengths and weaknesses in the user experience in the context of a task-oriented use case.

3 Design Science Research Approach

Our primary purpose is to devise and evaluate a design theory that integrates the design principles of an open and closed dialog strategy for an SDS in customer service. Therefore, we rely on the design science research (DSR) paradigm for guiding the entire research process based on the wealth of guidelines and principles proposed by various IS scholars (Walls et al. 1992; Hevner et al. 2004; Gregor and Jones 2007; Peffers et al. 2007; Gregor and Hevner 2013). Similar to several other DSR studies from the past (Markus et al. 2002; Abbasi and Chen 2008; Meth et al. 2015; Diederich et al. 2020), we follow the approach of Gregor and Jones (2007) and Walls et al. (1992) and propose a design theory for SDSs for a class of artifacts, with the aim of addressing a class of problems rather than a single artifact. In doing so, we consequently strive for a solid theoretical foundation and the empirical validation of our design theory throughout the research process.

With this purpose in mind, we draw on the DSR methodological approach of Peffers et al. (2007), which provides a structured development process with several continuous design and evaluation cycles. Although Peffers et al. (2007) adopt a different view toward the role of artifacts and design theory, their multi-step approach supports the basic principles of the design theory development process as proposed by Gregor and Jones (2007) and Walls et al. (1992); therefore, this multi-step approach is considered appropriate for guiding the research process of the current study. The development and evaluation of the SDS design theory takes place in three iteration rounds, as illustrated in Fig. 2.

Fig. 2
figure 2

Design science research approach according to Peffers et al. (2007)

The development process is initiated by a brief problem identification and motivation (Activity 1) to justify the value of a solution for the problem. The objectives for a solution are subsequently defined (Activity 2). Through a systematic literature review according to vom Brocke et al. (2009) and Webster and Watson (2002), we acquire knowledge about the research problem as well as justificatory knowledge that can be used for informing the design of our artifacts in the sense of kernel theory (Walls et al. 1992; Gregor and Jones 2007; Gregor and Hevner 2013). Accordingly, the systematic literature review benefits not only the initial steps of problem identification and motivation (Sect. 1, Introduction) and the definition of objectives for a solution (Sect. 2, Theoretical Background) but also the subsequent design and evaluation (Activity 3) (Sect. 4 Design Theory, and Sect. 5, Evaluation).

Due to the interdisciplinary nature of the research topic located at the interfaces between IS, computer science, human–computer interaction, and other related fields (Gnewuch et al. 2017), we conduct a broad literature search in several interdisciplinary databases. The literature search yields a sample of 74 articles that can be considered relevant for the objectives of our study.Footnote 1

Based on this body of academic and business knowledge, we start the design of the artifact (Activity 3) by identifying a set of 14 requirements to help us to address the class of goals to be achieved. Guided by these requirements, we explicate five DPs following the principles of Gregor et al. (2020) to meet the requirements for designing a dialog strategy for SDSs in customer service. Central to each design theory is a set of hypotheses for testing the question of whether the proposed DPs meet the requirements (Walls et al. 1992; Gregor and Jones 2007). Thus, we initially develope a system architecture based on the five DPs, which serves as a foundation for the subsequent development of the prototypes using the evolutionary prototyping approach. Consistent with the principles of the DSR approach, the evolutionary prototyping method is characterized by a process of constant revision, refinement, and testing of an artifact (Davis 1992). This method enables us to develop, test, and redesign the SDS in several iterations until we meet the requirements. The prototypes are iteratively tested by potential users and modified based on the users’ feedback (Carter et al. 2001). Only when the SDS is considered to meet the requirements, the evolutionary prototyping of the respective DSR iterations is completed and the next activity can begin (Activity 4). Given the significance of a diligent evaluation process, it is considered essential to every DSR project (Hevner et al. 2004; Peffers et al. 2007). Similarly, Walls et al. (1992) and Gregor and Jones (2007) posit that design theories must be subject to a thorough empirical investigation to test the hypotheses concerning the proposed design theory.

We follow the framework of Venable et al. (2016) to develop an appropriate evaluation strategy comprising three evaluation rounds in a naturalistic setting (Activity 5). Naturalistic evaluation allows us to study the performance of our artifacts in a real business environment and to increase the internal validity and rigor of the assessment process (Pries-Heje et al. 2008; Venable et al. 2016). After the first prototyping phase, our initial SDS prototypes are evaluated by four SDS experts with several years of professional experience in the field of dialog systems and user-experience design. The experts analyze the systems in terms of usability and feasibility through the cognitive walkthrough method, an effective approach for evaluating the design of user interfaces in early prototyping phases based on cognitive theory (Rieman et al. 1995). In the second iteration round, we subject the revised prototypes to further testing by five potential users to ensure that the tasks in the dialog system could be mastered without prior experience or further assistance. We use the feedback of the experts and users from both iterations to revise and refine our prototypes. Finally, in the third iteration, we conduct a two-phase experiment with 205 participants to empirically validate the developed prototypes. For this purpose, we invite the participants to take part in a remote test in which two tasks have to be completed using the prototypes. The users subsequently fill out the user-experience questionnaire based on their experiences with the closed and open SDS. The questionnaire for the user-experience survey is informed by the Subjective Assessment of Speech System Interfaces framework established for evaluating SDSs (Hone and Graham 2000). With the completion of the user survey, the design and evaluation cycle is completed. Finally, as the final communication step of the DSR process, we interpret and present the results and key findings (Activity 6).

4 Design theory of the Speech Dialog Systems

In this section, we focus on the development of the design theory according to the model proposed by Walls et al. (1992) and Abbasi and Chen (2008), which encompasses four main design components of a design theory (cf. Table 1). To ensure a consistent design theory, we draw on the body of knowledge in the research field of SDSs for a theoretical foundation of the development process. We thereby use dialog theory (Bunt 2000) as justificatory knowledge (kernel theories) to identify the requirements as a major precondition for deriving the corresponding DPs that can be adapted to the dialog management of the SDSs. However, several other theories are equally suited for guiding the socio-technical design of speech dialog systems, such as task–technology fit theory (Goodhue 1995), according to which a match between task characteristics and technology characteristics leads to improved user performance, or social response theory (Nass and Moon 2000; Moon 2000) and the embodied social presence theory (Mennecke et al. 2011), which consider technologies such as SDSs as social actors that should be designed as human-like as possible. In our research, we rely on dialog theory as kernel theory because it is ideally suited for guiding the design of dialog systems that are to assist users with simple tasks and short dialogs, thus aligning with our focus as described in Sect. 2. Furthermore, dialog theory provides guidelines on the socio-technical design of the dialog systems, for example on the communicative behavior of the agents (Bunt 2000).

The requirements represent a set of main goals and requisites, which specifies the functions of an SDS. The DPs, in turn, form a set of corresponding principles devised from the requirements. Guided by the main features of frame-based (open) and finite-state (closed) dialog strategies as presented in Sect. 2.2, we identify five main categories of requirements and DPs: prompt design, menu design, persona design, confirmation strategy, error management, and functional design. As shown in Table 1, we also include five hypotheses (H1–H5) that serve as a foundation for a subsequent qualitative and quantitative evaluations of the prototypes to empirically validate the value claims of an open SDS compared to the value of a closed SDS. We proceed to develop the requirements and DPs based on justificatory knowledge. We subsequently present the empirical results of our qualitative and quantitative evaluations.

Table 1 Main components of the design theory of an SDS dialog strategy

4.1 Requirements and Design Principles for the Speech Dialog System

According to Walls et al. (1992), a design theory includes prescriptive instructions for how to realize more effective and feasible design and use. With regard to our design theory for an SDS, we must therefore identify the main requirements and DPs to help us to achieve these goals. According to Bunt’s dialog theory (2000, p. 2), an SDS consists of “structures of goals, beliefs, preferences, expectations, and other types of information, plus memory and processing capabilities” that dynamically change during communicative acts as a reaction to other acts. In task-oriented dialogs in customer service, the goal of users is to express their concerns and inquiries in natural language to ensure that their requests are effectively handled. To this end, we identify the requirements related to DP prompt design, menu design, persona design, confirmation strategy, error management, and functional design. These requirements are essential to support the user through the dialog and to achieve the desired objective.

The first category of requirements is concerned with the design of system prompts as one major design aspect of SDSs. In this context, dialog theory assumes that communicative agents strive for rationality in reaching their goals (Bunt 2000), which in turn requires an effective design of an SDS. In recent years, an increasing number of studies on the prompt design of dialog systems have been published (Robertson et al. 2016; Jha 2019; Przegalinska et al. 2019). Overall, the academic literature agrees that system responses should be kept short because long messages would confuse the user (Delogu et al. 1998; McTear et al. 2016, p. 64). Moreover, direct and precise expressions and a strong task orientation should guide the answers of users (R1) (McTear et al. 2016, p. 64; Robertson et al. 2016; Jain et al. 2018). Additionally, for the sake of comprehensibility, Lewis (2016, p. 222) advises using simple expressions and low variation of technical terms. The SDS should convey competence within the context of the application while remaining comprehensible (R2) (Verhagen et al. 2014). Based on the aforementioned, the following DP can be explicated:

DP1

For SDS designers to shape an efficient dialog between customers and an SDS, ensure that the SDS employs brief but precise and goal-oriented prompts; such a design facilitates the understandability of the dialog while conveying competence in addressing the customer’s goal and task (Verhagen et al. 2014).

Aside from the task-oriented acts, dialog control acts are considered important for a smooth and successful communication according to dialog theory (Bunt 2000). Dialog control acts comprise social acts and behaviors for natural communication purposes. In this regard, another crucial stream of human–computer interaction research currently deals with anthropomorphism to examine the impact of human-like characteristics and design elements of conversational agents, the so-called “social cues” on user perception (Araujo 2018; Pfeuffer et al. 2019; Diederich et al. 2020). Anecdotal evidence has shown that anthropomorphic characteristics are not necessarily related to a higher trustworthiness of a system; instead, their impacts depend on the specific context. When the system is intended to replace a human expert (e.g., for customer support), human-like characteristics are considered beneficial for generating familiarity and trust with the agent. By contrast, the humanness of a system is not considered helpful when it is designed to substitute an existing computer system given the “automation bias” (Diederich et al. 2020). Guided by these findings, we adopt the view of human characteristics being positively related to the trustworthiness of conversational agents for deriving the requirements and DPs for our design theory in the context of customer support.

In customer service, customer satisfaction depends not only on measurable criteria (i.e., the time required to process the request) but also on social factors such as the feelings of users (Hudson et al. 2017). Therefore, a major aim is to create high-quality conversations that resemble human interaction in terms of not only expression but also the emotions generated (Lee and Choi 2017). According to the academic literature, users desire certain human characteristics when interacting with dialog systems. First, a dialog system should be honest and authentic (Przegalinska et al. 2019), that is, it should neither deny its status as a machine nor behave like one (Luo et al. 2019). The positive associations of an efficiently and rationally acting machine should be combined with the communication characteristics of a human interlocutor (R3) (Portela and Granell-Canut 2017). Nonetheless, the SDS should admit mistakes without making the user feel responsible for them to maintain user trust (R4) (Branham and Mukkath Roy 2019).

According to dialog theory, the communicative behavior of the agents, including communicative acts such as greetings, apologies, gratitude, agreement, should conform to the social norms and conventions of the specific context and application area (Bunt 2000). Following this recommendation, the mode of expression should correspond to the specific context of an application (Gnewuch et al. 2017). If the context of application allows informal language, users might want small talk, humor, sarcasm, and playfulness (Hill et al. 2015; Jain et al. 2018). However, the SDS should not show negative character traits such as being rude or offensive. Users prefer a friendly dialog partner without this friendly behavior appearing artificial (R5) (Verhagen et al. 2014). With regard to the voice, there is no prevailing opinion on whether the voice of an SDS should generally be female or male (Luo et al. 2019). Nonetheless, Eyssel et al. (2012) state that users prefer the voice of their own gender. To summarize, the following DP should be considered when designing an SDS character:

DP2

For SDS designers to enable customers having a human-like dialog with an SDS, prompts should be responsive to errors and their expressions should be appropriate to the customer service context, using natural and friendly phrases coupled with social cues for a more comfortable and trusting interaction (Lee and Choi 2017; Gnewuch et al. 2017).

The third category of requirements is concerned with confirmation and error management strategies, which are recognized as major components of an SDS (Gnewuch et al. 2017). Confirmation strategies check whether the system has correctly captured the variables once the customer responds to a question or makes a request (R6). A distinction is made between explicit and implicit confirmation strategies (McTear et al. 2016, p. 214). Explicit strategies prompt users to actively confirm their input, whereas implicit strategies only require passive confirmation (Lee et al. 2010). In the latter case, the system repeats the mentioned inputs in connection with a new question. If the user answers this question, the system automatically confirms the variables (McTear et al. 2016, p. 214). If the variables formulated do not apply, the user can point to this and the system initiates the correction process (R7) (Lee et al. 2010; Mané and Levin 2008) conclude that users prefer implicit strategies; by contrast, McTear et al. (2016) consider implicit strategies as beneficial to a more efficient conversation. To assess these finding in more detail, we explore and evaluate whether an implicit or explicit confirmation strategy is preferable. The DP regarding the confirmation strategy is as follows:

DP3

For SDS designers to ensure that an SDS has correctly captured all the required information during a conversation turn, a confirmation strategy should be implemented to guide customers in providing required and missing values for a structured and effective conversation (McTear et al. 2016, p. 214).

SDS error management similarly requires special attention (McTear et al. 2016, p. 266). Errors and dialog breakdowns can occur when user statements cannot be assigned to an intent (Uchida et al. 2019), which can create negative experiences and consequently reduce user trust in the SDS (Begany et al. 2016). To prevent the interruption of a conversation, an error management strategy is required (Opfermann and Pitsch 2017). Additionally, the error prompt should be based on the type of error. For example, the SDS must react differently to misunderstandings of the automatic speech recognition module than to unrecognized intents (R8) (Opfermann and Pitsch 2017). In the case that further errors occur despite error prompts, the SDS must react in a differentiated way to boost the chance of problem resolution (R9). Multi-stage error recovery strategies increase the level of assistance when the system repeatedly misunderstands the user. In particular, the error recovery strategies “ask” and “solve” according to Benner et al. (2021) are employed. The “ask” strategy includes the options for the customer to make another request at any stage of the dialog and to rephrase the request or sentence after repeating the input options, whereas the “solve” strategy aims to actively provide solutions for avoiding the dialog breakdown. Overall, the following DP should be considered for a consistent error management strategy:

DP4

For SDS designers to equip the SDS to handle errors (e.g., unrecognized intents, wrong navigation turns) and dialog aborts without interrupting the conversation for customers, a multi-stage error recovery strategy should provide customers with context-sensitive support to successfully communicate their requests (Begany et al. 2016).

The final important set of requirements is related to the functional design of the dialog flow, which defines the rules for the entire dialog course and thus describes the users’ different action alternatives (Handoyo et al. 2018). A logical dialog structure should enable the effortless handling of the system by incorporating the user perspective and using available information (Gardner-Bonneau and Blanchard 2007). The automatic verification and completion of user input ensures a more efficient dialog flow (Jain et al. 2018). For example, incomplete addresses can be completed with the help of the Google Maps API to enable a more efficient dialog (Vaira et al. 2018). As another example, the integration of mathematical checksums can help to validate credit card or customer numbers (Pearl 2016).

To summarize, on the one hand, all necessary user options should be integrated into the dialog flow to ensure completeness (R10) (McTear et al. 2016, p. 63); on the other hand, the number of functions of a task-oriented SDS should be limited, as an oversupply of functions can result in a higher development effort, an increasing error rate, and dissatisfied users (Michiels 2017). The functions should be designed to meet the expectations of users but avoid complex tasks that dissatisfy them (R11) (Kiseleva et al. 2016). To meet these requirements, the following DP should apply:

DP5

For SDS designers to provide customers with a functional range that adds value to customer service, only domain-specific functions that meet user expectations should be included, but the functions should be limited to the essential ones to achieve customer objectives and avoid overwhelming customers with options (Michiels 2017).

As stated in the methodology section, we conduct an empirical study to highlight the strengths and weaknesses in the user experience in the context of a task-oriented use case (Walls et al. 1992; Gregor and Jones 2007). To compare the effects of the open and closed dialog strategies on user experiences in detail, comparability between the systems is required. The main difference between the two strategies can be found in the menu-oriented structure of the closed dialog system. Menu design is a major category of requirements of the closed dialog strategy. Menu prompts belong to the category of system prompts, and they should also fulfill the requirements of being efficient, precise, and understandable (Robertson et al. 2016). Thus, we consider menu design as the equivalent category of requirements and DP for the “prompt design” of the open dialog strategy.

Menus are considered important when informing the user about the possible dialog paths, but an excessive number of menu options can overwhelm users (Bigot et al. 2013). In this context, anecdotal evidence has shown that the human memory is capable of remembering five to nine menu options (Miller 1956). With each additional option, the ability to remember is negatively affected (Bigot et al. 2013). Thus, the recommendation is to limit the number of menu options (R12) (Robertson et al. 2016) to a maximum of five (Bigot et al. 2013). In a similar vein, the arrangement of the options should be properly designed. In this case, the primacy-recency effect must be taken into account, according to which the information that is named first (primacy effect) or last (recency effect) is better remembered (Murdock 1962). Consequently, important or frequently requested menu options should be placed at the beginning or end to prevent errors and time-outs. Furthermore, the listing of the options to choose from should not be followed by additional information because such structure impairs the ability to remember the previously mentioned options (R13) (Bigot et al. 2013). In addition, Gardner-Bonneau and Blanchard (2007) recommend a strong distinction between the wording of individual menu options and commands (R14). Overall, the following DP should be considered when designing menus:

DP6

To enable SDS designers to facilitate a menu-driven conversation between the customer and the SDS, the SDS should be equipped with a menu of up to five differentiated options within a conversation turn, with important to frequently requested menu options placed at the beginning or end of the dialog to allow for a goal-oriented dialog (Lee et al. 2017).

The outlined requirements and DPs as well as the conceptual link between them are summarized in Table 2. As proposed by Gregor et al. (2020), DPs specify the mechanisms that SDS developers must implement to satisfy a particular set of requirements.

Table 2 Requirements and design principles for the design theory

4.2 System Design

To allow for a naturalistic design, we collaborate with a German IT consulting company from Lower Saxony, for whom we build two SDS prototypes: one with an integrated open dialog strategy and another with a closed dialog strategy. We apply the design theory to an adventure booking portal (Adventure Guru) that is aligned with the business logic specifications provided by our cooperation partner. The design instantiations are built and tested in each iteration. Within the first iteration, the effectiveness of the prototype design is evaluated via cognitive walkthroughs with two user-experience designers, a creative technologist, and an expert for digital business. All experts have distinguished professional experience in the field of SDS and work for the IT consulting firm of our cooperation partner. In the second iteration, the refined design instantiations are subjected to user tests (comprising researchers and practitioners) to validate that the SDS variants can be used for booking, editing, or cancelling adventures. Following DP5, the SDS instances are equipped with a limited number of functions that allow users to book 12 different adventures (e.g., bungee jumping), edit bookings, cancel bookings, or obtain answers to frequently asked questions. To capture customer intents, we use Google’s Dialogflow phone gateway as the customer interface, along with the underlying natural language understanding engine. Although we add some specific training sentences as well as alternative wordings and statements for model training, we are otherwise able to rely on an already well-functioning natural language understanding engine. The extra training phrases are not fed into the model by speech-to-text conversion, but rather as plain text. We integrate and utilize the Parloa development platform for the design and management of dialog flows. Furthermore, we design the SDS prompts for the instances to deal with customer booking inquiries and to handle customer conversations based on the business logic specifications and the devised DPs (cf. Figure 3).

Fig. 3
figure 3

System architecture

Open SDS instantiation – DP1: The welcome prompt of the open SDS instantiation welcomes and invites the user to start the conversation with the open question: “How can I help you?” (cf. Figure 4, right). The user decides on the further course of the conversation by either posing a question or placing a booking. However, the lack of clear instructions also causes considerable uncertainty, which is intercepted if the user does not react within 4 s, after which the system automatically informs the user about possible central functions. This interception does not constitute instructions for action as is the case in the closed SDS, but it is intended to provide information about the available functions. Instead of enumerating individual menu items and querying individual variables, the open SDS allows the user to input several variables within a single statement. For example, the selected experience, the number of people participating, and the date can be recorded within one statement. However, if the user specifies only one variable, the system will proactively ask for the remaining input to complete the process step. Even if the system asks for a specific variable, the user can still name additional information concerning several variables.

Fig. 4
figure 4

The humorous and anthropogenic side of the SDS (DP2)

Furthermore, the SDS is capable of telling jokes and engaging in simple small talk; nonetheless, to ensure that the system does not lose task orientation, the prompts always end with the question of the respective process step (cf. Figure 5). If the user deviates too far from the actual task so that the system cannot interpret the statement, an error prompt occurs. In such cases, the multi-level error recovery strategy enables the user to correct errors by intervening with statements such as “this is not correct; the booking should actually be made for the [date].” If the correction results in another error, the system provides an example statement and, if necessary, refers to the corresponding help intent. The level of assistance only increases after the second error. However, the error prompts remain short and rely on the user’s initiative to independently correct the error. The correction within the open SDS can be realized through a single user turn (DP4).

Fig. 5
figure 5

Welcome prompt – closed SDS (left) and open SDS (right)

Closed SDS instantiation – DP6: The closed SDS starts by welcoming the user, naming the menu options, and offering navigation hints (cf. Figure 4, left). Booking experiences constitutes the central functionality of the SDS; thus, it is the first named main menu option. We set the frequently asked questions last as a form of assistance in case of uncertainties.

In addition to the menu options, the closed SDS lists the command options to help the user in understanding how to operate the SDS. Due to the tree navigation structure of the closed dialog strategy, an incorporated “return” command ensures easy and quick navigation corrections during the booking process. Additionally, by selecting the “main menu” command, the user can cancel ongoing processes and return to the main menu, where only the welcome prompt is repeated, as the user should still be familiar with the command options. After selecting a menu path (e.g., “book experience”), the menu items of the next navigation level are listed and necessary input variables such as experience category, experience, number of participants, and date are successively captured. If errors occur despite the coherent closed dialog strategy, the SDS responds with the prompt “sorry, I’m probably hearing particularly badly today. Could you please repeat that?” The SDS admits its mistake in a funny and friendly way and asks the user to repeat the statement. Different responses of the error prompt ensure that the SDS does not repeat itself in the course of the dialog (DP2). The available options explicitly express that the user should repeat and not rephrase the input. If the user still fails to select the desired option, the system assistance is increased. For example, the system advises to follow the exact wording of the menu options before repeating them afterwards (DP4).

In the formative evaluation cycles, the idea that implicit confirmation is not always understood as data entry confirmation has become apparent; hence, we implement the explicit confirmation after each process step, although such approach lengthens the dialog. Thus, successful user inputs are explicitly confirmed by “all right,” “understood,” and “OK” in both instantiations. After the explicit confirmation, the user confirms the repeated variables, receives a booking number (at the end of the booking process), or returns to the main menu (DP3).

To explicitly summarize the described technical realization of our DPs, we outline the corresponding implemented design features in Table 3. These design features reflect a series of specific design choices that instantiate each DP (Meth et al. 2015; Schoormann et al. 2021).

Table 3 Design features

5 Evaluation of the Speech Dialog Systems

For the evaluation, we follow the framework of Venable et al. (2016) to ensure alignment between our research goals and framework settings, demonstrate design utility, and generate implications for research and practice. The applied framework comprises four steps: (1) explicate the goals of the evaluation, (2) select the evaluation strategy, (3) determine the properties to evaluate, and (4) design the individual evaluation episodes.

(1) Given our need to analyze the utility and efficacy of both SDS dialog strategies with respect to achieving a specific goal (Venable 2006), our evaluation aims to test the rigor of both designs by assessing their functional effectiveness (Venable et al. 2012). Furthermore, we aim to outline the strengths and weaknesses of both dialog strategies by empirically testing the user experience in customer service, thereby reducing design uncertainty and risk.

(2) According to Venable et al. (2016), the “human risk and effectiveness” strategy is suitable for problem spaces where user design is paramount (Gnewuch et al. 2017; Diederich et al. 2020). Therefore, with our focus on the design risks related to the interaction between users and the SDS, we also follow this strategy.

(3), (4) By applying formative qualitative evaluation methods during the first two iterations and a summative quantitative evaluation method at the end of the third iteration, we operationalize the human risk and effectiveness strategy in a naturalistic framework.

The goal of the formative evaluation cycles is to improve design and implementation to ensure effective instantiations; by contrast, the purpose of summative evaluation is to capture the usability of the final design. After the completion of the first prototyping phase, the initial SDS prototypes are evaluated by four SDS experts who analyze the systems in terms of usability and feasibility via the cognitive walkthrough method. In human–computer interaction research, cognitive walkthrough represents an effective method for evaluating the design of a user interface in early prototyping phases based on cognitive theory (Rieman et al. 1995). For the open SDS, the results of these cognitive walkthroughs particularly related to issues of unrecognized intents and prompt wording; for the closed SDS, the results pertain to the number of options and prompt length as well as the categorization and order of prompts. The idea that the implicit confirmation strategy poses the most challenging issue for the experts becomes apparent, as it is not always understood as data entry confirmation (cf. DP3). Based on the experts’ feedback, we refine the prototypes in the second design cycle and correct major pitfalls (e.g., change to an explicit confirmation strategy, addition of test phrases for the model training, improvement of prompt design, categorization of the service offerings). By subsequently conducting in-house user tests with five potential users, we aim to ensure that users can master the tasks in the dialog systems without prior experience and further assistance. We record, transcribe, and analyze the conducted user tests through a qualitative content analysis (Mayring 2001). The results from the user tests reveal rather minor issues (wording of the prompts, isolated intent detection issues), which are resolved by the further refinement of the instances.

After the second iteration, the instantiations are prepared for the final evaluation, a two-phase experiment with 205 participants. The properties to be evaluated for the comprehensive summative evaluation are captured in the hypotheses in Table 4. The hypotheses represent “statements required to test whether the design satisfies the requirements” (Gregor and Jones 2007, p. 319).

5.1 Experimental design

To test the hypotheses, we conduct a two-phase experiment. In the first phase, the participants are asked to familiarize themselves with both the open and the closed SDS by performing two tasks in each instantiation:

  • Task 1: “Call Adventure Guru to book the canoeing experience, on August 12, for seven people.

  • Task 2: “In the same call, you want to edit your booking and change the number to five people.”

The tasks provide a clear and comprehensible use case for the interaction with the SDSs and allowed for comparability across participants. We log the user activities (i.e., completion time), errors made (number of corrections), and number of dialog steps required to complete the tasks. The log file is automatically created by an integrated function of Dialogflow as soon as the participants begin their task by calling the Adventure Guru. In the second phase of the experiment, the participants complete an online survey that captures the user experience with both instantiations. The constructs and items operationalizing the survey constitute existing validated measures (cf. Appendix A.4). We use the construct of perceived humanness from Gnewuch et al. (2017) to test H1 (resp. DP2, with the aim of enabling customers to have a human-like dialog with an SDS). Furthermore, we employ the constructs of the Subjective Assessment of Speech System Interfaces framework (Hone and Graham 2000); this framework is a standardized user-experience questionnaire for conversational interfaces, which features a broad selection of user-experience dimensions (Kocaballi et al. 2019). To test H2 and thereby examine the DP3 design, we use the construct of habitability, which refers to “the extent to which the user knows what to do and knows what the system is doing” (Hone and Graham 2000, p. 23). In addition, the construct of system response accuracy is utilized to test H3, which examines the DP4 design of error handling, and the construct of likability is used for testing H4 by assessing preferences between an open (or DP1) and a closed menu design (or DP6). We utilize a five-point Likert scale to measure all constructs. We further conduct a small-scale preliminary study to test the comprehensibility of the items and ensure validity and reliability by refining the measurement instrument (Straub et al. 2004). To strengthen and extend the testing of the hypotheses, we include the results from the log file analysis to support the subjective assessment of the participants with objective information about the system tests, providing additional validation of DP4, DP1, and DP6 and testing the functional effectiveness anchored in DP5. In doing so, hypotheses H1–H5 allow us to test the aims of DPs implicitly through the implementation of the SDS designers (implementers) and explicitly through customers (user). An overview of the hypotheses and the corresponding measures is presented in Table 4.

Table 4 Hypotheses and corresponding measures

To recruit participants, we use several social media groups and the public news hub of our cooperation partner. Of the 214 survey participants, nine have to be sorted out due to incomplete data (e.g., no registered call in the system). The descriptive statistics of the remaining 205 participants are shown in Table 5.

Table 5 Descriptive statistics of the participants’ demographical data (n = 205)

5.2 Experiment results

Following the approach of Diederich et al. (2020), we analyze our data by using descriptive statistics and conducting statistical hypothesis tests. Descriptive statistics from the logging information show that participants are able to complete tasks more efficiently with the open SDS with an average of 9.66 dialog steps and nearly 50 s less time than with the closed SDS, in which an average of 11.72 dialog steps are required. However, navigation errors occur more frequently with the open SDS (average 1.46) than with the closed SDS (average 1.07). Moreover, the success rate in fulfilling both tasks for the closed system is a convincing 96.10%, compared to 91.22% for the open SDS. The survey data are validated for the internal consistency reliability of our latent constructs by calculating the Cronbach’s alpha (α) and the composite reliability that exceeds the recommended limit of 0.7 (Nunnally and Bernstein 1994). Descriptive statistics reveal higher subjective average scores using the open SDS in the area of perceived humanness, system response accuracy, and likability, whereas the closed SDS is more convincing in habitability.

To assess the significance of the difference between the systems for each of the examined variables, we first analyze our continuous survey data and the logging variables for univariate normality by applying the Kolmogorov–Smirnov test, which shows significant results (p < .01) for all continuous variables, indicating that the sample distribution do not follow a normal distribution (Field 2009). Based on these pre-tests, we use the non-parametric Wilcoxon signed-rank test to conduct the hypothesis tests of our related samples (Wilcoxon 1992). Next to the Wilcoxon signed-rank test, we use a chi-square test for examining the difference between the task success rates due to the dichotomous nature of the variable (1 = successful task completion of both tasks, 0 = unsuccessful task completion of both tasks). The descriptive statistics and the results of the hypothesis tests are highlighted in Table 6 (Wilcoxon signed-rank test) and Table 7 (chi-square test).

Table 6 Results from the Wilcoxon signed-rank test (n = 205)

We can confirm H1 (T+, Z = ˗5.545, p < .01), denoting that the open SDS design strategy is significantly perceived as more human-like than the closed one. We reject H2 (T-, ˗2.450, p = .014), as the closed SDS is perceived as more comprehensible than the SDS that follows an open dialog strategy. We identify significant differences in perceived system response accuracy in favor of the open SDS instance (H3a, T+, ˗2.234, p = .025), but the number of errors is significantly higher with this strategy (H3b, T+, ˗2.763, p = .006). Overall, the user experience with the open SDS is perceived as significantly more likeable than with its closed counterpart (H4, T+, ˗5.033, p = .006). With regard to functional effectiveness (H5), we obtain mixed results depending on the definition. In terms of the time (T-, ˗10.344, p = .000) and the dialog steps (T-, ˗8.027, p = .000) required to successfully complete the assigned tasks, we observe significant advantages for the open SDS. However, when considering functional effectiveness as the total number of successfully completed tasks, we note significant advantages for the closed SDS (χ² = 2.734, df = 1, p > .05).

Table 7 Results from the chi-square test (n = 205)

The logging data are clearly objective in nature. To substantiate the associated hypotheses (H3b, H5) from a subjective point of view, we conduct a qualitative content analysis of the open-ended answers in the survey, which allows the participants to optionally report what they like and dislike about the two SDS variants. Two researchers inductively code the occurring patterns in the text fields. The coding is validated using Krippendorff’s alpha (α = 0.83) (Krippendorff 1989). We count the subjective positive and negative sentiments in the text fields per variable (number of errors, dialog steps and duration to task completion). We summarize the counts of the examined variables that are assigned to H5 because the statements often refer to the effectiveness and efficiency of the task fulfilment and therefore cannot be clearly distinguished. The results from the qualitative content analysis support the findings from the hypothesis tests (cf. Table 8).

Table 8 Results from the qualitative content analysis of the open-ended questions

Considering the demographic data, we perform a non-parametric pendant to the one-way ANOVA, the Kruskal–Wallis H test (Kruskal and Wallis 1952), to evaluate the differences between the group distributions. The Kruskal–Wallis H test shows strong evidence of intergenerational differences in the perceptions of humanness (H = 15.921, df = 2, p < .01), habitability (H = 22.582, df = 2, p < .01), system response accuracy (H = 26.279, df = 2, p < .01), and likability (H = 22.394, df = 2, p < .01) in the closed SDS. A pairwise comparison using post-hoc (Dunn–Bonferroni) tests reveals that this result is predominantly due to the difference between Generations Z and X as well as Y and X, with the older generation showing a stronger bias in all respects toward closed SDS and the effect of the difference increasing with age difference. We also observe statistically significant differences between Generations Z and Y in terms of habitability, which can be interpreted as the older the user is, the more comprehensibly closed SDS are perceived (cf. Table 9).

Table 9 Results from the pairwise intergenerational comparison (post-hoc)

These findings are also confirmed when analyzing the preferences of users according to the survey results (cf. Table 10). Overall, 68.78% of users prefer the open SDS, whereas 31.22% consider the closed SDS as more preferable. The younger user groups of 18–24 and 25–44 years old prefer the open SDS, whereas the older user group of 45–65 years old clearly opt for the closed SDS.

Table 10 User preferences according to user group (open vs. closed SDS)

6 Discussion

Guided by the DSR paradigm, the primary purpose of this study is to devise and evaluate a design theory for an SDS dialog strategy in customer service. The proposed design theory including 14 requirements and five DPs is informed by the principles of dialog theory (Bunt 2000) and related work in prior conversational agent and SDS research; it is also empirically validated in three iteration rounds through five hypotheses. In doing so, we contribute to research and practice in several ways. First, we enrich the body of knowledge by proposing a design theory that provides codifying design knowledge for a class of artifacts (SDS dialog strategies) to address a class of problems according to Walls et al. (1992) and Gregor and Jones (2007). This type of knowledge can be referred to as “nascent design theory,” which provides “knowledge as operational principles/architecture” according to Gregor and Hevner (2013, p. 342). With this contribution, we respond to recent calls for more design knowledge on conversational agents for enhancing user experience in the customer service context in particular (Gnewuch et al. 2017). In the next sections, we discuss the main findings of this study prior to highlighting the major implications for research and practice. In addition, the limitations of this study are outlined with further propositions for future research.

6.1 Implications for Research and Practice

The insights gained in the three iteration rounds of the applied DSR approach contribute to an iterative revision and refinement of our design theory. Based on the key findings of the evaluation rounds, we are able to derive manifold implications for research and practice, which are concerned with the effectiveness and user experience of the proposed design theory. The main findings to the tested hypotheses and the corresponding implications for research and practice are summarized in Table 11.

Table 11 Main findings and implications for research and practice

With regard to the perceived humanness as one of the hypotheses (H1), we find support for the notion that the open SDS is perceived more human-like than the closed system. This result is not only quantitatively validated but is also indicated by the qualitative answers of the survey participants, with positive comments on the humanness of the open SDS, such as “it feels almost like talking to an employee” or “it was like a normal conversation with a real person.” In comparison, the strict enumeration of options makes the closed system appear cold and robot-like. Some users are annoyed by the closed system, as indicated by statements such as “the long announcements are annoying,” “mechanical communication,” or “too much talk, too many options.” Negative attitudes toward SDSs are due to the users’ discomfort and distrust when talking to a machine without a personality (Luo et al. 2019). These findings have several implications for research and practice. First, they are consistent with social response theory (Nass and Moon 2000; Moon 2000) and support the human–human trust perspective, according to which anthropomorphic characteristics tend to positively affect user trust (Gnewuch et al. 2017; Seeger and Heinzl 2018). Accordingly, we can confirm the findings of previous studies that human-like characteristics are considered beneficial for the design of conversation-based technologies when the system is intended to substitute a human expert, for example for customer support (Diederich et al. 2020).

We also find that the participants hardly used some functions of the open SDS. For example, the participants are only interested in performing their task and showed no initiative to utilize the small talk function of the open system. Instead, the function is triggered in a few cases and only in an unintentional manner, which in the dialogs caused more misunderstandings than being useful. Thus, small-talk intents should be avoided in a task-oriented SDS because too many different intents increase error probability. This finding is consistent with one of the major assumptions of dialog theory, which posits that task-oriented dialogs are instrumental, with people only engaging in a dialog when they intend to achieve a particular task or goal (Bunt 2000). One major implication that can be derived from this finding is that the design of task-oriented SDSs should be different from the design of social SDSs. Similarly, prior research has concluded that interactions with voice assistants, as is the case with SDSs, should be designed differently than in conventional human–computer interactions (Schmitt et al. 2021). Among others, the human-like design of voice assistants should be context- and task-dependent. Therefore, the investigation of the main similarities and differences between task-oriented and social SDSs in future research would help to enhance the understanding of how to design desirable AI-based digital assistants for different task types.

When investigating the habitability of the open SDS compared to the alternative (H2), we find that the closed SDS is perceived as more habitable than the open SDS. One reason for the higher habitability with the closed SDS is that this form is still predominant in business practice (Dale 2016). Users who are unfamiliar with open dialog strategies feel overstrained when using them. For the design of task-oriented SDSs, this finding has several important implications. On the first call, more assistance should be provided to carefully familiarize users step-by-step with the open system. In particular, the welcome prompt has a significant influence on user expectations; hence, a brief explanation of available self-service options would be useful prior to posing the open question on user intent. By naming the various options, users can easily initiate the desired process and start the conversation without making any mistakes, similar to a closed system. Once users are familiar with the open system (i.e., for subsequent calls or when returning to the main menu), the level of assistance can be reduced. Furthermore, the findings indicate that the clear confirmation of inputs in the closed system is beneficial for enhancing habitability toward the system. Accordingly, the implicit input confirmation should be taken into account more consequently in the design of the open SDS. However, an issue that remains unanswered relates to how an optimal level of assistance can be achieved in an open SDS while benefiting from open expression for an intuitive human-like conversation, which is frequently perceived as positive by the users.

System response accuracy and the number of errors during the dialogs are additional aspects that substantially affect user experience (H3a/H3b). The findings based on the analyzed quantitative and qualitative data underline the importance of an efficient and error-free dialog. The operation of both systems is generally perceived as easy to learn. However, the users are more satisfied with the control system in the closed SDS due to the higher predictability of communication. On average, both tasks are completed faster in the open system, which is also confirmed by the subjective perception of the users. The majority of users indeed perceive the system response accuracy of the open system as higher than that of the closed system, based on their subjective perception that the open SDS makes fewer mistakes than the closed system. However, this perception is contradictory to the recorded system data, which reveal that users made more mistakes in completing the two tasks in the open system. The lower level of habitability as described above could be one reason why the number of errors is higher in the open system. The contradiction between the perceived system response accuracy and the actual number of errors based on the logging information implies that other system characteristics such as likability or perceived humanness may be more important for users of SDSs than system response accuracy. Thus, user satisfaction with the open SDS in terms of likability or humanness may lead users to underestimate the error rate. However, further research is needed to gain deeper insights into the importance of different system characteristics on the user experience. Future design studies could rank the requirements and DPs by their relative importance (e.g., based on frequency analysis, factor analysis, or other ranking methods).

The number of errors made substantially varies from user to user, with some users sharing their impression with comments such as “The system understood me and was easy to control” or “Good speech recognition.” By contrast, other users experience considerable problems with fulfilling their tasks in the open SDS and criticize the number of errors in the open system. Various statements such as “You’re somewhere in a menu and can’t get any further, even though yelling at the phone” reflected such result. As indicated by DP4, a high priority in designing an SDS should be allocated to the successful handling of errors by including a multi-stage error recovery strategy that provides users with context-sensitive support to successfully communicate their request. Although this DP is considered when designing the open SDS, including two different test phases with different participants in which the error recovery strategy is iteratively improved, unforeseen errors still occur. Frequent failures to recognize user input causes user dissatisfaction and represents a major challenge in the development of SDS (Goetsu and Sakai 2019). Thus, the findings concerning error recovery strategy indicate further issues for improvement.

Given the varying preferences and needs of different user groups, system design should allow for tailored levels of help prompts. Hence, help prompts should be more detailed when the user specifically asks for support. By calling up help prompts, users could control the level of system help themselves. Additionally, more contextualization is required to avoid unnecessary errors and misunderstandings. Instead of allowing users to access the help at any time, this function should only be possible in the dialog steps in which help is relevant. Furthermore, an early provision of help can avoid unnecessary errors. To increase the probability that users request help when necessary, the function can be mentioned in the welcome prompt in the first call. For the developed use case, the system could formulate an example statement with several filled slots. The users then customize the sentence with their desired content and in this way learn how to use the system. Thus, the knowledge gap regarding slot filling can be closed.

Another way of avoiding dialog breakdowns is to implicitly integrate a de-escalation intent. Depending on the length of the dialog, the system should forward the user to a human employee after a certain number of errors to avoid dialog breakdowns. Overall, two steps are necessary: supplementing unexpected user statements in the rules of dialog management and formulating prompts more purposefully when the number of errors is too high. SDSs must therefore be supervised by qualified personnel over the course of their deployment to eliminate sources of error in the long term.

Likability is another major aspect to be considered when designing an open SDS (H4). The survey participants generally prefer the flexibility of the open system, which allows the user to fill several slots at once and helps to determine the course of the system. The open SDS is quantitatively and qualitatively rated as more friendly than the closed SDS, with the participants expressing statements such as “friendly voice” and “I found the robot very nice.” In addition, the users describe the navigation in the open system as intuitive and on average have more fun with the system. The open expression mode should therefore be possible throughout the SDS to enable a human-like conversation and to support the human–human perspective according to social response theory (Nass and Moon 2000; Moon 2000).

With regard to functional effectiveness and performance (H5), the hypothesis is only partially supported by the empirical results based on the duration of task completion, the number of dialog steps required to achieve task completion, and the completion success rate. Based on the duration of task completion and the number of dialog steps required to achieve task completion, the open SDS shows a higher functional effectiveness and performance than the closed SDS. However, when referring to the completion success rate, the closed SDS performs better. The open SDS is described as more professional and useful, but the higher number of unfulfilled tasks in the open system indicates a lower level of effectiveness. To improve the completion rate in the open SDS, the propositions made in P3a–P3f should be considered when designing the error recovery strategy to avoid dialog breakdowns.

However, with 74 positive and only 3 negative comments, the duration of the dialog in the open SDS is clearly considered superior to the duration of the closed dialog, which is criticized in 47 comments and positively mentioned in only 9 statements. Among others, the closed system is criticized for the detailed prompts that cause longer dialogs and the error prompts for partly unnecessary repetitions. This result is reflected in statements such as “the long announcements are annoying,” “too long instructions,” or “if you know what you want, the selection is annoying.” The users’ negative perceptions toward the closed SDS are understandable when comparing the average duration of both dialog forms. The average duration of task completion is significantly shorter in the open SDS than in the closed SDS. The shorter prompts and the possibility to capture several variables at once through slot filling contribute to a more efficient dialog in the open SDS. This result is also indicated by the average number of dialog steps required to achieve task completion, which is significantly lower in the open SDS than in the closed SDS. The mean value of the number of user dialog steps in the open SDS is significantly higher than the minimum number. The reason is that only a few survey participants attempt to capture several variables at once to benefit from the slot filling function of the open SDS. The rationale for the non-consideration of the slot filling function can probably be found in the lack of awareness of or the lack of experience with slot filling. To increase the probability that users utilize slot filling, they should be informed in the welcome prompt of the first call about the available function, including an example statement (see also P3d).

Overall, the participants find the open system more pleasant and express a preference for its use in the future. More than two thirds (68.8%) of the 205 participants prefer the open SDS, whereas the remainder of the participants (31.2%) favor the closed SDS. The popularity of the open SDS is also reflected in the qualitative statements of the users based on the frequency analysis of the negative and positive comments (cf. Table 8). The open SDS is predominantly positively emphasized (37 positive versus 12 negative comments), whereas the closed SDS yields a rather mixed ratio with 20 positive and 21 negative comments. However, differences between age groups are observed: older users have difficulties in using the open SDS and thus clearly prefer the closed SDS, whereas younger users generally perceive the open system as more preferable. One reason for this observation has already been described when explaining the users’ higher habitability with the closed SDS. Thus, older participants seem to be more familiar with using closed dialog systems and may feel overstrained with the open system. In line with prior findings in the literature on computer self-efficacy (CSE), younger users feel more comfortable using IT compared to older users (Reed et al. 2005; He and Freeman 2010). Thus, the decline of CSE in relation to age may be a further explanation for our observation.

Another explanation for the age-related differences in the preferences can be found in the research on technology acceptance. Consistently, a recent meta-analysis of 144 individual studies on the relationship between age and technology acceptance covering different types of technologies and user groups has revealed that age is indeed an antecedent of technology perceptions such as perceived ease of use, perceived usefulness, and intention to use a technology (Hauk et al. 2018). Additionally, the study has found that the negative relationship between age and technology acceptance is not present for technologies addressing the needs of the older user group. Thus, we can assume that although the acceptance toward the closed SDS is high, the acceptance toward the open SDS may be low. However, prior studies on CSE and technology acceptance are not conducted in a specific context of conversational agents; consequently, these findings may not be fully generalizable in the specific context of this type of technology. Further research is needed to shed light on the moderating effect of age on the preferred design components.

The findings indicate that the design of SDSs is a complex and demanding task; furthermore, the extent to which the design of the open dialog should integrate the elements of the closed SDS depends on the target group. Consequently, the design of an SDS should be contextualized and individualized to meet the demand of the target group of customers. For example, when being designed to serve as a customer service agent for older customers (e.g., to be used in healthcare), an SDS should integrate more structured elements. When being designed to function as a booking assistant for younger customers (e.g., a provider of adventures, as is the case with our “Adventure Guru”), an SDS should be equipped with the DPs of the open SDS. Given these findings, SDSs should have either more features of an open or a closed dialog, depending on the user group. Nevertheless, further research efforts are required to explore the impact of different SDS types on various user groups. For example, deeper insights into the impact of hybrid SDSs on the user experience of different user groups (e.g., age, gender, application domains) could provide useful results for the future design of SDSs.

With the presented design theory, we contribute to research and practice by providing a consistent set of design principles, propositions for further improvement, and future research avenues for addressing an important class of problems in human–computer interaction research. This is of particular importance in the context of customer service, as research on the design of conversational agents that can help to increase user experience is lacking to date (Gnewuch et al. 2017).

Aside from the provided design knowledge, our study shows in a particular context the dialog strategy that is preferred by users to create a user-friendly and efficient human–computer dialog. Thus, our study contributes to the body of knowledge in behavioral research by enhancing the understanding of user preferences toward different dialog strategies.

6.2 Limitations

As with any DSR project, the findings of this study are subject to some limitations that must be considered when interpreting the results. Some methodological limitations exist with regard to the systematic literature review conducted in this study to gather relevant literature that serves as justificatory knowledge. First, the literature search is conducted in six interdisciplinary databases for a broad and comprehensive search. Despite our efforts to “accumulate a relatively complete census of relevant literature” (Webster and Watson 2002, p. 16), the identified literature is only restricted to the accessed databases and the applied set of search phrases and may not cover all relevant literature in the respective research areas. Second, although two researchers are involved in this study to achieve interrater agreement (Krippendorff 1989), the process of literature screening and assessment and the qualitative analysis of the evaluation results may be affected by selection biases (Templier and Paré 2018).

The third central limitation of our study refers to the evaluation step of our design theory, which is based on expert knowledge (Iterations 1 and 2) and perceived user experience (Iteration 3). Although the experts involved in Iterations 1 and 2 possess valuable knowledge in the application domain, their feedback, which helps refine the requirements and DPs, merely exemplifies the perceptions of these experts and thus may not be representative. Another factor that must be considered is that the user-experience survey is conducted only with German participants, a large proportion of whom are younger adults. Only relatively few participants are aged over 45 years. Hence, the sample of respondents is not demographically representative and only exemplifies a German-based point of view.

Aside from methodological issues, another limitation can be found in the explicit focus on the dialog management of task-oriented SDSs. Thus, the design theory proposed in this paper is only suitable for serving as design knowledge for task-oriented SDSs, and it cannot be generalized in the context of non-task-oriented SDS or text-based dialog systems. Among others, future research could address the extent to which the requirements and DPs for a speech-based dialog system can be adopted for the design of text-based dialog systems. The mode of communication is considered a key design characteristic of conversational agents when using natural language for human–computer communication (Knote et al. 2019; Diederich et al. 2019b). Anecdotal evidence has shown that users perceive voice-based communication with conversational agents as more natural (Novielli et al. 2010; Elshan and Ebel 2020), although the extent of this perception strongly depends on the user group (Novielli et al. 2010). Aside from these few examples, however, studies that exclusively examine the impact of different communication modes on user experience are scarce. A comparison of the requirements and DPs for both speech-based and text-based dialog systems would help to provide more generalizable design knowledge to advance research in the conversational agent research field.

Another limitation relates to the moderate influence of the application domain on the design theory. The design principles are developed based on literature from the research field of conversational agents and dialog systems, including several studies from the customer service domain. In addition, during the formative evaluation, we involve experts who have contributed their experience with dialog systems in customer service to the design of the user experience. However, given the rather moderate focus on customer service in the first design steps, the transferability and generalizability of our research results may be limited. Further studies that exclusively address domain-specific design requirements and design principles (e.g., based on user stories and user focus groups) should complement our findings. Nevertheless, the results from our summative evaluation (cf. Section 5.2), which we conduct in the customer service domain, clearly demonstrate that the design theory is suitable for satisfying the needs of users from the customer service domain.

Another limitation is concerned with the underlying dialog theory (Bunt 2000) that serves as kernel theory for the derivation of the requirements and DPs. Although dialog theory is central to the design of SDS, it cannot cover all relevant SDS design aspects. The selection of another kernel theory may result in a modified set of requirements and DPs for the design theory. As stated in Sect. 4, several other theories may also serve as kernel theory for guiding the socio-technical design of speech dialog systems, for example, task–technology fit theory, social response theory (Nass and Moon 2000; Moon 2000), and embodied social presence theory (Mennecke et al. 2011). We rely on dialog theory as kernel theory because our focus is on the design of dialog systems that assist users with simple tasks and short dialogs, while taking into account the communicative behavior of the agents (Bunt 2000). When selecting another theory as kernel theory, the focus may be shift to other requirements and DPs. For example, according to embodied social presence theory, technologies such as SDSs are considered as social actors that should be designed as human-like as possible. Consequently, the human-like design of the system may be more important than as is the case with dialog theory.

In this context, an interesting yet still unanswered question in the SDS research field relates to the question of the specific kernel theory that is best suited to guide the design of an SDS. To date, there is a lack of research that provides an interdisciplinary overview of available and appropriate kernel theories, regardless of the respective research disciplines. Such an overview would help to guide future DSR projects for a more rigorous design process.

A further limitation of this study is that it primarily focuses on optimizing efficiency and user experience when developing the design theory, neglecting socio-economic issues. However, aspects such as data privacy, user data protection, or economic factors may have an equally significant impact on the technical design of such a class of artifacts. When using SDS, many users are concerned with the protection of their data (Luo et al. 2019). Particularly in the financial and healthcare sectors, dialog systems are met with skepticism and resistance by the end users, as the mere disclosure of confidential information poses a risk to the user (Carter and Knol 2019). In the course of the dialog, multiple user data are collected, including personal information such as name or address, customer number, credit card data or bank accounts, and these data must be adequately stored and properly handled. Given the sensitivity of such information, many users have privacy concerns (Lopatovska et al. 2020). Therefore, a concept is required to ensure data protection and secure the trustworthy handling of user data.

Furthermore, the implementation of an SDS can cause high costs. Although the costs are expected to be lower than the savings potential, they should not be underestimated (Ivanov and Webster 2017). For example, customizing the system and using conversation-based AI technologies involve development efforts and thus high personnel costs for qualified staff (Kirkpatrick 2017). In addition, due to the lack of technical knowledge, many users consider the high-value, automated services to be inferior and express an unwillingness to pay the same price despite receiving the same service (Ivanov and Webster 2017). The provision of automated services can also convey the impression that the company is uninterested in personal customer interrelations (Knilans 2014). Aside from the design aspects, a variety of socio-economic aspects are to be considered in future studies referring to designing SDSs for customer support. As stated by other IS scholars, the design of AI-based digital assistants will be associated with both positive and negative consequences for humans, which must be further examined in research (Maedche et al. 2019).

7 Conclusion

Given the major role of dialog systems in today’s customer service for answering customer requests, the design of dialog strategies constitutes an important but challenging task for designers of dialog systems. By adopting a design theory-oriented approach according to Walls et al. (1992) and Gregor and Jones (2007), we develop and evaluate a design theory for an SDS dialog strategy, including 14 requirements and five DPs. Based on the quantitative and qualitative results of a user-experience survey with 205 participants, we show that the users’ experience with the proposed artifact differs depending on their age. Younger user groups tend to prefer the features of the open SDS, whereas the older user groups clearly opt for the closed SDS. Although there is still room for improvement with respect to error recovery and completion success rate, users appreciate the elements of the open variant, such as open expression, friendliness, and humanness. However, the findings show that the design of SDSs is a complex and demanding task. Nevertheless, we believe that this study contributes to research and practice by proposing a design theory that helps to improve the development dialog strategies of SDSs for enhancing the user experience in the customer service context.