Introduction

Today’s customers expect to be able to contact a company via e-mail, chat, and social media platforms, all while demanding ever shorter response times (Microsoft 2018; Salesforce Research 2016). For example, a study by Zendesk (2017) found that while in 2013 62% of the surveyed customers expected a response to an e-mail within half a day, in 2016 this number had risen to 79%. Further, 64% of customers expected companies to respond to and interact with them in real-time (Salesforce Research 2016). In addition, customers are about five times more likely to view real-time messaging as important versus unimportant (Salesforce Research 2018). Thus, consumers’ preferred channel matched directly with the method they thought was fastest (Gladly 2018). In 2018, half of the customers were disappointed with machine-based customer service such as chatbots, based primarily on the need for fast and yet personal service. As a consequence, companies face the challenge of meeting customers’ demand for both a short response time and a high level of service quality (Forrester 2018; Mero 2018). Hybrid approaches combining the complementary strengths of human and artificial intelligence show great potential for deployment in online customer service where an advanced text comprehension is required to fulfill customers’ needs. While humans have superior capabilities for empathic, complex, or intuitive tasks such as the semantic understanding of texts, artificial intelligence is particularly good at consistently solving repetitive or routine tasks in a specific area where fast processing of huge amounts of data is required (Dellermann et al. 2018; Forrester 2016; Guzmán and Pathania 2016). Besides, human involvement in responding to customer requests is favorable. Many customers, while increasingly preferring digital channels and demanding fast response times, nevertheless prefer to receive their information from a person rather than from a computer (Mero 2018; Parature 2014; Salesforce Research 2016). For instance, a study by Parature (2014) found that 60% of customers chose to interact with a human representative over a self-service system.

Case-Based Reasoning (CBR) is a promising means to augment human judgments by assisting employees in online customer service (Acorn and Walden 1992; Bedué et al. 2018; Heras et al. 2009; Lenz and Burkhard 1997; Lenz et al. 1999; Lenz et al. 1998b). CBR is often used to retrieve past and already solved customer problems – so-called cases – similar to the one currently encountered, allowing employees to draw information from past textual communication and reuse respective solutions. In this regard, CBR constitutes a methodology based on human-machine collaboration. On the one hand, CBR provides employees with past cases suitable to solve new customer problems. On the other hand, by solving new customer problems employees constantly increase the number of cases to draw from. The potential of CBR as a hybrid intelligence approach where humans and machines act as teammates however has not yet been fully exploited. Particularly in approaches focusing on textual information (Burke et al. 1997; El-Sappagh and Elmogy 2015; Lenz et al. 1998b) the human capability to understand and judge the semantic relationship between potential solutions and the currently encountered problem has not been tapped to enhance case retrieval.

To further the goal of exploiting human-machine collaboration regarding case retrieval, we follow a design-oriented approach (Hevner et al. 2004; Peffers et al. 2007) and propose a novel long-term feedback-based approach to retrieve semantically similar cases. To this end, our study focuses on incorporating feedback from employees in the long term to infer the semantic relation of texts. To this end, we assume a conventional textual CBR approach as the starting point. In a first step, we collect unary feedback regarding the semantic similarity of customer problems from employees. In the context of semantic similarity, unary feedback is a sensible choice for a rating scale, as it provides a single, unambiguous rating and clearly distinguishes the fundamentally different implications of feedback “not semantically similar” and “semantically similar”. Second, we reuse and link the collected feedback in order to instantiate a machine-learning model based on methods from the area of information retrieval that is able to generate the semantic context of a customer problem. Third, we use this semantic context to enhance the case retrieval for new customer problems by creating an adapted customer problem as the weighted combination of the new customer problem and its semantic context. This way, we take advantage of the human capability to judge the semantics of solutions and the machine’s ability to comprehensively learn from employees’ input. We demonstrate the applicability and the capabilities of our hybrid intelligence approach by using publicly available open-domain customer problems from the popular service website Quora. Our contribution is twofold: First, our approach builds upon research in CBR dealing with textual information and extends this line of thought by integrating long-term feedback following established approaches from the field of information retrieval. Second, it fosters human-machine collaboration, using human-generated feedback to improve a computer system.

Guided by the Design Science Research (DSR) process by Peffers et al. (2007), the remainder of this paper is organized as follows (cf. Figure 1): In the next section, we describe and illustrate the problem context that motivates our research. Subsequently, we present the prior state of relevant research in the areas of CBR and information retrieval and conclude this section with the research gap. Following a description of our research method, we detail the design of our long-term feedback-based approach. In the subsequent section, we demonstrate our novel approach’s practical applicability using a publicly available real-world data set of a popular service website. Then, we conduct a summative evaluation based on a standard metric and compare the approach’s performance to competing artifacts from the literature. Our paper concludes with a discussion of implications and limitations, identification of possible future research opportunities, and a summary of our findings.

Fig. 1
figure 1

Overview of the DSR process used for the conducted research (Peffers et al. 2007)

Problem context

The starting point of our problem-centered research (Peffers et al. 2007) is the observation that an increasing fraction of customer service interactions is handled through online channels (Forrester 2018; Microsoft 2018). Indeed, online customer service has become ubiquitous across industries (e.g. in the areas of information technology or telecommunication (Dell) or telecommunication (AT&T)Footnote 1) with most customers expecting to be able to reach a company by e-mail, chat, social media, or via specialized online platforms (Altitude and Spider Marketing 2016; Microsoft 2018). Accordingly, customer service employees face the challenge of providing correct, reliable, as well as consistent solutions to a huge amount of customer problems in written form within a short time frame.

The following example illustrates our problem context: Consider the online customer service of a telecom company that is contacted by a customer having an issue concerning their phone and expecting immediate support. As in virtually all of the most prominent online customer service channels, the customer sends a request (customer problem) as free-text in form of a question: “My phone does not turn on anymore. What should I do?”. The incoming customer problem is answered by an employee who as a domain expert relies on their individual knowledge, experience, and available domain-specific information to respond with troubleshooting instructions in time. The free-text solution sent as the response together with the incoming customer problem constitutes a case. Over time a customer service department accumulates a large number of cases that comprise the case base.

Since many of the incoming customer problems are not unique but have been solved previously, there is often a past case in the case base that contains a semantically similar customer problem – generally with syntactical differences – and a solution that can serve as a basis for the solution to the newly incoming problem. In the example of a faulty phone, the solution might comprise standard troubleshooting instructions applicable in many related scenarios. Hence, by reusing solutions, the knowledge contained in the case base can be used to both reduce response times and increase the quality and consistency of solutions. The key challenge is to find a suitable solution among the cases in the case base in a fast and reliable way. Due to the unparalleled speed at which they can search and process large amounts of data, computer systems are prime candidates for this task. However, the large and inconsistent vocabulary as well as missing or only implicitly contained information in free-text customer problems poses a major problem for traditional approaches. Considering the examples in Table 1, a second customer problem “My phone’s screen stays dark and it does not seem to start up” (Problem B) is semantically similar to the first (Problem A) in the sense that the given and the desired information are similar despite a very different description of the issue. Hence the solution from the previous case (Solution A) can be reused (Solution B). In contrast, “I dropped my phone and now it does not turn on anymore” (Problem C) and “I dropped my phone in the sink and now it does not turn on anymore” (Problem D) are similar in terms of vocabulary and phrasing, but refer to different types of damage (mechanical damage vs. water damage), likely demanding different solutions (Solution C, Solution D). Understanding the semantics of text is a task humans excel at. It therefore appears likely that a solution which combines the respective strengths in form of human-computer collaboration yields a performance superior to that of a computer system or an entirely manual approach alone. In this context, we distinguish the performance dimensions efficiency and effectiveness. We define efficiency as the time and effort required by employees to solve a customer problem. In contrast, we define effectiveness as the ability to provide customers with a textual solution that contains the requested knowledge.

Table 1 Case structure and illustrative examples

Related work and research gap

Prior to designing a solution to the problem of combining the strengths of humans and computers in the context of online customer service, its objectives need to be defined (Peffers et al. 2007). In the following, informed by the literature in the areas of CBR and information retrieval, we identify the targeted research gap and state the desired properties and functionality of our approach.

Related work in textual CBR and information retrieval

Irrespective of a specific application domain, all approaches for online customer service automatically providing employees with similar cases and learning from human knowledge in terms of already solved customer problems are based on CBR (Acorn and Walden 1992; Bedué et al. 2018; Heras et al. 2009; Lenz et al. 1999; Lenz and Burkhard 1997; Lenz et al. 1998b). Thus, the well-established CBR methodology paves the way for a hybrid intelligence where humans and machines act as teammates (Gu et al. 2017; Martin et al. 2017; Reuss et al. 2015). As customer interactions in online customer service are generally based on textual messages, particularly the research stream of textual CBR seems to provide promising approaches to cope with the task of enabling hybrid intelligence.

Textual CBR approaches are based on the CBR cycle introduced by Aamodt and Plaza (1994): A case base contains all existing and solved cases cj, each consisting of a textual description of the customer problem pj and a solution sj (Burke et al. 1997; Cunningham et al. 2004; Lenz et al. 1998b; Wang et al. 2006b). Case retrieval starts with an incoming customer problem pi and aims to quantify the degree of resemblance between pi and all customer problems pj of existing cases cj in the case base by means of a similarity function sim(pi,pj) (Liao et al. 1998). Subsequently, the k most similar cases are presented to the employee, who can choose to reuse one or more solutions sj, revise them if necessary, and finally add the new case ci (incoming customer problem pi and corresponding new solution si) to the case base (Lenz et al. 1999; Wang et al. 2006b, 2011). For a more detailed review of textual CBR we refer to Appendix I (section “Supporting online customer service with Case-Based Reasoning”).

While most existing textual CBR approaches are promising examples of human-machine collaboration, their retrieval results are primarily based on the information contained in the customer problem. In contrast to machines, a human reader has multiple skills that are challenging to attain for automated approaches. First of all, humans are able to understand and interpret rhetoric or linguistic specificities (e.g. irony or sarcasm). Moreover, content written by humans often contains indirectly stated information, such as social background, level of education, or age of the author, all of which could hardly be identified without human life experiences. Further, despite very different (similar) vocabulary, two customer problems could refer to quite similar (different) types of problems (cf. Table 1). Dealing with understanding the semantics of texts is a task humans excel at. Human judgment has shown beneficial in enhancing and guiding a computer system (Salton and Buckley 1990; Sarwar, Foley, & Allan, 2018; Trstenjak and Donko 2016). Hence, to fulfill customers’ needs in online customer service an understanding of the semantic meaning in textual messages exchanged between customers and employees is important to provide a correct solution concerning customers’ problems. As a result, many approaches take employees or users into account to validate automated solutions for customer problems (Balakrishnan, Ahmadi, & Ravana, 2016; Kunze and Hübner 1998; Lenz et al. 1999; Weis 2013). However, the potential in CBR as hybrid intelligence where humans and machines collaborate on equal terms harbors potential for improvement.

To exploit this potential, some authors in the research area of CBR have started to introduce approaches which incorporate feedback from the system user into the retrieval of new cases (Branting 2001; Cheng and Hüllermeier 2008; Coyle and Cunningham 2003; Gabel and Stahl 2004; Leake and Dial 2008; Soh and Blank 2008; Stahl 2005; Stahl and Gabel 2006; Zhang and Yang 1999). Feedback approaches offer the great opportunity to enhance a CBR approach during its operation (Stahl 2003; Weis 2013).

While a CBR approach could be trained ex-ante by domain experts linking semantically similar cases, from an economic perspective this would be a waste of resources for two reasons. First, a CBR approach can already greatly support employees to some extent right from the start, even with a small case base and in the absence of feedback. Second, employees have to be released from work to label the past cases required to instantiate the CBR approach. In contrast, feedback approaches leverage synergies, since employees already profit from a CBR approach while providing feedback.

Nevertheless, only few researchers take feedback into account when developing textual CBR approaches (Balakrishnan et al. 2016; Daniels and Rissland 1997; Weis 2013). Some authors (Balakrishnan et al. 2016; Weis 2013) use feedback from system users resulting in a re-ranking of previously retrieved cases. However, re-ranking approaches discard potentially relevant cases that have not been returned by the initial retrieval. Thus, a re-ranking approach does not seem suitable to find new cases within similar contexts as the incoming customer problem. In contrast, Daniels and Rissland (1997) use so-called Pseudo-Relevance Feedback (Salton and Buckley 1990) by assuming the top two retrieved cases as relevant. By doing so, the terms in the cases treated as relevant are added to the initial customer problem, which is expected to lead to the retrieval of semantically more similar cases.

To the best of our knowledge, besides a few preliminary studies that we review in Appendix I (section “Feedback in case retrieval”), studies in the textual CBR literature do not focus on or consider feedback in depth. Thus, in the following we investigate feedback approaches from the related area of information retrieval that attempt to capture and utilize human knowledge on semantic similarity.

Research in information retrieval offers a wide range of feedback approaches for the retrieval of text documents that aim at taking humans’ superior capability to semantically understand texts into account. Although the retrieval process in this context is similar to textual CBR, the objectives differ slightly. While textual CBR approaches aim at retrieving helpful solutions with respect to a full-text description of a problem (Burke et al. 1997; Weber, Ashley, & Brüninghaus, 2005), approaches in information retrieval try to retrieve relevant text documents regarding a query which expresses the user’s request in a few keywords (Baeza-Yates, Ribeiro-Neto, & others, 1999). Nevertheless, approaches from information retrieval show high potential for adaption to textual CBR (Burke et al. 1997; Lenz et al. 1998b; Shekhar et al. 2014). Taking a feedback-oriented perspective, literature in information retrieval can particularly be classified into short-term (Rocchio 1971; Chen, et al. 2006a; Lagun, Sud, White, Bailey, & Buscher, 2013; Salton and Buckley 1990; Sarwar et al. 2018; Zhai and Lafferty 2001) and long-term feedback approaches (Crestani 1994, 2000; Mandl 2000; Lin et al. 2011; Mitra and Craswell 2017). Short-term feedback approaches can be characterized as using feedback only once for a single query, without storing it for use for further similar queries. Thus, these approaches require feedback for each query to enhance the retrieval of text documents, even if the query is nearly identical to previous queries. In contrast, long-term feedback approaches are identified by the storage of feedback to conserve the expressed interconnections between queries and relevant text documents for later use. One popular long-term feedback approach is to instantiate an artificial neural network on the collected feedback, as these algorithms are able to learn complex mappings between patterns (Cöster and Asker 2000; Crestani 1994, 2000; Mitra and Craswell 2017). These models can be used to adapt new queries without the need for new user feedback. For a more detailed review of these approaches, we again refer the interested reader to Appendix I (section “Feedback in information retrieval”).

Especially long-term relevance feedback approaches from information retrieval, which incorporate the human capability to understand and judge the semantic relationship between a query and retrieval results to enhance future retrievals seem a promising means to foster human-machine collaboration in online customer service through a feedback-based textual CBR approach.

Research gap and objective

Having surveyed related research in textual CBR and information retrieval, in the following we identify the research gap our novel approach seeks to close, concluding with the definition of the solution’s objectives as the next step in the DSR process (Peffers et al. 2007).

Prior studies in textual CBR offer well-suited approaches as starting point for our research (Ashley 1991; Burke et al. 1997; Daniels and Rissland 1997; Jayanthi et al. 2010; Lenz et al. 1998b; Weber et al. 2005). Since automated approaches still struggle to meet the challenge of truly understanding the full semantic meaning within texts (Berners-Lee et al. 2001; Embley 2004; Khanapure and Chirchi 2013; Wang et al. 2006b, 2011), human-generated guidance through feedback is still necessary to improve computer systems. However, there is still a lack of feedback-based approaches in textual CBR that take humans’ superior capability to semantically understand texts into account in order to support employees’ search for solutions to new customer problems. As system users in organizations can be expected to be experts in their domain and possess similar knowledge, their feedback can be stored and reused to improve the retrieval for all users. In turn, employees in online customer service could work more efficiently and effectively when supported by advanced case retrieval which has been enhanced based on their own feedback. However, to the best of our knowledge, approaches integrating recent long-term feedback approaches from information retrieval into textual CBR approaches for online customer service bringing together concepts and findings from both research streams are still missing.

This conclusion drawn from an extensive review of the literature enables us to define the objectives for our solution (Peffers et al. 2007): Based on well-established methods from information retrieval, we aim to develop a novel textual CBR approach which improves the case retrieval through long-term user feedback. The approach should enhance human-machine collaboration in textual CBR by making use of feedback from employees during operation of the CBR approach, leveraging the inherent human-machine synergies of CBR: With each added solution and accompanying feedback, employees increase the effectiveness of the CBR approach. At the same time, they profit from the unparalleled speed at which machines can search and process large amounts of data, hence improving their efficiency when solving customer problems. However, the involvement of employees generally can require additional effort on the side of employees. In order to maintain a high level of efficiency, when designing our approach, we intend to keep this additional effort as low as possible. In summary, the objective of our approach is to provide employees with consistent and high-quality knowledge in a short time frame. Thereby, it contributes to an improved online customer service regarding effectiveness as well as efficiency.

Research method

We conducted and report our research according to the Design Science research (DSR) process by Peffers et al. (2007), carrying out its six activities as visualized in Fig. 1. First, within the realm of online customer service we identify rapidly responding to customer problems with correct, reliable, and consistent solutions as a key challenge and open research question. Second, we uncover that using the complementary strengths of humans and computers – understanding the semantics of texts and searching vast amounts of data, respectively – appears a promising avenue. Together with the review of previous research on CBR in customer service as well as the incorporation of feedback in both case and information retrieval, this sets the stage for the third activity: The design of our research artifact, a novel long-term feedback-based approach for the retrieval of semantically similar customer problems. Fourth, we instantiate the artifact using a real-world data set and fifth rigorously evaluate its efficacy by comparing its performance to competing artifacts from the literature. The research process concludes with the sixth activity, the communication of the entire research process and findings in the present paper.

DSR efforts can be characterized by their knowledge creation strategy and theorizing mode (Baskerville et al. 2018) as well as their kind of outcome and contribution to knowledge (Gregor and Hevner 2013). Since our research is concerned with the development of a novel approach, it constitutes a contribution of nascent design theory (Gregor and Hevner 2013). As the focus of the conducted research is the design, implementation, and evaluation of an artifact, it is work in interior mode (Baskerville et al. 2018; Gregor 2009; Sonnenberg and Brocke 2012). In our research, we mainly employ an inductive, iterative knowledge creation strategy, producing prescriptive knowledge (Gregor 2009; Sonnenberg and Brocke 2012). More precisely, we start from the well-established conventional textual CBR process. Throughout the development, we draw from prior research on feedback, text retrieval, and incorporation of feedback into text retrieval as justificatory knowledge in which the design is grounded (Gregor and Hevner 2013). In terms of the knowledge contribution framework by Gregor and Hevner (2013) our artifact constitutes an “improvement”, striving to improve efficiency as well as effectiveness of online customer service.

Our approach for validation and evaluation follows the “Technical Risk & Efficacy” strategy of the Framework for Evaluation in Design Science (FEDS) put forward by Venable et al. (2016) which structures the evaluation as a four-step process: goal explication, choice of evaluation strategy, determination of the properties to evaluate, and design of the evaluation episodes. The overarching goal of the evaluation is to demonstrate that our novel hybrid approach improves effectiveness and efficiency of online customer service. Hence, a customer sending the description of their problem shall be provided more often with the requested knowledge to raise effectiveness. Further, the time required by an employee to solve the customer problem should be reduced for improving efficiency. We chose the “Technical Risk & Efficacy” strategy as the design of our artefact is subject mainly to technical design risks. The formative, artificial evaluations throughout the earlier stages of the design process prescribed by the strategy ensure the design choices made indeed contribute towards the overall objective, while a final summative evaluation in comparison to competing artifacts demonstrates that the artifact as a whole indeed constitutes an improvement (Gregor and Hevner 2013). In the following, we present the design, demonstration, and evaluation of the artifact as a single sequence of the process depicted in Fig. 1. Indeed, our artifact consists of a series of steps and elements, which in line with the evaluation strategy have been developed, tested, and validated throughout the search process (Gregor and Hevner 2013; Hevner et al. 2004; Sonnenberg and Brocke 2012). To give an example, we have validated two core elements – the semantic cluster vectors and the semantic context generator – empirically while developing our approach. For sake of brevity and communicative clarity, we defer the description of this validation to the sub-section “Instantiation and application” and refrain from using any test data throughout the artifact’s description (cf. Gregor and Hevner 2013).

Design of the long-term feedback-based approach

To attain the goal of leveraging humans’ capability to judge the semantic similarity of texts to enhance the retrieval of semantically related customer problems in textual CBR, in our approach each incoming customer problem is adapted prior to the Retrieve phase of the CBR cycle. Specifically, the customer problem is transformed such that semantically similar past problems are retrieved rather than past problems which are solely syntactically similar with respect to the CBR approach’s similarity function. The knowledge necessary for this adaption is gained from human feedback on semantic similarity of customer problems collected from employees during use of the approach. In this way, by incorporating human feedback our approach enhances efficiency of online customer service by providing employees with semantically similar cases in a short time frame, so that the time required to solve a customer problem is reduced. Further, it increases effectiveness as employees are consistently provided with past cases semantically similar to the newly incoming customer problem, whose solutions contain knowledge to solve the customer problem. In turn, they can incorporate the best of this and their own knowledge to provide the customer with the requested solution.

Basic idea and overview

In conventional textual CBR approaches the similarity between an incoming customer problem pi and each customer problem pj associated with a case in the case base is determined with respect to a similarity function sim(pi, pj) (Burke et al. 1997; Lenz et al. 1999). As a result, conventional textual CBR approaches suffer from the drawback that a customer problem pj similar to an incoming customer problem pi with respect to sim(pi, pj) is not necessarily semantically similar to pi as well. With our adapted CBR approach, we aim to assist employees with a fast and at the same time high quality retrieval of semantically similar cases by exploiting the complementary strengths of human and artificial intelligence through long-term feedback. To this end, in our approach, each incoming customer problem is adapted based on human knowledge on semantic similarity prior to the Retrieve phase of the textual CBR cycle. Specifically, we draw on employees’ feedback on semantic similarity of customer problems collected during the Reuse phase for previously solved customer problems. This way, humans’ superior capability to interpret texts is incorporated into the textual CBR cycle.

Our approach consists of three steps (cf. Figure 2).

Fig. 2
figure 2

Adapted CBR cycle in online customer service with feedback integration

First, to preserve and later benefit from the information contained in the dismissal and selection of retrieved cases, our approach enables employees to provide feedback on whether retrieved cases are indeed semantically similar to the considered customer problem (cf. step “Gathering Human Knowledge”). This feedback is collected in a feedback base comprising knowledge on semantic relationships of customer problems and in turn used to improve retrieval for further incoming customer problems.

Second, based on employees’ feedback on semantic similarities stored in the feedback base we learn a so-called semantic context generator (cf. step “Learning the Semantic Context Generator”) that derives the semantic context of incoming customer problems. Based on long-term feedback approaches from literature (Crestani 1994, 2000; Mandl 2000; Lin et al. 2011; Mitra and Craswell 2017) the semantic context generator draws on the combined knowledge on semantic similarity contained in the feedback base. This way, a semantic context \( {p}_i^s \) for an incoming customer problem pi is derived taking into account humans’ superior capability to interpret texts.

Finally, in order to integrate humans’ knowledge into our approach, in the third step an adapted customer problem \( {p}_i^a \) is created (cf. step “Adapting the Customer Problem”) from the incoming customer problem pi and its semantic context \( {p}_i^s \) generated by the semantic context generator. On the one hand, the resulting adapted customer problem contains human knowledge on semantic similarity of customer problems leading to retrieval of semantically similar cases. On the other hand, the machines’ superior capability is exploited as semantically similar problems are retrieved automatically and at a rapid pace.

Combining the three steps “Gathering Human Knowledge”, “Learning the Semantic Context Generator”, and “Adapting the Customer Problem” results in a novel long-term feedback-based approach incorporating humans’ superior capability to semantically understand texts in online customer service while at the same time merging concepts from research in textual CBR and long-term feedback approaches from information retrieval. In the following, we detail the three steps of our approach and thereby illustrate how employees’ knowledge can be leveraged to incorporate the semantic relationship between texts into textual CBR approaches.

Gathering human knowledge

We intend to exploit the complementary strengths of human and artificial intelligence in solving incoming customer problems. To do so, we aim to incorporate employees’ capability to infer the semantics of texts in terms of feedback into the conventional CBR cycle (Aamodt and Plaza 1994) which serves as a starting point and well-founded basis for our approach (cf. Figure 2). To gather human knowledge, in a first step employees’ feedback on the semantic similarity of customer problems is collected and stored in the feedback base FB as an integral part of the Reuse phase of our adapted CBR cycle. More precisely, following the conventional CBR cycle, in the Reuse phase the k most similar cases are presented to the employee (Burke et al. 1997; Lenz et al. 1998b). The employee examines these cases and selects those with a customer problem semantically similar to the incoming customer problem pi and therefore most suitable to serve as a basis for its solution. In order to later benefit from this intellectual human effort and the manifested knowledge, in our adapted CBR we treat the employee’s selection as feedback on semantic similarity. Since employees operating a CBR system always have to identify the semantically most similar cases as a basis for their solution, we are able to create our feedback on-the-fly. This allows our approach to function without a dedicated, potentially time-consuming and costly feedback-collection effort, avoiding a detrimental effect on efficiency.

The choice of feedback scale constitutes an important step in the design of our novel approach. In the literature, various different feedback scales are discussed, which can, in particular, be classified by the number of scale points (e.g. unary, binary, or multi-point scale) (Boynton and Greenhalgh 2004; Cena et al. 2010; Cena et al. 2011; Krosnick and Fabrigar 1997). To find a suitable rating scale for collecting the employees’ feedback, the properties of the quantity to be measured – the semantic similarity of customer problems – need to be taken into account (Krosnick and Fabrigar 1997).

On the one hand, the statement that two customer problems pi and pj are semantically similar differs fundamentally from the statement that pi and pj are not semantically similar. Only the rating of two customer problems pi and pj as semantically similar results in a transitive relationship: If a third problem pk is semantically similar to pj, then pk and pi are semantically similar as well. If however pi and pj are not semantically similar, and pj and pk are not semantically similar either, no information regarding the semantic similarity of pi and pk can be inferred. This property is best captured by a unipolar scale (cf. Krosnick and Fabrigar 1997).

On the other hand, it appears infeasible to find clear and unambiguous criteria for rating semantic similarity on a multi-point scale. Assume, for example, a five-point rating scale for semantic similarity, with a rating of 5 implying semantically identical and a rating of 1 representing semantically completely distinct customer problems. Further, assume the two customer problems “My phone does not turn on anymore. What should I do?” and “I dropped my phone and now it does not turn on anymore” (cf. Problems A and C in Table 1). While both customer problems are related to a malfunctioning phone, the type of damage differs. Thereby, it appears difficult and context-dependent to decide if this difference leads to a rating of 4, 3, 2, or 1. Hence, a rating on a multi-point scale is necessarily ambiguous. Further, two customer problems whose similarity to a third customer problem is each rated as 3 might be semantically identical or entirely unrelated to each other.

Taking these considerations into account, a unary scale is a sensible choice to collect feedback on semantic similarity since it provides a single, unambiguous rating and clearly distinguishes the fundamentally different implications of feedback FBij =  " not semantically similar" and FBij =  " semantically similar". Further, it can be assumed that employees in online customer service are domain experts who rate semantic similarity of customer problems on a unary scale consistently, avoiding the commonly encountered problem that users tend to use the very same feedback scales differently (Adomavicius and Tuzhilin 2005; Cena et al. 2010; Goldberg et al. 2001; Herlocker et al. 2004).

To sum up, employees’ knowledge on the semantic similarity of a customer problem pi to a past customer problem pj retrieved from the case base is collected using a unary feedback scale during the Reuse phase of our adapted CBR cycle (cf. “Gathering Human Knowledge” in Fig. 2). The employees’ feedbacks FBij are stored in the feedback base FB, the set of all feedbacks.

Learning the semantic context generator

As outlined above, we intend to adapt each incoming customer problem based on human knowledge on semantic similarity. To do so, we learn a model that captures the semantic context of customer problems based on the feedback collected in the step “Gathering Human Knowledge”. The semantic context of a customer problem encompasses different descriptions of the same underlying issue and in the following can be used to contribute to an improved retrieval. Literature developing long-term feedback approaches in the related area of information retrieval has often applied different forms of semantic contexts to enhance retrieval accuracy (Crestani 1994, 2000; Huang et al. 2013). To derive the semantic context, we propose to instantiate a so-called semantic context generator G(p) that learns the semantic relationship between customer problems from the employees’ knowledge contained in the feedback base FB (Huang et al. 2013; Jung et al. 2007; Morrison, Marchand-Maillet, & Bruno, 2008). This way, we encode human knowledge in terms of employees’ feedback into a machine-interpretable semantic context while laying a foundation for an advanced human-machine collaboration.

Prior to learning the semantic context generator, the transitive nature of semantic similarity discussed above is used to augment the information regarding the semantic relationships of customer problems. As semantic similarity constitutes a transitive relationship, if two feedbacks FBij and FBik link (pi, pj) and (pi, pk) as semantically similar this unambiguously implies that pj and pk are semantically similar as well, even though there might not be a feedback FBjk ∈ FB explicitly indicating this relationship. Consider, for instance, the two customer problems “My phone does not turn on anymore. What should I do?” (Problem A) and “My phone’s screen stays dark and it does not seem to start up.” (Problem B) (cf. Table 1) as well as feedback FBAB indicating the semantic relationship between them. Further, assume a third customer problem “My phone is no longer running. How can I start it up?” (Problem E) which is semantically linked to Problem B by feedback FBBE. Obviously, Problem E is semantically similar to Problem A as well, illustrating the transitivity implied by semantic similarity. Hence, in the absence of feedback FBAE, this relationship can be deduced from FBAB and FBBE (Morrison et al. 2008). Consequently, the set c = {pi, pj, pk…} of customer problems linked through transitive semantic relationships implied by multiple feedbacks contains all customer problems concerning a common issue. Thus, we determine each so-called semantic cluster c to obtain all sets of customer problems that are semantically similar to each other. To learn a model for the adaption of customer problems, a representation of the semantic clusters is required such that a customer problem can be associated with the joint semantics of the respective cluster (Crestani 1994). Therefore, we encode a semantic cluster c in terms of a semantic cluster vector \( {p}_c^s \) by proposing the encoding function E(c). E(c) takes into account all customer problems pi, pj, pk, … part of a semantic cluster c to find\( {p}_c^s=E(c) \). As a result, we can draw on the relationship between each customer problem pi and its corresponding semantic cluster vector\( {p}_c^s \). In order to do so, we define the semantic context generator G(p) as a machine-learning model that learns to associate a customer problem with its semantic context. To this end, it is trained pairwise with each customer problem pi from the case base and its corresponding semantic cluster vector \( {p}_c^s \) to maximize the similarity function\( sim\left(G\left({p}_i\right),{p}_c^s\right) \). By this means, we enable the semantic context generator G(p) to derive a semantic context \( {p}_i^s=G\left({p}_i\right) \) based on the incoming customer problem pi (Crestani 1994).

To sum up, based on the step “Gathering Human Knowledge” the semantic context generator G(pi) learns from employees’ joint feedback in terms of semantic cluster vectors \( {p}_c^s \) to generate the semantic context \( {p}_i^s \) for incoming customer problems pi.

Adapting the customer problem

Finally and in order to ensure a synergistic human-machine collaboration in online customer service by merging humans’ superior capability to semantically understand texts with machine’s ability to process data at a rapid pace, we adapt the incoming customer problem pi using its semantic context obtained in the previous step. Based on the well-known short-term feedback approach Relevance Feedback, we adapt the incoming customer problem using the popular query reweighing technique derived from the Rocchio Algorithm (Carpineto and Romano 2012; Manning, Raghavan, & Schütze, 2008; Rocchio 1971; Salton and Buckley 1990). However, in contrast to standard Relevance Feedback, we do not collect feedback to adapt the incoming customer problem prior to retrieval. Instead, we incorporate employees’ feedback through the semantic context \( {p}_i^s \) derived by our semantic context generator G(pi). Thereby we combine the incoming customer problem pi with its semantic context\( {p}_i^s \). Thus, we enable a well-founded retrieval of semantically similar cases by means of human-machine collaboration.

More precisely, we create an adapted customer problem \( {p}_i^a \) such that the given similarity function \( sim\left({p}_i^a,{p}_j\right) \) is maximized for all pj which are semantically similar to pi but minimized for all pj for which this is not the case. In order to achieve this, we combine employees’ knowledge on semantic similarity – the semantic context introduced in the previous step – and the incoming customer problem. We define \( {p}_i^a \) as the weighted combination of the incoming customer problem pi and its semantic context \( {p}_i^s \) (cf. Equation (1)). On this account, the weights α ∈ [0, ∞) and β ∈ [0, ∞) determine the extent to which the incoming customer problem pi and its semantic context\( {p}_i^s \) are represented within the adapted customer problem\( {p}_i^a \):

$$ {p}_i^a=\alpha \cdotp {p}_i+\beta \cdotp {p}_i^s $$
(1)

To sum up, the adapted customer problem\( {p}_i^a \) is defined as the weighted combination of the incoming customer problem pi and its semantic context\( {p}_i^s \). In our approach, the adapted customer problem \( {p}_i^a \) is used in place of the incoming customer problem pi in the Retrieve phase of the CBR cycle. As outlined above, the resulting refined retrieval of semantically similar problems is expected to lead to improved performance in online customer service. In this way, our approach leverages the complementary strengths of human and artificial intelligence through the incorporation of human feedback into the CBR cycle.

Demonstration of the approach

As an essential part of the DSR process (Gregor and Hevner 2013; Hevner et al. 2004; Peffers et al. 2007) we demonstrate the practical applicability of our approach. To this end, we use a publicly available real-world data set of the popular service website Quora. In the following, we first introduce the data set of suitable customer problems and elucidate the general validation and evaluation setup. Following this, we instantiate the components of our approach and in the process validate the design choices made during its development (Venable et al. 2016).

Data set

To instantiate our approach and evaluate its efficiency as well as its effectiveness we use a publicly available real-world data set containing 404,288 pairs of customer questions and their corresponding semantic relationships published by the popular service website Quora (Iyer, Dandekar, & Csernai, 2017). On Quora, users ask questions on a wide variety of topics that are answered by an international community comprised of both laypeople and topic experts (Wang et al. 2013). These answers are discussed and judged through voting by fellow community members (Wang et al. 2013). To keep their knowledge base redundancy-free, Quora aims to have each semantically distinct question answered only once (Bodnick 2015; Iyer et al. 2017). To facilitate this, all registered users can merge questions (Scharff 2015), with some complex merges requiring review by Quora staff (Wacker 2016). If questions are merged, future visitors will be redirected to the incarnation of this question deemed to be phrased best by the merging user (Scharff 2015). In total, the data set comprises 149,496 distinct customer questions that are linked by feedbacks indicating whether two customer questions are semantic duplicates. One example is (“How do I get my iPhone out of recovery mode?”, “What do I do if my iPhone is stuck in recovery mode?”, Duplicate). The questions asked on Quora and contained in the chosen data set are similar in style to online customer service. For this reason, the data set provides an appropriate setting to demonstrate our novel long-term feedback-based approach.

Instantiation and application

For the demonstration and later evaluation of our approach, we use the whole set of 149,496 distinct questions contained in the Quora data set as customer problems pj. Following conventional textual CBR approaches, we handle and store customer problems as representations in the well-established vector space model which has proven successful in many retrieval applications (Burke et al. 1997; Manning et al. 2008; Salton, Wong, & Yang, 1975). To transform the customer problems into vector representations, we preprocess the questions by removing stop words (Manning et al. 2008) and using a Porter stemmer (Manning et al. 2008; Porter 1980). Then, we convert each question into a tf-idf vector (Hua et al. 2009; Manning et al. 2008; Salton and Buckley 1988; Sebastiani 2002), limiting the total vocabulary size to Nvoc= 10,082 terms by omitting words which appear fewer than four times in the whole corpus (cf. Turney and Pantel 2010). As the semantic context generator G(p) learns to maximize sim(G(pi),pcs), it is capable of adapting to any textual CBR similarity measure. Here, we use the cosine similarity cos (pi,pj) which is commonly used for similarity determination in textual CBR with tf-idf vectorization (Bedué et al. 2018; Burke et al. 1997; Lenz et al. 1998b; Manning et al. 2008; Salton and McGill 1984).

In line with CBR literature (Bedué et al. 2018; Burke et al. 1997; Lenz et al. 1998a), we refer to the retrieval for a specific customer problem as successful if at least one semantically similar case is contained within the k retrieved cases. In the following, we refer to the proportion of customer problems for which the retrieval was successful with respect to the total number of customer problems as successful retrievals (Bedué et al. 2018). Whenever necessary, we set k=5 as this is a reasonable number of cases to display at once for an employee to scan, a commonly made assumption in similar application contexts (cf. Balakrishnan et al. 2016; Burke et al. 1997). On this basis, we instantiate and validate our approach focusing on its major steps “Gathering Human Knowledge”, “Learning the Semantic Context Generator”, and “Adapting the Customer Problem”.

Gathering human knowledge

In our approach, employees’ knowledge on the semantic similarity of customer problems is gathered in terms of feedback during the Reuse phase. In the chosen data set feedback has already been provided by humans and consolidated (Scharff 2015), which allows us to simulate the collection of feedback from employees. To this end, we assume that the pairs of questions contained in the Quora data set that are marked as semantic duplicates are semantically similar customer problems. In this regard, the Quora feedback process corresponds to an employee providing unary feedback FBij on customer problems, enabling us to create our employee feedback base FB directly from the data set. As a result, we obtain a feedback base FB comprising 149,263 distinct unary feedbacks FBij linking two semantically similar customer problems pi and pj.

Learning the semantic context generator

To derive the semantic context of customer problems based on the feedback stored in the feedback base FB, the semantic context generator has to be instantiated. To do so, we first determine the semantic clusters c; second, specify the encoding function E(c) to determine the semantic cluster vectors\( {p}_c^s \); and third, learn the semantic context generator G(p).

First, we determine the semantic clusters c by identifying all customer problems in the feedback base FB linked as semantically similar by feedback or a transitive semantic relationship. This results in 6,279 semantic clusters comprising between 2 and 109 customer problems. To learn the semantic relationship between customer problems, a sufficient number of them is needed within each cluster c. Therefore, and to support adequate splits in training, validation, and test data, we focus on the semantic clusters comprising more than 50 customer problems. This leads to 13 semantic clusters covering a total of 933 distinct customer problems and comprising between 51 and 109 customer problems each, which in the following we use for demonstration and evaluation purposes. To enable a thorough validation and evaluation of our approach, we conduct a five-fold cross-validation, splitting the 933 customer problems into equally sized balanced test sets and performing the subsequent steps for each of these (cf. Goodfellow, Bengio, & Courville, 2016). For each fold, the test set is removed prior to any processing.

Second, we specify the encoding function E(c) to create a semantic cluster vector \( {p}_c^s \) for each semantic cluster c maximizing \( sim\left({p}_i,{p}_c^s\right) \) for all customer problems pi ∈ c. In our vector space representation with cos(pi, pj) the cluster’s centroid is the vector with the highest average similarity to all customer problems pi ∈ c and is thus well-suited to represent the semantic cluster vector \( {p}_c^s \) of a cluster (Chen et al. 2006a; Crestani 1994). This is precisely the relationship exploited in Relevance Feedback, where the users’ short-term feedback is used to approximate the centroid of the cluster of documents relevant to a query (Manning et al. 2008). Hence, we define the encoding function E(c) as the cluster’s centroid:

$$ E(c)=\frac{1}{\left|c\right|}\sum \limits_{p_i\in c}{p}_i={p}_c^s $$
(2)

To empirically validate our choice of the cluster’s centroid as the semantic cluster vector\( {p}_c^s \), for each incoming customer problem pi we use the centroid\( {p}_c^p \) of the semantic cluster c it belongs to for retrieval \( \left({p}_i^a={p}_c^s\right) \). We find that this maximizes our evaluation metric, i.e. for each cross-validation fold each retrieval yields at least one customer problem that is semantically similar to the incoming customer problem, fulfilling the expectation that the cluster centroid yields optimal retrieval performance (cf. Manning et al. 2008).

Third, we instantiate and learn the semantic context generator G(pi) based on the customer problems (samples) and the corresponding semantic cluster vectors \( {p}_c^s \) (targets). In order to instantiate the semantic context generator G(pi), we use an artificial neural network due to its great potential to perform well on tasks involving large and sparse vectors, such as those present in our application context (Goodfellow et al. 2016). To find a suitable parameterization for the given task, we follow the procedure outlined by Ng (2018) and start with an initial configuration, validate how well it performs regarding a loss function, and then iteratively adjust the configuration to improve upon this baseline. As the loss function measuring the distance between the semantic context generator’s output G(pi) and the desired target\( {p}_c^s \) we choose the negative of the similarity function \( - sim\Big(G\left({p}_i\right),{p}_c^p\left)=-\cos \Big(G\left({p}_i\right),{p}_c^p\right) \) that reaches its minimum when the similarity function \( \cos \Big(G\left({p}_i\right),{p}_c^p\Big) \) of our textual CBR approach is maximized. Further, we opt to use the ReLU activation function (Glorot et al. 2011; Goodfellow et al. 2016) for all layers since its range of values [0, ∞) matches that of the vector for pi. To learn the semantic context generator, for each fold of the cross-validation we split the available data (note that the test set has been removed) into a training and validation set of which only the former is used for training (Goodfellow et al. 2016; Ng 2018). To verify that the semantic context generator was learned successfully, the model’s output for a customer problem pi is compared with the associated target from the training and validation set by means of the loss function. While the length of pi – the number of words in the vocabulary Nvoc – determines the number of neurons in the input and output layers, both the number of hidden layers and their size have to be determined by iterative experimentation. We find that two fully connected hidden layers with 5,000 neurons each are well suited to the task at hand. We further use dropout (Srivastava et al. 2014) for better generalization and stop the training of our neural network as soon as the loss on the validation set does not decrease for five epochs in order to prevent overfitting (cf. Prechelt 2012).

We find that the neural network chosen for G(pi) is well suited to infer the semantic cluster vector \( {p}_c^s \) for an incoming customer problem pi as the average losses are nearly −1.0 which represents an optimal result for the chosen loss function. Specifically, the average loss on our training sets yields −0.98 while the average loss on the validation set is −0.97, indicating good generalization. The average test set loss is comparable as well (−0.97).

Adapting the customer problem

Finally, the output of the semantic context generator\( {p}_i^s=G\left({p}_i\right) \) is used to adapt the incoming customer problem pi, creating the adapted customer problem\( {p}_i^a \). Subsequently, we have to determine the weights α ∈ [0, ∞) and β ∈ [0, ∞) for merging the incoming customer problem pi and its semantic context\( {p}_i^s \). To approach this issue, we use the customer problems from the validation set as input in the Retrieve phase of our adapted CBR cycle and optimize α and β regarding the proportion of successful retrievals. As the semantic context generator G(pi) outputs the semantic clusters’ centroids almost perfectly, α = 0.0 and β = 1.0 are expected to yield the optimum proportion of successful retrievals. Indeed, we find that varying α decreases the metric. If however, customer problems in other application contexts deal with multiple topics simultaneously, α > 0.0 is reasonable in order to consider both the customer problem and its semantic context.

Finally, we applied the instantiated approach and used the customer problems from the test set as input for the Retrieve phase of our adapted CBR cycle. On this account, we incorporate employees’ feedback on semantic similarity of customer problems contained in the feedback base FB by generating the semantic context of a customer problem contained in the test set. Subsequently, the semantic context is used to adapt the customer problem to improve the retrieval of semantically similar customer problems. Preprocessing the customer problem, generating the semantic context, and adapting the customer problem requires just a few milliseconds even on a standard laptop computer and thus significantly less time than the CBR retrieval from a large case base. By this means, we are able to adequately assist employees in online customer service by a fast and at the same time high-quality retrieval of semantically similar cases.

Evaluation of the approach

As demanded by the DSR process (Peffers et al. 2007), we conduct a summative evaluation (Venable et al. 2016) of our approach. The goal of this evaluation is to show that our long-term feedback-based approach is indeed an improvement (Gregor and Hevner 2013) in terms of both effectiveness and efficiency over existing approaches regarding our problem context. To this end, we compare the performance of our novel hybrid approach on the Quora data set chosen for its instantiation both with solely machine-based (fully automated) and hybrid as well as entirely human-based (manual) approaches.

To evaluate our approach against machine-based as well as hybrid approaches, we compare its performance against that of competing artifacts in textual CBR. To ensure comparability, all considered approaches are based on the same conventional textual CBR approach. As the baseline (BL) we take the solely machine-based CBR retrieval of related customer problems pj in absence of human feedback and hence without any further adaption of the incoming customer problem (\( {p}_i^{a\_ BL}={p}_i\Big) \) (Bedué et al. 2018). Additionally, we use Relevance Feedback (RF) and Pseudo-Relevance Feedback (PRF) as the two probably most established short-term feedback approaches from information retrieval (cf. Manning et al. 2008) that have already been integrated into textual CBR approaches (Daniels and Rissland 1997). As in our approach, we utilize the well-known Rocchio query reweighting technique to integrate these approaches into the conventional textual CBR cycle.

Relevance Feedback (Manning et al. 2008; Salton and Buckley 1990), the first competing artifact, requires the customer service employee to provide feedback on the top five retrieved customer problems on the semantic similarity to the incoming customer problem. As the Quora data set already contains the semantic relationship between customer problems (cf. “Gathering Human Knowledge”), we can simulate the respective human-machine collaboration. On this basis, the adapted customer problem \( {p}_i^{a\_ RF} \) is obtained as the weighted combination of the incoming customer problem pi as well as the simulated human feedback regarding semantically similar and not semantically similar cases:

$$ {p}_i^{a\_ RF}=\alpha {p}_i+\beta {\sum}_{" similar"}{p}_k-\gamma {\sum}_{" not\ similar"}{p}_k $$
(3)

We integrate Pseudo-Relevance Feedback as the second competing artifact. Here, each of the top five customer problems returned by the conventional textual CBR retrieval is taken to be relevant and used to create the adapted customer problem \( {p}_i^{a\_ PRF} \) (Manning et al. 2008). Thus, the adapted customer problem \( {p}_i^{a\_ PRF} \) is modeled as the weighted combination of the incoming customer problem pi and the top five retrieved customer problems marked as semantically similar:

$$ {p}_i^{a\_ PRF}=\alpha {p}_i+\frac{1}{5}\beta \sum \limits_{l=1}^5{p}_l $$
(4)

For both competing approaches, in absence of a deterministic algorithm (Moschitti 2003) we empirically determine the optimal pre-factors α, β, and γ on the validation set by setting α = 1.0 and using a simple hill-climbing algorithm (Russell and Norvig 2010). In the case of Relevance Feedback, we first obtain an optimal value for\( \frac{\alpha }{\beta } \) keeping γ = 0 and subsequently repeat the procedure varying γ while keeping\( \frac{\alpha }{\beta } \) fixed.

As the metric to evaluate the effectiveness of the machine-based as well as hybrid approaches, we calculate the proportion of successful retrievals by comparing the cases retrieved with the set of semantically similar cases in the corresponding semantic cluster as ground truth. This metric is chosen as it clearly reflects the effectiveness of an approach with regard to providing customers with the requested knowledge: Retrieving semantically similar cases more often increases the effectiveness of employees in online customer service as the knowledge to craft a solution is readily available.

For each of the approaches, we conduct a five-fold cross-validation as introduced above, measuring the proportion of successful retrievals for each of the five test sets. Averaged over the cross-validation folds (cf. Fig. 3) our approach retrieves a semantically similar customer problem among the top five retrieved problems (i.e., k = 5) 97.9% of the time. In contrast, the baseline approach performed quite poorly, as only 86.3% of retrievals yield a semantically similar customer problem for k = 5, indicating an inferior effectiveness. The retrieval results of the competing artifact based on Pseudo-Relevance Feedback are affected negatively by incorrect cases among the top five retrieval results of the baseline approach. Thus, the Pseudo-Relevance Feedback approach slightly outperforms or fails to surpass the baseline depending on k, retrieving a semantically similar customer problem for k = 5 in 85.1% of all cases. In contrast, Relevance Feedback returns a semantically similar customer problem in 88.9% of all cases. Here we can see that since in Relevance Feedback only the top five retrieved problems are considered, its effectiveness regarding the very first retrieved customer problem is approximately equal to the baseline performance for k = 5 and thus better than our approach, but does not increase much beyond. In contrast, our approach surpasses Relevance Feedback already for k = 2.

Fig. 3
figure 3

Evaluation of our approach in comparison to competing artifacts

To compare the efficiency of our hybrid approach against the competing solely machine-based and hybrid approaches, we first focus on the time required for retrieval. As the approaches are based on the same conventional CBR approach, the steps required by an employee to work with the approaches are the same. Our hybrid approach merely requires about 0.04 s more computing time (on a standard laptop) than the baseline approach, whereas due to the twofold retrieval Relevance and Pseudo-Relevance Feedback require at least double the baseline computing time. In the case of Relevance Feedback, the manual feedback phase in practice likely dominates the overall retrieval duration. Hence, in terms of retrieval duration our approach yields virtually the same level of efficiency as the baseline approach while outperforming the competing short-term feedback approaches. Further, in the evaluated application scenario, this advantage of our approach regarding efficiency is increased by the high rate of successful retrievals: Since it returns at least one semantically similar customer problem in close to 98% of cases, employees need to undergo the lengthy process of crafting a solution from scratch significantly less often than when using the baseline or short-term feedback approaches.

Comparing our hybrid approach to an entirely human-based approach in terms of effectiveness and efficiency is hardly possible without an additional field study where service experts are timed while they phrase solutions to customer problems that are then rated by customers regarding their quality. While this is out of scope of the present study, we can nevertheless compare the approach by relying on fair assumptions. Regarding the effectiveness of our approach, we argue that providing a service expert with semantically similar cases in addition to their own knowledge does not lower effectiveness in terms of providing the customer with the requested knowledge. To the contrary, providing employees with relevant knowledge to solve customer problems should reduce the limitations and errors inherent to employees phrasing solutions from scratch only based on their own knowledge. With regard to efficiency, we rely on empirical data on typical reading and typing speeds to estimate that a service expert on average requires 4 min to phrase a solution to a customer problem from scratch (under the generous assuming that service experts are able to immediately type out solutions to customer problems without prior deliberation). In contrast, for our hybrid approach we estimate an average total time of up to 3:30 min to provide a solution, taking into account the cases where no semantically similar past customer problem can be retrieved and allowing for up to 1 min for selection and adaption of the reused solution. For details on the assumptions underlying these estimates, see Appendix II.

Taken together, our long-term feedback-based approach clearly outperforms solely machine-based and hybrid approaches in terms of effectiveness and outperforms both the baseline textual CBR approach as well as the competing short-term feedback approaches in terms of efficiency. Finally, our hybrid approach is arguably more effective and clearly more efficient than an entirely human-based approach.

Discussion, limitations, and future research

Truly understanding the semantic meaning of texts is still a challenge for computer systems. In particular, prior textual CBR approaches do not fully leverage the complementary strengths of human and artificial intelligence. Against this background, we proposed a novel long-term feedback-based approach taking humans’ capability to semantically understand texts into account. Our results illustrate that integrating employees’ knowledge into textual CBR by means of long-term feedback leads to superior results. In detail, our contributions to theory and practice are threefold.

First, our approach improves upon existing approaches (cf. Gregor and Hevner 2013) by fostering a synergistic human-machine collaboration. Compared to conventional CBR approaches which only evolve with the number of customer problems solved, our approach additionally incorporates humans’ feedback as a learning component by exploiting the Reuse phase and augmenting the Retrieve phase. Thus, on the one hand the adapted customer problem of our approach contains human knowledge regarding the semantic similarity of customer problems leading to a refined retrieval of semantically similar cases. On the other hand, machines’ superior capabilities are exploited as semantically similar problems are retrieved at a rapid pace. Indeed, this fosters solving customer problems efficiently. Further on, our approach provides all employees with relevant and detailed knowledge beyond their own to solve customer problems. Hence, we also contribute to more effective as well as consistent solutions in online customer service. As a result, through incorporation of feedback our approach enables a service quality level greater (successful retrieval) or equal (no successful retrieval) than without the approach. In the rare case that our approach does not return a semantically similar case, the employees can still phrase a solution from scratch, just as in the entirely manual approach. If, however, employees instead reuse solutions from the semantically different cases presented by the system, effectiveness might be affected adversely. Hence, to avoid this competing effect, employees must be provided with clear instructions on how to work with our approach.

Second, our novel long-term feedback-based approach contributes to the development of a refined textual CBR approach incorporating humans’ superior capability to semantically understand the relation between customer problems. Since conventional textual CBR approaches suffer from the drawback that a close match regarding the similarity of two customer problems does not necessarily mean that they are semantically similar as well, human guidance is still needed to determine the semantic meaning of customer problems. Here, our approach improves upon previous approaches by leveraging employees’ capability to capture the semantic relation between customer problems in terms of feedback. Furthermore, as we provide an approach to adequately assist employees with a fast and at the same time high-quality retrieval of semantically similar, employees and organizations benefit from our approach insofar as the quality of provided solutions improves over time. Thus, we take a step towards fully grasping the semantic meaning of texts in customer service, which is still a challenge for automated approaches (Berners-Lee et al. 2001; Embley 2004; Khanapure and Chirchi 2013; Wang et al. 2006b, 2011).

Finally, our study builds upon research in textual CBR (Burke et al. 1997; Lenz and Burkhard 1997) and extends its line of thoughts by integrating a long-term feedback approach from information retrieval (e.g. Crestani 1994, 2000; Mandl 2000; Lin et al. 2011; Mitra and Craswell 2017). While doing so, we merge concepts from research in textual CBR and recent long-term feedback approaches in information retrieval. Since prior literature does not provide such an integrated perspective of textual CBR and long-term feedback approaches, we address this gap and substantially extend existing contributions.

Besides its benefits, our approach and study also implicate limitations that can serve as starting points for future research. First, regarding the demonstration and evaluation of our approach, the implementation and evaluation of our hybrid approach within an operating online customer service department remain a desideratum. Our work paves the way to empirically investigate the influence that hybrid approaches, as ours, may have on efficiency and effectiveness in online customer service as well as the trade-off that could appear between them. Further, for the demonstration and evaluation, we considered only one data set. Although the customer problems contained in the open-domain data set published by the popular service website Quora conform to the properties of customer problems in online customer service, one single customer problem could refer to multiple different relevant topics and might contain significantly more text. However, the feedback on duplicates in a service context closely resembles the feedback on semantic similarity of customer problems and the free availability of the data set enables direct and rigorous quantitative comparison of (future) feedback-based approaches. Nevertheless, as one possible next step in exploring long-term feedback-based approaches, we encourage researchers to apply and evaluate our approach in real-world customer service settings. Second, while we focused on integrating long-term feedback by adapting the customer problem prior to the core CBR process, it also seems promising to investigate the integration of long-term feedback into other parts of the CBR cycle. Promising starting points include adapting the similarity function (Huang et al. 2013; Weis 2013) or gathering implicit feedback on the basis of employees’ behavior when reviewing and selecting cases. Finally, while we only consider employees’ feedback on customer problems, it could also prove beneficial to integrate further information during query adaption. For example, various kinds of data available about the customer (e.g. past communication, purchases, personal information) might be included. This additional context information may help to further increase the quality of the proposed solutions. For instance, if an incoming customer problem contains references to previous requests or omits important order details that can be deduced from the customer’s purchase history. Further, it might be beneficial to weight recent feedback higher than earlier feedback to reflect refinements in the employees’ understanding of semantic similarities or changes in the company’s policies (e.g., one of two similar products is discontinued by an organization and therefore earlier feedbacks linking these two products are outdated).

Conclusion

Nowadays, organizations face the challenge of meeting customers’ demand for reduced response times while handling customer requests with a consistently high level of service quality. Since until now automated approaches still struggle to meet the challenge of truly understanding the full semantic meaning of texts (Berners-Lee et al. 2001; Khanapure and Chirchi 2013), human guidance through feedback is still necessary. Despite extensive scientific work in the field of textual CBR and information retrieval, so far no study has considered intensifying hybrid human-machine collaboration to enhance case retrieval for new customer problems by investigating the semantic relationships between free-text cases through long-term feedback. To this end, we propose a novel approach in textual CBR which incorporates human knowledge in terms of long-term feedback. We gather employees’ feedback regarding the semantic similarity of customer problems in the Reuse phase of the CBR cycle. The collected feedback from all employees is used to create semantic clusters and to train a semantic context generator. Finally, the semantic context of incoming customer problems is determined to enhance the retrieval of semantically similar cases. The demonstration and evaluation based on a real-world data set illustrates that our long-term feedback-based approach clearly outperforms solely machine-based and hybrid approaches in terms of effectiveness, retrieving semantically similar customer problems in 98% of cases compared to 87% (baseline CBR) and at most 89% (Relevance Feedback), respectively. Further, our approach outperforms the baseline textual CBR approach in terms of efficiency, as employees need to provide a solution from scratch less frequently, reducing the average time to provide a solution by at least 12.5%. It is also more efficient than the competing short-term feedback approaches, as it requires only a single retrieval. Additionally, it is arguably more effective and clearly more efficient than an entirely human-based approach. Thus, our approach improves performance in online customer service. It fosters a synergistic human-machine collaboration and contributes to the development of a more refined textual CBR approach regarding the semantic relation of customer problems by merging concepts from research in textual CBR and long-term feedback approaches in information retrieval. Against this background, our approach constitutes a promising first step in order to overcome current challenges in understanding the semantic meaning of texts in textual CBR and beyond.