1 Introduction

Even though increasing global markets can increase the number of customers and sales volumes of companies, they also mean being exposed to constant pressure to be able to offer the products in a high-quality, cost-effective, and fast manner, as otherwise global competitors can take over market positions (Thomassey and Happiette 2007). To offer products cost-effectively and quickly, a sophisticated sales planning strategy plays a decisive role as it can result in ‘improved customer service, reduced lost sales and product returns, and more efficient production planning’ (Doganis et al. 2006). On the other hand, incorrect sales forecasting can lead to increased inventories or, vice versa, to stock-outs, as production simply surpasses market demand (Petropoulos et al. 2018). Reports from the U.S. Census Bureau and the National Retail Federation indicate that retailers lose approximately $200 billion in revenue annually due to inaccurate sales forecasting (Kumar and Patel 2010).

To achieve sales planning that is as accurate as possible, companies usually use forecasting models. Those forecasting models project demand quantities into the future, based on mathematical models. Such models are usually based on past sales of the products and try to incorporate trends and seasonalities (Ali et al. 2009). Since these models are based on past sales, they are relatively easy to understand, which helps these procedures to become established in practice. Currently, however, new types of forecasting methods, based on machine learning (ML), exist. These ML-based methods offer a great potential since, in addition to historical data (the primary source of information in previous methods), they include additional data such as prices, promotions, discounts, holidays or weather data (Ali et al. 2009; Liu and Ichise 2017; Žliobaitė et al. 2012;). Particularly in dynamic markets, which are characterized by various influencing factors, machine learning methods can support an accurate sales planning through the use of a wide variety of input information (Tarallo et al. 2019). Current studies (Ali et al. 2009; Carbonneau et al. 2008; Tarallo et al. 2019; Thomassey and Happiette 2007; Tsoumakas 2019) show that ML-based forecasting methods raise the quality of sales planning (comparison of actual sales with the forecasted sales) in comparison with classical forecasting models. However, due to the greater complexity of ML-based methods, they are often perceived as a black box, where users do not fully understand how certain results were forecasted (Petropoulos et al. 2018). This non-transparency can even lead to a complete rejection of the entire forecast (Amershi et al. 2019; Gilpin et al. 2018). But not only the model itself, but also the user interface and its explanations will decide whether the underlying model is applied and whether it is accepted as well as trusted (Amershi et al. 2019).

To explore the challenges referred to, and the opportunities, in more detail, we investigate in this paper the use of ML-based sales forecasting using the example of a medium-sized German bakery in a rural area. The production site of the local bakery supplies its own local branches and several other customers, therefore precise sales planning is necessary. Otherwise, too much quickly perishable food is thrown away, which corresponds to high costs on the one hand, but also to an unsustainable way of working on the other. We therefore designed and implemented a software system that encompasses ML-based sales forecasting and offers new functionality for the ordering processes. Through our evaluation, we highlight the socio-technical challenges in planning and implementing ML, and the basic requirements and concerns that still exist despite a high forecasting accuracy.

In what follows, we look at related work in the area of sales planning as well as machine learning and outline current acceptance criteria for artificial intelligence from a user’s perspective. In Section 3, we explain our research approach and the research gap. Subsequently, we give an insight into our empirical study within the research field (Section 4). Based on the empirical findings of our qualitative approach, we present our derived concepts and their implementation in Sections 5 and 6. We implemented a software solution for a digital ordering process and applied the machine learning algorithm, taking into consideration the available data and derived factors that might influence the sales of the bakery. In Section 7, we describe the evaluation of our software which we discuss in Section 8 with regard to the extant literature. We conclude in Section 9 with the lessons learned and implications for future work and summarize our contribution to the area of machine-learning based sales forecasting in the food industry.

2 Related work

Our work combines two research areas. The first area focuses on sales planning and forecasting where we take a specific look at the food industry. The second area examines machine learning (transparency) and acceptance criteria from a Human-AI-Interaction perspective.

2.1 Sales planning and learning-based sales forecasting

Sales planning deals with forecasting future sales within a market segment and can be a critical success factor for a company (Tarallo et al. 2019). Sales planning is usually located in the sales department and is executed on the basis of past sales within specific product groups. The sold quantities within the product groups and time periods must then be disaggregated accordingly to provide figures for planning. Current planning is often based on procedures such as the mean value, which suggests a constant value for the future and thus does not consider factors such as fluctuations of customers’ demands (Žliobaitė et al. 2012). However, since sales planning influences all underlying processes, such as scheduling, production, and purchasing, it is important to have the most accurate basis for planning as possible (Doganis et al. 2006; Tarallo et al. 2019; Tsoumakas 2019). Incorrect/non-qualitative sales planning usually leads to increased or decreased inventories due to a lack of sales or to incorrect allocations of production capacities (Petropoulos et al. 2018). A variety of different parameters, such as the procurement of available material or marketing campaigns, have an influence on sales planning (Ali et al. 2009), which, on the one hand, calls for an interdisciplinary approach in sales planning (Lee et al. 2012), but on the other hand, makes an appropriate forecasting even more difficult and complex.

While good sales planning plays a very important role in the manufacturing industry, it plays a special role again in the food industry, since food usually has a very short shelf life (Doganis et al. 2006). Especially this low shelf life, as well as the sometimes strong dependence on external factors such as the season (summer/winter), make it important to plan sales as accurately as possible (Chen et al. 2010). Arunraj and Ahrens (2015) even recommend a day-by-day forecast in the food industry. To support the planning process, several mathematical methods have been proven useful to predict future sales based on past data (Petropoulos et al. 2018). Since these methods are based on easily understandable parameters, such as past purchases in a certain period of time, they usually have a high acceptance in practice from a user’s perspective (Ali et al. 2009).

Meanwhile, sales forecast with prediction models based on machine learning methods have already been developed. Within the machine learning field, algorithms are derived from a model that adapts to training data by learning its structure and relationships, without being explicitly programmed (Alloghani et al. 2020; Tsoumakas 2019). Machine learning is broadly divided into two types, consisting of supervised and unsupervised learning (Alloghani et al. 2020). Supervised learning uses existing input and output pairs to learn a mapping function. The prelabeled training data and its relationships are learned and evaluated by a test dataset to estimate its accuracy in predicting or classifying. On the other hand, unsupervised learning uses unlabeled data and tries to discover the data structure (features) on its own, which is mostly used for clustering and feature reduction (Alloghani et al. 2020; Mahesh 2020). Based on the so-called ‘learned model’, concrete data are used and processed (Norris 2017). The models can be divided into different categories that are specialized for the respective use case, context, or goal, for example, object identification and categorization or the prediction of a data series. In contrast traditional, so-called ‘rule-based systems’ are characterized by ‘conditional if–then rules’ and use predefined rules defined by humans to select, apply, and combine them according to a given problem to provide a result (Hayes-Roth 1985). However, machine learning-based sales planning often produces more accurate results than traditional methods using the same data, by uncovering rules and relationships (Carbonneau et al. 2008). Recently, reinforcement learning has appeared as a third type of machine learning. Reinforcement learning is neither based on supervised learning nor unsupervised learning. Instead, the algorithms learn to react to an environment on their own. The algorithms ‘learn’ how to map situations to actions by maximizing a numerical reward. The learner is not told which actions to take, but instead must discover what actions yield the most reward by trying them (Nandy and Biswas 2018; Sutton and Barto 2018). Since we will be learning from available data in our use case, supervised machine learning was chosen.

One of the machine learning models is a support vector machine (SVM) which is a supervised learning model with associated learning algorithms that analyze data for classification and regression analysis (Du et al. 2013). In this circumstance, an SVM tries to divide the training data into classes. In doing so, the training data know to which class it belongs. To create classes, the SVM draws a hyperplane in which the distance between the points of the classes, which are represented by an n-dimensional vector, is the greatest (Ma et al. 2017). It always tries to maximize the space of the hyperplane (separating surface by which the objects are divided). This should serve to assign objects that cannot be unambiguously assigned to a class to the closest similar one. Thus, the membership of a new data point is predicted by deciding on which side of the hyperplane it lies (Du et al. 2013; Ma et al. 2017). A support vector machine can be used when the parameters that are used as input values and the training data are known. However, the formation of the final model for the data remains hidden from the user. This often leads to difficulties in interpreting and explaining the results. In addition to the results, other output values can be taken, such as the vector representing the feature weights and some instances of the training data, which are called support vectors (Ma et al. 2017).

Carbonneau et al. (2008) show that the forecast quality achieved (the comparison of actual sales with the forecast) is higher when using machine learning than traditional forecasting methods. The findings of Žliobaitė et al. (2012) underline this by showing that a set of features that characterize the sales are superior to a simple moving average calculation. Tsoumakas (2019) consolidates and shows a variety of already existing approaches for food sales forecasting in terms of input variables. Tarallo et al. (2019) research states that ML with its variety of variables (influencing factors) outperforms statistical techniques, resulting in emerging economic benefits. Chen et al. (2010) apply ML techniques to forecast fresh food for convenience stores to meet fluctuating demand and avoid stock outages, as well as to structure large orders to improve customer satisfaction, sales, and costs.

2.2 Machine learning-based sales forecasting in the food industry

The food industry has special requirements as the produced goods usually have a short shelf life and must meet a high-quality standard (Doganis et al. 2006). There are a variety of factors influencing sales planning in the food sector. These factors can be interpreted as a reflection of customer needs, which have a great influence on forecasting. Arunraj and Ahrens (2015) list various studies, such as that of Ali et al. (2009), who included the type of promotion, price, and percentage discount as influencing factors. Other studies also consider factors such as holidays, the days immediately before holidays, type of day (working day, Sunday), festivals, climate/weather, brand loyalty, marketing activities, substitution or cannibalization effects, or school holidays (Liu and Ichise 2017; Tsoumakas 2019). Studies, such as Agnew and Thornes (1995) deal purely with the influence of weather on purchasing behavior. Arunraj and Ahrens (2015) summarize the influencing factors regarding food (Figure 1). Those factors can be divided into three groups: (1) Internal factors, such as price and product characteristics, which are controllable from a company’s perspective; (2) substitution and cannibalization, which are partly internally and partly externally controlled and can therefore only be controlled to a limited extent; (3) all other factors included in the group of external factors that are not controllable.

Figure 1.
figure 1

Classification of demand influencing factors (Arunraj and Ahrens 2015).

The current widely used mathematical methods for calculating a forecast are relatively intuitive models as they use historical data, e.g., historicized data on sales as the basis for calculating the forecast (Tarallo et al. 2019). It often happens that there are gaps in the data/time series for products, or that these are contaminated since promotions etc., are still included. These outliers or peculiarities have to be corrected to provide a robust basis for the calculation (Pavlyshenko 2019). Despite the fact that the methods use only one source of information (historical data), these models deliver satisfactory results and show their validity in practice (Ali et al. 2009). However, the large number of factors influencing sales shows that it is worthwhile to consider more parameters to derive more valid statements about future sales (Ali et al. 2009). The SVM shows good results in the food sector by considering the relationships of the different input parameters to each other (Ali et al. 2009; Du et al. 2013).

2.3 Machine learning from an end-user perspective

The use of machine learning, or more broadly artificial intelligence (AI), poses new challenges for users, and also developers, in terms of the presentation of results (Dove et al. 2017). Over the years, recommendations and guidelines have emerged in the area of HCI (Human–Computer Interface) to help designing interfaces that provide the end-user with the most effective and efficient yet empowering way to perform tasks (for example, Abascal and Nicolle 2005; Dünser et al. 2007; Hassenzahl 2013). One characteristic is that systems react consistently at all times, which means that a function always runs in the same way and produces the same result. Thus, a user can anticipate the result (Amershi et al. 2019).

However, systems that integrate machine learning functions deviate from such patterns. They can produce different results due to the fact that these systems adapt their results over time (Graaf et al. 2017; Lee et al. 2014). Examples include search engines or news feeds that display personalized results based on different factors, such as the search history of an individual or other entity. This unpredictability of results can irritate and unsettle end-users. In some circumstances, it leads to the point of rejecting the system completely, as they can not understand or interpret it (Graaf et al. 2017; Lee et al. 2014). Consequently, approaches are needed that provide a way of interpreting and explaining the results. However, trying to define interpretability is difficult. Many studies equate interpretability with transparency. Others equate interpretability with trust, which itself can take a variety of forms, e.g., trust as good results from a model or trust as controlling a model. Interpretability itself could also simply mean that the models and how they work are understandable. However, what is sufficient for one user might be insufficient for another as their focus is on a different type of interpretability, or they may have different background knowledge (Doran et al. 2017; Hirsch et al. 2017). Transparency and interpretability can help to uncover any misbehavior of the algorithms such as inequality, unfairness, and even discrimination and racism (Grace 2019; Hübner 2021).

To generate a differentiated view of transparency, Lipton (2018) defines two categories for overview purposes: Transparency (how does the model work), i.e. a technical explanation of how the model works, the input values, and how the model is formed; and the post hoc explanations (what else can the model tell me), i.e., a linking of information provided by the model, such as a visual representation or a textual explanation. Thus, developers are more interested in the mechanisms of the model and its hyperparameter tuning to achieve a valid result (Wolf 2021). End users, on the other hand, want to interpret and use the results; understanding how the model works is not the core goal and is a negligible concern (Loyola-Gonzalez 2019).

In addition to the challenge of defining interpretability, many machine learning methods are complex, and sometimes humans cannot fully understand what is happening (Alkhatib 2021; Hannon et al. 2019; Kaur et al. 2020; Lehr and Ohm 2017). Kaur et al. (2020) show that even experts, like data scientists, struggle to interpret results explained by techniques, such as Shapley additive explanations (SHAP) or generalized additive models (GAM). They conclude that interpretability tools should be designed in a way that matches the user’s expectations and background. Taking complex systems into account (e.g., AI-driven autonomous driving), it becomes nearly impossible for human beings to understand the ‘reasoning’ (Hannon et al. 2019). Reducing the complexity could potentially provide easier interpretability but, it may also reduce the strength of the predictions and benefits to the process of making them more explainable (Lipton 2018). Dove et al. (2017) believes research in HCI is still in the early stages since there is hardly any standard definition of how ML and design can be tested. Kaur et al. (2020) suggest that HCI community members should be involved in the development of interpretation tools for ML, such that the HCI and ML communities work together from the start. Carmona et al. (2018) propose improved collaboration between the design and ML disciplines to promote mutual awareness and foster innovation in the areas of interpretability and data collection. Wolf and Blomberg (2019) remind us that making ML useful for end-users ties in with the requirements that the Computer Supported Cooperative Work (CSCW) community has been addressing for decades. Ultimately, CSCW is about how users integrate this technology into their daily work and practices. Simply introducing ML for the sake of relevance is not the way forward. Rather, the end-users need to see the benefits for themselves, and this can be supported by a high enough level of satisfaction derived from explaining the results (Gilpin et al. 2018).

Recent studies in the field of Human-AI-Interaction try to find approaches and guidelines to design the interfaces in the field of AI that are also acceptable to end-users. Amershi et al. (2019) collected prior approaches and used them in several iterations, followed by evaluations with experts in the field of human–computer interaction as user experience, and users, to establish 18 general design guidelines. Users also lack the ability to set what the AI monitors and knowledge of how the behavior can be adjusted. When designing AI applications, attention must be paid to the explainability or interpretability of the results produced, so that the user is favorably inclined toward the system and does not reject it (Amershi et al. 2019; Ribeiro et al. 2016). A high level of satisfaction in explaining the results uncovers benefits for the user (e.g., work simplification) which in turn is supported by an interface able to visualize and contextualize the information (Amershi et al. 2019; Gilpin et al. 2018). It finally results in the integration to daily work and practices, supports the relevance of the technology, and leads to further advancements (Harper 2019; Wolf and Blomberg 2019). Thus, a new technology with the potential of AI needs an adequate interface to become applicable (Gillies et al. 2016). Harper (2019) sums it up by saying that artificial intelligence is not the future, but only an artificial intelligence that is made usable and open to us through HCI.

3 Research gap and approach

Sales planning is a favorable way to make production planning more efficient, minimize product returns, reduce lost sales, and improve customer service (Doganis et al. 2006). To forecast future sales, statistical methods that rely solely on historical sales are usually implemented. However, machine learning methods have increasingly come into focus, as they can consider far more input values and correlations. Large retailers or food chains, whose products only have a short shelf life and require the most accurate possible prediction of future sales volumes to avoid wasting food, already try to integrate machine learning (Petropoulos et al. 2018; Thomassey and Happiette 2007). However, when applying the ML-based sales forecasting in practice, users face several challenges. These challenges include the fact that the ML-based systems are perceived as a black box, have an inconsistent and unpredictable behavior pattern, and lead to reduced trust in the system. These issues might lead to a rejection of the entire system (Graaf et al. 2017; Lee et al. 2014; Petropoulos et al. 2018). Thus, it is important to design the interface between humans and computers in such a way that the results can be explained and interpreted by humans. Amershi et al. (2019) have derived guidelines to check whether AI systems meet with user acceptance.

Our paper contributes to the current state of research by presenting a design case study in the food sector, outlining the prerequisites, design, and implementation of ML-based sales forecasting and the challenges in practical application. We, therefore, take a closer look at a medium-sized bakery that sells its goods regionally through its own branches, to private individuals, and to micro-companies such as kiosks. We will examine the design of a ML-based sales forecasting from a technology-related, process-related, and end-user-related perspective. We conducted a design case study consisting of the phase of the empirical pre-study in the domain mentioned, the implementation of a prototype (IT artifact), and the subsequent exemplary evaluation in the form of a functional test without an appropriation phase in the intended domain (Wulf et al. 2015).

In the pre-study phase, we familiarized ourselves with the research field and conducted several workshops together with the bakery to understand their current work practices and processes, with a special focus on sales planning. Based on the empirical findings, we conducted a co-design phase and implemented a system that tries to address current issues of the sales planning process. Our system ran for three months to simulate a first use and was then evaluated with end-users and domain experts.

4 Empirical pre-study

To get insights into the current work practices and planning process, we conducted our empirical pre-study with a local bakery in Germany. Bakeries are representatives for the food industry that concentrate on a specific product category, region, and deal with the same obstacles concerning the limited shelf-life, high quality requirements with regard to food products, and uncertainty of market demand. Although the processes and structures cannot be generalized, the reduced complexity provides a good application field as it is affected by the same restrictions. This allows an examination of challenges in planning and implementing ML, basic requirements, and identifying which concerns exist. We will briefly outline our application field and the applied methodology, as well as the specific findings of our empirical study.

4.1 Application field and methodology

The bakery is family-run and all family members support the daily business in baking as well as administration. The products are produced in one central production facility and include a wide variety of breads, buns, cakes, and other pastries. The products are sold through 14 of the company-owned branches, which are located throughout the region and supplied each morning. In addition, the bakery maintains business relationships with several external traders, for instance, small shops which are also supplied directly by the production facility (Figure 2). No goods are sold directly to consumers through the production site. Both internal customers (their own shops) and external customers (small shops, kiosks, etc.) order independently from the production site. The bakery’s administration is located directly at the production site and is responsible for all administrative tasks, as well as customer management. One task is production planning and scheduling based on the orders received. Many of the employees at the point of sale (their independent branches) are only temporary workers (i.e., students, mini-jobbers) and have limited experience in sales planning, but are asked to plan and submit the future demand. The external traders order directly from the central production unit.

Figure 2.
figure 2

Production and distribution network.

The bakery is confronted with constantly fluctuating sales volumes. It plans its production in the form of purchase forecasts of their company branches and external traders by asking them what they need. There is no consideration of the past, nor of other factors that might influence sales, such as weather, holidays, or the like. This leads to a large number of returns and goods that have to be disposed of or, in some cases, donated to charitable organizations. However, in both cases, the current practice causes significant costs for the bakery. Additionally, there are situations where the demand for certain goods is higher than the produced amount.

We decided to use a qualitative approach and conducted two workshops. During the workshops, we took field notes and recorded, verbatim, the statements that made an impression to us, as well as the emotional reactions and postures. We conducted the first workshop with the owner of the bakery, his wife, and his daughter. All three of them know the family business well (e.g., the economic situation of the company, the organization, and the processes), and they all perform important tasks within the company. The owner founded the bakery, is a trained baker, and works in production. His wife is responsible for all tasks in the administration. The daughter is expected to take over the bakery at some point in the future, and she contacted us because the current work processes are too inefficient, and therefore, she realized that several benefits may be afforded by digitalization. Since none of the family members knew about ML or AI in detail, no expectations were placed on ML, but they simply wanted to explore new ways to position the company more competitively in the future.

The goal of the first workshop was to get initial insights into the research area and to understand the current ordering processes. The second workshop, which took place three weeks after the first, was meant to gain details about work practices and to capture aspects that we might have missed in the first workshop. In addition, the staff member who is currently in charge of the ordering process was present in the second workshop. Both workshops took about 90 min each and were held in a conference room of the bakery. Due to a lack of consent, the workshops were neither recorded nor visually documented. Therefore, we subsequently structured, clustered, and evaluated the field notes.

4.2 Empirical findings

In the first workshop, we examined the current structure of the company where the company’s owner, his wife, and their daughter explained how the company is structured, including the distribution networks and the production unit, and detailed the current sales planning process, which we later modeled (Figure 3). In general, three different players are involved in the ordering process. First, there are the internal customers, which consist of the company’s own branches. Second, the external customers, like small shops and kiosks. The third group consists of a member of the bakery’s administrative staff.

Figure 3.
figure 3

Main ordering process.

Within the current sales planning process, all customers (branches/kiosks etc.) use a printed form and indicate their demand quantity for each product per working day for the upcoming week (Figure 4).

Figure 4.
figure 4

Excerpt from the bakery order form (translated).

This form is faxed to the central production facility. At the production facility’s head office, the demand quantities are processed and accumulated manually. All of the company branches and customers have a deadline where the data must be final. Before that, changes can still be communicated – again by printed form and fax. The last faxed form always counts. These changes must be manually recalculated in the head office to obtain the current demand. Based on the accumulated figures, the production is aligned and the goods are produced. Subsequently, the goods are delivered to the customers/branches on a daily basis with the company’s fleet. During the process the following tasks can be summarized:

Customers (internal and external):

  • Estimating the amount of required goods for the upcoming week

  • Reporting the quantity via printed form via fax

  • Changing the quantity via printed form via fax

Head Office:

  • Confirming each order via fax, email, or telephone

  • Accumulating the quantities manually for each order

  • Accumulating the quantities manually across orders

  • Providing the production department with the quantities

  • Managing and contacting the customers

  • Maintenance of the product catalog

The owner outlined that many branches ‘scrap’ a lot of the goods as too many goods were delivered, or they go out of stock because too few were ordered. In addition, many of those employed at the point of sale are temporary workers and have little expertise in sales planning. The fact that this situation stresses the owner could be seen by his emotional expressions. He stood with his arms folded most of the time during the workshop. Between the individual explanations, he often used statements such as, ‘It might be better to close everything down.’ In our opinion, this emotional reaction showed that there was clearly a need for action because the high workload involved in planning and managing the many stores was no longer physically possible for the owner. The daughter always tried to encourage her father to break new (digital) ground with sentences like, ‘Let's try and change something here’. (As researchers, we were initially unaware of this situation and our importance for the continued operation of the company as a whole.)

In the second workshop, we validated our previously summarized process models. Furthermore, for deeper insights into the ordering process, the responsible employee from the administration took part. She emphasized that she spends a lot of time collecting the faxed orders, altering them if changes were made by the customers, or calling them back if there are mistakes or something is unreadable. She wished her task would be made easier at this point as the current manual effort is enormous and takes up a large part of her working time. She also told us that customers sometimes send changes by fax several times a day. When we asked whether there is any validation regarding past sales and current sales, she said that the paper-driven way of working hampers such comparisons. The current planning quality, and the validation of whether certain customers usually order too much or too little, are therefore unknown. Right now, no analysis is conducted due to the lack of time caused by the current time-consuming process. However, she assumed that the bakery is confronted with the problem of oversupplying/undersupplying market demand due to the current ordering process. This circumstance is particularly serious in the food industry, as the goods only have a limited shelf life, and goods from previous days cannot be sold the next day (Wen et al. 2014), at least not at the same conditions and with the same quality.

Furthermore, she told us that one of the biggest issues is that the prediction of future sales is based on the experience of the employees in the branches, or the external customers, which are the most inexperienced workers who have no overview of past daily demands. At the same time, the high turnover of employees also means that seasonal knowledge or knowledge about sales for special events is not available. The owner despaired, within the workshop, and summarized that his company will surely have to close soon if it will not be possible to simplify the planning, make it more market-oriented, and provide an orientation for future demand, based on past experience.

In summary, we can state that the following points need to be considered: (1) minimize the errors during the ordering process (wrong order unit, legibility), (2) automatically consolidate the order demands and, therefore, (3) minimize the effort in the entire process, and (4) to provide a supportive suggestion for future demand.

5 Design and introduction of the basic technological infrastructure

Based on the empirical study, we considered whether artificial intelligence could support the lack of experience of the temporary employees in the individual branches by basing sales planning on as many detectable context parameters as possible. The goal was a better and more efficient sales planning procedure to reduce effects such as over/underproduction. However, as our empirical study and the current process have shown, there is neither a suitable digital data basis nor a technological infrastructure in place to introduce new forms of machine learning-based sales planning. It was therefore first necessary to establish a suitable technological infrastructure to gather data that might enable new forms of sales planning. To address these shortcomings within the current ordering process we, therefore, set up a website with a centralized database as a platform, and location-independent technology. Together with the participants in our workshops, we refined the ordering process and adapted it to the website (Figure 5).

Figure 5.
figure 5

New digital ordering process.

Many manual tasks that were previously carried out by an office worker are in the new process supported by the website. The process includes receiving orders and sending confirmations, calculating a total per product, and checking to determine whether adjustments were made to an order. The office worker only needs to download the total quantity per product, put it into a suitable form, and pass it on to production. This reduces the effort on the administration effort on the part of the bakery and supports increased error tolerance through digital technologies. The website was meant, on the one hand, to offer a digital interface for all customers to order goods and store the orders appropriately. On the other hand, the website provides a management module for assessing the orders digitally and automating cumbersome and exhausting tasks such as calculating and updating the different demand quantities. The website has a responsive design and can be accessed through mobile devices as well as desktop or laptop computers. Customers only need the pre-specified login parameters.

The website was implemented in HTML and PHP to ensure platform independence. The underlying database is an open-source MySQL database. The website, including the database, is hosted by a German service provider. For the prediction calculation and visualization, we developed a Python program with Spyder IDE, in which we use different open-source libraries like scikit-learn for machine learning or pandas, numpy, and matplotlib.

Based on the login, the system decides between three different roles. The first is the admin of the system who has all rights, the second role is administration staff, and the third role is customers. While the administration staff can see all orders, can add users and products, and have access to functions for calculating the quantities, the customer only sees data relevant to the customer and does not have any insight into the orders of others. If the users have any problems in using the system, a section for contact is available, which sends an email with the request to the admin. Each customer gets a unique login provided by the bakery. Here, the customer must enter some mandatory information, such as the company and name. This information is used in various places in the system. The user is given the opportunity to edit the profile and to change the password.

The main function of the system, as far as customers are concerned, is order entry (Figure 6). The entry allows each customer to choose from the product range and to place an order for a certain week. The respective weeks can be selected via the ‘Calendar Week Choice’ which offers the option to place an order for each product for each working day in the week. If the user has enabled the option to receive confirmations by e-mail, the system automatically sends a confirmation after the order is submitted.

Figure 6.
figure 6

Order form embedded in the website.

In the menu section ‘My Orders’, the user can edit his/her submitted orders at any time as long as the submission deadline has not expired (Figure 7). Each submission triggers a new confirmation message. The two pages ‘Order placement’ and ‘My orders’ represent a central step in the new digital process. The tasks of confirming changes, as well as the manual change of the orders, are now assigned to the customer. The customer can easily make changes and get an overview of current and previous orders, which leads to transparency and time savings on both sides.

Figure 7.
figure 7

My orders overview.

The system provides a function for ‘Order Consolidation’ (Figure 8), which was previously performed manually by the bakery staff. Various evaluations are available, including an overview per product (aggregated over all customers/orders) or combinations such as customer and product. Furthermore, the individual calendar weeks can be queried (past and future) and individual evaluations can be started for each customer and/or product. The resulting tables are also searchable and filterable. In addition, the current view can be sent to the user’s email address. With this view, the administration staff can quickly generate an overview for production or do evaluations per customer/product/week in case of inquiries. The overview created for production can now be shared with the production team for a more precise plan.

Figure 8.
figure 8

Order analysis—order overview.

We wanted to implement a new ordering process that would support the bakery by (1) minimizing the errors during ordering (wrong order unit, legibility), (2) automatically consolidating the order demands and, therefore (3) minimizing the effort in the entire process. With the introduction of the system and the new co-designed process, some tasks were transferred from the bakery to the customers, thus creating time savings and transparency. Further, manual, time-consuming calculations are no longer necessary because the system shows the total quantity required per product and calendar week, as well as the individual requirements of the customers. The system we developed was then used for a total of 12 weeks (September to December) within the bakery's practice.

6 Machine learning-based sales forecasting

Although the use of IT infrastructure has made the ordering process more efficient, the problem of inadequate planning on the part of customers has not yet been addressed. However, the redesign of the process created the basis for having data available digitally to be able to plan ML-based forecasting procedures for sales, even over several weeks. The stored data is now available for an analysis of the past allowing us to identify order patterns with regard to context parameters such as seasonal effects.

Based on the available data, we implemented a forecast module that is written in Python and uses libraries, such as scikit-learn. In the first step, we use an export from the ordering platform database enriched with local weather data from the German Weather Service as a data source. The module carries out correlation analyses, visualizes them, and presents the initial findings and correlations to the user accordingly. Furthermore, the module displays a week’s forecast and compares it with the sales that actually arrived to obtain an initial forecast quality in the form of a mean deviation or a percentage and absolute deviation (Tables 2 and 3).

For the analysis, we chose products that are high in demand according to the bakery. The figures contain the respective orders of the goods by a customer and are recorded daily. The goods have different quantities, are divided into different types of goods, like pastries, breads, or cakes, and are analyzed individually. In the following discussion, we focus on buns and cream slices to outline our ML-based approach.

To calculate a sales forecast, we used an SVM. Therefore, the data must be in a normalized, machine-interpretable form. We, therefore, had to map input parameters such as sunny or cloudy weather to numerical values (categories). After data preparation, we divided the available data set into a train and test set. The train data set was used to teach the model, while the test data was used to perform the prediction. Finally, we compared the results of the model with the actual sales to determine the model’s accuracy. Depending on the results, further adjustments to the model may be necessary to increase the quality of the prediction. We took the following steps to perform the sales prediction:

  1. 1.

    data preparation and normalization

  2. 2.

    dividing the data set into training data and test data

  3. 3.

    determination of the hyperparameters (C, γ)

  4. 4.

    training of the SVM with the training data

  5. 5.

    prediction with test data

  6. 6.

    replace forecast value with real value as time progresses

  7. 7.

    observation of the forecast quality

As written above, the historical data originates from our newly introduced system that stores orders by customers. The data is available at a daily level per customer and item. Since the main focus is on forecasting each product, the data must be condensed to this level to get the general demand per product. Outliers in the data series can be neglected due to the robust design of the SVM. The literature (Arunraj and Ahrens 2015; Doganis et al. 2006; Tsoumakas 2019; Žliobaite et al. 2009) lists numerous factors influencing sales planning, of which we need to identify the crucial ones that impact sales in our given examples. For that purpose, we analyzed the correlation between sales and potential features, and took only those into account that have a high correlation and therefore a high influence on the sales. A high positive correlation lies between 0 and 1 and a high negative between 0 and -1. Values between -0.5 and 0.5 indicate weak correlation, but considerable between |0,2| and |0,35|, very weak and inconsiderable between -0,2 and 0,2 (Senthilnathan 2019). For the bakery products, the data suggest that weather, promotions, season, events, and price are of greatest interest (Table 1). However, promotions were not implemented within the current system as these are not yet explicitly recorded.

Table 1. Considerable influencing factors.

As already mentioned, the correlation describes how strongly the values are dependent on each other. The more they are dependent on sales, the more valuable their inclusion in the calculation is. We provide two examples (buns, cream slices) to outline our approach.

In the case of buns, the temperature has a slight negative correlation of -0.01 (-1%) and is therefore unsuitable for proceeding. Sunshine duration, as an indicator of good weather, also has a weak correlation of 0.15 (15%). The season, i.e., the type of day, whether a weekend, public holiday, or working day, shows a high correlation of 0.88 (88.1%). This is reflected in the figure and the graph (Figure 9). The events also show a weak positive correlation of 0.26 (25.6%). As a result, the season factor is the most suitable. The events are also included as the correlation calculation seems sufficient (weak but considerable) for the small amount of data (five events in a period of > 80 days). Furthermore, the data indicated a peak in sales at each event.

Figure 9.
figure 9

Buns—correlation to season and sales volume.

In addition to the various factors, sales itself has a high correlation in the offset of one week (0.81) and two weeks (0.80) and is useful as input for the SVM to predict future demand. To support the usage of the past sales, we also added the past records for each of the other input parameters (Figure 10).

Figure 10.
figure 10

Buns—correlation of sales and sales the previous week.

In the case of the buns, this results in the combination of features: Sales (7/14 days ago), Weather (Sunshine in hours 7/14 days ago and current day), Event (7/14 days ago and current day), Day type (Season 7/14 days ago and current day).

In the case of cream slices, the temperature has a slight positive correlation of 0.09 (8.6%) and is therefore unsuitable for further proceedings. Sunshine duration, as an indicator of good weather, also has a weak correlation of 0.04 (4.1%). The season, i.e. the type of day, shows a high correlation of 0.52 (51.7%). The events also show a weak positive correlation of 0.53 (53.3%). As a result, the event factor is the most suitable, while the second important variable is the season. Sales itself, in contrast to the buns, has a moderate to low correlation in the offset of one week (0.30) and two weeks (0.17). It indicates that there is a volatility in the data that doesn’t follow definite patterns. To support the usage of the past sales, we also added the past data for each of the other input parameters. Due to similar correlations, we decide to use the same features for the cream slices as for the buns.

We used 60% of the available data as training data to train the model and the rest for the testing. For further preparation, the hyperparameters C and γ must be determined using a grid search and validated by cross-validation. The recommendations of Hsu et al. (2010) were used to preselect the intervals for both parameters. For the buns, we suggested C = 2014, γ = 9e-05. and for the cream slices C = 1114 γ = 8e-06.

To assess whether the forecast was accurate, the last week (seven days) of the available data was used as a reference. In this way, actual sales could be compared with forecasted demand. For the calculation, 74 periods in the past are used. Beginning from the 75th period onward, a forecast is made for the next seven periods.

Figure 11 shows bun sales (green) and forecast (blue) in one diagram, making it easy to see that in large parts both lines are congruent, but in some cases, they diverge. For example, the first two weeks (period 3–9 and period 10–16) show a similar forecast, as the SVM only has the information of the previous week available until the second week. However, the second week has higher sales in period 13 (944 units) than the previous week in period six (764 units). Period 19 has high sales, unlike the previous two weeks. This is most likely due to a public holiday (October 3rd, German Unity Day). Table 2 shows the sales of the two weeks before the forecasted day, the forecast itself, and the actual sales. In addition, the absolute deviation and the relative deviation are calculated. It is striking that for the days Thursday up to Monday, the forecast comes relatively close to the actual sales with a positive or negative deviation between 0.8% and 0.9%. Tuesday and Wednesday are an exception. Both deviate by about 25% each. The forecast for Wednesday was too low, but Tuesday’s sales were 25% higher than the forecast.

Figure 11.
figure 11

Buns forecast.

Table 2. Buns deviation analysis.

It is noticeable that the two previous Wednesdays themselves had lower sales (sales -7 = 583; sales -14 = 603) than the forecast Wednesday (577). Looking at these two previous Wednesdays, the deviation between forecast and sales is between 1 and 4%. Since both values are considered as features in the model, they have a corresponding influence on the forecast. Looking further back in time, we see that we have three more outliers in the data. The sales of 818 (Halloween, October 31st—period 47) and 1,154 (German Unity Day, October 3rd—period 19) can be explained by the assumption that these were holidays or events. The days without a holiday show significantly lower sales, such as period 68 (583), period 61 (603), period 54 (614), and period 40 (570), and therefore move closer to the forecast. Periods 33 (700), 26 (666), and 12 (747) show slightly higher quantities.

If we take the mean value of the apparent plausible data series (period 68, 61, 54, 40), we get 607 pieces in mean sales, which would be a deviation of 4.97%. Even with the higher values (period 33, 26, 12), the mean value would be 640, which would result in a deviation of 9.9%. Thus, the question arises whether there are influencing parameters that we do not have in our current model, such as large exceptional orders. Wednesday could thus be considered as an outlier in relation to the other two weeks. There may be other factors contributing to the increased sales. A more detailed analysis could reveal a non-intended feature.

Tuesday (sales 14 days ago) seems interesting as it is an outlier in the overall data series. The mean of all the other Tuesdays in the dataset is 540, and the median 556, indicating that the 700 is an exception. Its influence on the SVM was not directly evident from the data. Later, however, it turned out that it was an unscheduled order from an external customer.

Considering the average deviation with respect to the entire dataset (81 periods), an absolute deviation of 5.5 pieces, or a relative deviation of -1.75%, can be determined. Periods exist that have 38% (period 19) more sales than forecast or -52% (period 70) fewer sales. Our test scope has an average relative deviation of 2.1% during the overall week.

Figure 12 shows actual cream slices sales (green) and forecast sales (red) in one diagram. Both lines are often congruent, but in some cases, they diverge. Table 3 shows the sales of the two weeks before the forecasted day, the forecast itself, and the actual sales. In addition, the absolute deviation and the relative deviation are calculated. It is striking that Saturday and Tuesday, match the forecast with zero deviation. Wednesday, Friday, and Monday deviate between 30 and 50%. The forecast for Thursday is more than two times higher than the sales. The day with the highest deviation of about 630% is Sunday.

Figure 12.
figure 12

Cream slices forecast.

Table 3. Cream slices deviation analysis.

It is noticeable that the two previous Sundays themselves had higher sales (sales -7 = 18; sales -14 = 33) than the forecast Sunday (3). Looking further back in time, we see that our predicted Sunday is an outlier in the data. The sales vary from 23 to 38 sold units and have an average of 29 units. Last week, with its 18 pieces, is already very low. Thus, it’s unclear why the sales decreased and resulted in 3 sold pieces. The relevant information is not contained in our model.

Considering Thursday, and looking further back in time, we see that in period 41 there was a higher sale (34) than usual (11 – 25 pieces and an average of roughly 17 pieces). As in the buns example, we could assume that there are influencing parameters that we do not have in our current model, such as large exceptional orders. Hence period 41 could be considered as an outlier that influenced our forecast for Thursday.

Considering the average deviation in relation to the entire dataset (81 periods), a relative deviation of 21% can be determined. Our test scope has an average relative deviation of 122% over the overall week. In general, we can see that the bun’s forecast is more precise than the cream slices, which could be brought back to the feature selection, the greater correlation, and a distinct sales pattern. All features have a lower correlation considering the cream slices (season 51.7%, event 53.3%, offset one week 81%), than the buns (season 88.1%, event 25.6%, offset one week 30%).

7 Evaluation

To get insights regarding the applicability of our prototype, and the ML-based approach, we conducted an on-site evaluation consisting of a workshop to which we invited the owner, his wife, and daughter, as well as the person in charge of the ordering process (administrative assistant). Within the workshop, we used a projector to demo our prototype, which was hosted on a web server accessible via a webpage. Afterward, the participants explored the prototype. The participants were asked to think out loud while using the application (Nielsen 1994). At the same time, we took field notes.

We aimed at testing both usability and practice relevance. We were primarily concerned with identifying the various ways our prototype would be used, and what difficulties in use might be encountered. The philosophy behind the evaluation was derived from the notion of ‘situated evaluation’ (Twidale et al. 1994) in which qualitative methods are used to draw conclusions about the real-world use of technology-using domain experts. The aim was not to measure the relationship between evaluation goals and outcomes but to derive views from the experts about how useful the tool might be in use.

Following the evaluation of the application itself, we demoted the forecast tool, including explanation and results for the sales forecasting of buns. The participants were asked to give their assessment of the quality of the forecasts. We also took notes here. The aim of digitizing the ordering process is to create transparency in planning in the form of a more centralized overview and to reduce the workload on the employees. The forecast module was meant to significantly reduce overproduction and underproduction by generating suggestions for future sales based on historical sales and influencing parameters.

7.1 The technological basis for ML is already an innovation in itself for the users

We introduced the prototype, along with the new (digital) process, and referred the current process as a comparison to illustrate differences. We prepared two demo user accounts consisting of a customer and an administrator. To outline the difference between a customer and an administrator we presented both logins in parallel and explained the menu and structure of the website (sub-areas such as administration, product management, user administration, etc.). The menu already differed by hiding the areas ‘manage product, manage user, admin area, and order overview’ from the customer based on authorization at login. ‘This is great, so we can manage everything and the customers can just place their order,’ said the administrative assistant. To answer the administrative assistant’s question, ‘Where do I set this up with the users?’, we briefly opened the user administration module and showed how a new user can be created and that a setting in the profile assigns the role ‘administrator’ or ‘customer’ Following this procedure (Figure 5), we opened the ‘Order Placement’ procedure, which supersedes the old order form. We highlighted its similarity to the older form so as to not confuse current customers, which all the participants appreciated. In addition, the embedded product pictures immediately caught the eye, and the owner pointed out: ‘[…] then we can take pictures of our products and display them on the website. These must then be appetizing pictures.’ Based on the simple digital form, the participants discussed the idea of graduated prices. As the owner said: ‘If we write that it is cheaper to buy 10 pieces or more or that one comes for free, people might buy more.’

We asked the participants to create a test order for the next week with quantities for each day for some goods and showed the error tolerance of the system by entering negative values and letters. The participants asked directly for a blocking logic with reference to the availability of certain articles as certain articles are only available for certain periods of time, such as Berlin donuts. These products may only be ordered during a specific period which is customizable in the item’s master data. The settings allow setting exact availability dates, or a product can be completely removed from the active assortment. The people liked this option: ‘This way we don't have to change and send out new forms, but can easily add products to the assortment and also remove them again, like Berlin (donuts) at carnival,’ (the owner’s wife). Regarding this requirement, the administrator (owner’s wife) directly asked: ‘Can you please make it possible to enter until when in the week the order can be sent. And also, a calendar so that no orders can be placed on public holidays.’ After filling out the form and submitting it, a confirmation via e-mail is sent as a confirmation that is displayed on the screen itself. We showed the confirmation email directly on the same laptop and received positive feedback. Afterward, we moved to the section, ‘My Orders’, to demonstrate the ability to adjust previously set orders and demonstrate that a change is accompanied by confirmation e-mails. This functionality was appreciated by the administrative assistant: ‘Then I don't have to do this and painstakingly search it out.’ The system always records the current status of the order and allows changes until a certain predefined date, analogous to the possibilities in the analog process.

We demonstrated the procedure to assess the total amount per product to schedule a production plan by querying the total quantity per product for the next week, as well as a summary per customer and product. All participants valued this functionality as one of the main functionalities of our prototype, as such a calculation was previously only possible with a large manual effort. At this point, the administrator expressed the wish to create further scenarios such as the concrete order quantity of a product in a defined period: ‘Can I also look at orders in the past? So, what a customer ordered and how much of a product was ordered in general?’ Even though this requirement had already been partially implemented in our prototype, querying the individual customer was not included. When downloading the total quantity used for creating a production plan, two options are available. The first is sending an email with a sketched table of the current evaluation, and the second is downloading a CSV (Comma Separated Values) file directly from the server by pressing a button on the interface. As the administrative assistant asked: Can I download every evaluation directly? That's great. I wasn’t able to do such evaluations before, or only with a lot of effort.’

The daughter tried out the ordering function several times and seemed visibly enthusiastic about the system. During the evaluation, the owner asked her and the administrative assistant: ‘Can you imagine using this website and working with it?’ The administrative assistant responded: ‘This would save me a lot of work and together with this evaluation function it makes my work easier.’ The owner in return said: ‘Then you take pictures of the products and we upload them. Let's try it out with a few selected customers and install it on our infrastructure.’ We (as the evaluators) quickly realized that the owner’s opinion had changed regarding the planning process (Although we had not even presented the ML component and thus the actual goal of our study). Whereas he had previously made statements such as ‘Then we might as well let the whole thing go.’ or ‘Maybe it's better to close everything down’ he now wanted to directly use our prototype productively. Also, his rather distant attitude with folded arms at the beginning of the presentation gradually changed. He asked questions and showed interest. With his statement that he wanted to implement the process with some customers, we concluded that the website met the requirements and got his approval to proceed. Considering the statements made by the administrative assistant when seeing a possibility to simplify her work and save time showed her support for our assumption.

7.2 Higher reliance on poor planning by humans rather than above-average forecasting by an AI due to lack of trust

The second part of the evaluation regards our machine learning module. Here we immediately encountered a certain skepticism. The attitude seemed to be exactly in contrast to the walkthrough of the digital ordering process through our prototype. There are two reasons for this skepticism: the first was the confidence in the results of the ML and the second was the handling of the results and its visual presentation.

The sales forecast is presented in Table 2, as well as the differences that arose between actual sales and the forecast for the buns from the bakery. Although we obtained a high forecast quality for five of the seven days, two outliers led to a complete mistrust and rejection of the results. One of the outliers was easy to explain because it was caused by a spontaneous bulk order from a special customer. While there are technological possibilities to allow for such exceptions, it was quickly clear to the owner that ML-based planning was not suitable for his company. He argued that if production had been aligned to these forecasts, then too much or too little would have been produced and customer orders would not have been fulfilled, or a considerable quantity would have had to be thrown away. The owner said: ‘Then I would have produced far too much and not sold if I trusted that, and on the other hand far too little.’ Both cases would lead to higher costs. Even though the prevailing (non-digital) processes already lead to over- or under-production, different standards were set for ML-based forecasting by the participants.

Furthermore, the participants criticized the handling of the module. Currently, data has to be consolidated and prepared in an Excel file from different sources. In addition, the current version of the module does not include a comprehensive user interface. The graphics that the forecasting module generates at runtime to show correlations, the congruence of sales and forecasts, as well as the current quality of the figures (Figure 11, Table 2) also failed to convince the users. Notably, the owner emphasized that ‘[…] there are no IT experts sitting here,’ to deal with the current module, and no one contradicted his quick, clearly formulated statement. Furthermore, responses clearly showed that a transparent explanation of the results is essential for trust, and the employees said they had too little technological understanding to fully comprehend how the system worked.

We trained the model for the forecasts using actual data, but the data were critically opposed. For example, the owner asked: ‘Why are the sales forecasts so low?’ We tried to explain the data in more detail through various graphs, but this did not appear to be sufficient. When we were unable to explain Tuesday’s prediction regarding the forecast quantity, the skepticism in the owner's facial expression increased significantly. Our attempts to get the owner to explore the data to determine possible reasons for deviations were quickly blocked: ‘There is not much interest in this kind of planning.’ We pointed out that this tool is only meant as support and that it provides suggestions to make planning easier for the employees and customers. However, the owner wanted to focus more on the website and digital ordering process. Even though we as researchers can see the advantages in terms of production planning using the ML-based application, we quickly accepted that such a ML-based forecasting will not be used because the resistance of the owner is too great.

8 Discussion and limitations

As Arunraj and Ahrens (2015) describe, the food industry benefits from valid forecasts at the daily level to avoid over/underproduction. They have already shown the advantages of machine learning methods for forecasting sales. We conducted a design case study based on a bakery and used machine learning to show valid results regarding the forecasting of bun and cream slice sales for production planning. The use of ML requires digitized and normalized data as well as an examination for outliers to increase the data quality. Ideally, the collection, processing, and central storage of data is automated, as it has a high degree of complexity. Adding other external sources, such as weather data, increases complexity. However, the challenge was that within the application field there was an insufficient data basis and the necessary processes were not available. We therefore only had a limited selection of potential influencing factors for the beginning. For this reason, it was first necessary to establish new digital processes to provide the necessary data with sufficiently high quality. We planned this process together with the end users and introduced new technologies to support the ordering process. Based on the data collected we were able to calculate valid results using the example of bun and cream slice production. We showed what effect factors such as holidays or events, combined with sales data, have to provide information for a precise forecast.

Machine learning methods are often described as black boxes that make it difficult for users to understand the results or scrutinize the actions (Bohanec et al. 2017; Hannon et al. 2019; Ma et al. 2017). We were also able to confirm this view in our study. Although we generated several diagrams showing correlations and interdependencies, such as sales and public holidays (Figure 9), or sales and forecast in one diagram to illustrate the congruence (Figure 11), the end-users remained skeptical of ML approaches. One reason for the skepticism was the supposed outliers in our results, which from their perspective could not be adequately explained by the chosen representation. Some of the outliers are explainable through historical data and available features. One way to uncover correlations is by exploring the raw data to derive explanations for calculated results. However, it must also be sufficiently useful to later users, for instance by using visual support for the correlation of features and their impact on the forecast (Krause et al. 2016). An interactive way of exploring the correlations may have supported the explanations regarding both outliers in our case. Ultimately, we were not able to explain every outlier and why they occurred. Furthermore, our model appears to miss features and performed worse in those cases (Tuesday and Wednesday) than the currently human-made forecast. This lack of transparency and explanation of why those outliers occurred (influencing features) significantly unsettled the users and caused a complete rejection of our model.

Since our system was the first experience with ML for the users, it was especially difficult for them to understand what exactly lies behind the calculations. Explanatory components, such as those proposed by Krause et al. (2016), may help users recognize why certain results have occurred. However, this only describes the technological aspects. Gilpin et al. (2018) and Wolf and Blomberg (2019) pointed out that, ultimately, it is about the integration of this technology into users’ daily work and practices. A simple introduction for reasons of relevance is not always sufficient since users must recognize the benefit for themselves. As a result, our application not only lacks the functionality of an explanation but also requires simple integration into the working practices. The owner stated that he does not need this part of the system and does not see any benefits. Maybe, more attention toward explainability and interpretability would have led to an inclination in favor of the forecast mechanism instead of a rejection (Amershi et al. 2019; Ribeiro et al. 2016). All questions by the users regarding the model and its results were non-technical in terms of implementation, mathematics etc., but still related to concrete influencing parameters like events and weather. The focus from their point of view was on results and their explanation and not on how the model works from a technical point of view (e.g., hyperparameters or the machine learning algorithm) (Wolf 2021; Loyola-Gonzalez 2019). Another possibility regarding showing the advantages of the new model would have been to point out possibilities to react to exceptions to the model. However, this did not happen during the evaluation because the owner rejected such procedures after looking at the ML-based forecast values. In terms of the administrative assistant, who previously combined many of her activities in an Excel tool and prepared the sales planning there, she now sees herself confronted with new challenges. While calculations in Excel are usually straightforward, the ML calculations are not easily comprehensible and users or even developers often do not understand what is happening (Alkhatib 2021; Hannon et al. 2019; Lehr and Ohm 2017).

Taking the adaptability of ML and the changing behavior of the model into account, it becomes even harder to interpret results, even for experts, such as data scientists, who face issues in trying to correctly interpret explainability models like SHAP (Kaur et al. 2020). Hence, trust issues arise and irritate users (Graaf et al. 2017; Lee et al. 2014). ML cannot simply be introduced because it serves the circumstance, but rather, there must be a negotiation of tasks between users and machines or algorithms. Concerning our system, we were able to demonstrate the benefit of a digitized ordering process, whereas the benefit of the sales forecast was not realized. In other words, the existing practices combined with the digitization of the ordering process were sufficient (Gilpin et al. 2018). As a result, we can show that the usefulness of new technologies and the associated potentials must be communicated in an accessible and low-threshold manner, for example in the form of simple explanations of the results, to make the benefits more perceptible. Furthermore, our result correlates with the ‘attitude towards using’ a model, which states that the perceived usefulness and perceived ease of use are essential factors influencing acceptance and adoption of technology (Davis et al. 1989).

Our work was subject to several limitations. The software system was not in use long enough and could only collect data for three months. This makes forecasting difficult in terms of annual events (e.g., carnival, Halloween and other holidays). In addition, the results of the forecast should be compared with the actual sales figures over a longer period of time in order to make a medium to long-term statement about the forecast accuracy. Based on an extended period additional trend diagrams, exploration of past events and repetitive patterns in the data could become more obvious and support explainability, such as comparing carnival with last season or in the last three years. Furthermore, this affected the correlation analysis. Random correlations can creep in during the investigation, or simply go undetected. ML requires a broad data foundation (e.g., large data sets and multiple sources) to unfold its full potential. In addition, we concentrated initially on a limited set of features. Further analysis may have uncovered more features with strong correlations, which could have been integrated into the system or obtained from other sources. Additionally, we analyzed only two products and trained and applied our model with a typical product. Multiple products should be analyzed and evaluated so that other features can be identified to increase the model’s accuracy. Moreover, other ML techniques may have been able to lift more information from the data. It is necessary to implement additional methods, such as neural networks, to compare their prediction power and examine how each model deals with the same data set. We can also investigate how various methods react to special effects like the current pandemic, with the accompanying regulations regarding access restrictions to facilities, supply shortages, and the extreme situation of a lockdown. Multiple questions arise: How long do the procedures need to run to determine the effects? Which features are necessary for this? Does such an unsteady market environment generally lead to inferior prediction results? Finally, we would strive to involve the field experts, like the owner and the administrative staff, during the second iteration to incorporate their feedback and their domain knowledge. This could uncover aspects we might have missed in the initial model (Hannon et al. 2019; Zhou et al. 2022).

In addition to the technical limitations, we were not able to investigate what impact the use of the technology will have on the actual work practices over a reasonable time period. Due to the lack of interpretability, and therefore the owner’s rejection of the ML module, the continued investigation will not be possible in this case. Based on the owner’s focus on the ordering process, no user training on the forecasting framework or interpreting the model’s results was done. However, this is a classic chicken-and-egg problem to consider when building ML procedures in practice. On one hand, accurate forecast values are expected, but on the other hand, these values have to be explained appropriately. To obtain accurate forecast values, the model must first be explained. This interrelation of model construction, data basis, and interpretability creates opportunities for future research. The research area of Explainable Artificial Intelligence (XAI) tries to resolve this tension by creating transparency (Lipton 2018). XAI offers the opportunity to demystify the black box of machine learning through explanatory components and make it more accessible to people so that barriers and inhibitions are reduced among those less familiar with ML. For example, the Local Interpretable Model-Agnostic Explanations (LIME) approach, as proposed by Ribeiro et al. (2016) for the Support Vector Machine, may be a possible explanatory component. LIME generates local linear models around the predictions of an opaque model to help explain it. Such methods can also determine the influence of various features on the predictions and demonstrate causality.

In our study, we tried to provide a first explanation for the results of the system by visualizing correlations of different features via diagrams in our first prototype. However, these were not sufficient and led to a quick rejection of the system. The recognition of benefits is a key point in the introduction of technologies, as a mere introduction of the technical possibility leads to failure (Wolf and Blomberg 2019). In the end, this was the crucial point in our study, which despite the good results of our system, led to its rejection. At this point, it would be interesting to evaluate whether an adequate interface with explanatory components would have increased the willingness to use the tool. Finally, the end-user needs to see the benefits, and this can be supported by a sufficiently high level of satisfaction in explaining the results (Gilpin et al. 2018). Especially, showing the benefits of new technologies such as AI in an environment characterized by a low level of digitization seems challenging and needs a systematic approach to show the benefits in a low-threshold and easy-to-understand way, so that it does not lead directly to a rejection as in our study. The potential users in our study are, as the bakery owner described it, ‘[…] no IT experts […].’

9 Conclusion

The purpose of our study was to examine the prerequisites, design, and implementation of ML-based sales forecasting and which challenges exist within practice. Since the food industry is unusually affected by restrictions such as a short shelf life, high product quality requirements, and a strongly fluctuating demand, it can particularly benefit from an accurate forecast (Doganis et al., 2006). We, therefore, examined a bakery.

Since the basis for the use of machine learning methods was not given within our application field, we first had to digitize the data basis as well as the underlying process to obtain the necessary prerequisites for ML-based prediction methods. Although this new (digital) process met with a great deal of approval, the actual forecasting module, which was meant to plan production, reduce overproduction/underproduction, and therefore relieve the inexperienced employees in the branches, was rejected. The main reason was the missing trust in the forecast. We conclude that good results of ML are no guarantee for its actual use if there is no adequate explanation for the user to encourage the buildup of trust and acceptance of the system, and to recognize the benefits. Here, the field of XAI has a supporting and decisive role to play in trying to establish ML in practice.

Although there is still a lot of future research to be done in the area of sales planning using machine learning in the food industry, and the interface between users and artificial intelligence, we are confident that our study shows which socio-technical challenges exist when introducing ML in practice. Our study shows that the interface between machine learning and the end-users (Human-AI-Interaction) needs to be investigated more closely to clarify the benefits and potentials of such technologies. The technology itself can be made available quickly, but the existing and manifested practices will not change when there is no acceptance by the end-users. With our study, we hope to motivate researchers to continue research in the field of Human-AI-Interaction, sales planning, and machine learning.