Introduction

Customer expectations regarding logistic performance have strongly increased in the last decades. Nowadays, customers expect not only high quality and individual products for low prices but also short and especially reliable delivery times (Paprocka & Cyba, 2015). This results in major challenges for manufacturing companies. On one hand, although Make-to-Stock (MTS) approaches allow fast delivery times, they might not be beneficial for all types of manufacturing companies due to a large number of products or product variants. On the other hand, Make-to-Order (MTO) manufacturers are able to address more customized needs but they often tend to use standard delivery times (Rao et al., 2005). The latter approach leads to a strong fluctuation of capacity utilization and requires high effort in continuously adjusting the manufacturing capacity. In addition to that, customers’ requirements can vary greatly. While in some cases the delivery should just be as fast as possible, others demand a specific time window for the delivery or even a just-in-time delivery. The resulting pressure on manufacturing companies to meet communicated delivery dates while offering competitive delivery times pushes them to find reliable approaches for accurate predictions. Often, a reliable delivery date can only be determined by converting a customer order into internal manufacturing orders that are scheduled within the manufacturing execution system. However, since converting and planning manufacturing processes is time-consuming, companies face the choice of delaying the confirmation of the customer order or adding significant buffers to ensure on-time delivery for the communicated delivery date.

For these reasons, the estimation of delivery dates has been addressed by the literature for decades. However, the majority of contributions are based on numerous assumptions and strongly condense the underlying information from various orders. In recent years, research has been increasingly focusing on the application of machine learning approaches for production planning and control tasks (Panzer & Gronau, 2023; Waubert de Puiseau et al., 2022). Nevertheless, the estimation of delivery dates often only appears as a sidenote while the main objective of such contributions is to optimize a system regarding tasks like order acceptance (Zhang et al., 2021), order release planning (Schneckenreither et al., 2021), or sequencing (Liang et al., 2012). Existing research in manufacturing analytics has shown a notable gap in its coverage, particularly regarding the application of Quantitative Logistic Models (QLM) in extracting valuable insights from raw manufacturing data. To address this shortcoming, our study explores the use of QLM to improve feature extraction from manufacturing data and then validates the improvements achieved through practical applications in two different cases. In addition, current research in manufacturing analytics focuses mainly on the prediction of manufacturing lead times, while neglecting the consideration of desired delivery dates, which may be a crucial aspect. Our results prove the importance of this integration, showing that without considering desired delivery dates, the accuracy of delivery date forecasts is no better than that achieved by conventional methods such as moving averages or mean delivery times. The central role of desired delivery dates in improving forecast accuracy is carefully assessed through the application of Shapley values, shedding light on their profound impact in this area. Through the strategic use of established QLM, we strive to extract valuable features based on domain knowledge. This approach enables domain specialists to understand the underlying factors of predictions. As a result, it has the potential to enhance both the trust and clarity of AI, which addresses a significant challenge in its integration into decision-making processes (Adadi & Berrada, 2018; Golpayegani et al., 2023).

Therefore, the object of this study is to evaluate the use of machine learning in predicting delivery dates through a combined case study from two job shop manufacturers. For the case studies, we use two German companies that operate as contract manufacturers producing small batches from 1 to 10,000 pieces. One manufacturer produces precision mechanics utilizing 17 different machines. The other case deals with a manufacturer of rubber sealings utilizing 52 machines covering 15 different manufacturing technologies.

The contribution of this study can be summarized as follows:

  • We prove that machine learning is capable of assisting in predicting delivery times at various stages of the order processing workflow, which leads to sustainable competitive advantage for companies. We especially use the desired delivery date as a feature. This has not been taken into account in predicting the delivery date yet.

  • We evaluate how sets of input parameters available at different stages of the process contribute to the prediction accuracy.

  • We adapt the CRISP-DM methodology in this paper to focus on the intricate task of delivery time forecasting. In doing so, we extend the scope of the conventional CRISP-DM framework, introducing elements such as the utilization of QLM.

  • We demonstrate that machine learning is able to estimate delivery dates before confirming incoming orders, which reduces manual efforts throughout the process.

  • From a practical perspective, we show that QLM combined with machine learning can significantly improve the prediction of the time of order confirmation. Compared to the previous method in one Use Case, it shows that the forecast error, measured by the Normalised Root Mean Square Error (NRMSE), can be reduced by about two-thirds.

  • We show that the selection of the best features and their quantitative importance (measured by SHAP Values) is central to forecast quality. It is not only about finding the best algorithm but also about identifying and quantifying those features that give the best results. Based on the use cases we present, we are able to highlight the essential role of considering the desired delivery date for the accurate prediction of delivery dates. This finding is useful for all companies with small batch production, as it can be applied to their forecasting processes.

Our study is structured as follows. After a literature review that places our contribution in the context of current research, we outline the methodology used to utilize domain knowledge in the feature extraction process. Then, the case studies are described and the application of the steps of the methodology are documented in the main case. The results are compared with the findings of the secondary case before the most relevant input factors are discussed within the context of existing domain knowledge. Results indicate that the extraction of features using domain-specific knowledge significantly improves the quality of the prediction and the explainability of the overall results. In the presented main case, the proposed machine learning-based approach estimates delivery dates with an additional error of 2 business days compared to the estimation of the process planning department. The machine learning approach does not use data that can only be used after process planning has been conducted on the customer order. Therefore, the machine learning-based approach can be used significantly earlier in the process, for example within the offer process or before confirmation of the customer order.

Literature review

The scientific community provides different definitions for delivery times depending on the domain. Within the industrial context, Wiendahl (1997) defines the 'delivery time' as the period of time between the placing of the order by the customer and the receipt of the goods, the 'delivery date'. This definition stems from a logistics-related context in multi-link supply chains. Accordingly, Chapman et al. (2017) define delivery times as the sum of transport time, picking, provision, and order processing. Looking at the delivery time from the production point of view, the definition slightly differs. Following the Hanoverian Supply Chain Model (Schmidt & Schäfers, 2017), delivery times are the result of throughput times of the orders through the customer-order specific processes plus possible transport times. This implies, that the delivery time depends on the stocking strategy (Lödding, 2013). Thus, the delivery time in Engineer-To-Order (ETO) includes different time components than in Make-To-Stock (MTS) approaches. In addition, the delivery time is influenced by various variables. Among other factors, are the sequence-dependent setup times, the share of different product groups in the product mix, the batch size, the order lead time, and thus also the delivery time of a product in discrete manufacturing (Schuh et al., 2019). In ETO environments, the time components for the design and engineering are dominant, other components are procurement or supplier lead times, and lead times for production, including manufacturing, assembly, and testing (Alfnes et al., 2021).

The prediction of delivery times in the highly variable and complex production environment is necessary (Bhalla et al., 2023). Accurate prediction of delivery times is a good foundation for negotiations (Rau et al., 2006), as it allows companies to better plan and allocate resources, and make informed decisions based on a high number of influencing variables and gives, therefore, an advantage in the market (Amaro et al., 1999; Bhalla et al., 2023; Cannas et al., 2020; Grabenstetter & Usher, 2014; Hicks & Braiden, 2000). In the literature, delivery time estimation has been more of a byproduct of models to improve logistics performance, including order release planning or sequencing. Next to estimating delivery dates prior to order confirmation, assessing the feasibility of meeting requested delivery dates is highly important (Bhalla et al., 2023). As we move towards data-driven decision-making, various models have been developed for determining delivery time as well as different individual components such as due dates or lead times in different domains (Bezirgiannidis et al., 2013; Choetkiertikul et al., 2017; Jodlbauer & Tripathi, 2023). The literature describes deterministic, stochastic, and combined methods. In the industrial environment, mathematical calculations, heuristics, and meta-heuristics were first described decades ago (Adam et al., 1993; Ragatz et al., 1984). Thürer et al. (2012) and Moses et al. (2004) propose simulation-based heuristics for estimating delivery dates, known as simulation-based due date setting and incremental forward simulation based on an approach of Roman and Del Vallei (1996). Even today, these approaches are still being used and further research is being carried out on them (Bhalla et al., 2023).

In recent years, the use of machine learning methods to support decision-making in manufacturing has been growing, as they provide the opportunity to make accurate predictions based on large amounts of data. In particular, the use of such methods to solve production planning and control (PPC) tasks is gaining increasing interest. More recently, there has been a sharp increase in the number of publications on the use of machine learning in the estimation of components of the delivery times as well, especially throughput times (Maier et al., 2022). This can be explained by the fact that there are various variables influencing the total delivery time. Especially in this environment with different data and influencing factors, the use of machine learning helps to identify correlations in the data (Shet et al., 2022). This can be particularly valuable in the manufacturing environment to analyze the dependencies of delivery time components in detail.

The majority of existing methodologies do not utilize QLM for feature extraction in delivery time forecasting. Much of the existing literature mainly focuses on modeling techniques and neglects the crucial step of feature extraction. A more in-depth discussion of the primary influencing factors with the aim of generalization is notably observed in the works of Öztürk et al. (2006) and Bender and Ovtcharova (2021). The authors employ distinct (simulation) models to assess the impacts of various manifestations of influencing factors on forecasting accuracy.

A review of the literature in this field reveals that various authors have examined delivery times or some of their key components. In our research, we considered references that deal directly with the assignment of time components, as well as literature in which a prediction of any time components is implied as part of the optimization of a system. Table 1 classifies methods for determining delivery times as well as the type of input data. Based on this, it is possible to draw several conclusions regarding the use of machine learning for predicting delivery times in a practical setting. Firstly, many of the existing publications only utilize simulation data, which may not accurately reflect real-world scenarios (Alenezi et al., 2008; Bender & Ovtcharova, 2021; Murphy et al., 2019; Öztürk et al., 2006). Using simulation-based approaches, Adam et al. (1993) and Thürer et al. (2012) nevertheless demonstrated that it is possible to predict production lead times in complex systems with multi-level product structures, but only a few exceptions specifically focus on delivery time rather than just the order lead time (Alnahhal et al., 2021; Khiari & Olaverri-Monreal, 2020). Second, most contributions continue to use, at least for comparison, classical data-driven methods such as linear regression, heuristics, or other mathematical models (Nguyen, 2016; Polim et al., 2017). Thirdly, publications that use more complex algorithms such as Neural Networks also typically use simulation, which may lack practical validation (Schneckenreither et al., 2021; Yang & Zhang, 2018; Yang et al., 2017). The considered research, applying Neural Networks on real data, deals with the delivery of packages and is therefore assigned to a rather foreign domain (Araujo & Etemad, 2021). Finally, many publications with real-data validation use or compare tree-based models, whose use is often based on easier comprehensibility (Alnahhal et al., 2021; Khiari & Olaverri-Monreal, 2020; Mohsen et al., 2023).

Table 1 Overview of data-based determination of delivery time and its components in complex production systems, based on Rokoss et al. (2021) and Bhalla et al. (2023)

Overall, existing literature maintains a narrow focus on the determination of specific time components such as throughput time, instead of looking at the prediction of the delivery time. A holistic approach to determining delivery times requires understanding the relevance of production management theory-related features, as well as increasing the interpretability of the results with existing domain knowledge. A structured approach to applying the supervised learning models to real production data is needed. This involves several steps, including the need for clear business understanding to include the relevant features on the delivery time, detailed data pre-processing using specific models for feature extraction, as well as suitable modeling to select the best supervised learning models should be aimed for. This paper intends to consider these aspects from the literature review while shedding light on the development and application of machine learning techniques for delivery time prediction.

Methodology

This paper uses a deductive approach to contribute to the explanatory research on delivery time determination. The methodology used to predict delivery times follows an applied research approach and is based on real world primary data. Using exemplary use cases is an established approach in the field of machine learning, as it allows to gain insights with practical relevance (McCutcheon & Meredith, 1993; Steinberg et al., 2022). Thus, a machine learning-based approach for predicting delivery times has been developed using data and information from two separate manufacturing companies producing in small batches. In contrast to mass producers, small-batch producers have an earlier decoupling point from customer orders. The earlier decoupling point allows for more customization of products to meet the product variability needs of the market, but at the same time missing finished goods buffers lead to longer delivery times (Olhager, 2003). Small batch manufacturers generally have high production uncertainty due to various complexity and dynamic characteristics like production volumes, product mix, and design changes (Birkie & Trucco, 2016; Braglia et al., 2019; Powell et al., 2014). These characteristics cause variability in the individual components of the delivery time, compared to a mass producer that stores finished goods in stock. It increases the difficulty of the prediction. Analyzing those dynamics is a main challenge to manage their impact on the companies’ targets (Richter et al., 2023). External factors like unstable incoming purchases or customer request dates act as a catalyst for internal fluctuations because they lead to rush orders and non-linear scheduling, resulting for example in sub-optimal capacity utilization. As these aspects affect delivery times, the approach presented in this study considers the characteristics of small batch production.

The overall aim is to answer the following research questions:

  1. 1.

    Can delivery times be predicted earlier in the process in the same quality the process planning department provides?

  2. 2.

    Which domain knowledge-specific features do improve the quality of the prediction?

  3. 3.

    What are the different quantitative effects of the different features on the training model?

  4. 4.

    Do the domain knowledge-specific features help a domain expert to follow up on the decisions made?

The methodology used for the development of the approach follows the widespread, industry-independent Cross Industry Standard Process for Data Mining (CRISP-DM) (Chapman et al., 2000). We use domain knowledge by applying QLM described in the Hanovarian Supply Chain Model (Schmidt & Schäfers, 2017) on manufacturing systems to identify relevant input factors regarding the manufacturing process. The underlying funnel model for capacity utilization (Bechte, 1988) and the throughput diagram (Wiendahl & Tönshoff, 1988) represent the main influence on the feature extraction process. Overall, the methodology of this paper aims to specify the very broad CRISP-DM for the task of delivery time forecasting, by taking into consideration desired delivery dates and QLM.

CRISP-DM follows six key phases using a cyclic approach, including business understanding, data understanding, data preparation, modeling, evaluation, and deployment (Chapman et al., 2000). The deployment phase is strongly company-dependent and therefore not part of this paper. Although CRISP-DM itself is well documented, it only provides a very broad structure to a wide variety of data mining projects. To address the requirements of delivery time prediction, the specific methods utilized to adapt the CRISP-DM for forecasting delivery times are discussed below. Figure 1 gives a schematic overview of the specific methods that are used in the respective CRISP-DM phase.

Fig. 1
figure 1

CRISP-DM scheme (Chapman et al., 2000) enlarged with application-specific methods for delivery time prediction

Business and data understanding

The initial phase, business understanding, requires gaining knowledge about requirements and needs from a business perspective. From a value stream perspective, it is possible to identify the key stages of the ordering process, from the first customer inquiry to the product delivery. From an operational perspective, it is possible to understand the manufacturing process and the sequence of operations required to manufacture the goods, both automated and manually. Finally, interviews with stakeholders allow to unveil domain knowledge and in particular what are the key factors influencing delivery times according to the stakeholders’ business experience, thus providing guidance for future modelling. To address the specific characteristics of small batch manufacturers, the relevance of external manufacturing processes needs to be examined using value stream mapping and/or interviews.

Key factors of the data understanding phase for delivery time prediction include data on products, customers, and suppliers. Process-wise data on ordering processes, manufacturing processes, and overall market data need to be analyzed. In terms of products, it is possible to identify the structure of the product portfolio, the volume, complexity, manufacturing steps, and whether products follow a Make-to-Stock (MTS) or a Make-to-Order (MTO) approach. Customer data shows the number of orders, order frequency, pricing, and relevant customer groups. Similarly, supplier data highlights the presence of possible stable partnerships, as well as how semi-finished goods and components are delivered (i.e. based on the customer order or in stock), and in what time frame. In general, exploring the available data needs to be split into data that is available prior to the processing by the process planning department and data that is available after processing. To apply QLM during feature engineering it is crucial to extract relevant timestamps from the ERP systems. Timestamps marking the start and finish of every machining step internally are necessary, as well as timestamps on external machining and procurement. The overall focus of the data collection must be to reconstruct the full internal and external processing of a customer order, covering every step from the first offer to the handover of the finished goods to the shipping company.

Data preparation and feature extraction

The data preparation phase involves all activities required to prepare the final dataset before it is fed into any model. In workshop environments, the key manufacturing activities are measured manually by operators, who observe processes and record timestamps. This manual process ultimately often leads to gaps in the collected data. The use of Multivariate Imputation by Chain Equations (MICE) approach for handling missing values is an appropriate approach to improve the overall data quality before proceeding to feature engineering. Feature engineering is supported by domain knowledge regarding production planning in extracting features from raw data that best describe the manufacturing, procurement, and shipping processes. The Hanovarian Supply Chain Model offers different approaches to describe the logistic performance of a manufacturing system. Due to the lack of finished goods stock inventory in MTO workshop manufacturing, stock inventory models as well as service level models are not applied to the datasets. Instead, the feature extraction focuses on internal and external manufacturing processes as well as customer and supplier ratings. Internal manufacturing processes are analyzed by applying the funnel model for capacity utilization (Bechte, 1988) and the throughput diagram (Wiendahl & Tönshoff, 1988) on the existing data. The combination of Work-in-Process (WIP) and the resulting range of the systems, allows to identify internal manufacturing bottlenecks for every order. WIP is defined as the workload that is waiting to be processed by a system. WIP can be measured using a time scale (e.g. manufacturing minutes) or using a quantitative scale (e.g. number of orders). For further analysis, the average output of a system needs to be calculated. The average output of a system is measured by the sum of the output over all orders divided by the time it took the system to finish the orders. The range of the system x is calculated by dividing WIP by the average output over the past n manufacturing orders. The range measures the time a system needs to finish all waiting orders (Kettner & Bechte, 1981).

$${range}_{\left(n,x\right)}=\frac{{WIP}_{(x)}}{\frac{1}{n}\sum_{i=1}^{n}{{output}_{(x)}}_{i}}$$
(1)

The range can also be applied to the whole factory to get an idea of the overall waiting time for new orders. The range of the production system is calculated by dividing the WIP of waiting customer orders by the average quantity of dispatched orders. With the range for every system available, every order can be checked for their respective bottleneck system by finding the maximum value of the ranges for the systems k the specific order needs to pass.

$${{range}_{bottleneck}}_{(n,k)}=\underset{x\in k}{{\text{max}}}\left\{{range}_{\left(n,x\right)}\right\}$$
(2)

External manufacturing processes are covered in the feature set with the average delivery time \(\overline{DT }\) and the average desired delivery time deviation \(\overline{\Delta DT }\) over the past m orders placed.

$${\overline{DT} }_{m}=\frac{1}{m}\sum_{i=1}^{m}DT$$
(3)
$${\overline{\Delta DT} }_{m}=\frac{1}{m}\sum_{i=1}^{m}(DT-desired DT)$$
(4)

The presented features can be combined with an ABC classification or ranking of the suppliers. The same procedure is applicable to the customer side, offering information on key account customers that might result in prioritization of their orders to maintain a high long term delivery time reliability. An approach towards a ranking system is to generate rank by applying a sortation to customers and suppliers by the relative quantity Q of orders that have been placed by customer/supplier within a certain time frame in the past (e.g. three years) compared to all placed orders m.

$${Q}_{(z)}= \frac{1}{m}\sum {order}_{customer/supplier}$$
(5)

It is important to note that the presented approaches follow the basic steps of feature engineering, and they should be considered as a baseline for further development. Every production system has its unique mechanisms and resulting data, which needs to be processed accordingly.

Modeling

After the data has been adequately prepared, modeling techniques within the field of machine learning can be applied to predict delivery times and extract relevant input factors. This requires selecting the most performant model out of various proven supervised learning algorithms. In this study, we utilize Support Vector Machine (SVM), Artificial Neural Networks (ANN), Gradient Boosting Tree (XGB), Decision Tree (DT), Random Forests (RF), and Linear Regression (LR). To determine the influence of domain-specific knowledge on feature generation, models are trained with different feature sets. The modeling involves tuning relevant parameters of the models by applying grid search, reducing the number of features by backward feature elimination, and cross-validating the model. Since the second case only consists of 2632 orders, for consistency both cases utilize a 90–10 train-test split to improve the overall prediction quality compared to the more common 80–20 split. This is achieved by increasing the size of the training and validation data set (Xu & Goodacre, 2018) to 90%. The increased risk of overfitting that could be caused by the smaller test data set is reduced by cross validating the prediction results (Hawkins et al., 2003; Vabalas et al., 2019). When validating predicted delivery times, it is important to take into consideration the chronological sequence of the orders. Performing a k-fold cross validation, often used in classification problems, would expose the dataset to the “look-ahead bias”. This bias occurs when future data is used during training, thus leading to better results in the training process than later in a production environment. To avoid introducing future knowledge for predictions, a Time Series Split Cross Validation approach is applied. This method allows to avoid the look-ahead bias while using enough data to perform effective cross validation (see Fig. 2). While the training set increases in size, the test set maintains a consistent size and includes observations that happen after the ones in the training set.

Fig. 2
figure 2

Time series cross validation vs. K-fold cross validation (adapted from Hyndman and Athanasopoulos (2021))

Evaluation

The evaluation phase allows to assess the results provided by each model and compare the results. After the data is trained on 90% of the dataset in the modeling phase, the model is tested on the remaining 10% of the data. The process is repeated for various algorithms and feature sets, and the performance is evaluated accordingly. The quality of the predicted delivery times is measured as the deviation between the actual delivery date and the predicted delivery date in business days. To put the measurements into context, the results are compared to the prediction deviation of a) the delivery dates estimated by the process planning department and b) estimated delivery dates if the average delivery time is used. To analyze the performance of the different approaches, various metrics are used. One of the most widely adopted metrics is R2, which is used in regression problems to evaluate the quality of the model on a scale from − ∞ to 1.

$${R}^{2}\left(y,\widehat{y}\right)=1-\frac{\sum {({y}_{i}-{\widehat{y}}_{i})}^{2}}{\sum {({y}_{i}-\overline{y })}^{2}}$$
(6)

Due to its intuitive interpretability, another popular metric is using the mean average error (MAE), which returns the absolute distance of the prediction from the real value. When predicting delivery time, MAE is measured in business days.

$$MAE(y,\widehat{y})=\sum_{i=1}^{n}\frac{\left|{y}_{i}-{\widehat{y}}_{i}\right|}{n}$$
(7)

However, MAE is robust to outliers when there are enough observations close to the real value. For the prediction of delivery times for small batch manufacturers, outliers are extremely important as small deviations can be addressed by adjustments such as rescheduling or increased capacity (e.g. extra shifts), while large differences between the prediction and the real delivery time represent a big issue for the manufacturing company. Therefore, because of its ability to weight outliers more heavily, the Root Mean Squared Error (RMSE) is selected as the preferred metric for the prediction of delivery times.

$$RMSE\left(y,\widehat{y}\right)=\sqrt{\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}{n}}$$
(8)

In addition to the metrics previously mentioned, the normalized RMSE and the normalized MAE are used to allow to comparison of results between different manufacturing systems, as normalization allows to compare datasets with different scales.

$$NRMSE\left(y,\widehat{y}\right)= \frac{RMSE\left(y,\widehat{y}\right)}{\frac{1}{n}\sum \left|{y}_{i}\right|}$$
(9)
$$NMAE\left(y,\widehat{y}\right)= \frac{RMSE\left(y,\widehat{y}\right)}{\frac{1}{n}\sum \left|{y}_{i}\right|}$$
(10)

However, the proposed approach aims to provide an automated method for predicting the delivery date either when the first offer is made or shortly before the incoming customer order gets confirmed. The proposed methodology aims to adapt the broadly applicable CRISP-DM to the task of delivery time forecasting by utilizing desired delivery dates and QLM.

Case studies

The methodology presented above is validated through its application to two case studies from two real world manufacturing companies. Both companies behind the following case studies use workshop production, despite operating in different industries. After a short introduction of both cases, the first case study is described in detail. As the second case study follows the key steps of the main one, only the results are presented. Since the study aims to compare predictions made at different points in time, both case studies follow the experimental design shown in Fig. 3. In the first step (I), manufacturing as well as procurement, article, and customer order data is fetched from different systems within the company. Then, all relevant features are extracted from the existing data (II). The features are split into three groups (A, B, C). Feature group A consists of features that can be directly extracted from the customer order, such as ordered articles or the date of order confirmation. Feature group B consists of features that can be extracted from the underlying data utilizing domain knowledge. The features in group B are extracted from the data by applying QLM as described in the methodology section. Feature group C consists of features that can only be extracted late in the order process when the process planning department has finished processing the order. The features of group C contain the planned start of manufacturing, planned throughput time, and the planned delivery date. In the third step (III), the feature sets for the modeling phase are combined. All three feature sets contain features directly from the dataset (group A). Feature sets 2 and 3 also contain features that derive from domain knowledge (group B). Feature set 3 also contains features that are only available later in the process (group C). The fourth step of the experimental design is the modeling and evaluation phase as described in the methodology section (IV). For both cases, the same model parameters are applied as described. The modeling phase consists of three phases—the model selection, the feature reduction, and the model refinement phase. In the final step, the results of the application of the feature sets are discussed in the context of manufacturing management domain knowledge (V).

Fig. 3
figure 3

Experimental design for both examined case studies

Comparison: case study business characteristics

Both examined companies are make-to-order (MTO) manufacturers based in Germany. The manufacturer behind case 1 produces order-specific rubber sealings. The manufacturer behind case 2 produces metallic parts for mechanical use cases. Both companies produce in batches between 1 and 10,000 pieces for a global market. Since the global shipping times vary widely depending on the level of urgency of the shipping and the shipping provider, all delivery dates in both cases are communicated ex works. The two cases were chosen for this study because of the similarity in the underlying business process from the initial offer to the final delivery of the goods. The manufacturing process of the goods usually requires between three to ten machining steps in case 1 and three to seven machining steps in case 2. Although the two examined cases share several similar process patterns, there are some differences. The company behind case 1 does not send unfinished goods out to other manufacturers for external processing such as painting. In case 2, those external processes are part of the manufacturing process and can impact the manufacturing throughput time of the orders. Additionally, the desired delivery date was not specified in the dataset for case 2.

During the offer process, the sales department estimates delivery times based on experience, taking a large buffer time into the calculation. After receiving the customer order, along with CAD (Computer Aided Design) files describing the object details, the order gets confirmed and the details are forwarded to the process planning department. On average, the process planning department is able to estimate a delivery date using manufacturing execution systems and communicate it to the customer within 7 business days for case 1 and four business days for case 2. During this time, the process planning department verifies the CAD files sent by the customer, checks the availability of raw materials, and defines process details related to manufacturing. Decisions on utilized manufacturing methods and machines are finalized and then migrated into software files for production. Finally, the order is placed into the Enterprise Resource Planning (ERP) system and the theoretical delivery date is calculated. Overall, this planning process results in a high risk of the delivery date initially communicated in the order confirmation not being matched by the manufacturing capacities. This Mismatch will first be visible days after the confirmation of the customer order, as shown in Fig. 4. Therefore, the key objective in the given case studies is predicting the delivery date of a customer order based on the information available at the time of the first offer, although a machine learning-based approach can be applied also later in the order processing (i.e. order confirmation and manufacturing order released into the ERP system) (see Fig. 4).

Fig. 4
figure 4

Customer order processing (both cases)

Case study 1

Business and data understanding

Data was fetched from the company’s Enterprise Resource Planning (ERP) system including the time frame from 07/23/2019 to 03/31/2022. Table 2 gives a short description of the raw data that has been processed.

Table 2 Description of raw data

The main data set on manufacturing orders contained 16,361 single orders involving 58,344 manufacturing steps. The manufacturer finalizes approximately 300 customer orders on a weekly basis. Orders from 1,630 different customers covered 7,620 different articles to be produced. The full data set contains ten tables counting 73 columns and 2,393,353 rows.

Data preparation and feature extraction

To gain valuable insights, the data collected needs to be cleaned and processed. The main goal of this phase is to identify relevant features and avoid data inconsistencies that might lead to inaccurate results during the modeling phase.

Firstly, the analysis focuses on the entire dataset and the label attribute. Across the entire dataset, duplicate data has been removed, and missing data points were omitted from the dataset using Multiple Imputation by Chained Equations (MICE) and k-nearest-neighbor regression. The label attribute contains the delivery time. For this, the original timestamps have been converted into integers representing the number of business days since the customer order. Removing non-business days allows to eliminate the effect of public holidays (e.g. Christmas) on the manufacturing capacity and delivery times (see Fig. 5). This enables the extraction of clear statistics of the prediction variable. Thus, the average delivery time is around 18 days, and over 99% of deliveries are completed within 50 business days, with only a few outliers over 50 business days (see Fig. 6).

Fig. 5
figure 5

Daily utilization of available work hours (case 1)

Fig. 6
figure 6

Distribution of the label attribute (case 1)

Secondly, features have been examined and processed independently to address their individual characteristics. As a result, ad-hoc steps for data preparation have been followed for customer order-related features, features representing production planning and control information (domain knowledge), and features representing the planned times available after the order has been processed by the process planning department.

Attributes related to customer orders (feature group A) contain various information including order date, desired delivery date, products, and other customer details. These features are converted to numerical or categorical types while ensuring consistency across the dataset. Among these features, there is also the predicted number of machines to be used in the manufacturing process. The company behind case 1 is organized in job shops, with 13 different departments that group machines of the same type (e.g. milling, drilling, gluing). When the customer first requests an offer, the sales department estimates how often these departments need to be utilized to complete the order.

Additional attributes have been extracted to represent the domain knowledge in production management (feature group B), with the features deriving from QLM. These features aim to provide statistics related to the performance of the manufacturing system, based on historical data. This is due to the influence of the performance of manufacturing systems on delivery times. For example, the presence of bottlenecks, which vary depending on specific product routings, limit the amount of output a manufacturing system can deliver (Goldratt & Cox, 1984).

Finally, after the order has been processed by the process planning department, it is possible to extract the attributes representing the planned times (feature group C), namely the planned throughput time, the planned manufacturing start date (expressed in number of business days since receiving the customer order), the planned machine set up times and cycle times. Other historical information represents the status of the manufacturing system. This includes the level of work-in-progress (WIP) in the factory and in specific departments, as well as the average number of processed orders in a specific month for both the whole factory and specific departments.

The three feature groups collected represent different inputs to the modeling phase (see Table 3).

Table 3 Main data groups and availability in time

Modeling

Modeling and evaluation for the case study are iterated three times with different feature sets. The initial predictive model is developed using two of the three main data groups identified in the previous phase, namely customer order-related data, and domain knowledge (feature set 2). These two types of information are available as soon as an offer is created or a customer order is received, while planned times features become available only later in the process once the process planning department has processed the order. Thus, together these two groups are the input of the baseline scenario. After a model is developed using feature set 2 as input, the model is re-trained changing the input to feature set 1 and feature set 3, thus allowing to compare the contribution of each feature group. The modeling phase consists of the three steps model selection, feature selection, and model refinement.

1. Model selection

The dataset is trained using a 90–10 ratio. Various established machine learning algorithms have been evaluated using the metrics previously mentioned. The table below lists the models used in the training phase along with their parameters for the grid search. Training has been conducted using between 18 and 20 different combinations of parameters (see Table 4) within a separated grid search process for each model. Models have been fitted using Time Series Split Cross Validation.

Table 4 Models and parameter sets for initial grid search

The results of the initial modeling process are presented in Table 5. As shown by the results in the table, all models perform better than the baseline prediction using the delivery time average (RMSE = 12.1 business days). However, because XGB demonstrates the best performance in RMSE and MAE, further development is conducted using this algorithm.

Table 5 Results of initial modeling (case 1, feature set 2)
2. Feature reduction

During the data preparation phase, the data is extracted and aggregated from different systems, resulting in 46 features. To reduce the risk of overfitting the model, as well as the amount of data preparation needed in production, only the most relevant features are selected. Thus, features are ordered by importance using Shapley Additive Explanation (SHAP) values. In this specific case, while the number of features for a local optimum would be 41, it is possible to score a close-to-optimal RMSE within 0.1 business days with 13 features (see Fig. 7).

Fig. 7
figure 7

Feature reduction process by threshold definition

3. Model refinement

Once the best model is selected, along with a reduced number of features, the model parameters are tuned again to identify the values leading to the best performance. The final grid search identified the following optimal parameters of the XGB model:

  • colsample_bytree: 0.9

  • learning_rate: 0.1

  • max_depth: 4

  • n_estimators: 250

Then, the model is re-trained over the training dataset with 13 features, and prediction values are saved from the application of the model to the test set. In the final step, the metrics are being calculated.

Finally, the whole modeling process is re-tested to compare the relative contribution of the feature groups identified in the data preparation phase.

Evaluation and results

The value of the delivery times prediction can be best discussed by comparing the various applications represented by feature sets. Table 6 shows the performance of the machine learning approach using different feature sets. The baseline application is the simple calculation of the mean delivery time based on historical data. This results in a mean of 12 days RMSE and a negative R2, indicating a very poor correlation. Additionally, the results are compared to the best-performing approach using moving averages. Moving averages were calculated using different numbers of finished orders (n = 5, 10, 25, 50, 100, 200, 500, 1000, 2000). The prediction using the best-performing moving average (200 orders) does result in comparable results to using the mean delivery time as the prediction. Training a dataset without domain knowledge features improves the prediction by almost 3 business days in RMSE (9.5). This shows that the use of machine learning can help with predicting delivery times, even without utilizing domain knowledge. In this scenario, the prediction error is reduced by almost 50% compared to the estimates produced by the process planning department. However, by training a dataset that includes production management domain knowledge features, the RMSE can be further reduced by 1.8 business days (7.7), reducing the remaining error almost by a further 50%. In other words, including information about the current state of the performance of the manufacturing system allows considerable knowledge gain. Finally, when the planned dates defined during process planning are also included in the dataset, the results are more accurate but only by a small margin. This shows that the planned dates originate mostly from the information derived from the status of the manufacturing system. It is important to note that this last step can be performed only late into the process when the process planning department calculated the planned delivery date.

Table 6 Final results for XGB model (case 1)

Case study 2

Business and data understanding

To validate the method presented in the previous section, a second case study was examined. Since the majority of the steps applied from the methodology, this section only addresses differences in the processing of Case 2 compared to Case 1, described in the section above. Figure 8 visualizes the delivery times of the second case, showing overall longer delivery times compared to case 1.

Fig. 8
figure 8

Distribution of the label attribute (case 2)

In case 2, the mean delivery time is 27 business days, resulting in an RMSE of 10.18 business days if used as prediction. After applying the methodology, the initial results proved to be significantly less performant compared to case 1 (see Table 7). The results show that machine learning does not allow significant improvements in the prediction, compared to using the mean delivery time.

Table 7 Initial results for XGB model (case 2) without constructed desired delivery time

Data preparation and feature extraction

Although the dataset structure is similar to the one presented in case study 1, the desired delivery date was not specified in the dataset for case 2. This results in higher uncertainty, as it is unclear whether the order should be completed by a desired date in the future or as soon as possible. To address this issue, a new feature was engineered and added to the dataset. The algorithm checks whether the orders in the dataset have been completed within the longest range of the used machines and calculates the throughput time of the whole order. Combined with a normalized buffer (+ 3/-3 business days), the algorithm decides if the order was produced as fast as possible depending on the capacity at the time, or if there was an additional delay. If there was an additional delay, the algorithm establishes that there was a desired delivery date causing the delay.

Modeling, evaluation and results

To ensure the comparability of both case studies, Case Study 2 went through the same modeling phase as described above for Case Study 1. With the new feature included in the dataset (see Table 8), the results improve by 2 business days without including the domain knowledge as input. Again XGB showed the best results.

Table 8 Results of initial modeling (case 2, feature set 2)

Table 9 shows that compared to case study 1, adding the manufacturing domain knowledge does not improve the RMSE and MAE significantly (0.4 business days), while the R2 score is enhanced by 0.13. As opposed to the first case, adding planned times information (feature group C) improves the RMSE results by 4 business days, with predictions being 2 business days more accurate than the dates planned by the process planning department. This is due to the fact that the planned dates defined by the process planning department tend to be 2 days late on average (see Fig. 9).

Table 9 Final results for XGB model (case 2)
Fig. 9
figure 9

ML model compensates mean delivery time deviation of process planning department

Discussion of results over both cases

Table 10 summarizes the key results of the two case studies presented in the previous sections. The R2 score shows that with the mean delivery time being used as the delivery time prediction, there is no relationship between the real values and the prediction.

Table 10 Comparison of both cases (NRMSE, NMAE, R2)

The same effect can also be observed in the comparison between the RMSE of the machine learning-based approaches to the RMSE of the prediction from the process planning department (see Fig. 10). The diagram also shows an increasing added RMSE correlating with a decreasing amount of information given to the ML approach.

Fig. 10
figure 10

Added error of approaches compared to prediction results of process planning department

As mentioned earlier, using the baseline prediction of the mean delivery time calculated on historical data results in significant error, leading to a R2 score of 0, which means that there is no correlation between prediction and reality. The same applies to the usage of the best approach utilizing the moving average of the delivery time. In both cases, the NRSME highlights the gain in prediction accuracy when different groups of information are included in the datasets. In both cases, the NRSME decreases when all available information is given as input to the machine learning model (feature set 3). However, the decrease in case 2 derives from a global delay of all orders. After the correction of this delay, the additional decrease in prediction error for both cases is below 10%. This indicates that after scheduling all processing steps, unforeseeable changes result in an error that can neither be corrected by the ML model, nor by the process planning department. If the prediction is made earlier in the process—before the customer order is confirmed or the offer sent out—using only customer order-related data (feature set 1), the prediction error increases compared to the delivery date determined by the process planning department. Without planned delivery times from the process planning department and domain knowledge-specific features (feature set 1), the ML model scores a RMSE that adds around 50% relative error. If domain knowledge-specific features are added (feature set 2), the additional error is reduced to around 35%. However, as the domain knowledge is based on historical data, and therefore it is known prior to the order, the delivery date prediction can be done before the order gets processed by the process planning department. Although this results in less risk of late delivery when confirming the customer order, the added uncertainty results in a higher prediction error. This demonstrates the importance of the knowledge about the performance of the production system. Knowing details about production system dynamics allows the early prediction of the delivery date. From a manufacturing management perspective, the early prediction of the delivery time before the customer order is confirmed has several positive impacts on the overall production performance. Early knowledge about realistic delivery dates helps avoiding rush orders. Fewer rush orders improve the reliability of the planned schedule which improves the stability of the overall production process. Assuming a more stable production process leads to less unpredictable delays. Fewer delays will eventually result in an even more precise delivery time prediction.

The contribution of different feature groups can be further explained by analyzing the respective top 10 SHAP values. The tables below list the most important features of the two case studies except for the planned times defined by the process planning department (feature set 2). The features are split into two groups: features extracted directly from the dataset (group A) and features that utilize QLM (group B) to be calculated (Table 11).

Table 11 Most important feature groups

In both cases, with a SHAP value above 4, the most valuable feature is the desired delivery time. From a manufacturing management perspective, this information highlights the difference between manufacturing lead time and delivery time prediction. Orders may be received with a desired delivery date in the far future. The resulting desired delivery time is the only feature in the examined cases, that provides information to the model, whether the manufacturing processes need to start as soon as possible or at some point in the future, generating a buffer that can help meeting the desired delivery time. The desired delivery time is directly fetched from the input dataset. So are the quantity of items ordered by the customer and the resulting quantity of processes and elements of the BOM. From the manufacturing management perspective, both features enable the model to estimate the quantity of operations needed to fulfill the order. In the group of features generated using domain knowledge WIP, range, and bottleneck calculations are important features describing the internal manufacturing process. WIP calculations generate information for the model on how many orders are currently in the system. The range calculations add information on the existing capacity of machines and employees to the WIP and condense this information to an estimation of the time needed for a specific system to finish all existing orders. Range calculations allow the identification of bottleneck systems for an individual order, therefore giving the model insight into the earliest possible start for the manufacturing processes. The last two domain knowledge-related feature groups are the processing of external orders and the customer rank based on the quantity of orders placed. From a manufacturing management perspective, the analysis of external manufacturing processes is as important as the analysis of internal manufacturing processes, although most of the underlying information on the suppliers’ manufacturing process is expectedly not available. However, information on the suppliers' on-time delivery reliability in the past and the quantity and value of the external order condense into useful information for the model. The high impact of the customer rank based on the quantity of placed orders can be explained by domain knowledge. A high number of orders indicates a higher chance of recurring processes and therefore shorter and less spreading processing times. All of the above-mentioned feature groups improved the prediction of the actual delivery time and are therefore considered mandatory for implementation in practice. Another observation that matches domain knowledge is that the estimated machining times of the single manufacturing steps do not show in the top 10 most important features in both cases. Research shows that for manufacturing lead time determination the transition times (not available in the observed cases) are more important than the actual machining times (Schuh et al., 2019). For both cases, the replenishment times and overall procurement processes did not have a significant impact on the quality of the predicted delivery time. Since procurement processes can significantly delay a production process, if raw materials are not in stock, further research needs to be undertaken to investigate the impact of procurement delays on the delivery time in real world scenarios.

The top 10 most important features for both cases are listed in Table 12 and Table 13.

Table 12 Top ten most important features (case 1)
Table 13 Top ten most important features (case 2)

Conclusion

The proposed study aims to provide a machine learning-based approach for delivery date prediction. For the development and testing of such approach, a case study methodology has been used. Case studies are suitable for investigating the evidence of an approach—in this case using machine learning—in a real-world application and evaluating the transferability of the research idea (Yin, 2018). Based on the data collected from the two case studies, three groups of information were identified, namely customer order-related data, production planning and control information or domain knowledge, and information about planned times. Only the first two of these three groups are available for a first offer or as the customer order is received, and as this study aims at the early prediction of delivery dates, these two categories have been used to build the machine learning model. Referring to the four research questions formulated in the methodology, the case studies revealed the following results:

  1. 1.

    Can delivery times be predicted earlier in the process in the same quality the process planning department provides?

Applying ML approaches in the initial offering process or just before confirming the customer order results in less risk of late delivery. However, the added uncertainty results in a higher prediction error of + 35%. Using ML approaches in process planning just before releasing the manufacturing order can help to further reduce the prediction error of the delivery time.

  1. 2.

    Which domain knowledge-specific features do improve the quality of the prediction?

For the machine learning prediction, domain knowledge is extremely valuable, as it helps reducing the model error significantly. In other words, historical knowledge about the performance of the manufacturing system increases the prediction accuracy. The SHAP values showed a high relevance for all features extracted according to the Hanovarian Supply Chain Model. The identification of the current range of the bottleneck system as well as external manufacturing processes and customer rank by quantity of orders showed a high impact on the prediction in both examined cases. However, results across both case studies also show that the specification of the desired delivery date is the most valuable feature for the prediction of delivery times, as it allows the model to distinguish whether to aim for the fastest possible delivery or the delivery in a certain time frame.

  1. 3.

    What are the different quantitative effects of the different features on the training model?

All generated features utilizing QLM showed a high impact on the predicted output in the single-digit percentage range. However, the results showed a significant importance of the desired delivery date. In both cases, one third of the average quantitative change of the prediction value of a customer order relied on the desired delivery date.

  1. 4.

    Do the domain knowledge-specific features help a domain expert to follow up on the decisions made?

The extracted SHAP values show that similar features are used by the ML models in both cases. The trained models utilize input factors as features that enable comprehensible decision-making for a domain expert. The key reason for this lies in the fact that the underlying theory on QLM, that were used for feature generation, has been subject of research for decades and is popular within the domain of manufacturing management.

Overall, this research provides contribution to both academia and industry practice by addressing existing research gaps and providing a practical solution for delivery time prediction at the same time. The ability of machine learning to make predictions earlier in the process allows to define a delivery date at the time of the first offer. This can result in increased efficiency, lower costs for the process planning department, and ultimately it can represent a competitive advantage for manufacturing companies using this approach.

Future research on machine learning methods for delivery time prediction should aim in two directions. First, existing cases should be extended by including external input factors such as stock market indexes. Second, the application of new ML approaches to the field under investigation should be considered. A detailed examination of the impact of the desired delivery date on the prediction results seems reasonable. As this research is based on two cases, more examples are required to strengthen the observed results.