1 Introduction

In the present business world, organizations must improve their services in terms of efficiency, reliability, and availability to survive in the market. Sales forecasting and effective demand planning positively affect the performance of a supply chain [1]. The goal of demand planning is to develop a forecasting model that helps decision-makers in the areas of procurement, production, distribution, and sales. Forecasts serve as the basis for action plans carried out by various organizational units at different planning levels [2]. Managers usually predict sales based on their perceptions, intuitions, and experience. This may vary from person to person and it is very hard to continuously get reliable inputs from qualified and experienced managers. As a result, computer networks can assist in decision-making by estimating future sales. Machine learning (ML) can be utilized to create effective sales forecasting models utilizing the vast amount of data and related information [3].

ML techniques have become prevalent across various disciplines due to their ability to address the problems associated with increasingly large and complex datasets [4]. It involves complex algorithms to reveal meaningful patterns in large-scale and diverse datasets, which would be virtually impossible for a well-trained person [5]. Machine learning focuses on inductive inference inducing general models from specific empirical data [6]. In recent years, advancements in this field have been driven primarily by the creation of new algorithms as well as the ongoing burst of data at reduced computational costs [7].

ML methods empowered by predictive analysis create enhanced customer engagement and forecast demands with better precision and accuracy in comparison to the traditional demand forecasting methods [8, 9]. ML techniques can handle complicated correlations between so many causal elements having nonlinear relational demand patterns, thereby boosting retail chain performance [10]. The auto-regressive integrated moving average (ARIMA) and auto-regressive integrated moving average with exogenous variables (ARIMAX) approaches are the most often used predictive models for demand forecasting. Recently developed ML algorithms, such as artificial neural networks (ANN), support vector machine (SVM), and regression trees, have already outperformed the traditional methods [11]. The main objectives of the study are as follows:

  • To explore the machine learning models used for forecasting.

  • To compare the select machine learning models and hybrid models for sales forecasting of a US-based retail company.

In this study, several ML models are compared for retail demand forecasting. These models include random forecast (RF), artificial neural network (ANN), gradient boosting (GB), adaptive boosting (AdaBoost), and extreme gradient boosting (XGBoost), and the performance of these models was compared with the proposed hybrid model of the RF, XGBoost, and linear regression (LR). To make an accurate comparison of the said models, various performance metrics, namely, mean squared error (MSE), R2 score, and mean absolute error (MAE), were considered. The advantages and limitations of the employed methodologies as well as future options for performance enhancement are explored. The historical sales data of a leading US-based multinational retail company is considered for the forecasting analysis. The company has a large number of retail stores across the globe, specializing in a long range of products fulfilling the day-to-day demands of consumers. The sales data used for forecasting is related to various stores of the company spread across the USA.

The rest parts of the paper are arranged as follows: Sect. 2 represents the literature review, Sect. 3 discusses the research methodology, Sect. 4 represents the case study and result discussion, and Sect. 5 concludes the study.

2 Literature Review

Many advances and changes have been observed in the global business world during the past couple of decades. Some major factors leading to business-related uncertainties include partner activities, consumer behaviour, rival behaviour, evolving technology, and new product development [12]. Because of these uncertainties, the market is becoming complex and competitive and needs contemporary supply chain management [13]. Precise forecasting is essential for the success of supply chains [14]. On the other hand, various endogenous factors concerning the collection and application of field data can make the forecasting techniques extremely difficult and the external factors can also have a detrimental impact on forecast accuracy [15]. Effective demand forecasting can save more than 7% on the annual operating expenditures of a business [16]. Either qualitative or quantitative techniques can be employed for demand forecasting [17]. Executives’ consensus, Delphi technique, historical analogies, and market research are used as qualitative demand forecasting. These techniques heavily rely on domain experts’ subjective evaluations and lack decision models, which are data-driven. Quantitative methods, such as regression and time-series analytics, tend to be more systematic and dependable [18]. Regression methods, in particular, are concerned with determining the causal relationships between the independent and the dependent variables [19].

The notion of industry 4.0, ML, and artificial intelligence (AI) as an innovative framework is now being applied to supply chain analytics [20, 21]. Strategic planning, ad-hoc reporting, and end-user computation are common in business intelligence and analytics that aid in robust performance evaluation and management [22, 23]. Descriptive analytics is a technique that deals with happenings in the past, while diagnostic and predictive analytics deals with the happenings to be predicted in the future. For mitigating harmful impacts, the module prescriptive analytics comes in the application [24, 25].

2.1 Machine Learning Methods

ML methods have garnered considerable interest from researchers and practitioners in demand forecasting [26]. But, when it comes to the context of supply chain management, these methods have not been investigated properly. Although algorithms for ML are complex, they offer a variety of distinct and flexible demand forecasting models [27]. The effectiveness of different statistical and ML methods is heavily debated, making it difficult to draw gross generalizations regarding their efficacy [28]. Under varied circumstances, each model class may outperform the others [29]. Some of the most prominent ML methods are discussed in brief in the following paragraphs which have been compared with the proposed hybrid model.

Random Forest

Punia et al. [30] introduced a hybrid forecasting method that is based on long short-term memory (LSTM) and random forest (RF). This first model utilizes LSTM to map the temporal characteristics of the data and then random forest is used to model the residuals of the LSTM network. The random forest section of the network is of vital significance as it provides a substantial edge in forecasting accuracy due to its ability to predict sudden changes due to holidays, promotions etc.


Kang et al. [31] used the XGBoost hybrid model for tourism and trend prediction. They introduced location attributes and the time-lag effect of network search data to propose the hybrid model. The findings suggest that the spatiotemporal XGBoost composite model outperforms single forecasting approaches. There are several modifiable parameters in the XGBoost algorithm, including general, promotion, and learning objective parameters. They adopted the tree model to concentrate on the nonlinear interactions between the Baidu index and the number of tourists. Using the general parameters, the promotion parameters were changed according to the model chosen.

Gradient Boosting Machine

Xenochristou et al. [32] measured the influence of the spatial scale on water demand forecasting. Multiple models were trained on UK daily consumption records for different aggregations of consumptions. Three different levels of spatial aggregation were created using properties’ postcodes. A gradient boosting model for training on each of the configurations and prediction for water consumption was made for 1 day in the future. The results implied that the amount of spatial aggregation had a substantial influence on forecasting accuracy and errors can be minimized by utilizing additional explanatory variables.


Walker and Jiang [33] observed that the forecasts using AdaBoost are more accurate and reliable than those derived via a more traditional logistic regression method. They have analysed the importance of each predictor in the AdaBoost model to better understand the relative contribution of each factor to the overall predicted outcome. They observed that AdaBoost models are not easily interpretable as regression model coefficients.


Jahangir et al. [34] used rough artificial neural network (R-ANN) approach to forecast plug-in electric vehicles travel behaviour (PEVs-TB) and PEV load on the distribution network. R-ANNs can increase the accuracy of forecasting findings due to their capacity to analyse phenomena with high uncertainty. Furthermore, two training methods are used in this paper—conventional error back propagation (CEBP) and Levenberg–Marquardt—which are specified using first- and second-order derivatives, respectively. In PEVs-TB, the results demonstrated that the Levenberg–Marquardt approach is more accurate.

It is observed that various authors advocated the varying level of performance of the different models of forecasting which vary on a case-to-case basis. It is very difficult to say which the best performing model is. In this study, RF, ANN, GB, XGBoost, AdaBoost, and hybrid models are tested and compared with each other for the sales forecasting of a US-based retail company.

The hybrid network has been developed by the separate training of the random forest and the XGBoost model on 67% of the data. Then a new dataset was created by generating the prediction from both the models for all the data points. This new dataset was passed as input to train the linear regression model and it predicted the final sales values. Some of the most common machine learning models used in forecasting are summarized in Table 1.

Table 1 Applications of machine learning models in forecasting

2.2 Measurement of Forecasting Accuracy

Chicco et al. [65] used healthcare information to forecast rates of obesity. R2 was observed to be more accurate and complete than symmetric mean absolute percentage error (SMAPE). The value of R2 becomes high if the analysis accurately predicts the majority of ground truth entities for every ground truth category taking into account their dispersion. The accuracy of various machine learning/deep learning models is compared using MSE, MAE, mean absolute percentage error (MAPE), root mean squared error (RMSE), and R2 metrics [64]. Tsoumakas [3] examined the present state of ML algorithms for forecasting food-purchasing habits. It covers essential design concerns for forecasting food sales, such as the temporal granularity of sales figures, the intake factors to use for predicting sales, and the depiction of the sales evaluation function. It also looks at machine learning algorithms for predicting food sales and important measures like MAE, MSE, RMSE, and MASE, for evaluating forecasting accuracy. Ala’raj et al. [66] used Covid infection data to model and forecast Covid-19 outbreaks. They utilized a modified SEIRD (Susceptible, Exposed, Infectious, Recovered, and Dead) dynamic model and ARIMA model for prediction. The model prediction accuracy was estimated by using 5 metrics: AE, MSE, MLSE (maximum likelihood sequence estimation), normalized MAE, and normalized MSE. Ramos et al. [67] examined the efficiency of phase space models and ARIMA regressors as a tool for predictions of retail sales of five different types of women’s footwear: boots, booties, flats, sandals, and shoes. RMSE, MAE, and MAPE were used to evaluate the ARIMA model.

3 Methodology

In this study, the performances of XGBoost, RF, ANN, gradient boosting, AdaBoost, and the proposed hybrid framework (RF-XGBoost-LR) are compared using several performance metrics, namely, MAE, MSE, and R2 score (coefficient of determination). XGBoost, RF, gradient boosting, and AdaBoost are ensemble techniques built on top of decision trees, while ANN is a deep learning technique. The framework of a decision tree (DT) comprises a root node (topmost node), internal nodes, and leaf nodes (end nodes). Simple principles are used in DT algorithms to branch out of the root node, passing through internal nodes and eventually ending in the leaves [68]. In this work, Python 3.7.12 was utilized. For data handling, Pandas version 1.1.5 and Numpy version 1.19.5 were used. For model training, XGBoost version 0.90 and Scikit-learn library version 1.0.2 were used.

3.1 Proposed Framework

In this study, a hybrid model of RF-XGBoost-LR is proposed and its performance is compared with other individual models.

Bagging vs Boosting

The primary distinction between the approach of bagging and boosting methods is that the former decreases the variance in prediction by generating the additional data for training from the dataset using combinations with repetitions to produce multi-sets of the original data, while the later adjusts the weight of an observation based on the last classification by iteration. Unlike the bagging method, wherein a uniform selection of each sample is made to build a training dataset, the boosting algorithm’s likelihood of choosing a particular sample is unequal. Misclassified or inaccurately calculated samples are more likely to be selected when they carry a higher weight. As a result, each new model may focus on samples that have incorrectly been classified by earlier models [69].

Random Forest

RF is an ensemble technique in which the results of many regression trees are combined to generate a single prediction. The primary premise is bagging, in which a sample of training data is selected at random and fitted into a regression tree [70]. This randomly selected sample is termed a bootstrap sample and it is chosen with replacement, meaning any previously chosen data point can be chosen again. A bootstrap sample can be made by choosing N data points randomly from the dataset and then substituting them with the data points present in the dataset. There is a 1/N chance of any data point being chosen.

RF is a combination of decision tree estimators \(\left\{h\left(X,\Theta {\text{k}}\right),\text{ k }=\text{ 1},\text{ 2},...\right\}\), in which every decision tree is calculated by utilizing the outputs of a random vector \(\left\{\Theta {\text{k}}\right\}\), which is independently sampled and evenly distributed among all the decision trees present in the forest.

Once the training is complete, the result of the entire set of decision trees on sample X′ is averaged to generate predictions as shown in Eq. (1).

$$\widehat{f}=\frac{1}{k}\sum\limits_{i=1}^{k}h({X\,{^\prime}}, \Theta k)$$

where \(\widehat{f}\) is the final prediction and k is the number of decision trees.


An abbreviation for ‘extreme gradient boosting’ is XGBoost with potential improvements upon gradient boosting. XGBoost enhances the performance and is capable of solving problems of real-world scale while making use of a minimum number of resources [71]. XGBoost is a parallel tree model built upon the gradient boosting model. It utilizes the tree ensemble method, which is made up of a series of CART. Although XGBoost consists of various unique characteristics, second-order Taylor expansion and embedded normalization algorithms appear to be similar to GBDT [72]. XGBoost models have the advantage of scaling effectively for different scenarios while requiring fewer resources than existing prediction models. Within XGBoost, parallel and distributed computation speeds up the model learning and allows for more rapid model exploration.

Hybrid (RF-XGBoost-LR) Model

RF makes parallel decision trees which help in reducing the overfitting problem. Accuracy improves as a result of the reduction in variance. In RF, individually separate decision trees are used for each of the multiple copies of original training data. Despite its widespread popularity, random forest suffers from conceptual and practical shortcomings. Random forest adaptive learning is inherently poor in terms of minimizing training error. In particular, each tree is learned autonomously. Complementary information from other trees is not fully realized in this kind of training [73]. It results in a reduction in model performance. XGBoost combines multiple weak learners in a sequential method, which iteratively improves model performance.

XGBoost is a boosting technique. It takes advantage of parallel processing and runs the model on several CPU cores. However, it is affected by the renowned overfitting problem in boosting, which also impacts multiple additive regression trees (MARTs). This challenge arises when there are few trees accessible early in the iteration process; as a result, all of the trees impact significantly the model [74]. Overfitting of training data degrades the model’s generalization capabilities, resulting in unreliable performance when applied to novel measurements. High variance and low bias estimators are common manifestations of overfitting. The extra complexity may, and frequently does, aid the model’s performance on a set of training data, but it inhibits future point prediction [75]. Overfitting gives an overly optimistic impression of prediction results in new data drawn from the underlying population [76].

Hybrid models are combinations of two or more single models of machine learning or soft computing to achieve higher flexibility along with higher capability in contrast to a single model. One of the two entities, prediction, and optimization of the prediction are often present in a hybrid model for higher accuracy. Mainly, there are two key reasons to develop a hybrid model:

  1. (i)

    To eliminate the risk of an unfortunate prediction of a single forecast in some specific conditions.

  2. (ii)

    To improve upon the performance of the independent models.

A hybrid model is designed to reap the advantages and overcome the shortcomings of the individual models involved [77]. In this research, a hybrid ML model has been proposed within which random forest regressor which is a bagging technique and XGBoost regressor which is a boosting technique have been combined.

A hybrid model has been developed to overcome the shortcomings of both the models, i.e. RF and XGBoost, as shown in Fig. 1. The random forest model addresses the overfitting problem inherent to XGBoost as it can decrease the model variance without increasing the model bias. This implies that the overfitting problem may be observed in the forecast of a single regression tree, but it can be eliminated in the average forecast of multiple regression trees. Random forest model is poor in terms of reducing the training error as multiple regression trees are trained autonomously. XGBoost addresses this shortcoming of random forest by sequentially training decision trees.

Fig. 1
figure 1

Hybrid model flowchart

In the proposed framework, RF and XGBoost models are trained separately and predictions of both the models are used as input into an LR model. The LR model processes the final output.

The LR equation can be defined by Eq. (2):

$$\text{Y }=\beta +{\beta }_{1}{\text{ x}}_{1}+{\beta }_{2}{\text{ x}}_{2}+\varepsilon$$

where the final prediction is represented by Y, and the predictions from RF and XGBoost are represented by \({\text{x}}_{1}\) and \({\text{x}}_{2}\). The y-intercept is represented using \(\beta\). The coefficients and error terms are represented by \({\beta }_{1}\), \({\beta }_{2}\), and \(\varepsilon\) respectively. The value of \({\beta }_{1}\) is − 0.05671126, \({\beta }_{2}\) is 1.05822592, and \(\beta\) is − 4.4399934967759985e-05.

The Python source codes for all the four ML forecasting models and hybrid models are shown in the Appendix.

4 Case Study and Result Discussions

The case company operates as a merchandiser of consumer products. The international segment manages supercentres, supermarkets, hypermarkets, warehouse clubs, and cash and carries outside of the USA. The company was founded in 1945 and is headquartered in Bentonville, Arkansas. Among the largest retailers in the world, based in the USA, the company experiences revenue gain year over year. It operates grocery stores, supermarkets, hypermarkets, department stores, and discount stores offering commodities at the lowest prices, the strategy which defines it, in more than 25 countries across the globe. Fuel, gift cards, banking services, and other associated products such as money orders, prepaid cards, and wire transfers are all available through the company.

According to statistics, grocery prices were reduced by an average of 10–15% in markets where the company entered. It has a wide product range, which makes it a tough competitor among other companies in the same segment. Products offered range from electronics and offices, movies, music, and books to jewellery, baby products, and furniture for pharmacies. It is capable of lowering grocery prices by another significant margin during promotional periods. The strong market power over the supplier and competitors allows them to sell the products at the lowest prices and helps them compete in the market.

In this research, the data of a retail company known to keep up with the demands of customers by offering a wide range of products at one stop has been used. The sales data of the company spans different regions in the USA. The data consists of weekly sales for all the 45 stores and 99 departments over 3 years. The data consists of different attributes of the store and geographic-specific information, namely store number, size of the store, department of the store, date mentioning the week, region’s average temperature, fuel price in the region, CPI (consumer price index), unemployment rate, and holiday week.

Normalization is a pre-processing step that plays a crucial role in machine learning. Normalizing aids in decreasing the learning time when the datasets are too large. Min–Max normalization transforms the original dataset into the desired interval using a linear transformation. This technique has the advantage of preserving all relations between the data points. Min–Max normalization is given by Eq. (3).


For proportionate scaling of the data, the Min–Max scale was used at the beginning of the analysis keeping the minimum and maximum values as 0 and 1 respectively.

4.1 Performance Parameters

For the comparison of the forecasting models, mean absolute error (MAE), mean squared error (MSE), and R2 value are used as discussed in the following subsections.

Mean Absolute Error

It is an average measure of errors in a set of predictions. Since it is absolute, it ignores the positivity or negativity of the error and all individual errors are equally weighted. The calculation of MAE is straightforward as shown in Eq. (4). To get the ‘total error’, the absolute values of the errors are summed up and divided by the total number of observations [78].


where \(\gamma_i\) is the true value, \({\widehat\gamma}_i\) is the prediction value, and n is the number of observations.

Mean Squared Error

It is also an average measurement of the errors in a set of predictions. The squares of each error are added together and then averaged as shown in Eq. (5). This ensures that all errors are equal in weight and that the direction of the error is irrelevant. Since it is a quadratic function, it will always reach global minima.


where \(\gamma_i\) is the true value, \({\widehat\gamma}_i\) is the prediction value, and n is the number of observations.

R 2 Score

It is also known as the coefficient of determination which expresses the amount of variance in the dependent variable explained by a model [79]. R2 score is used to evaluate the scattered data about a fitted regression line. Higher R2 values for similar datasets represent smaller differences between the predicted data and the true data. It measures the relationship between predicted and true data on a scale of 0–1. For example, an R2 value of 0.8 indicates that the variation of the independent variable explains 80% of the variance of the dependent variable being analysed. It is given by Eq. (6).

$${R}^{2}=1-\frac{S{S}_{res}}{S{S}_{total}}=\frac{{\sum }{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{{\sum }\left({y}_{i}-\mu \right)}$$

where \(SS_{res}\) is the sum of squares of residuals, \(SS_{total}\) is the total sum of squares, \(\gamma_i\) is the true value, \({\widehat\gamma}_i\) is the prediction value, and \(\mu\) is the mean.

R2 mainly shows whether the said model provides the goodness of fit for the observed values. It was also necessary to understand errors for which metrics namely MSE and MAE were utilized. Mean squared error (MSE)’s purpose was to put more effort into outliers. Due to its square, it weighs large errors more heavily than small ones. Mean absolute error (MAE) is used when measuring the prediction in the same unit as the original series. MAE provides information about how much an average error is expected from the forecast on average. Using the above performance parameters (MSE, MAE, R2), all the different models incorporated in this study are compared as shown in Table 2.

Table 2 Performance of the forecasting models

In the RF model, the number of estimators was kept at 175 and the maximum depth was kept at 28. In the gradient boosting model, the number of estimators was kept at 125 and the maximum depth was kept at 25. In the AdaBoost model, a decision tree with a depth of 25 as a base estimator was used. In ANN, the model has kept 5-layer deep followed by an output layer. The number of neurons in each layer was 10, 12, 24, 12, and 10. The activation function for each layer was taken as ‘relu’ and ‘adam’ optimizer. The batch size was taken as 256 and the model was trained for 500 epochs. In the XGBoost model, the number of estimators was kept at 150 and the maximum depth was given as 25. In the RF-XGBoost-LR model, at first, the RF and the XGBoost models were used and their predictions were used as input to an LR model.

In the hyper-parameter optimization phase of the machine learning model, determining the most optimal configuration parameters of the ML optimization methods is challenging. As a result, using random values within the effective range of relevant ML algorithm parameters may result in enhanced optimization outcomes. The output from RF and XGBoost models is being passed as input to an LR model. The LR model was used to output the final predictions because of its simplicity. If we use a complex model like gradient boosting in our final layer, the hybrid model would be prone to overfitting. The hybrid model leads to overcoming the shortcomings of both the RF and XGBoost models.

Wolpert [80] proposed stacking (also known as stacked generalization), which is an ensemble of well-performing models for their capabilities. Stacking uses a single model to combine the different predictions from multiple models. Stacked models provide the best results by using a wide range of algorithms in the first layer of design, as different algorithms identify trends and patterns differently in training data, and merging both models results in a more accurate and reliable output.

Various performance measures were utilized to compare the performance of all the models. Holistically, based on three metrics, the proposed forecasting method is found to outperform the other benchmarking methods with an R2 score of 0.9551, MAE of 0.0024, and MSE of 4.7932e-05.

Figure 2 shows the comparison of the week-wise sale of all the stores and departments against the sales predicted by the hybrid (RF-XGBoost-LR) model. It is observed from Table 2 that the hybrid networks can better forecast the sales as compared to the other models since they can map the trend of the actual sales most accurately.

Fig. 2
figure 2

Comparison between actual sales and forecasting using the hybrid model

4.2 Academic Implications

The proposed hybrid model can be utilized to enhance supply chain-related studies and be applied to extend research work on demand forecasting. The robust performance of the proposed framework augments its utility. Retailers, wholesalers, and other industries can use it to their benefit. It is however essential to have adequate domain knowledge for it to be tailored for various applications in different industries. Products in different sectors may have distinct properties that may be retrieved and put into the forecasting framework to fully increase their performance. This makes industry-specific customization to be a potential research topic.

4.3 Managerial Implications

The findings of the study show that the proposed hybrid model improves the forecasting accuracy up to a large extent compared to other individual machine learning models. Both the models random forest and XGBoost jointly overcome the problems of overfitting and training errors in linear regression analysis of the data, and hence, the forecast values are very close to the actual values of sales. Thus, the proposed model helps the industry decision-makers in more accurate forecasting, which leads to the formulation of a better marketing strategy, increasing stock turnover, optimizing capacity building, lowering supply chain costs, and improving customer happiness. An accurate demand forecasting method can improve the supply chain performance by eliminating the bullwhip effect and proper inventory management.

5 Conclusion

In this study, a hybrid model of ML has been proposed combining XGBoost, RF, and LR, for real-time analysis of sales data. Sales data of the retail company with various attributes are trained to introduce a newer more advanced model of veracity. To address the shortcomings of both the RF and XGBoost models, a hybrid model is proposed. At first, the dataset was normalized and then trained and tested separately in RF and XGBoost models. The predictions from these models were assimilated to create a new dataset, which was used as input into the LR model to generate the final predictions.

It is observed that by combining the XGBoost model with the RF model, the dataset improves the accuracy due to reduced variance and enhanced robustness to outliers, which results in improved predictive ability and less vulnerability to overfitting. Three metrics were used in this study, which are MAE, MSE, and R2 scores. The results suggest that the proposed hybrid model RF-XGBoost-LR (MAE = 0.0024, MSE = 4.7932e-05, and R2 score = 0.9551) has better performance than the other models, namely RF, ANN, gradient boosting, AdaBoost, and XGBoost. R-squared score infers that the model explains 95.51% of the data and variables incorporated.

A precise demand forecasting in an integrated commercial planning environment can be utilized to optimize capacity building, schedule labour management, inventory, supply chain management etc. In the proposed hybrid model, random forest helps to overcome the overfitting problem of XGBoost, while XGBoost is used to reduce the error by training the decision trees sequentially. Forecasting using the proposed model may improve stock availability and enhance stock allocation.

With all the implications, the proposed hybrid model has some limitations in terms of the requirement of the bid size of the training data, decision integration etc. As the size of training datasets expands, machine learning algorithms become more effective.