Keywords

1 Introduction

Markets are workplaces where customers can meet their needs. It is essential that the market supplies the needs of the customers, namely the demand. If the market does not present the customer’s needs to the customer, this may lead to customer dissatisfaction and loss of customers for the market [1]. The reputation of the stores in the eyes of the customers is also important, their customers want to buy the product they want at an affordable price. Inventory management plays an important role in customers’ access to products and in affordable prices compared to other competing markets [2]. Effective stock management benefits the market and the customer [3].

Inventory management is to ensure that the products offered by the markets to their customers are supplied from the suppliers and that the customers can reach them on the shelves of the market at more affordable prices than the competitors of the market. There are two criteria here; It should be ensured that the customer has continuous and affordable access to the product [4]. When the criteria are met, customer satisfaction will increase the profitability rate in the market.

In order to be able to manage the stock, it is necessary to predict how much of each product will be sold [5]. If the sale of the product is underestimated and the customer cannot reach the product, it will cause loss of customer and reputation in terms of the market. If we overestimate the sales of the product, we have to store the product, which increases our warehouse cost and therefore the cost of the product. It makes it difficult to offer affordable products to our customers. In another case, when we have excess demanded product and the expiry date of the product has passed, we can no longer sell the products and the cost of the product increases [6]. However, it causes additional costs because the products in the warehouse or on the shelf have to be removed for the market.

In extraordinary times such as stock management pandemics, it becomes even more important that the demand, which consists of the needs of the people, can be supplied by the market [7]. This situation, which is out of normal, causes abnormality in sales. Especially during the pandemic period, together with the problems experienced in logistics, stock management and thus sales forecasting are of vital importance in presenting the required products to customers [8].

Sales forecasting provides continuation of sales before stocks run out and provides real-time forecasts suitable for all situations [9]. Statistics, mathematical models, machine learning and deep learning etc. on stock management. Many techniques are used [10]. Instead of a rule-based model [8], we decided to examine this issue using a machine learning technique that could extract the change in sales from the data. First of all, we made general analyzes on our data and tried to understand our data. In the light of the analyzes we made, we tried to find features that would positively affect the result while predicting sales by performing feature engineering on our data. We developed a model using our data and newly discovered features and the XGBoost machine learning model [11], which is frequently used in making predictions. In our study, with the development of the pandemic feature, an improvement of approximately 25% has emerged in sales forecasts.

2 Dataset

Our data set includes sales records of 5-lt sunflower oil product in all branches between 01-05-2019 and 01-05-2021 of a chain market consisting of 10 branches located in various parts of Turkey. Data set contains 1613 daily records. The selected time period also includes the COVID-19 pandemic period so that the pandemic effect can be examined. With the COVID-19 pandemic, there have been unusual curfews and serious difficulties related to logistics [12]. Limits have been imposed on the working hours of the markets and the number of customers inside. At the same time, restrictions were introduced for a certain segment of people to go out to the street, depending on their age. Such restrictions on both markets and customers have led to significant changes in sales [13]. The features included in our data are given in Table 1.

Table 1. Data information.

Using the date information here, the period without a pandemic before March 2020, when the pandemic effect was seen, and the period after it were described as the pandemic period.

3 Methodology

3.1 Problem Identification

Figure 1 shows our methodology. The methods that are actively used to forecast the sales of the product in non-pandemic times have changed due to changes and limitations in shopping habits during the pandemic. When we use the machine learning algorithm, which is trained with normal data, under pandemic conditions, it makes bad predictions compared to the previous period estimates due to the change in the data. Due to this change, the previous forecasting model will have a high margin of error during the pandemic period. In this study, it is aimed to prevent this.

3.2 Data Preparation

After our data was obtained from the database. It was pre-processed as follows in order to eliminate the errors and deficiencies found in our data. Some of these errors are Identifying the missing areas in our data and removing them from our data will prevent our model from making mistakes and making biased learning during the training. Features in the dataset are typecasted in order to extract additional features. For example: date columns transformed string to datetime type.

Fig. 1.
figure 1

Flowchart for our processes in sales forecasting

3.3 Exploratory Data Analysis

In this study, only human exposure estimation was made without distinguishing activity. Exploratory data analysis provides the opportunity to summarize our data, discover the features of our data, understand the relationship between features, and detect abnormal data by analyzing our data with statistical methods and data visualization methods. These analyzes can also be used during feature extraction processes.

General information can be obtained by examining our data statistically first. When we examine Table 2, where the data are analyzed statistically, we can see that the sales figures of the pandemic period have decreased in general. We analyzed the sales data of our product statistically: the number of records, average sales, standard deviation, minimum value, maximum value and data distribution of 25%, 50%, 75%. We obtained these values in Table 2 using the print(df.describe()) function of the pandas library. As can be seen in the table, there has been a decrease of approximately 30% in the average of sales. Changes in other areas are also at this level, and sales have generally decreased as expected during the pandemic period (Fig. 2).

Table 2. Data information.
Fig. 2.
figure 2

Distribution of data

In order to see the changes in our sales data with and without pandemics on the basis of days of the week, we obtained Figs. 3, 4 and 5 using the seaborn library and column (bar plot) and line (line plot) graphics. Percentage changes are expressed in the bar chart.

In the graphs in Fig. 4, a column chart of the sales in the pandemic period (shown in red) and in the non-pandemic period (shown in green) is shown by weekdays. In this chart, the change in sales is observed due to the expected weekend closure. Percentage changes are given on the column. When we interpret the graph, we can see that the sales in the normal period are high on the weekend. In the pandemic period, we can see that sales are less due to the weekend closure.

Fig. 3.
figure 3

Average sales by weekdays during the pandemic and non-pandemic period.

In order to understand the changes in sales on weekdays and weekends, we showed our data in the form of time with and without pandemics, as in Fig. 3, with a column chart as the average sales on weekdays and weekends. When we examine these graphics, we need to update our model that we use in normal times, because the sales character has changed during the pandemic period. In order to predict sales during the pandemic period, a model suitable for the development of our model should be put forward by adding new features.

Fig. 4.
figure 4

Average weekday and weekend sales in the pandemic and non-pandemic period.

As seen in the line graph in Fig. 5, sales follow a decreasing trend on weekends.

Fig. 5.
figure 5

Average sales per weekday in the pandemic and non-pandemic period

3.4 Feature Extraction

In the light of the graphics in the analysis section, the pattern and characteristics of our data have changed due to the change in sales habits during the pandemic period. Because of this change, when our machine learning models are trained with the data before the pandemic, the consistency in the predictions is lost because the pandemic data is different. For this, we can add new features, taking into account the information in the analysis section, so that our model can better learn the changing data and non-pandemic data. In this way, we can better predict the sales during the pandemic period and make better stock management by placing the orders accordingly. We use pandas library’s ready-made functions on time data types while extracting properties.

These features were extracted from the time feature of our data in order to be able to deal with the year, month, week of the year, which day of the year, which day of the week, and when and in which situation the sale was made. Thanks to these features, we expect our model to establish a connection between sales data and time and to better learn the sales values made at the same time. We evaluate these temporal features categorically, and we want our model to evaluate it that way. Since there will be a connection between time-based sales, records with the same time feature will have similar values and this approach will enable our model to learn better.

We extract a true-false property from our time feature, whether it is weekdays or weekends. Weekday and weekend sales are different in the retail industry. It is said by retail experts that people shop more during the holidays. In order to express this situation in our data, we add this feature to our data.

The character of the sales during the pandemic period, which we observe in the graphics, is changing. We add a categorical feature to our data that takes values as pandemic and non-pandemic when it is in the pandemic period.

3.5 Dataset Separation

We divide our data into two as data to be used in training the models and then as data to be used in testing in order to measure the predicted performance of the trained model. When dividing the data, considering that our data is a time series, we determine the training and test parts by shifting. In order to see the change over time in the time series data, the part immediately after the data taken for training is taken for testing. Figure 5 illustrates this approach. Each shift is evaluated as a step, and the model is trained with the training data at each step, and evaluation metrics are obtained with the test data. Then the average of the scores obtained in all steps is taken.

Time series cross validation method is used before a certain time for training and after for testing, so we can objectively train time series data and measure the performance of our model. We did it using the time series cv algorithm of the sklearn library.

tscv = TimeSeriesSplit(n_splits = 3, test_size = 2).

Example: Size of data set = 1613.

  1. 1.

    Fold = 1

    1. a.

      Train size = 1607

    2. b.

      Test size = 2

  2. 2.

    Fold = 2

    1. a.

      Train size = 1609

    2. b.

      Test size = 2

  3. 3.

    Fold = 3

    1. a.

      Train size = 1611

    2. b.

      Test size = 2

Fig. 6.
figure 6

Time series cross validation method.

3.6 Regression and Machine Learning Model

Regression is a statistical measurement that tries to determine the strength of the relationship between a dependent variable and other independent variables, and makes predictions according to this power. It is used as a predictive modeling method in machine learning where an algorithm is used to predict the dependent variable.

Solving regression problems is one of the most common applications for machine learning models, especially in supervised machine learning. Algorithms are trained to understand the relationship between independent variables and an outcome or dependent variable. The model can then be used to predict the outcome of new and invisible input data or to fill a gap in the missing data.

Machine learning is a sub-branch of artificial intelligence algorithms. It has a structure that can learn on data, and allows us to make predictions after this learning. This model takes an input and an output data as training data. The algorithm establishes a function by learning between the input data and the output data and learns the statistical pattern in our data. Estimates are made with the learned model.

Here, we decided to use the XGBoost and LGBMRegressor model with the tree-based gradient boosting [14] feature, which is the most popular of the machine learning algorithms, and the linear-based Ridge model. This choice was made taking into account both the overall success of the models in machine learning and the number of records of our data. Due to the scarcity of data we have, deep learning methods were not preferred.

XGBoost and LGBMRegressor algorithms are decision trees-based machine learning algorithms that use gradient boosting. XGBoost has brought some improvements over plain GBM, such as the use of regularization, pruning, and parallelization to prevent over-learning. It works faster than other algorithms with its parallel operation. It is used in many projects and competitions. Since the XGBoost library is an open source project, it is developed and supported by many users.

XGBoost, short for Extreme Gradient Boosting, is a scalable distributed gradient assisted decision tree (GBDT) machine learning library. It provides parallel tree reinforcement and is the leading machine learning library for regression, classification and sorting problems. XGBoost, a supervised machine learning method, uses algorithms to train a model to find patterns in a dataset containing tags and features, and then uses the trained model to predict tags in a new dataset’s features.

LGBMRegressor has the following differences from XGBoost. Instead, it grows trees in the form of leaves. He chooses the leaf he believes will provide the greatest reduction in loss. Also, LightGBM does not use the commonly used rank-based decision tree learning algorithm, which looks for the best split point in the sorted feature values, as XGBoost or other applications do. Instead, LightGBM implements a highly optimized histogram-based decision tree learning algorithm, which provides huge advantages in both efficiency and memory consumption. The LightGBM algorithm uses two new techniques called Gradient-Based Unilateral Sampling (GOSS) and Special Feature Packing (EFB), which allow the algorithm to run faster while providing a high level of accuracy.

Ridge regression is a method of estimating the coefficients of multiple regression models in scenarios where the independent variables are highly correlated. It has been used in many fields, including econometrics, chemistry, and engineering.

3.7 Performance Evaluation

The metrics allow us to evaluate our model’s predictive performance after training. It allows us to measure how successful a model trained with training data is on the test data, or how wrong predictions it makes. We will use the MAPE [15] and RMSE [16] metrics to measure our model. These metrics are frequently used in regression tasks.

The problem we are trying to solve is a regression problem. Among the most preferred metrics in regression problems are MAPE and RMSE. The reason why we use two different metrics at the same time here is to be able to look at our mistakes from two different angles and to be more objective.

MAPE; Average absolute percent error (MAPE), also known as mean absolute percent deviation (MAPD), is a measure of the error in estimation of a forecasting method in statistics. The MAPE scale is set out in Eq. 1.

$$ {\text{MAPE}} = \frac{1}{n}\sum\nolimits_{i = 1}^n {\left| {\frac{{y_{true} - y_{pred} }}{{y_{true} }}} \right| \times 100} $$
(1)

Here, At, Tt, and n represent the actual value, the predicted value, and the number of cases tested in the total data set, respectively. The MAPE metric is our error expressed as a percentage. Calculation of the error as a percentage will reflect the error more objectively, if the magnitude of the values change as a scalar, since this is calculated as a percentage to the error.

RMSE; The mean square deviation (RMSD) or mean square error (RMSE) is a commonly used measure of the difference between values predicted by a model or estimator (sample or population values) and observed values. The RMSE scale is shown in Eq. 2.

$$ {\text{RMSE}} = \sqrt {{\frac{1}{n}\sum\nolimits_{i = 1}^n {(y_{true} - y_{pred} )} }} $$
(2)

The abbreviations used in the RMSE metric are also used in the same sense as MAPE. The performance of the extracted feature and the operated model with these two performance scales are given in Table 3 and Fig. 6.

4 Findings and Interpretation

By training our data in a comparative way, we will make evaluations by observing the effect of the features we add on the result. Two datasets were created, the first contains features related to the pandemic and the other does not. We trained these datasets with XGBRegressor, LGBMRegressor and Ridge machine learning models, and evaluated our models with the performance metrics that occurred in the estimation they made on the test data.

The results we obtained with the test data after the training are given in Table 3 and Fig. 6. Considering these results, our model makes 25% improvement in the one step ahead sales prediction (Table 4).

Table 3. Data information.
Table 4. Data information.

Figure 7 in below shows the estimation of our model on the test data graphically. As seen in this way, our model generally catches the sales trend, but makes an one step ahead prediction that sometimes less or more sales prediction will be made in sales (Fig. 8).

Fig. 7.
figure 7

Model predictions with Pandemic feature (blue model prediction, red actual test data)

Fig. 8.
figure 8

Model predictions without pandemic feature

5 Conclusion

If the market chains cannot manage the stock well, they may face the risk of loss of reputation in the public, storage due to excess stock, or deterioration of the product due to insufficient supply to the incoming demand. Any market chain sells around 10000 products and the stock management of this large-volume product portfolio requires serious resources. Here, XGBoost, LGBMRegressor and Ridge machine learning models were preferred as a method where we can efficiently manage stocks through algorithms created with today’s developing techniques and data obtained, update our data despite changing conditions, and provide rapid training because there are so many product types.

As a result of our work, we have developed a new feature that can be used for the pandemic period, and by adding this feature to other already existing features, we have improved the consistency of our predictions.

In the future, when feature extraction studies on our data reach the desired point and the number of data increases, we are considering using deep learning methods, feature extraction and models. The amount of data is of great importance for deep learning methods.

With this developed model, the sales forecast of the products in the portfolio of a market can be made and an adequate stock management can be realized in this way. In this way, the stock management of the entire market will be fully automated with the recommended machine learning model.

Production and consumption, which is the basis of the economy, is directly related to stock management. This study will be an example to the literature for information and guidance for each company working on stock management.