Stability of a long-term sale depends heavily on consumer behavior and his/her needs, instead of having just great products. In particular, in the current situation of COVID-19 pandemic, consumer behavior is changing every day. Therefore, predicting consumer behavior could be of crucial importance for future business planning. An e-commerce company with the capability of anticipating consumers’ shopping behavior will gain several advantages such as increasing consumer purchase rates, increasing sales and consumer satisfaction, and greater competition (Souri et al. 2017).
In this paper, two approaches are taken to predict consumer behavior: statistical approach and machine learning approach. In statistical approach, correlation between different features are calculated and analyzed. In machine learning approach, a predictive model is proposed to anticipate consumer behavior in online shopping. The research process followed in this paper is illustrated in Fig. 1. The methods, are examined to predict consumer behavior based on the data collected from online purchase at DigiKala site (www.digikala.com). DigiKala is one of the largest and successful online shopping sites in middle east. The well-known classification methods of machine learning are compared and the best practice with the highest accuracy is identified.
In the rest of this section, first data preprocessing is presented. Then, correlation analysis explained briefly. After that, the proposed machine learning method for predicting consumer behavior is presented. At the end of the section, evaluation criteria employed to assess the proposed method are provided.
Preprocessing
In the first step, the data should be preprocessed to reduce the implementation time and improve the results. For this purpose, we normalized the data so that the attributes are normalized as follows. The general formula is:
$$ {{newValue}} = { }\frac{{{{origionalValue}} - {{oldMin}}\left( {{{newMax}} - {{NewMin}}} \right) + {{newMin}}}}{{{{oldMax}} - {{oldMin}}}} $$
(1)
In this paper, the desired range for normalization is considered as [0, 1], therefore, the Eq. (1) would be changes to the following equation:
$$ {{newValue}} = { }\frac{{{{origionalValue}} - {{oldMin}}}}{{{{oldMax}} - {{oldMin}}}} $$
(2)
Correlation Analysis
The data employed in this paper is constructed of 11 features where the “Effective” that is the 11th features, is the impact of COVID-19 on consumer purchase volume (the structure of the dataset is explained in Sect. 4.1 in detail). After normalizing the data, we calculated the correlation between the features to find the feature that has higher correlation with “Effective” feature. Pearson correlation is calculated from the following equation:
$$ \rho_{X,Y} = {{corr}}\left( {X,Y} \right) = \frac{{{{cov}}\left( {X,Y} \right)}}{{\sigma_{X} \sigma_{Y} }} = \frac{{E\left[ {\left( {X - \mu_{X} } \right)\left( {Y - \mu_{Y} } \right)} \right]}}{{\sigma_{X} \sigma_{Y} }} $$
(3)
where \(\rho_{X,Y}\) is the Pearson correlation between X and Y that are two variables we are interested to calculate their correlation. \(\mu_{X}\) and \(\mu_{Y}\) are expected values of X and Y, and \(\sigma_{X}\) and \(\sigma_{Y}\) are standard deviations of X and Y. Standard deviation should be finite and positive to make Pearson correlation valid. E is expected value operation, cov is covariance operator, and corr is correlation coefficient.
The feature “Effective” accepts two values: ‘yes’ and ‘no’. The value ‘yes’ indicates that COVID-19 influenced consumers’ purchase behavior and the value ‘no’ indicates that COVID-19 did not influence consumers’ purchase behavior. The correlation results are provided in Sect. 4.2.
Classification
Classification methods are usually used to design both description and prediction models. In this research study, we examined five classification methods namely: support vector machine (SVM), decision tree (DT), Sequential Minimal Optimization (SMO), artificial neural network (ANN), and Naïve Bayes (NB). Ensemble meta-algorithms, Boosting and Bagging are also examined with the above classifiers to improve the accuracy of the proposed method. Ensemble methods use a number of classification algorithms to achieve a better predictive performance than each of the classification algorithms. Boosting and Bagging are ensemble methods that convert weak classifiers to strong classifiers through reducing bias and variance. Bagging is a statistical approximation method that uses a statistical quantity, such as an average, to be estimated from multiple random samples of your data. It is a beneficial method when there is a small amount of data and we are interested in a stronger approximation of a statistical measure. Boosting works as Bagging, however uses weighted average. In training stage, they generate N training datasets from the original dataset through random sampling with replacement. In Boosting, the samples are weighted and some of the samples has more chance to influence the classification results. However, in Bagging all the samples has same chance to be participated in training process. Bagging improves the accuracy of the weak classifiers via training the learners in parallel. On the other hand, Boosting trains weak classifiers sequentially, where each classifier tries to improve the results of its predecessor. Bagging and Boosting reduce the bias and variance of a single classifier and improve the accuracy of prediction. In order to predict the consumer behavior, individual classifiers and their ensembles with Bagging and Boosting are examined in this paper, and the results are provided in Sect. 4.2.
Evaluation Criteria
The results of applying the model on the dataset are evaluated through accuracy, precision, recall, and F-Measure criteria calculated from confusion matrix values. A confusion matrix for a typical two-value classification problem is presented in Table 1.
Table 1 A typical confusion matrix for a binary classification problem Four important criteria to evaluate classification performance are accuracy, precision, recall, and F-Measure. Accuracy is one of the important classification evaluation criteria that could be calculated from Eq. 4:
$$ {{AC}} = \frac{{{{TN}} + {{TP}}}}{{{{TN}} + {{FN}} + {{TP}} + {{FP}}}} $$
(4)
Precision and recall refer to quality and quantity of the results. In other words, precision indicates the ratio of the relevant results to irrelevant ones, and recall indicates the overall percentage of relevant results retrieved. Precision and recall are measured through the Eqs. 5and 6, respectively.
$$ {{Percision}} = \frac{{{{TP}}}}{{{{TP}} + {{FP}}}} $$
(5)
$$ {{Recall}} = \frac{{{{TP}}}}{{{{TP}} + {{FN}}}} $$
(6)
F-Measure deals with the accuracy of a test and is calculated based on precision and recall as follows:
$$ {{F - Measure}} = \frac{{{{Precision}}*{{Recall}}*2}}{{{{Precision}} + {{Recall}}}} $$
(7)
F-measure is the harmonic mean of the precision and recall of the model developed. The maximum value of F-measure is 1, which means the best precision and recall. Its minimum value is 0, which means that one of the prescription and recall is 0.