Towards early purchase intention prediction in online session based retailing systems

Purchase prediction has an important role for decision-makers in e-commerce to improve consumer experience, provide personalised recommendations and increase revenue. Many works investigated purchase prediction for session logs by analysing users’ behaviour to predict purchase intention after a session has ended. In most cases, e-shoppers prefer to be anonymous while browsing the websites and after a session has ended, identifying users and offering discounts can be challenging. Therefore, after a session ends, predicting purchase intention may not be useful for the e-commerce strategists. In this work, we propose and develop an early purchase prediction framework using advanced machine learning models to investigate how early purchase intention in an ongoing session can be predicted. Since users could be anonymous, this could help to give real-time offers and discounts before the session ends. We use dynamically created session features after each interaction in a session, and propose a utility scoring method to evaluate how early machine learning models can predict the probability of purchase intention. The proposed framework is validated with a real-world dataset. Computational experiments show machine learning models can identify purchase intention early with good performance in terms of Area Under Curve (AUC) score which shows success rate of machine learning models on early purchase prediction.


Introduction
The market share of online/e-commerce sales has been rapidly increasing during the last three decades.In 2018, for the first time, the amount of total online sales had exceeded in-store sales in the USA (Mu et al. 2020).Moreover, Google and Facebook generated 116.3 and 55.8 billion US dollars, respectively, from online advertising only (Corrigan et al. 2018).Unlike in-store sales, digital marketing, and online sales generate big and valuable data about products, consumers' intention and behaviour (where consumers are coming from, what devices are they using, what items do they buy or view and for how long, how do shoppers respond to digital marketing ads and emails, and much more.)which was not available for business before (Leeflang et al. 2014).
The e-commerce industry is moving rapidly towards targeted personalised adverts.Recent studies show that offering all potential consumers generic items recommendation has proven to be an ineffective strategy (Behera et al. 2020; de Pechpeyrou 2009;Stewart-Knox et al. 2016).One of the main issues in online sales is consumer conversion; the amount of online sessions (i.e.user visit to a website) that end with a purchase is negligible when compared to the total number of session/visits (Liu et al. 2019;Zhou et al. 2019;Behera et al. 2020).A substantial number of these abandoned sessions are due to lack of purchase intention from the consumers, which means that there is almost no chance for conversion; therefore, rendering marketing strategies ineffective.Conversely, a considerable number of abandoned sessions come from users exhibiting strong purchase intention.The lack of purchase, in this case, could be due to reasons such as lack of offers or inability to correctly interpret user behaviours.For consumers with strong purchase intention, personalised marketing strategies such as targeted discounts, personalised recommendation, targeted adverts and follow-up emails could be very effective.Moreover, in addition to the possibility of increasing conversion rate, correctly identifying and targeting consumers with strong purchase intention could lead to an increase in sales.There has been growing interest in developing different purchase prediction models.Not just online but also in physical environments (standard stores).Kim et al. (2020) developed a framework for real-time purchase behaviour prediction from the users' (shopper) physical movement in a store environment.They used camera sensors and object detection algorithms to recognise purchase action.However, deploying such systems is very expensive and challenging.Contrarily, online purchase prediction models are effective, easily deployed and integrated with the system (Mokryn et al. 2019).
Studies have proposed methods for purchase prediction in the literature in the last few years (Rust et al. 2011;Esmailian and Jalili 2015;Lo et al. 2016;Brodén et al. 2018;Mokryn et al. 2019;Martínez et al. 2020;Esmeli et al. 2020).However, most of these methods are offline and try to predict purchase from completed sessions (after the shopper has left the website) in order to define a followup action.This makes such methods ineffective for early purchase prediction while a session is on-going.
To the best of our knowledge, there is a lack of studies in the literature that propose methods for early purchase prediction.This lack could limit the application of personalised marketing strategies which in turn, could result in the loss of potential sales from users with high purchase intention.In this work, we close this gap by developing a novel framework for early purchase intention prediction (EPP).We design an EPP framework that could enable content personalising and the provision of real-time offers.The proposed EPP framework aids in the prediction of purchase intentions early by analysing characteristics of consumers' online shopping behaviours in an e-commerce website and extracting hidden features realtime.These features could facilitate the provision of smart personalised marketing strategies that could boost sales, improve consumer experience and retention.The work is motivated by the following questions.
1. Given session data after a user's first interaction, how helpful can machine learning (ML) models be in predicting the likelihood of a purchase in an ongoing session.2. What is the most critical session feature for early purchase prediction and how can it be identified?3. How can ML models be evaluated to measure their performance on early purchase prediction?
In order to understand and answers these questions, we propose and develop an Early Purchase Prediction framework (EPP); we define 'early' in this framework as substantial purchase indication in a session between first item interaction, and last purchase action.We develop an early purchase prediction utility score where we aim to evaluate the performance of the ML models on early purchase prediction.This work can be seen as an extension of Mokryn et al. (2019) and Martínez et al. (2020), where the effect of temporal features and product trendiness on purchase prediction after sessions end is analysed, and Lo et al. (2016), which investigated how a registered user's actions can build up purchase intention in the long term.However, our work mainly focuses on real-time user purchase intention detection.
Our main contributions are summarised as follows: 1. We design a framework to predict users' purchase intention early that could enable content personalising and the provision of real-time offers and improve consumer retention.2. We propose a utility scoring method in order to measure how ML models can predict purchase intention before the purchase happens.3. The developed framework proposes a method to extract dynamic features from session logs.4. A set of computational experiments that compares the ML models' performance on early purchase prediction is presented.5.A detailed evaluation of the proposed EPP framework on a real-world e-commerce dataset is presented.
This paper is organised as follows.Section "Related works" provides an overview of previous works in purchase prediction.Section "Dataset description" describes the dataset used in this work.Section "Early Purchase Prediction (EPP) framework" introduces the proposed EPP framework.Section "Experiments and results" presents the results of the experiments.Section "Discussion" discusses the results, theoretical and practical implications of this research.Finally, Section "Conclusion and future work" provides a conclusion of this work.In addition, we provide key terminologies used in this work in Appendix.

Related works
This section gives a description of session logs, related works done in purchase prediction and an overview of ML models used in purchase prediction.

Session logs
A session is described as a certain time duration that the user has been browsing on the website.The session interval time depends on company's policy.Session logs have been categorised as web usage logs (Zhuang et al. 2005) which could be utilised for analysing user behaviour for purchase prediction or product recommendations.Initially, these logs should be pre-processed to meaningful structured data such as session identification (Shu-Yue et al. 2011).Generally, session logs for an e-commerce data contain details of browsed products ID, products added to the cart or products purchased and the timestamp (Zeng et al. 2019).Since purchase prediction depends on extracted features from session logs, capturing relevant attributes from session logs is important in order to improve the accuracy level of the prediction models.

Purchase prediction
Purchase prediction has been studied in several works in the literature.Recent studies have developed frameworks to investigate purchase prediction using users' previous session features and users' physical movement in a market environment in real-time (Martínez et al. 2020;Kim et al. 2020).Experiment results showed embedding these user behaviour/session features into ML models improves the performance of ML models in predicting users' purchase intention.In a physical environment, Zeng et al. (2019), proposed a purchase prediction model to analyse user behaviour during a festival in China.It was found that if a product is interesting to a user, the user was more likely to spend more time on it.In a virtual environment, Wu et al. (2015) proposed purchase behaviour prediction model focused on identifying click patterns rather than session features.Experiment results showed that using learned features from click patterns can improve purchase prediction as good as a conventional classification model trained using session features.When users are anonymous, users' web-logs can be extracted and used for purchase prediction (Suh et al. 2004;den Poel and Buckinx 2005).On the contrary, when users are registered, the performance of purchase prediction models can be improved by using extracted session features built up over time (Lo et al. 2016).In Lo et al. (2016), purchase prediction models for registered Pinterest users were proposed to analyse users' long and short term behaviours.The authors tested the performance of prediction models on extracted features at different times before purchase action happened, and found that purchase intentions were built up by time.Results of the experiments indicated that prediction models produced better accuracy results when features are created based on the whole timeline until right before purchase action.Some of the state-of-the-art ML models have been used for purchase prediction include Decision Tree (DT), Neural Networks (NN), Recurrent Neural Network (RNN) and logistic regression.Mokryn et al. (2019) and Li et al. (2016) investigated users' purchase prediction using various ML models, and evaluated the performance of these models.Li et al. (2016) proposed three ML models (Bagging, DT and Random Forest) and prediction results of the models were combined using a linear regression method.Their proposed method gained % 8 accuracy(Recall) on purchase prediction.In Mokryn et al. (2019), logistic regression, Bagging, DT algorithms were used to investigate the effect of temporal features (time, product trendiness, etc.) on purchase prediction performance, where it was found that Bagging performed best when temporal features were applied.Suh et al. (2004) created model attributes using Association Rules (AR) and applied a combination of different ML models; DT, NN, and Logistic Regression, and found that ML models performed better when these ML models are combined.When ML models are trained using anonymous sessions with extracted session features for purchase prediction for anonymous users, ML models performed well despite user anonymity (Yagci et al. 2015;Romov and Sokolov 2015;Esmailian and Jalili 2015;Pálovics et al. 2015).

Limitations of previous approaches and proposed contributions
As discussed above and shown in Table 1, most of the current methods for purchase prediction are offline and try to predict purchase from completed sessions (i.e. after the user has left the website) in order to define a follow-up action (e.g.send a follow-up email).Moreover, existing methods could be ineffective in early purchase prediction while the shopper is still navigating the ecommerce website.Admittedly, there are methods for purchase prediction for registered users at the end of a session, however, there is a lack of early purchase prediction methods for anonymous users or registered users during an on-going session.This limits the use of personalised marketing strategies in real-time.Such marketing and business improvement strategies can boost sales, improve consumer experience and retain consumers if users are anonymous (no contact information) and cannot be contacted again after they leave the e-commerce website (Mokryn et al. 2019).In this work, we present a novel approach for early purchase prediction in an active session in an e-commerce website in order to improve the provision of real-time discounts on products.This can be effective in convincing shoppers to purchase products.Also, we develop and implement an EPP framework to evaluate the performance of ML models for early purchase prediction.

Dataset description
The dataset used in this work consists of 6 months of session logs (Ben-Shimon et al. 2015).The session logs are collected from a European e-commerce business (YooChoose 1 ).YooChoose is a Germany-based company that offers a Software as a Service (SaaS) solution to help 1 https://www.yoochoose.com/online shops generate personalised shopping experience for their consumers via personalised product recommendation, search results and newsletters.Product categories in the dataset are not limited to clothes only, but also includes toys, electronics and garden tools.The dataset contains two files; one for click data that contains the click events on an item associated with the session id, item id, the category of the product and the timestamp (the time when the click occurred).The other file called purchase event logs contains purchase events from sessions that appear in the click events log and end with a purchase.Each purchase event is associated with a session id, item id, price and quantity of purchase.Sessions are diverse with their length (the number of browsed products) and timestamp (the time that session has started).Sessions last from a few minutes to a few hours, and the number of clicked products varies from one to hundreds depending on user activity.This dataset has been used widely for consumer intention prediction and  (Yeo et al. 2017;Wu et al. 2015;Brodén et al. 2018;Bogina and Kuflik 2017;Mokryn et al. 2019).
In the dataset, there are 52739 unique items, 9249729 sessions and 26637000 interactions.A majority of the sessions are only viewed session without any purchase, which means there is a massive class imbalance problem in the dataset.Users may have different habits; for example, they may prefer shopping on specific days of the week.We use the day of the week in the attributes.Consequently, we examine which days of the week do users mostly shop on the website.As seen in Fig. 1 users are more likely active on Sunday and Monday.
We analyse the dataset in terms of the hours users mostly visit the website.Figure 2, shows the distribution of the number of the user interactions of the website by hours.
Session duration is another indicator of purchase action.The longer a user spends browsing items in a session, the higher the probability of a purchase in that session.Users' spends more time on the e-commerce website, and there is a higher probability that the session will end with a purchase.We analyse the difference between the session duration distribution for purchase and non-purchase sessions.As shown in Fig. 3, the majority of purchase sessions last between 6 and 12 minutes while majority of the non-purchase sessions last less than 6 minutes (Fig. 4).

Test dataset analysis
In this section, test dataset used to validate our framework is analysed.For the purpose of this work, we selected 44999 sessions randomly.Out of the selected sessions, Table 2  shows that only 5414 sessions end with purchase indicating that the test dataset is heavily imbalanced.
Two critical factors on test dataset that effect utility score is the first purchase location and last purchase location (if there is more than one purchase in a session) as will be explained later in the scoring function.Figure 5 shows that first purchase location is mainly after four viewed item, which means that shoppers' purchase actions focus between 2 and 5 product browsing.The last purchase location mainly concentrated around between five and ten viewed items.The line inside the box plot shows the mean of shoppers' first and last purchase locations.

Early Purchase Prediction (EPP) framework
The EPP framework (Fig. 6) is designed in order to determine whether ML models can predict a purchase before purchase action happens in an ongoing session.Also, the EPP framework helps to evaluate the performance of ML models on early purchase prediction using the designed scoring function.
The proposed framework consists of three phases.In the first phase, the session logs from the e-commerce website are collected.Session logs store the data about which products are browsed in each session and start and end timestamps of  the session.The collected session logs are unstructured for example, in some records, timestamp or session id can be missed.Therefore, to have a proper dataset to feed the ML models, these records are filtered out in the pre-processing stage of the EPP framework.In order to evaluate the ML models, their performance needs to be tested.In order to test ML models, the dataset is split as train and test datasets.
Train dataset is used to train the ML models, while the test dataset is used to evaluate the models.The second phase solves the class imbalance problem, attribute selection, and features generation from selected attributes.Class imbalance can be defined as having more number of non-purchase sessions than purchase ended sessions.If models are trained with the imbalanced data, they will produce a bias on purchase prediction for the ongoing session; for instance, all the predictions will be non-purchase.To prevent this, we apply the class imbalance method as explained in Section "Class imbalance".In order to build ML models, features about sessions need to be extracted.Choosing the right and relevant features is a critical step in achieving well-performing ML models.If the well-correlated features are selected, ML models perform better on predicting purchase action for an ongoing session.The details of selected features are given in Section "Attribute selection and feature generation".In the last phase, trained ML models are evaluated using features created dynamically in a test session after each viewed product.
Since proposed EPP framework will provide real-time purchase prediction, the features for the sessions in the test dataset are generated dynamically, such as after each product viewing in the session (Fig. 7).The proposed scoring function detects if ML models can predict purchase action before actual purchase happens.

Attribute selection and feature generation
In this work, we choose the Timestamp (see Appendix) attribute from session logs since this attribute will help to create other features such as session duration.The following features will be used to train the ML models.These features are generated from training dataset statistics based on users behaviour and timestamp.
1. Total Viewed Items: This feature shows how many items are viewed in the session and the length of the session.Clearly, an item can be clicked many times in a session and this provide an indication about users' intention on the item.2. Total Unique Items: We take into account the unique seen items in the session to show how many of the clicked products are different products.If a user prefers to view the same item multiple times, it could be an indication of the user's interest in the item.3. Total Session Duration: Shows the duration of a session.It is found that there is a positive correlation between dwell time and users interest on the items in a session (Bogina and Kuflik 2017).4. Click Rate: Defines how many products are clicked within the duration of a session.This could show the user intention in the session.For example, if the user clicked many items within a short period in the duration of a session, this could be interpreted as the user having browsing intentions rather than buying an item in that session.We used temporal features to explore their effect on the performance of the ML models.The list of the created temporal features are described below.
1. Hour: This attribute shows the hour the session has starts.It is seen from the dataset analyses that users are more likely to proceed to check out after 5:00 pm (Fig. 2).2. Day of the week: Since each day of the week has a different number of browse and purchases, we add to features which day of the week a session started (Fig. 1).3. Weekend: Indicates whether the day of the week that the session started falls on the weekend or not.This feature can give an indication of a user's purchase intention.We found from our test dataset that Sunday is the busiest day with the highest number of visit to the e-commerce website (Fig. 1).4. Day of the Year: Indicates the numerical day of the year that the session started.This attribute has an important role since user behaviours might change in seasons and special days such as Black Friday or festive periods such as Christmas.

Model training
To test the proposed EPP framework, we train five ML models based on extracted features from the sessions.The ML models are Decision Tree (DT), Random Forests (RF), Bagging,K-Nearest Neighbour (KNN) and Naive Bayes (NB).DT (Berry and Linoff 2004) is a non-parametric supervised learning method used for classification and regression.The goal of DT is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.RF (Breiman 2001) is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Dynamic feature building for the test dataset and purchase prediction
Feature extraction for the test dataset is done following a similar way with feature creation for training dataset.However, the features of the test sessions have been built after each viewed item (Fig. 7) since our purpose is to predict purchase intention in an on-going session.For each updated feature after each viewed product, we get a prediction if user will buy an item in the session or not in real-time.
After each prediction EPP utility score is calculated for the model.EPP utility score depends on the item position and session specification.For instance, if there is a purchase, the EPP utility score is calculated using the position detail of first purchased and last purchased items.

EPP scoring function
EPP scoring system is an important component of the proposed EPP framework since we can evaluate ML models performance on how early they can determine purchase in the session based on the calculated utility score distribution.The frequency of higher utility score distribution for a model shows that the model is performing well in early purchase prediction.For example, after browsing the first item, based on the captured session features, the model can predict whether the session will end with a transaction or not.In this case, the proposed scoring system checks if there is actually purchase in the session.If there is a purchase, the scoring system will measure how early the model gives purchase prediction correctly by looking at the distance between the position of interacted item and position of purchase action.
We define the proposed EPP utility scoring function as follows.Given a set of sessions S with session s ∈ S, lp is the position of last purchased product in session s, t is the number of the total items browsed in s, p is the position of the last observed item i, and predictions use the item features to generate and hold a binary array of the prediction prediction results after each observed item i in s.If there is a purchase prediction in the predictions array, prediction is 1 otherwise, prediction is 0 (1).
The utility scores are calculated based on the prediction location for the observed interacted items so far in the session.The classification prediction can be positive or negative (Figs. 8,9).This means that if prediction is 1 we classify it as positive , if prediction is 0 we classify it as negative prediction.
The details of the different probable situations in the purchase prediction process is seen in Table 3.In this table, U(s,i) is the utility score function that gives a utility score for the observed features after each viewed item i in the session s.
Base on Table 3, there are 4 cases that need to be considered to calculate the EPP utility score; 1) In the case of U T P (s,i) , Eq. 2 is used to calculate the utility score sc 1 for item i in position p in the session s. 2) In the case of U F N(s,i) , we will give negative utility score based on the position p of last observed item i (3).In the utility score calculation(sc 2 ), we do not apply negative values until p reaches the first purchase point(fp) of the session s.
3) If there is no purchase in the session but prediction model gives positive prediction (U F P (s,i) ), we calculate Negative prediction of item i in purchase session s.

U F P (s,i)
Positive prediction of item i in non-purchase session s.U T N(s,i) negative prediction of item i in non-purchase session s.
the utility score (sc 3 ) as in Eq. 4. In this equation, −0.05 indicates False Positive (FP) coefficient.

4)
If there is no purchase in the session, also the prediction model gives negative prediction (U T N(s,i) ), the utility score is calculated in Eq. 5 We follow two methods to calculate overall utility score of the models' early prediction performance (Figs. 8  and 9): 1.If the model has a positive prediction for the sessions and ends up with a purchase, we will consider the utility score in the earliest position.However, if the model has a negative prediction for the session, but it ends up with a purchase, we will consider the utility score in the last position.For example, lets say fp = 5, purchase = 1, prediction = 1, lp = 5, t = 10, and we have positive prediction in position p = 2.In this case, we will consider the calculated utility score in position p = 2 neither negative utility scores calculated in previous positions nor positive or negative utility scores after position p = 2.Note that in this method, there must be at least one positive prediction.2. If the model has negative predictions for all observed items in the session, this means that there is no positive prediction in this model; however, the session has a purchase, we will assign the lowest utility score to that model for that session.
Higher utility score indicates that the model is better at early purchase prediction.The details of the utility score function are shown in Algorithm 1.

Experiments and results
This section shows the performance of the ML models on early purchase prediction on the dataset.As we discussed in the EPP scoring function in Section "Dataset description", utility scores above 1 indicate that there is an early purchase prediction for purchase ended session (Fig. 8).We conduct experiments on different ML models to see which model is better on early purchase predicting.The aim of this section is as follows.
1. Analyse the early purchase prediction performance of ML models in purchase ended sessions.2. Analyse how accurate ML models can predict when sessions do not have a purchase.3. Compare the effect of the two methods proposed to fix class imbalance problem in the dataset on early purchase prediction score.

Class imbalance
Class imbalance (Berry and Linoff 2004) is a major problem in purchase prediction from session logs since the majority of the sessions end without any transaction.Techniques deal with imbalanced data including (i) oversampling, which is oversampling of the minority class, (ii) undersampling (Kubat and Matwin 1997) which is undersampling majority class and (iii) mixed methods (Batista et al. 2004) in which combination of undersampling and oversampling are applied.For oversampling, Synthetic Minority oversampling Technique (SMOTE) (Chawla et al. 2002) and the Adaptive Synthetic (ADASYN) (He et al. 2008) sampling are wellknown methods.In this study, we apply the SMOTE oversampling and random undersampling class imbalance techniques.For each case, we train the classification algorithms.We use a Python library (Lemaître et al. 2017) to implement methods to deal with class imbalance problem.

Evaluation metrics
We use the proposed EPP utility score designed in Section "EPP scoring function" to measure the performance of classification models on early purchase prediction.In addition, we use the Area Under Curve (AUC) (Martínez et al. 2020) score to evaluate which model performs better with highest EPP utility score.AUC show measurement of how the model is capable of differentiating between classes (Huang and Ling 2005).The classes in our work are purchasing and not purchasing.When a model has a high AUC score, it means the model is good at differentiating the purchase and not purchase intention.Additionally, we consider the confusion matrix (Martínez et al. 2020) in order to analyse the number of the correctly classified and unclassified purchase intention predictions.

Experimental setup and analysis
In order to analyse how the ML classification models can predict early whether there will be a purchase in the session or not, a set of experiments are carried out (Table 4).We tested ML models on different class imbalance situations (SMOTE, undersample and without any class imbalance method) to understand the effect of class imbalance techniques (Section "Class imbalance") on early purchase predictions.Also, we added additional features to our dataset to see the effect of temporal features on early purchase prediction.Moreover, we tested classification models on the filtered dataset, in which we eliminated sessions which have less than ten interacted items in the training dataset.NB, KNN, RF, Bagging and DT models were used to conduct the experiments and compare their performances.For the KNN model, we set n neighbour to 64, in RF, the number of estimators is set to 50, and the base estimator is chosen as DT.
All classification models were run on each dataset category.For each category, their performance was compared to find the best performing category and model.

Utility score and evaluation of early purchase prediction for classification models
We used the proposed utility score to evaluate models in terms of how they can identify the earliest purchase intention.We created a decision condition that helps to labelling early purchase predicted sessions based on the utility score.The details of the utility score can be seen in Section "Dataset description".As seen in Fig. 8, when the utility score is greater than one; it means classification model predicts correctly that there would be purchase in this session.On the other hand, when the utility score is greater than 0.05 for a session, this means that the session is labelled correctly as purchase ended session.However, in the case when utility score is between 0 and 1, this means purchase action is predicted after the transaction happened.Moreover, if utility score equals to 0.05 which means a non-purchase session classified correctly as non-purchase session (Fig. 9).Function 6 shows the possible early prediction results according to given session utility score series s. s is created as a result of utility score calculation after predictions derived from classification models for each session.
In order to carry out the experiments, we pick the session features according to their utility score.For each interacted item, we create features, and we select the feature with the highest utility score to represent the session based on the utility score of the created feature.For example, let assume that a session ends with a purchase and prediction model gives the wrong prediction for the features derived from 2nd to the 4th interacted items in the session, and created session features after 5th interacted item leads to correct prediction.So, the utility score in the 5th location is picked as early purchase prediction score for the session, and the session features created at the 5th viewed item are selected as the session representing features.On the other hand, if the model gives wrong purchase prediction for the session, we continue to check if we can get the right prediction until the session ends.If we do not have the correct prediction for sessions after the session ends, we use the features that are created after the last interacted item as representative features of the session.
The experiments aim to see which model is performing better on early purchase determination.After we find the best-performing model, we analyse how early the model predicted the purchase action by analysing utility score distribution (such as the number of steps after the session started).Having higher utility score shows earlier purchase prediction for the model.Also, we analysed the machine learning models with AUC score and confusion matrices.Our aim for using these assessment measures is to find out if there is any positive correlation between these evaluation metrics (AUC and confusion matrices) and calculated EPP utility score.Table 5 shows the performance comparisons of the models.For category A dataset, Bagging achieved the highest AUC score followed by KNN classifier.In other categories, DT classification models outperformed other classification models.While, NB shows consistency in terms of performance in different class imbalance methods and categories, with the worst AUC score.Undersampling method for the class imbalance problem outperformed SMOTE method in most classification models.
It can be seen from the results that the most affected model from class imbalance problem is KNN classifier (the lowest AUC score is 0.5124 in the imbalanced dataset) In order to compare their differences, we look closer to distribution of EPP utility score of the sessions.As seen from the Figs. 10 and 11, DT has more number of 2.0 utility score than Bagging which means DT is better than Bagging in terms of early purchase prediction after two viewed item.We analysed the confusion matrices of three bestperformed models trained on Category C undersampled dataset.It can be observed that (Table 6) DT classifier has superiority predicting purchase ended sessions comparing to Bagging classifier and KNN classifier.
The importance of data features during the ML model establishment process can be sorted and scored.We use DT classifier to conduct the feature importance analysing following the same method in Dutta et al. (2019) and Dou (2020) as we found DT classifier is the best performing model among all ML models we analysed for early purchase intention prediction.Based on feature importance analyses,  we found that the most important feature for the purchase prediction is session duration.The reason might be that session duration is a good indicator for identifying users' purchase intention.For example, when a shopper spends more time on the e-commerce platform, it can show that the user has purchase intention, which increases the probability of purchase action.The importance of other features is shown in Fig. 12.It can be seen that using the minimum popularity value of browsed items in the session has very little effect on identifying purchase prediction.

Discussion
This work explores how ML models perform on early purchase prediction.We train our ML models on anonymous session logs and investigate how early purchase intention can be determined for a test session.In this section, we discuss our findings based on the questions asked in Section "Introduction".These are: (1) Given session data after a user's first interaction, how helpful can ML models be in predicting the likelihood of a purchase in an ongoing session?(2) What is the most critical session feature for early purchase prediction and how can it be identified?
(3) How can ML models be evaluated to measure their performance on early purchase prediction?The findings based on the above questions are discussed in the next section (Section "Findings").In Section "Contributions", we discuss the implications of these findings and our contribution to the body of knowledge.

Findings
1. We found that ML models are useful in predicting the likelihood of purchase in an ongoing session after the first interaction, especially when the DT model is used.DT is able to predict early purchase intention with around % 97 AUC score.Bagging, KNN and RF achieved AUC scores of % 94, % 93,% 90 respectively.While NB showed % 75 AUC score ( For more details refer to Table 5).We observed that based on our experiments, DT is the best performing ML model for predicting early purchase intention in an on-going session (Table 6).In addition, we analysed EPP utility scores for two best performing ML models (Figs. 10 and 11).Results show that the higher the number of sessions, the higher the EPP utility score that DT model can produce in comparison to Bagging model.2. We found that "Session Duration" is the most important feature that gives critical signal for predicting users' purchase intention.We believe that this could be attributed to the length of the duration that a user spends in a session as this could increase the probability of a purchase.On the other hand, "Min Popularity" feature that indicates the value of minimum popularity of the browsed items in the session is found as having the lowest importance to classify users' purchase intention.
We created the importance of features analyses (Fig. 12) during establishing the process of the best performed ML model, which is DT classifier.3. To evaluate the performance of ML models on early purchase prediction, we develop an EPP utility scoring method that assigns efficiency scores to the ML models.
The developed EPP scoring method applies defined rules for purchase-ended sessions and non-purchase sessions (Section "EPP scoring function").A high utility score indicates that a ML model is efficient in early purchase prediction.

Contributions
Although many works in literature investigated purchase intention prediction, they focused on using features and building ML models after sessions end (Park and Park 2016; Mokryn et al. 2019;Martínez et al. 2020;Wu et al. 2015;Kytö et al. 2019;Köcher et al. 2019).These works mainly investigate if users can purchase product in their next sessions.Also, these works studied featurebased performance difference of the models on the purchase prediction after sessions end.For instance (Mokryn et al. 2019) examined performance improvement of the ML models when the product trendiness is added as a feature in the model training stage.In Kim et al. (2020), realtime purchase action was investigated by using cameras and object detection algorithms and analysed users' physical movements in a product cabinet in order to detect purchase action.What differentiates our work from existing studies is that in our work, we designed an EPP framework that is able to predict purchase intention for an ongoing session on an e-commerce website.Also, we proposed a utility scoring method in order to investigate how early ML models can detect purchase in an on-going session before purchase action happens.

Theoretical implications
Our work offers academic and industrial implications in line with how important predicting users' purchasing intention is in digital marketing community (Qiu et al. 2015).Current studies about purchase prediction are primarily focused on predicting purchase intention from the next time a consumer visits the e-commerce website (Park and Park 2016; Mokryn et al. 2019;Martínez et al. 2020).Some of these studies applied ML models for purchase prediction after a session has ended (Park and Park 2016; Mokryn et al. 2019).On the other hand, research investigating the purchase intention of a consumer when the session is active is still scarce albeit important to digital marketing research.In this regard, we extend the existing works of Park and Park (2016) and Mokryn et al. (2019) by applying the ML models to predict purchase intention in an ongoing session.The purpose of analysing a consumer's purchase probability while the session is on-going is for targeting advertisements and recommending products in real-time since recommending products and advertisements or sending offers after a session has ended is difficult when a consumer is anonymous, i.e when the consumer is not registered to the website (Jannach and Jugovac 2019).
To the best of our knowledge, this is the first work that applies and evaluates the performance of ML models in early purchase intention prediction.In this work, we develop an EPP utility scoring method to measure the performance of the ML models.We offer valuable contribution for information retrieval researchers by utilising EPP framework to identify shoppers who do not intend to purchase.This may provide the added benefit of recommending products and offering discounts to users in an on-going session.Since many studies show the positive value of offering discounts and product recommendations in aiding consumers' purchase decision (McColl et al. 2020;Liu et al. 2016;Leeflang et al. 2014;Ahmed et al. 2014), our EPP framework identifies users' intention in real-time before they abandon the website.This also provides high accuracy on early purchase intention prediction even after first product interaction.Another contribution of this work is to develop a framework that demonstrates the capability of ML models in determining early purchase intention regardless of a user's registration status (registered or unregistered).In the existing studies, it has been highlighted that returns from unregistered consumers is very low as providing personalised offers for unregistered user is a challenging task (Behera et al. 2020;Hallikainen et al. 2019).Consequently, this work shows that ML models are strong predictors for when the shoppers' behaviours are analysed appropriately and behavioural features are generated dynamically for active sessions.Conclusively, we showed that EPP utility score can be applied successfully in order to determine how early ML models can predict the true intention of a consumer for purchase prediction.Thus, researchers in other domains may find some benefit in adapting our proposed EPP utility scoring method to determine the probability of occurrence of an event.For instance, research in the airline operations shows that early determination of a failure of a system of an aircraft is important (Dangut et al. 2020) or in healthcare research, early diagnosis of a disease is very crucial in the successful treatment of the disease (Perveen et al. 2016).

Practical implications
Notable practical implications can also be concluded from this work.Early online identification of shoppers with high purchase intention could allow marketers to implement different online strategies for consumer intention, personalised recommendations, sales-boosting and targeted offers.Also, providing better-personalised recommendations can make consumers feel special which can improve consumer experience and loyalty.The role of discounts and price history charts for shoppers' perception has been widely investigated by the digital marketing community (Drechsler and Natter 2011).It was found that giving smart discounts and providing price history chart can improve the persuasions of the consumer on purchasing products.Our framework can contribute to the digital marketing community by improving the quality of personalised recommendations and targeted advertisements, and discounts.It can also aid in identifying profitable consumers which can contribute to making marketing spending more efficient (Kumar et al. 2008).

Conclusion and future work
In this work, we proposed an Early Purchase Prediction framework (EPP).The framework aims to predict the users' buying intention from their first few interactions during a given online session.The problem was modelled as a classification problem.We evaluated the proposed framework using five ML models.The framework was evaluated using a range of ML models and different features sets.The proposed EPP framework can be combined with existing methods for recommender systems.Our method uses only anonymous session logs as dataset.We built dynamic feature creation after each interaction in the session, and then we used that feature for prediction.In order to determine a given positive purchase prediction before or after purchase action, we designed an early purchase prediction utility score and selected the features that created the highest early purchase prediction utility score.We used various ML models, including ensemble model under different class imbalance methods to test algorithms' performance for the chosen features for each session.We found that DT model is the best ML model that can determine early purchase intention in an ongoing sessions.In order to apply proposed EPP framework, we assumed that shoppers interacted with at least two products when they visit e-commerce website since the scoring utility is activated after the second viewed product as described in Section "EPP scoring function".Therefore, the framework cannot determine whether the given purchase prediction is early purchase intention prediction or not when a shopper browsed only one product.This can be viewed as a limitation of the EPP framework.Nevertheless, the EPP framework can still give a prediction for shoppers' purchase intention.However, after shoppers browsed two or more products, the EPP framework can determine successfully shoppers' early purchase intention.Several lines of future work are identified from this work.First, we propose our next step which aims to test the early purchase prediction framework in other session-based datasets to further substantiate the effectiveness and adaptability of our proposed framework.In addition, early purchase prediction framework can be integrated into a session-based recommendation system.For example, prediction results based on the created features from the session interactions can be utilised as a filtering option or a guide for choosing recommended items which can improve more personalised recommendation.

Fig. 1
Fig. 1 Interaction frequency and weekday correlation

Fig. 2
Fig. 2 Session distribution by hour

Fig. 3
Fig. 3 Session distribution by session duration for purchase sessions

Fig. 4
Fig. 4 Session distribution by session duration for non-purchase sessions

Fig. 5
Fig. 5 Distribution of first purchase and last purchase positions of sessions in test dataset

Fig. 8 Fig. 9
Fig. 8 Utility function for purchase session (p = 1 means 2 nd viewed item in the session)

Fig. 10
Fig. 10 EPP score distribution for DT classifier trained on category C and undersampled dataset

Fig. 11
Fig. 11 EPP score distribution for Bagging classifier trained on Category C and undersampled dataset

Fig. 12
Fig. 12 Scored feature importance using DT classifier

Table 1
Recent works about purchase prediction in literature

Table 2
Test session details (Pedregosa et al. 2011)mator that trains base classifiers each on random subsets of the original dataset and then aggregates their individual predictions by voting or by averaging to form a final prediction.KNN classifier(Qian and Rasheed 2007)uses a simple majority vote of the nearest neighbours of each point for classification.A query point is assigned the data class which has the most representatives within the nearest neighbours of the point.Lastly, NB(Wang and Tseng 2015)is a classifier that can be extremely fast compared to other classifier methods and requires fewer data to have a wellperforming model.All these models have their specific learning methods.We conduct experiments for each model to identify which model is performing better.We use the scikit-learn(Pedregosa et al. 2011)python ML library for the experiments.

Table 3
Description of Confusion matrix T P (s,i)Positive prediction of item i in purchase session s.U FN(s,i)

Table 4
Performed test categories and their details

Table 5
AUC score comparison of early purchase prediction(EPP) Framework on NB, RF, Bagging, DT and KNN classifier models on different categories and class imbalance methods In general, the highest AUC score for early purchase prediction is produced by DT classification model with 97 % of AUC score.Moreover, we analysed two highest performing models in terms of how early they are successful on purchase prediction.The first model is DT trained on Categrory C undersampled dataset, and the second dataset is Bagging trained on Category A undersampled dataset.
however when Undersampling is applied, the performance of KNN classifier AUC score improved to 93%.DT is the best robust classification model for imbalanced data case with the highest AUC score (the lowest AUC score is 0.89 %).

Table 6
Confusion matrix analysing of DT, Bagging and KNN classifier models trained on category C undersampled dataset