1 Introduction

In today’s business landscape, companies are faced with the challenge of identifying potential customers who are most likely to respond positively to a product or offer, this is where data mining techniques come into play. With the increasing amount of data available, data mining has become an essential tool for direct marketing efforts, allowing companies to create a prediction response model based on past client purchase data. This study aims to present a data mining preprocessing method for developing a customer profiling system that improves the sales performance of an enterprise. The study uses an RFM analysis methodology to evaluate client capital and a boosting tree for prediction. Furthermore, the study highlights the importance of customer segmentation methods and algorithms in increasing the accuracy of the prediction. The main result of this study is the creation of a customer profile and forecast for the sale of goods, which will assist decision-makers in making strategic marketing decisions. The study is expected to provide valuable insights for companies looking to improve their direct marketing efforts and increase sales performance through data mining-based customer profiling [1,2,3].

The proposed methodology in this study utilizes the RFM analysis (recency, frequency, and monetary) approach to assess client capital, coupled with a boosting tree algorithm for predictive modeling. Additionally, the study emphasizes the crucial role of customer segmentation methods and algorithms in enhancing prediction accuracy. The primary outcome of this research is the development of a customer profile that offers valuable insights into customer behavior and sales forecasts for goods. However, there is no explicit mention of any secondary results or derivative findings throughout the introduction.

To achieve the research goal of enhancing sales performance through data-driven customer profiling, the study will address a series of key tasks. These tasks include data collection, a comprehensive study of machine learning methods, specifying the structure of client profiles along with their types and relevant indicators, analyzing and organizing customer data, systematizing international practices for improving client profiling, identifying effective methods for researching client profiles, and exploring the concept of "consumer loyalty" in modern marketing. Furthermore, the research aims to clarify the nature and structure of consumer loyalty, highlight foreign experiences in enhancing consumer loyalty, and emphasize the factors influencing the selection of reward systems for developing comprehensive consumer loyalty programs for goods and services manufacturers. Practical recommendations for the formation of such loyalty programs will also be formulated.

Deep learning is a subfield of machine learning that has seen widespread applications in various industries. Deep learning models have been applied to tasks such as text classification, sentiment analysis, machine translation, speech recognition [4], and table detection and recognition [5,6,7,8]. Health care is another industry where deep learning has found several applications, including diagnosis, treatment planning, drug discovery [9], and medical imaging analysis [10,11,12]. In robotics, deep learning is used for autonomous navigation, object recognition [13,14,15], and robotic control, handwritten recognition for various languages [16,17,18,19,20], questions–answering [21,22,23,24,25], intrusion detection in IoT [26,27,28], and energy consumption prediction [29, 30].

The research focuses on studying enterprises, organizations, and examining their marketing activities within the context of client policy formation and implementation. Consumers of goods and services are also within the scope of the study. The research delves into the entirety of economic and organizational relationships that arise as firms implement relationship marketing, particularly in creating and implementing programs to build consumer loyalty.

The study’s theoretical and methodological foundation is built upon essential research by internal and international scientists in market economy, management, marketing, consumer, and brand loyalty management. Methodologies such as marketing, economic, and statistical analysis, as well as quantitative and qualitative study principles, were utilized. Expert methods were also employed to substantiate the main research provisions. The study also draws inspiration from the work of authors Muller and Hamm [31], emphasizing the significance of starting with segmentation, marketing, and customer data adjustment to achieve accurate analysis and profiling.

The scientific novelty of this research lies in the development of scientific and methodological provisions and recommendations focused on creating and implementing a client profiling framework using AI techniques. Additionally, the study identifies the most effective methods for researching client profiles and enhancing consumer loyalty in Kazakhstani enterprises, exemplifying its applicability in a specific context.

In conclusion, this research aims to provide valuable insights for companies seeking to improve their relationship marketing efforts and enhance sales performance through data-driven customer profiling. With the vast array of customer needs, behavior, and preferences observed in online business platforms, customer segmentation becomes crucial. The study will explore customer segmentation, its importance in understanding customer behavior, and the role of AI in this process, offering practical insights for organizations looking to improve customer retention and benefit upgrades through data-driven customer segmentation.

2 Related work

In the field of customer segmentation, researchers have been experimenting with different algorithms to perform segmentation on customer data. Most of these studies have focused on analyzing customer buying history and purchasing behavior to identify segments. In the following paragraphs, related work methods will be explained followed by a table, i.e., Table 1, that summarizes the advantages and disadvantages.

According to Jiang and Tuzhilin [32], it is crucial to implement both customer segmentation and buyer targeting in order to enhance marketing performance. These two tasks are integrated into a step-by-step approach; however, the challenge of unified optimization arises. To address this issue, the authors proposed the K-classifiers segmentation algorithm. This method prioritizes allocating more resources to those customers who generate the most returns for the company. A significant number of researchers have discussed various techniques for segmenting customers in their studies. Also, the authors propose a direct clustering method for grouping customers. Rather than relying on computed statistics, this approach utilizes transactional data from multiple customers. The authors also acknowledge that finding an optimal segmentation solution is computationally difficult, known as NP-hard. Therefore, Tuzhilin presents alternative sub-optimal clustering methods. The study then experimentally evaluates the customer segments obtained through direct grouping and finds them to be superior to statistical methods.

Kashwan [33] proposed a K-means algorithm and a statistical tool to propose a model that elaborates on a continuous analysis and online framework for an e-commerce organization to predict sales. They involved a clustering strategy for determining market segmentation because a developed computing-based system is intelligent enough to address results to managers for a quick and fast decision-making cycle.

Brito [34] emphasized that advertising and manufacturing approaches are highly important for customized industries because buying a large variety of products makes it difficult to find specific patterns of customer preferences. As a result, they proposed two different data mining methods, clustering and sub-cluster discovery, for customer segmentation to better understand customer preferences.

He and Li [35] propose a three-dimensional strategy for enhancing customer lifetime value (CLV), customer satisfaction, and customer behavior. The study concludes that consumers have varying needs, and segmentation helps to identify their demands and expectations, which, in turn, leads to providing better service.

Sheshasaayee [36] developed a new integrated approach to segmentation by combining the RFM (recency, frequency, and monetary) and LTV (lifetime value) methods. They employed a two-phase approach, starting with a statistical method in the first phase and then proceeding to cluster in the second phase. The objective is to apply K-means clustering following the two-phase model and then utilize a neural network to improve the segmentation.

Ballestar [37] proposed the role of customers in the use of their cashback and determined the business activity and behavior of customers on the site of a social network. They proposed a model that applied social network analysis to marketing such as loyalty, communication, customer development, and customer engagement to show the dependence of customers’ positions within an organization.

Qadadeh [38] proposed the evaluation of data analysis algorithms such as K-means for clustering and self-organized maps for the nature of clustering with visualization. They recommend that involving various procedures for segmentation with experts will further develop organizations such as insurance and study segment elements and behavior of a customer in any customer relationship management dataset.

Several studies have demonstrated the extensive use of RFM technology for customer segmentation and information access. In the context of commercial banks, marketing representatives can employ K-means classification to identify potential customers. To extract valuable insights from customers, data mining methods, including neural networks, C5.0, classification and regression trees, and Chi-squared automatic interaction detector, are highly beneficial for detecting background information related to credit card holders [39].

Christy [40] emphasized that a good understanding of the customer’s needs and identification of potential customers for the organization are satisfied by the segmentation process. They performed segmentation using RFM analysis and extended it to other algorithms such as K-means and RM K-means through minor adjustments in K-means clustering.

Table 1 Related work methods, advantages and disadvantages

3 Problem statement

The problem of customer segmentation can be based on various factors such as marketing, sales, support, product, and leadership. Experts in large or small organizations involved in the data analysis process adjust the working group and set the expectations that it will continue to do so in many stages. Some issues that can be resolved through customer segmentation are given below.

  • Marketing We can solve the problem by understanding our customer base to effectively reach them. We may not be able to observe the business’s email lists using the task to be done, but we can observe ones for business to consumer (B2C) subscription organizations with high website traffic volume.

  • Sales Many issues faced by sales representatives can be resolved by this process. We can route prospects to our self-service stream or the most appropriate group within sales, such as startups, small market businesses, and multi-model businesses, based on clear customer segments.

  • Support Issues are categorized based on their tool and field. After categorization, it can be used to route support inquiries to the appropriate channels, such as AnswerBot, Alexa, Google Assistant, our help center, or a support representative, to improve customer and business outcomes further.

  • Product This process can also resolve issues with product quality. Experts should know which product requests and feedback make the biggest impact on which customer and focus accordingly, instead of by volume alone.

  • Leadership It manages the mission run by e-commerce organizations to deliver their service and make a lead. For this, they create a common language for the product and design and go to markets to describe the customers.

In this paper, we proposed a customer segmentation strategy based on various categories. Different clustering methods such as K-means, repetitive median-based K-means (RM K-Means), and self-organized maps were used for segmentation. We proposed a business model for e-commerce organizations based on segmentation according to various categories and recency, frequency, and monetary (RFM) positioning to retain and acquire customers in e-commerce. Observing new customers is important, but retaining old customers is even more important.

4 Model, tools, environment, and technology

4.1 The customer segmentation approach

Client segmentation is a widely used marketing strategy that involves dividing the customer base into smaller groups based on characteristics such as demographics, behavior, and purchasing history. This enables businesses to understand their customer base better and implement more effective marketing strategies. Vector quantization is an algorithm commonly used for client segmentation, automatically grouping customers based on their behavior data. While it may not always achieve optimal results, it provides valuable insights for businesses to target their marketing efforts. A mapping or vector quantizer can be used to divide data into smaller groups. The mapping is an N-level k-dimensional tool that takes various client RFM values as input vectors. It uses a non-negative real distortion measure to represent the difference between the original and reproduced vectors. The error distortion measure, widely used in mathematical applications, is chosen for its computational efficiency [41].

Data mining techniques have emerged as essential tools in market segmentation. This modern approach to market research involves processing vast datasets from databases using intelligent solutions, such as neural networks, evolutionary algorithms (EA), fuzzy theory, RFM, hierarchical clustering, K-means, bagged clustering, kernel methods, Taguchi method, multidimensional scaling, model-based clustering, and rough sets, among others. These techniques offer highly effective and time-efficient means of segmenting the market [42].

Quantizer optimality is determined based on its ability to minimize average distortion. An N-level quantizer is considered ideal or globally ideal if it achieves this, or at least, it is ideal if, for any remaining quantizers, their distortion is greater than the globally ideal quantizer [40]. Quantizer design aims to obtain an optimal or locally optimal quantizer if possible, and various algorithms have been proposed for this purpose in the literature.

4.2 Machine learning

Interest in machine learning (ML) has grown due to increased processing power and data availability. ML utilizes past experience to enhance performance and make precise predictions. Tasks include classification, regression, ranking, clustering, dimensionality reduction, and complex learning.

  • Data Preprocessing and Model Optimization It is crucial for AI model creation, involving cleaning, standardization, transformation, feature extraction, and selection.

  • Data Cleaning and Transformation Class imbalance in ML can lead to issues such as improper evaluation metrics and overfitting. Techniques such as oversampling and undersampling can address this.

  • Missing Data Handling missing values involves deletion or imputation with estimated values.

  • Sampling Preprocessing plays a vital role in AI model creation, impacting performance and interpretability. Techniques such as oversampling and undersampling can address class imbalance but have drawbacks.

  • Feature and Variable Selection Feature selection is critical for identifying relevant data, improving predictive performance, and efficiency. Various methods can be used based on the dataset and computational resources.

4.3 Model for customer segmentation and market segmentation

Commonly used models for customer segmentation include:

  • Demographic segmentation;

  • Recency, frequency, and monetary (RFM) segmentation;

  • Customer status and behavioral segmentation.

Segmentation based on gender is a simple yet effective way for organizations to categorize their customer base, allowing for targeted content and promotions for gender-specific events. RFM segmentation is widely used in the direct mail industry to rank customers based on their purchasing history, considering recency, frequency, and monetary value of purchases [34].

Client status and behavior analysis involve categorizing clients into active and lapsed based on their last purchase. Behavioral analysis analyzes past customer behavior, such as shopping habits and brand preferences, to predict future actions. Data analysts use this information to segment clients into categories and develop strategies for customer retention and acquisition [43].

Market division, first defined in 1956, is a method used by organizations to categorize customers based on similar characteristics, such as geographic location, demographics, product usage, and purchasing behavior. The goal is to increase customer satisfaction and maximize efficiency by tailoring marketing efforts to specific segments. One common tool used in the market division is clustering, which groups elements with similar values into segments [44,45,46]. While early market division studies only considered one set of factors, modern market division models take into account multiple sets of factors simultaneously, called cooperative market division. There are various market division methods, including K-means clustering, hierarchical clustering, association rule mining, decision trees, and neural networks. The objective is to identify and describe customer groups and reach profitable customer segments [47].

Clustering data is a significant aspect of data mining techniques, involving the utilization of latent class analysis (LCA), prior clustering, and various similarity or distance measures to segment large customer groups based on individual expectations [48]. Sánchez-Fernández [49] presents a conceptual framework centered around tourists’ perception of sustainability policies at various destinations, along with a multidimensional measure to assess this construct. Through an empirical analysis conducted across five Mediterranean destinations, the proposed conceptual model was validated, offering substantial empirical evidence supporting the viability of perceived sustainability as a valuable factor in segmentation studies.

4.4 Customer segmentation and client profiling

Market and customer segmentation are often used interchangeably, with market segmentation seen as a high-level strategy and customer segmentation providing a more granular view. The RFM model is a valuable tool for combining customer segmentation and targeting in campaigns [34]. Genetic algorithms can enhance the customer division and targeting process, using the LTV model as a fitness function [38]. Various mathematical methods have been explored for customer segmentation, including statistical techniques, neural networks, genetic algorithms, and K-means fuzzy clustering.

Client profiling involves analyzing client characteristics for tailored marketing strategies, contributing to customer retention and CRM [50]. Segment profiling focuses on understanding specific customer group attributes to guide marketing strategies. Buyer behavior profiles consider social factors such as timing, benefits sought, usage rate, loyalty, and attitude for targeted marketing efforts [51].

The proposed architecture of customer profiling, depicted in Fig. 1, offering a comprehensive approach to understanding customer behavior and preferences.

Fig. 1
figure 1

Processes of customer segmentation

5 Results and discussion

5.1 Dataset

The data utilized for our research were sourced from the Marketing Campaign datasetFootnote 1, which encompasses a cross-border dataset that encompasses several key demographic attributes, including age, education level, ID, annual income, marital status, and presence of children in the household, as shown in Table 2.

Table 2 Attributes of first datasets

The RFM model employed in this study utilized data from the SAS Institute to calculate the recency, frequency, and monetary rankings, enabling the segmentation of customers into distinct groups. The data comprise the following attributes as shown in Table 3.

Table 3 Attributes of second datasets

5.2 Preprocessing

This study employs three distinct algorithms to perform customer clustering utilizing RFM analysis. Initially, the data undergo preprocessing to eliminate outliers and extract pertinent instances. Outliers are detected using the z-score method, which assesses data’s proximity to its mean and standard deviation. This relationship is transformed into a scale from 0 to 1, with values deviating significantly from the mean (zero) identified as outliers.

5.3 Customer profiling approach

Within this investigation, following the completion of data preprocessing, the dataset proceeds to the customer profiling phase. In this phase, the K-means algorithm is applied to the dataset for clustering purposes, process of K-means on RFM analysis shown in Fig. 2. Subsequently, the outcomes of the K-means clustering are subjected to validation procedures aimed at determining the optimal cluster value (K). This validation is executed using three distinct metrics: the Elbow method, the Silhouette coefficient, and the Gap Statistics method. The graphical representation of these validation results is depicted in Fig. 3 for the Elbow method, Fig. 4 for the Silhouette coefficient, and Fig. 5 for the Gap Statistics method. Importantly, the Matthews correlation coefficient (MCC) scorer, known for its ability to accommodate classes of varying sizes, is employed to assess the effectiveness of the chosen methodology.

Fig. 2
figure 2

Processes of RFM analysis

The outcome of the Elbow method analysis illustrates a decline in the value of cluster inertia, also known as the sum of squared errors (SSE), as the number of clusters increases. From the graphical representation, it becomes apparent that potential candidates for the optimal K value reside within the range of K = 2 and K = 8. This observation is guided by the appearance of a discernible "elbow" shape in the graph, where the decrease in SSE starts to plateau. Nonetheless, the validation process remains essential and will involve the assessment of the other two metrics for confirming the optimal cluster configuration.

Fig. 3
figure 3

Elbow method

Fig. 4
figure 4

Silhouette method

Fig. 5
figure 5

Gap Statistics method

The second validation involves employing the Silhouette coefficient method, which yields the Silhouette scores for each cluster as presented in Table 4. This method allows for a comparative assessment with DP Agustino [52].

Table 4 Silhouette scores for different K values

The Matthews correlation coefficient (MCC) for the test data was computed as 0.88, suggesting potential challenges in accurately categorizing positive examples within the test set.

Based on the outcomes of the clustering analysis, the findings reveal the existence of three distinct clusters that serve as foundational categories for digital start-ups in executing customer profiling strategies. The initial category (designated as Cluster A) pertains to new customers, denoting those who have initiated their first purchase of products within the digital start-up’s domain. In the context of this research, it is imperative to augment engagement with these new customer segments by aligning strategies with their specific needs, thereby enhancing the prospect of subsequent product repurchases. Particularly, for platform-based Edutech start-ups, augmenting the platform with enriched educational content and providing tailored teacher support emerge as a pivotal strategy to amplify customer relevance and sustained interest.

The second category (identified as Cluster B) encompasses the best customers. This category comprises individuals who have engaged in multiple purchases, particularly emphasizing recent transactions. Customers within this cluster exhibit a pronounced potential to be offered each new product launch by the digital start-up. Furthermore, to foster more robust customer loyalty, the provision of exclusive discounts to customers within this category presents an effective approach.

Cluster C, the third category, is composed of intermittent customers. These customers display a sporadic pattern of engagement, characterized by occasional purchases and fluctuations in their interaction frequency. For digital start-ups, devising targeted marketing efforts that encourage consistent engagement from these intermittent customers can enhance their loyalty and transform them into more regular purchasers. Tailored promotions and personalized offers are instrumental in motivating these customers to establish a more enduring connection with the digital start-up’s offerings.

6 Conclusion

In conclusion, this research delved into the critical domain of customer profiling within the context of digital start-ups. Through a comprehensive analysis of clustering algorithms and validation methodologies, we successfully identified distinct customer clusters that offer invaluable insights for tailored business strategies. The utilization of K-means clustering, coupled with validation metrics such as the Elbow method, Silhouette coefficient, and Gap Statistics method, provided a robust foundation for customer segmentation.

The results revealed three primary clusters that serve as significant touchpoints for digital start-ups to refine their customer engagement tactics. Cluster A, representing new customers, necessitates tailored approaches to enhance their initial experience and foster repurchase potential. In the realm of platform-based Edutech start-ups, offering enriched learning content and personalized support emerges as a potent strategy. Cluster B, housing the best customers, signifies a vital avenue for product promotion and customer loyalty enhancement. Customized incentives and exclusive offerings can solidify their engagement and elevate their lifetime value. Cluster C, comprising intermittent customers, highlights an opportunity to re-engage and cultivate consistency. Strategic interventions, such as targeted promotions and individualized incentives, can transform intermittent customers into steady patrons.

In the broader landscape of digital start-ups, the outcomes underscore the paramount importance of customer profiling in enhancing business outcomes. By acknowledging the nuanced requirements of different customer clusters, start-ups can forge more meaningful and enduring connections, thereby fostering growth, customer satisfaction, and long-term success. As the digital landscape continues to evolve, these insights hold the potential to guide start-ups toward informed decisions that resonate with their customer base, fostering a symbiotic relationship between innovation and consumer needs.

7 Future work

In the future work, more advanced methods for predicting customer churn may be explored, such as weighted random forests and hybrid models that can handle unstructured data. This would enable the extraction of relevant attributes for potential customer segmentation studies in the retail industry. As highlighted in the literature review, using hybrid models has shown promising performance gains and could be a strategy to improve the models.

Artificial intelligence has the potential to revolutionize various industries by transforming existing business processes and creating new business models. Key areas of focus include consumer engagement, digital manufacturing, smart cities, autonomous vehicles, risk management, computer vision, and speech recognition. AI has already demonstrated positive results in a range of sectors including health care, law enforcement, finance, security, trade, manufacturing, education, mining, and logistics.