Clustering
With the prepared target dataset we intended to identify whether consumers can be segmented meaningfully in the view of recency, frequency and monetary values. The k-means clustering algorithm was employed for this purpose, and it can be easily performed by using the Cluster node in SAS Enterprise Miner.
As well-known, the k-means clustering algorithm is very sensitive to a dataset that contains outliers (anomalies) or variables that are of incomparable scales or magnitudes. Examining the histograms of the variables Recency, Frequency and Monetary of the target dataset in SAS Enterprise Miner, as illustrated in Figure 2, it is evident that there are a few instances having quite different monetary and frequency values compared to the majority of the instances in the dataset. These instances are valid from the business point of view as they are genuine transaction records; however, they are outliers from the data analysis point of view. Therefore, these instances should be isolated from the majority and treated separately. In addition, the three variables are not on comparable scales, and the value ranges are quite different: Recency [0,12]; Frequency [1,169] and Monetary [3,88 125], respectively. As such, these variables should be normalized before the clustering analysis.
On the basis of the initial insight into the dataset, a project diagram has been set up in SAS Enterprise Miner for the clustering analysis as depicted in Figure 3. There are four nodes in the diagram. In the Data Sources (Target Dataset) node, the three variables Recency, Frequency and Monetary were chosen as input for the clustering analysis. The Filter node was set to exclude from the analysis any instances having a rare value for any variables involved, and the minimum cutoff value for rare values was set to 1 per cent of the total number of instances under consideration. For example, out of the total 3799 instances, there was only one instance taking a monetary value of more than £87 684, and therefore, that instance was extended from the analysis. Overall there were totally 73 instances were excluded by the Filter node, and the summary of the resultant filtered target dataset is given in Table 5. In the Cluster node, the standard range transformation for normalization was used with the number of clusters specified as 3, 4 and 5, respectively, and finally, the Segment Profile node was utilized to assists to interpret each cluster found.
Table 5 Summary of the filtered target dataset (3726 instances) The clustering and segment results with five clusters are shown in Tables 6 and 7, and the distribution of the instances within each cluster is detailed in Figures 4 and 5. This segmentation by five clusters seems to have a clearer interpretation of the target dataset than the ones by three and four clusters.
Table 6 Instances in each cluster Table 7 Statistics of each cluster Understanding the clusters
Interpreting and understanding each cluster identified is crucial in generating customer-centric business intelligence.
Examining Table 7 and Figures 4 and 5, it is interesting to see that each cluster indeed contains a group of consumers that have certain distinct and intrinsic features as detailed below.
Cluster 1 relates to some 527 consumers, composed of 14.4 per cent of the whole population. This group seems to be the least profitable group as none of the customers in this group purchased anything in the second half of the year. Even for the first half of the year, the consumers didn’t shop often, and the average value of frequency was only 1.3.
Contrasted with the customers in cluster 1, the 188 customers in cluster 5 mainly started shopping with the online retailer at the beginning of the year, and continued to the end of the year with an average value of recency 0.7. They purchased quite often and as a result, spent a quite high amount of money. This group of consumers can be categorized as very high recency, very high frequency and very high monetary with a high spending per consumer. In fact, those 188 consumers contributed 25.5 per cent of the total sales in the year. This group, although the smallest (only composed of 5.05 per cent of the whole population), seems to be the most profitable group.
Cluster 4 contains some 627 consumers with a very high value for frequency and monetary, although lower than those of cluster 5. This group seems to be the second high profit group.
There are some 459 consumers in cluster 2. Compared with clusters 4 and 5, this group of customers has a lower frequency throughout the year and a significantly smaller average value of monetary, indicating that a much smaller amount of spending per consumer. This group can be categorized as low recency, high frequency and medium monetary with a medium spending per consumer.
Cluster 3 is the largest-sized group with 1748 consumers. Consumers in this group have a reasonable value of frequency. Compared with clusters 2 and 4, this group has a lower but reasonable value of monetary as the group includes many newly registered consumers starting shopping with the retailer very recently. This group seems to have represented ordinary consumers and therefore has a certain level of uncertainty in terms of profitability. In the long-term view, some of the consumers might be potentially very highly profitable or unprofitable at all.
We use Figure 6 to summarize our analysis made so far: in the whole population of the consumers, 47 per cent of them were ordinary shoppers with reasonable spending and frequency, about 34 per cent were medium to high profit, 5 per cent were extremely highly profit, and the remaining 14 per cent were extremely low profit. About 22 per cent of the consumers contributed roughly 60 per cent of the total sales. Overall the business seems to be quite healthy in terms of profitability.
Enhancing clustering analysis using decision tree
As discussed above, cluster 3 is the most diverse cluster among the five identified clusters in the sense that it contains both newly registered and old customers as well. To refine the segmentation of the instances in this cluster, a decision tree has been used to create some nested segments internally inside the cluster, as shown in Figure 5. In other words, these nested segments form some sub-clusters inside cluster 3, and make it possible to categorize the consumers concerned into some sensible sub-categories. For example, as shown in Figure 7, the customers can be divided into such categories as frequency more than 2.5 with an average monetary value of 990.66; and frequency more than 2.5 and less than 3.5 with an average monetary value of 1056.70 and so on. Also, it is interesting to note that the relationship between frequency and monetary seems to be a monotonic linear relationship.