# KNNs and Sequence Alignment for Churn Prediction

## Abstract

Large companies interact with their customers to provide a variety of services to them. Customer service is one of the key differentiators for companies. The ability to predict if a customer will leave in order to intervene at the right time can be essential for pre-empting problems and providing high level of customer service. The problem becomes more complex as customer behaviour data is sequential and can be very diverse. We are presenting an efficient sequential forecasting methodology that can cope with the diversity of the customer behaviour data. Our approach uses a combination of KNN (K nearest neighbour) and sequence alignment. Temporal categorical features of the extracted data is exploited to predict churn by using sequence alignment technique. We address the diversity aspect of the data by considering subsets of similar sequences based on KNNs. Via empirical experiments, it can be demonstrated that our model offers better results when compared with original KNNs which implies that it outperforms hidden Markov models (HMMs) because original KNNs and HMMs applied to the same data set are equivalent in terms of performance as reported in another paper.

## 1 Introduction

In a very competitive and saturated market, it is important for customer centric companies to keep their existing customers because attracting a new customer is between 6 to 10 times more expensive than retaining an existing customer [1]. Churn prevention is a strategy to identify customers who are at high risk of leaving the company. The company can then target these customers with special services and offers in order to retain them.

Researcher use different approaches and different types of data to predict churn: neural networks and support vector machines [2], hidden Markov models [3], random forests [4], bagging boosting decision trees [5], and sequential KNN [6]; demographic data, contractual information and the changes in call details and usage data [7] and complaint data [8]. Word of mouth has also been studied in churn prediction [9].

Customer behaviour data consists of asynchronous sequences describing interactions between the customers and the company. To investigate the impact of sequential characteristics on churn detection, sequential approaches are of interest. There is a rich available source of mathematical models in data mining [10] but there don’t seem to be many efficient sequential approaches. The problem seems to be poorly aligned with any single classical data mining technique.

Our objective is to determine similar sequences, expecting that sequences which behave similarly in earlier steps go to the same final step. To achieve this, an original KNN is combined with sequence alignment to form a sequential KNN. A sequential KNN using Euclidean distance is introduced in [6]. As our data consists of event sequences, in this study we use the matching scores from sequence alignment as distance. The problem can be formulated as follows. A customer behaviour record \((S_j)\) is a composition of discrete events in a time-ordered sequence, \(S_j = \left\{ s_1^{(j)}, s_2^{(j)}, \ldots , s_{n_j}^{(j)}\right\} \), \(s_k\) takes values from a finite set of event types \(E = \left\{ e_1,\ldots ,e_L\right\} \). The goal of our models is to predict churn for a given event sequence \(S_{N+1} = \left\{ s_1^{(N+1)}, s_2^{(N+1)},\ldots , s_{i-1}^{(N+1)}\right\} \) based on the data from classified sequences \(S\) = \(\left\{ S_1, S_2, \ldots , S_N\right\} \) where we know if a customer has churned or not.

The remainder of this paper is organised as follows Sect. 2 presents the proposed model. It is followed by Sect. 3 where we evaluate the performance on a data set. Section 4 provides conclusions and future research directions.

## 2 Sequential KNNs

KNN is one of the classical approaches in data mining [11]. It can be used as the original non sequential approach [3] or extended into a sequential approach [6]. The core idea in KNNs is to find similar sequences based on distance functions. We adopt sequence alignment to define a distance measurement which is expected to be suitable for event sequences [12]. The obtained approach by coupling KNN with sequence alignment is named KNsSA. In this work, both types of alignment are investigated to verify which one is effective for measuring the similarity between two given sequences in order to predict process outcomes.

*Global algorithm*One such algorithm was introduced by Needleman and Wunchs [13]. There are three characteristic matrices in this algorithm: substitution matrix, score matrix and traceback matrix.- 1.Substitution matrix: in biology a substitution matrix describes the rate at which one amino acid in a sequence transforms to another amino acid over time. In our problem regarding customer behaviour data no mutation occurs. Therefore, we use the simplest form of substitution matrix.$$\begin{aligned} s(i,j) = \left\{ \begin{array}{l r} 0 &{} if \quad event\, i\ne event\, j\\ 1 &{} otherwise\\ \end{array} \right. \end{aligned}$$
- 2.Score matrix: This matrix’s elements are similarity degrees of events from the two given sequences considering the event positions.$$\begin{aligned} h_{i0} = h_{0i} = - \delta \times i \end{aligned}$$(1)where \(i = \left\{ 1,\ldots ,len_1\right\} \), \(j = \left\{ 1,\ldots ,len_2\right\} \). \(\delta \) is a specific deletion/insertion penalty value chosen by the users. \(h_{i0}\) and \(h_{0j}\) are the initial values. \(x_i\) and \(y_j\) are events at positions \(i\) and \(j\) from the given sequences. \(s(x_i,y_j)\) is the value from the substitution matrix corresponding to events \(x_i\) and \(y_j\).$$\begin{aligned} \begin{array}{ll} h_{ij}\!\!\!\!\! &{} = \max \left\{ h_{i-1,j} - \delta , h_{i-1,j-1} + s(x_i,y_j), \right. \\ &{}\quad \left. h_{i,j-1} - \delta \right\} \end{array} \end{aligned}$$(2)
- 3.Traceback matrix: Elements of this matrix are left, diag or up depending on the corresponding \(h_{ij}\) from the score matrix:This matrix is used to track back from the bottom right corner to the top left corner, by following the indication within each cell, to find the optimal matching path.$$\begin{aligned} q(i,j) = \left\{ \begin{array}{l l} diag\\ \quad if \quad h(i,j) = h(i-1,j-1)\\ \quad \quad + s(i,j)\\ up\\ \quad if \quad h(i,j) = h(i-1,j) - \delta \\ left\\ \quad if \quad h(i,j) = h(i,j-1) - \delta \\ \end{array} \right. \end{aligned}$$(3)

- 1.
*Local algorithm*The aim of local algorithms [12, 13, 14] is to find a pair of most similar segments, from the given sequences. These algorithms also have substitution matrix and score matrix like global algorithms. However, the initial values for the score matrix in local algorithms are set to be 0:The optimal pair of aligned segments is identified by first determining the highest score in the matrix, then tracking back from such score diagonally up toward the left corner until 0 is reached.$$\begin{aligned} h_{i0} = h_{0j} = h_{00} = 0 \end{aligned}$$(4)

## 3 Evaluation

### 3.1 Data and Benchmarking Models

*RM—Random Model*: in order to find the outcome of the process, we randomly generate a number between 0 and 1, if the generated number is greater than 0.5 the outcome is success (1) and vice versa if the generated number is smaller than 0.5 the outcome is failure (0).*Original KNN*: we chose \(K\) nearest sequences in terms of having common unique tasks. As shown in the work of [3], churn prediction using a sequential approach (HMM) applied to the same data set \(DS\) does not outperform non sequential KNNs. Therefore, we benchmark our model only with a non sequential KNN and show that our way of dealing with sequential diverse data is more efficient.

Local KnsSA applied on original DS with different \(K\), \(K = 3, 5\) and \(7\)

Results/K | 3 | 5 | 7 |
---|---|---|---|

Actual tests successful | 793 | 793 | 793 |

Actual tests failure | 16 | 16 | 16 |

Predicted tests successful | 803 | 805 | 805 |

Predicted tests failure | 6 | 4 | 4 |

Correct predictions | 799 | 797 | 797 |

Failed predictions | 10 | 12 | 12 |

Predicted tests success correct | 793 | 793 | 793 |

Correct ratio | 0.99 | 0.99 | 0.99 |

Local KnsSA applied on original DS with different churn and non churn ratios \(K = 3\)

Results/ratio | 0.05 | 0.10 | 0.15 |
---|---|---|---|

Actual tests successful | 45 | 83 | 120 |

Actual tests failure | 14 | 17 | 15 |

Predicted tests successful | 45 | 84 | 124 |

Predicted tests failure | 14 | 16 | 11 |

Correct predictions | 57 | 93 | 127 |

Failed predictions | 2 | 7 | 8 |

Predicted tests success correct | 44 | 80 | 118 |

Correct ratio | 0.97 | 0.93 | 0.94 |

Global KnsSA applied on original DS with different churn and non churn ratios with \(K = 3\)

Results/ratio | 0.05 | 0.10 | 0.15 |
---|---|---|---|

Actual tests successful | 43 | 79 | 121 |

Actual tests failure | 15 | 16 | 16 |

Predicted tests successful | 48 | 90 | 134 |

Predicted tests failure | 10 | 5 | 3 |

Corrected predictions | 43 | 82 | 118 |

Failed predictions | 15 | 13 | 19 |

Predicted tests success correct | 38 | 78 | 118 |

Correct ratio | 0.74 | 0.86 | 0.86 |

### 3.2 Results

For the proposed models, we investigate the effect of \(K\) as it is important to get the reasonable number of similar sequences. We now present the results of the experiments by applying the two models, global and local KnsSAs to \(DS\). There are three tables which illustrate different aspects of the objectives of the experiments. Table 1 shows the difference obtained by varying \(K\), using local KnsSA and the original churn data. Table 2 demonstrates the results obtained by using local KNsSA when artificial data sets were created by changing the ratio between churn and no churn sequences in order to decrease the skewness of the data. Table 3 shows the performance of global KnsSA applying to the former artificial data sets.

It can be seen from Table 1 that \(K = 3\) is the best case. Also, when the original data were modified, the performance of the model in terms of the churn detection objective improved even though the overall performance worsened. Intuitively, when the population of no-churn sequences strongly dominates, it is very likely that our model could not catch the full churn set. It is shown in Table 1 that the precision for churn is 100 % and the corresponding recall is 37.5 %. With the amended data, the overall performance of the model is reduced as well as the precision for churn class. Nonetheless, it is still of interest because the precision and the recall for churn prediction are 92.85 %.

The results in Tables 2 and 3 show that the local KnsSA outperforms the global KnsSAs when applied to the churn data set. This could be caused by the fact that in customer behaviour sequences, only a subset of special segments has strong influence on churn action.

## 4 Conclusion

In this paper, we propose some extensions to KNNs which were designed in order to capture the temporal characteristics of the data and to profit from KNNs ability to deal with diverse data. These extensions are tested on real customer behaviour data from a multi telecommunications company and the experiments provide some interesting results. Even though churn is a rare event the proposed models correctly capture most of the churn cases and the precision and recall values of churn class are very high. This paper confirms our initial point of view that it is hard to model diverse sequences in a generic way to predict the outcome and that it is important to use the temporal characteristics of the data. Hence, a KNN is a good candidate because KNNs treat certain number of similar sequences in the same way. Combining sequence alignment and KNNs we can achieve better results since this helps to compare two sequences based on the ordered events themselves.

### References

- 1.J. Hadden, A. Tiwari, R. Roy, D. Ruta, International Journal of Intelligent Technology
**1**(1), 104 (2006)Google Scholar - 2.C. Archaux, H. Laanaya, A. Martin, A. Khenchaf, in
*International Conference on Information & Communication Technologies: from Theory to Applications (ICTTA)*(2004), pp. 19–23Google Scholar - 3.M. Eastwood, B. Gabrys, in
*Proceedings of the KES2009 Conference*(Santiago, Chile, 2009)Google Scholar - 4.B. Lariviere, D. Van den Poel, Expert Systems with Applications
**29**(2), 472 (2005)Google Scholar - 5.A. Lemmens, C. Croux, Journal of Marketing Research
**43**(2), 276 (2006)Google Scholar - 6.D. Ruta, D. Nauck, B. Azvine, in
*Intelligent Data Engineering and Automated Learning IDEAL 2006, Lecture Notes in Computer Science*, vol. 4224, ed. by E. Corchado, H. Yin, V. Botti, C. Fyfe (Springer, Berlin, 2006), pp. 207–215Google Scholar - 7.C. Wei, I. Chiu, Expert Systems with Applications
**23**(2), 103 (2002)Google Scholar - 8.
*Churn Prediction using Complaints Data*.Google Scholar - 9.S. Nam, P. Manchanda, P. Chintagunta, Ann Arbor
**1001**, 48 (2007)Google Scholar - 10.R. Duda, P. Hart, D. Stork,
*Pattern Classification*(Wiley, New York, 2001)Google Scholar - 11.M. Berry, G. Linoff,
*Data Mining Techniques: for Marketing, Sales, and Customer Relationship Management*(Wiley, Newyork, 2004)Google Scholar - 12.M. Waterman, Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences
**344**, 383 (1994)Google Scholar - 13.S. Needleman, C. Wunsch, Journal of Molecular Biology
**48**, 443 (1970)Google Scholar - 14.T. Smith, M. Waterman, Journal of Molecular Biology
**147**, 195 (1981)Google Scholar