An Efficient Algorithm for Bayesian Nearest Neighbours

K-Nearest Neighbours (k-NN) is a popular classification and regression algorithm, yet one of its main limitations is the difficulty in choosing the number of neighbours. We present a Bayesian algorithm to compute the posterior probability distribution for k given a target point within a data-set, efficiently and without the use of Markov Chain Monte Carlo (MCMC) methods or simulation - alongside an exact solution for distributions within the exponential family. The central idea is that data points around our target are generated by the same probability distribution, extending outwards over the appropriate, though unknown, number of neighbours. Once the data is projected onto a distance metric of choice, we can transform the choice of k into a change-point detection problem, for which there is an efficient solution: we recursively compute the probability of the last change-point as we move towards our target, and thus de facto compute the posterior probability distribution over k. Applying this approach to both a classification and a regression UCI data-sets, we compare favourably and, most importantly, by removing the need for simulation, we are able to compute the posterior probability of k exactly and rapidly. As an example, the computational time for the Ripley data-set is a few milliseconds compared to a few hours when using a MCMC approach.


INTRODUCTION & RELATED WORK
Various authors have explored the idea of Bayesian k-NN algorithms, e.g. [1,2], and originally [3]. The simplicity and elegance of k-NN lends itself, at least intuitively, to a Bayesian setting where the aim is to allow the number of neighbours to vary depending on the data (as highlighted in [4].) Practically all of the work has relied on Markov Chain Monte-Carlo methods in some form or other; the use of simulation circumvents the need to model the full joint probability distribution of the data and the number of neighbours for any target. In an attempt to avoid the use of simulation, the authors in [5] have approximated the likelihood function, albeit at the expense of accuracy and portability to regression problems. More recently, as an alternative approach, the idea of hubness is explored in [6].
In a somewhat distinct branch of Bayesian statistics, numerous studies have focused on estimating change-point probabilities for data where the generating process is presumed to vary over time. Initial works were focused on partition analysis for the entire data-set, often using MCMC simulation: [7,8,9] (which, interestingly, is loosely connected with the computational complexity of estimating the posterior probability of k using approaches based on simulation.) Yet it was not until the authors in [10] presented an on-line version of Bayesian change-point estimation (with an O(n) computational complexity,) that change-point problems become easily estimated.
As discussed in detail by [11], a probabilistic view of k in nearest neighbour algorithms outperforms standard crossvalidation approaches, though practitioners often avoid the probabilistic approach due to its reliance on MCMC methods of estimation. By using the algorithm presented in [10] applied to the data ordered by distance to our target coordinates, we can compute the exact probability distribution for k specific to our target point.

EFFICIENT BAYESIAN NEAREST NEIGHBOUR
The idea of calibrating the number of neighbours to the data is centered around the notion that, within the appropriate neighbourhood, data points are similar, or, in other words, they are generated by the same process. It is indeed our goal to determine how many k neighbours represent such appropriate neighbourhood.

DATA GENERATING PROCESS
If we order the data using a distance measure of choice with respect to a target point, we have transformed our assumption into the idea that the data-generating process is shared for the first k points closets to the target. As such, moving from the most distant point towards our target, the underlying process generating the data can vary with a known probability and, when a change occurs, such process is drawn from a known prior distribution. We aim to determine the probability of such change-point having occurred at the various intervals between neighbours -once we have reached our target datum.
This formulation is equivalent to the change-point analysis in [10]: we can recursively keep track of the historical change-point probability until we reach the target point. The resulting probability of a change-point having occurred at k points from our target is indeed the probability of k being the correct number of neighbours.

A Simple Example
As a simple two dimensional classification problem, imagine we have the data depicted in Figure 1, with the respective order, based on the Euclidean distance, shown below in  In this case, the appropriate number of neighbours for target point (I) is five. An alternative way to view the choice of k is to order the data-points by their distance to the target point ( Figure 2, below the x-axis.) We note the prior probability of each k (before seeing any of the data) as the dotted line, which is just a geometric distribution with p γ = 0.05. To compute the posterior probability for k (solid blue line in Fig. 2,) we can start from the rightmost point and move towards our target: i.e. the probability that the data-generating process has changed (assuming a Beta prior probability for the data generation with parameters B(α = 10., β = 10.) and a probability of a change-point occurring in between any two neighbours of p γ = 0.05, i.e. our prior on the number of neighbours is k = 20.) Conversely, we are not as convinced of the appropriate k for target (II), as we can see form the posterior the distribution which is giving a rather mixed view. Note that the choice of Beta distribution is the standard conjugate prior for the parameters of a binary random variable. The choice of p γ = 0.05 is a key part of this approach. Specific to this example, we are implicitly assuming that the data will vary as we move away from our target point with a probability of 0.05. In other words, the hazard function is memoryless -with our prior for the expected number of neighbours set to 1 pγ = 20.

ALGORITHM
The first step is to represent the data into an ordered list driven by the distance from our target point 1 . If we define the target point as x τ , we order all of the available training data as x 0 , ..., x τ −1 (defined as x 0:τ −1 ) with x 0 being the most distant point from our target. We assume that the data x t is i.i.d. over a partition ρ from some probability distribution P (x t |η ρ ), where η ρ represents the parameters of the data-generating distribution. Finally, we assume that for all of the partitions, η ρ is also i.i.d from a known prior distribution (where ρ = 1, ..., n and n ≤ τ .) Now we are ready to use the algorithm presented in [10], applied to the projected data 2 .
Our objective is to compute the probability of each number of neighbours once we have reached our target point: p(k τ =i| x 0,...,τ −1 ) ∀i = 0, ..., τ − 1 with τ − 1 total data points. Note that the subscript τ in k τ indicates that we are representing the appropriate number of neighbours from the viewpoint of x τ , i.e. the target point. Starting from the point farthest away, and initializing the probability of a change-point having occurred before the initial point to 1.0, we set the initial conditions 3 : Firstly, we note that, as we observe a new datum, moving closer to our target, the number of neighbours k t within the same partition can either increase by one, with probability 1 − p γ , or terminate in favour of a nascent partition.
A key advantage of the algorithm in [10] is that we can recursively compute the probability over the number of neighbours, p(k t ), by keeping track of the joint probability of each k and the data: p(k t−1 , x 0 , ..., x t−1 ), as we observe a new datum, x t alongside the predictive probability of x t for a given number of neighbours, π t = p(x t |k t−1 , η t−1 ).
Finally, we define the notation η ρ և x t to indicate that we update the distribution parameters for η ρ with the datum x t using standard Bayesian updating rules 4 (see [12] for conjugate prior updating within the exponential family.)

Implementation Notes
(a) The hazard function need not be constant; p γ = f (.) can, interestingly, depend on distance between points,

Algorithm 1 Efficient Bayesian k-NN Algorithm
Initialize the data: x 0 , ..., x τ −1 ← ordered data for target point τ Initialize change-point variables: Observe next variable x t for i ← 0, t do Compute predictive probability: Compute probability of k: Update distributions: η i և x t end for end for return p(k τ | x 0:τ −1 ) ∀k τ ∈ {0, ..., τ } or the current run-length (i.e. not memory-less,) etc.; (b) Initializing the change-point probability: we do not have to set it to 1.0 before the first data-point. To speed up the analysis, we can start the algorithm m points away from our target point (where m can be set so that the prior probability of a change-point having occurred before m falls below a preset threshold, and m ≪ n.) In this case, and as an alternative, we can initialize the probability of a change-point before m with the prior distribution for p γ .
(c) The Bayesian update defined as 'և' can generally be computed efficiently for distributions in the exponential family. Other, possibly more complex, distribution may require a quadrature or simulation approach.
(d) For large data-sets, we resort to applying a log transform the joint probabilities in order to maintain nuthe algorithm to O(n 2 ). 3 Covered in more detail in the implementation notes. 4 As a simple example, let's assume we are updating the probability of a Bernoulli distribution, e.g. a coin toss, with a prior of α = 50 for heads and β = 50 for tails (where η = {α, β} defined as a total of 100 pseudo-observations and a prior probability p(H) = 0.5.) If we then observe a new datum x = tails, the η և x operation will update the parameters to α ′ = 50 and β ′ = 51, for a posterior predictive distribution of p(H) = 0.49505. merical stability.

RESULTS
In order to critically appraise this approach, we benchmark our analysis to the ubiquitous Ripley data-set for classification, and, as for a regression problem, to the Nuclear Power Plant output in [13]. The comparison is made against results obtained using the global optimal number of neighbours (as a manual process.) In other words, we compare this approach against the best choice of k when applied to all of the training data points. The key idea here is indeed that the optimal k varies depending on the specific target point within the same data-set 5 .  In addition to the a prediction based on various k values (weighted by their likelihood,) we now have a measure of certainty regarding our prediction. In Fig 4 we present the probability of classification computed using the training data.
The MCMC-based Bayesian analysis in [1] achieved a misclassification rate of 0.087, which is, not surprisingly, similar to our result. We also note the similarity between our Figure 4 and the one presented in such study. Indeed, we are not proposing the idea of a Bayesian approach to estimating the number of neighbours, but an efficient method to do so. Using this algorithm, the time to compute the posterior distribution over k for a test point within the Ripley data-set averaged three milliseconds per test point (for a standard PC,) compared to 50,000 paths used in the MCMC implementation in [1]. Notably, our approach 5 Data and descriptions for both data-sets are available the at UCI Machine Learning Repository (http://www.ics.uci.edu)   Finally, on the right column in Table 1 we show how, by switching the prior from a Beta distribution in the classification problem with the Normal distribution (as we only assumed the mean to be unknown,) we can obtain improved results for the Power Plant Output data. An interesting observation is that, if we plot the maximum posterior probability of the data w.r.t. the absolute error (in Fig. 5,) we can observe the ultimate limitation of k-NN algorithms. Data points with a large absolute error have clearly a very small probability of occurring, despite the fact that we are displaying the maximum likelihood across all possible k's; in other words, these are true outliers with respect to the distance measure that we have chosen 6 .
We have presented an efficient algorithm to compute a Bayesian analysis over the number of neighbours in k-NN algorithms, applicable to classification and regression, which does not rely on MCMC simulation. This yields both superior predictions and a full probabilistic view of k. Yet the biggest challenge for k-NN algorithms is likely to be within the choice of the distance measure and differentiated input scaling (as highlighted in [14].) Certainly for multidimensional problems, the challenge lies in ordering the neighbours correctly with respect to their proximity to our target point, which in turn is driven by the coordinate transform we apply to compute the distance measure. An efficient Bayesian approach in understanding such scaling aspect may well be possible.