A New Weighted k-Nearest Neighbor Algorithm Based on Newton’s Gravitational Force

Aguilera, Juan; González, Luis C.; Montes-y-Gómez, Manuel; Rosso, Paolo

doi:10.1007/978-3-030-13469-3_36

Juan Aguilera¹⁷,
Luis C. González¹⁷,
Manuel Montes-y-Gómez¹⁸ &
…
Paolo Rosso¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11401))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

2089 Accesses
2 Citations

Abstract

The kNN algorithm has three main advantages that make it appealing to the community: it is easy to understand, it regularly offers competitive performance and its structure can be easily tuning to adapting to the needs of researchers to achieve better results. One of the variations is weighting the instances based on their distance. In this paper we propose a weighting based on the Newton’s gravitational force, so that a mass (or relevance) has to be assigned to each instance. We evaluated this idea in the kNN context over 13 benchmark data sets used for binary and multi-class classification experiments. Results in $\mathrm {F}_1$ score, statistically validated, suggest that our proposal outperforms the original version of kNN and is statistically competitive with the distance weighted kNN version as well.

You have full access to this open access chapter, Download conference paper PDF

Feature Weighting by Maximum Distance Minimization

A Coupled k-Nearest Neighbor Algorithm for Multi-label Classification

A new globally adaptive k-nearest neighbor classifier based on local mean optimization

Article 03 October 2020

1 Introduction

The k-Nearest Neighbor (kNN) classification algorithm is one of the most popular approaches used by researchers and practitioners in the areas of Pattern Recognition and Machine Learning. Altogether with the Support Vector Machine (SVM), it is considered a firm representative of the classification by analogy principle [4].

Generally speaking, kNN only needs one parameter to be adjusted, k, which represents how many closest neighbors are to be considered to classify an unseen object. Once this parameter is set, two main approaches are followed in order to classify an object, (i), the vote of the majority of the k neighbors, and (ii), a weighted vote of all k neighbors considering the distance from where each of them are located with respect to the object to classify. Following these two ideas, the kNN algorithm has been successfully applied in such diverse learning task such as data mining [14], image processing [6], and recommender systems [7].

For classification purposes, all kNN variants, up to now, have assumed that, independently of the voting strategy that they follow (by majority or weighted) all objects in the training set are equal in their classification power. For instance, if two objects from different classes are exactly at the same distance of a test object, both objects will contribute the same amount to the final decision. Another way to perceive this is by saying that the two training objects have the same relevance. In this work, we are interested in proposing some ideas to alter this behavior. Motivated by how big bodies exert and influence to proximate objects, we think of assigning a mass to each of the objects in the training set.

There are several scenario applications that make us hypothesize that assigning a mass to all the training objects could have positive effects in the classification performance of the kNN algorithm. Particularly, this could be of interest when some aspect or natural feature of the problem needs to be considered. For example, within the field of Natural Language Processing (NLP), for the task of news classification, capturing the temporal aspect may be relevant, i.e. more recent news could be more informative (or have more context) than older ones^{Footnote 1}. In this case, we could think of the more recent news to have a larger influence, thus a larger mass. Another application of this approach could be the recognition of highly heterogeneous categories. In this case it is usual that the majority of the neighbors (to the object to classify) vote for a wrong label. With objects with different masses it would be possible to overcome this decision, i.e. if the objects with the right class have proper mass.

In this work we approach these ideas by proposing two different ways to calculate a mass for a given object. We formulate the kNN algorithm to take into consideration this mass by using a voting strategy based on Newton’s gravitational force. We tested our proposal in 13 benchmark data sets and contrasted the results against the regular kNN and weighted kNN algorithms.

2 Related Work

Literature has reported several ways in which the kNN algorithm could improve its performance. Naturally, finding an optimal value of k has been one of the questions that some works have attempted to solve [16, 17]. Besides finding this k value, there is an open question regarding which distance metric is the more suitable to use. In this regard, some previous works have evaluated new and traditional metrics in a variety of classification problems [2, 8, 15].

Using a weighting scheme was firstly proposed by Dudani [5] in the 70’s, this variant of kNN is called the Distance-Weighted k-Nearest-Neighbor Rule (DWkNN). Since then, different weighting schemes have been proposed. Among the most recent works, Tan [12] proposed the algorithm Neighbor-Weighted k-Nearest Neighbor (NWkNN), which applies a weighting strategy based on the distribution of classes. When working with unbalanced data sets, NWkNN gives a minor weight to objects of majority classes and more weight to objects less represented. For the case of text classification, Soucy and Mineau [11] proposed a weighting based on the similarity of texts (objects), measured by the cosine similarity between their bag-of-word representations. Mateos-García et al. [9] developed a technique similar to those used in Artificial Neural Networks to optimize some weights that would indicate the importance that each neighbor has with respect to the test objects. Finally, Parvinnia et al. [10] also computed a weight for each training object based on a matching strategy between the training and testing data sets.

3 Proposed Algorithm

In this section we present two approaches to calculate a mass for a given object in the training set. We then explain the complete kNN framework that exploits the concept of mass, by considering Newton’s gravitational force.

3.1 Mass Assignment

Approach 1: Circled by Its Own Class (CC). This approach is based on a instance selection strategy known as Edited Nearest Neighbor (ENN) originally proposed by Wilson [13]. The rationale of ENN is to keep an instance that is surrounded (or circled) by other instances of its same class. For the CC approach, the mass of an object x is directly proportional to the number of objects from its same class that circled it. By doing this, we aim to give less importance to objects that are in regions of the feature space that are more likely to represent a different class. In other words, the idea is to penalize rare objects and, as a consequence, make the classifier more robust to outliers. To calculate the mass via CC we apply the Eq. 1.

$$\begin{aligned} m(x\in c_i)=\log _2(SN_k(x,c_i)+2) \end{aligned}$$

(1)

where x is a training object, $c_i$ is its class and the function $SN_k()$ calculates how many out of the k closest objects to x belong to its same class. The $log_2()$ function serves as a smoothing factor; we include a constant 2 to avoid computation errors or obtaining masses equal to zero.

Approach 2: Circled by Different Classes (CD). This approach is the opposite of the CC approach. It gives more mass to objects that are surrounded by objects from different classes, that is, the mass is inversely proportional to the number of objects of the same class. CD aims to balance the discriminative power of an outlier object, since it could be relevant to classify other outlier object in the testing set. It also allows to better modeling heterogeneous classes formed by different small subgroups of objects. To assign a mass following this approach we applied the Eq. 2. The interpretation of its elements is the same as in Eq. 1.

$$\begin{aligned} m(x\in c_i)=\log _2(k-SN_k(x,c_i)+2) \end{aligned}$$

(2)

3.2 Weighted Attraction Force kNN algorithm (WAF-kNN)

The traditional weighted kNN algorithm is as follows: given a set of training objects $\{(x_1,f(x_1)),...,(x_i,f(x_i))\}$ (being $x_i$ an object and $f(x_i)$ its label), an unlabeled object $x_q$, and the set of the k closest neighbors to $x_q$ in the training set $\{x_1,...,x_k\}$, the class of $x_q$ is determined by Eq. 3:

$$\begin{aligned} f(x_q)\leftarrow \arg \max _{c\in C} \sum _{i=1}^k weight(x_i)\times \delta (c,f(x_i)) \end{aligned}$$

(3)

where C represents the set of classes, $weight(x_i)$ indicates the weight for the vote from object $x_i$, and $\delta (c,f(x_i))$ is a function that returns 1 if $x_i$ belongs to class c or 0, otherwise.

Supported on this framework, our proposal, that we call Weighted Attraction Force kNN, or simply WAF-kNN, uses a weighting scheme based on the Law of Universal Gravitation as presented by Eq. 4.

$$\begin{aligned} weight(x_i)=G\frac{m(x_q)m(x_i)}{dist^2(x_q,x_i)}\simeq \frac{m(x_i)}{dist^2(x_q,x_i)} \end{aligned}$$

(4)

where $weight(x_i)$ is the attraction force or the voting amount exerted by the training object $x_i$ to classify the object $x_q$. $m(x_q)$ and $m(x_i)$ are the masses of the testing and training objects respectively, and $dist(\cdot ,\cdot )$ is a distance metric between the two objects. The reader could detect that there are two constants that we could omit to simplify the original equation, since they only serve as scaling factors without affecting how the vote is computed. These two constants are G and $m(x_q)$. Note that $m(x_i)$ could be calculated by any of the two approaches, CC or CD, that we already presented in Sect. 3.1 for mass assignment.

4 Experiments and Results

4.1 Experimental Configuration

For the evaluation of the proposed approach we considered 13 different data sets from the UCI data repository^{Footnote 2}. All these data sets exclusively contain numeric features and do not show any missing value. These data sets are commonly used in classification tasks. Table 1 presents some statistics on these data sets such as the number of instances, features, and classes.

Table 1. Data sets characteristics.

Full size table

We applied a common experimental setting for the experiments across all the collections. Firstly, we considered three different values for k, namely, 3, 5 and 7. Then, we standardized the data by means of their z-scores. In all the experiments we used the Euclidean distance as the distance measure, and employed the $\mathrm {F}_1$ score as main evaluation metric due to its appropriateness for describing results in unbalanced data sets. A 10-fold cross-validation procedure was applied to get the results. Finally, we applied the non-parametric Bayesian Signed-Rank (BSR) test [1] for analyzing the statistical significance of the obtained results.

4.2 Results

Table 2 presents a first comparison of the approaches used to calculate the masses (CC and CD), each employed within the WAF-kNN algorithm. This table is organized by the three k values that were evaluated. The best results, for each k, are shown in bold face. Globally, the CD approach slightly outperforms the CC approach, being this more evident when $k=7$; notwithstanding, there are data sets where the CC approach is better for all k values, e.g. Arcene and Ecoli. The analysis of the Ecoli data set tell us that classes are more or less well defined in homogeneous clusters. Being this the case, the CD approach gives more mass to outliers, causing a larger classification error than CC, which assigns less mass to objects away from their class main centroid and having the effect of reducing noise. Both approaches, CC and CD, aim to offer a better weighting scheme to improve classification performance, but which one to use will ultimately depend on the distribution of classes in the data set of interest.

Table 2. $\mathrm {F}_1$ scores of WAF-kNN, using the two approaches for mass assignment.

Full size table

To evaluate our proposal against kNN and DWkNN algorithms, we chose the CD approach given its consistent performance in the previous experiment. This new comparison is presented in Table 3, where it can be observed that our proposal outperforms the baseline methods in the majority of data sets. This behavior is consistent among the three values of k that are considered. Again, the best performance is obtained with $k=7$.

Table 3. Comparison of kNN, DWkNN and WAF-kNN using CD masses.

Full size table

To further analyze these results, we applied the non-parametric BSR test [3]. According to this test three possibilities do exist for a given pairwise comparison of methods A and B: (scenario 1) A outperforms B, (scenario 2) both methods show the same performance, or (scenario 3) B outperforms A. The BSR test computes the probability of occurrence of each scenario when we applied approaches A and B over a given data set. Table 4 presents the probabilities of occurrence for each scenario when comparing the baseline approaches kNN and DWkNN with our proposed WAF-kNN approach, respectively.

Table 4. BSR output probabilities. A refers to the baseline methods, kNN and DWkNN respectively, whereas B refers to the proposed WAF-kNN approach.

Full size table

According to the performance of the WAF algorithm in each data set (with $k=7$), it was in Ionosphere and Ecoli, where we obtained the largest improvement and decrement with respect to the baseline methods, respectively. When visualizing these data sets, it is possible to notice some data characteristics that could shed some light on details about the behavior of the method.

Figure 1 shows the distribution of objects in these two data sets using the t-distributed Stochastic Neighbor Embedding (t-SNE). The Ionosphere data set is composed by two classes. Class 1, represented in red color and grouped in two well defined clusters which are located in the upper and lower section of the space. Class 2, represented in blue color and mainly spread along the mapping space with an identifiable cluster on the right side of the figure. For this case, the CD approach favors the classification of objects of class 2 by assigning more mass to training objects that are located in the central and upper left region, which are clearly circled by objects of class 1, thus getting right label assignment even in regions where majority of objects belong to different class. On the other hand, in the Ecoli data set, CD gives more mass to hypothetical noisy objects located away from their normal behavior of its own class (see blue and white objects over the green cluster objects), then negatively affecting the classifier.

5 Conclusions

In this work we introduced the WAF-kNN algorithm, which is a variant of the weighted kNN algorithm but based on the attraction force that exist between two objects. We present two methods of assigning mass to training objects, i.e. Circled by its own class (CC) and Circled by different classes (CD). For testing purposes 13 known data sets were employed. Comparisons indicate that our proposal obtained better classification results than kNN and is statistically competitive with DWkNN. These results were validated with a non-parametric BSR test.

Notes

1.
Before 2016 it would not be surprising to classify a news containing the term Donald Trump in the Business section, when now it would be more appropriate to assign it to the political section.
2.
https://archive.ics.uci.edu/ml/datasets.html.

References

Benavoli, A., Mangili, F., Corani, G., Zaffalon, M., Ruggeri, F.: A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, p. 9 (2014)
Google Scholar
Bhattacharya, G., Ghosh, K., Chowdhury, A.S.: An affinity-based new local distance function and similarity measure for kNN algorithm. Pattern Recogn. Lett. 33(3), 356–363 (2012)
Article Google Scholar
Carrasco, J., García, S., del Mar Rueda, M., Herrera, F.: rNPBST: an R package covering non-parametric and bayesian statistical tests. In: Martínez de Pisón, F.J., Urraca, R., Quintián, H., Corchado, E. (eds.) HAIS 2017. LNCS (LNAI), vol. 10334, pp. 281–292. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59650-1_24
Chapter Google Scholar
Domingos, P.: The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books, New York City (2015)
Google Scholar
Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC 6(4), 325–327 (1976)
Article Google Scholar
Guru, D.S., Sharath, Y.H., Manjunath, S.: Texture features and KNN in classification of flower images. Int. J. Comput. Appl. 1, 21–29 (2010)
Google Scholar
Lam, S.K., Riedl, J.: Shilling recommender systems for fun and profit. In: Proceedings of the 13th International Conference on World Wide Web - WWW 2004, p. 393 (2004)
Google Scholar
López, J., Maldonado, S.: Redefining nearest neighbor classification in high-dimensional settings. Pattern Recogn. Lett. 110, 36–43 (2018)
Article Google Scholar
Mateos-García, D., García-Gutiérrez, J., Riquelme-Santos, J.C.: An evolutionary voting for k-nearest neighbours. Expert Syst. Appl. 43, 9–14 (2016)
Article Google Scholar
Parvinnia, E., Sabeti, M., Jahromi, M.Z., Boostani, R.: Classification of EEG Signals using adaptive weighted distance nearest neighbor algorithm. J. King Saud Univ. - Comput. Inf. Sci. 26(1), 1–6 (2014)
Google Scholar
Soucy, P., Mineau, G.: A simple KNN algorithm for text categorization. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 647–648 (2001)
Google Scholar
Tan, S.: Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28(4), 667–671 (2005)
Article Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet Google Scholar
Wu, X., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37 (2008)
Article Google Scholar
Xu, Y., Zhu, Q., Fan, Z., Qiu, M., Chen, Y., Liu, H.: Coarse to fine K nearest neighbor classifier. Pattern Recogn. Lett. 34(9), 980–986 (2013)
Article Google Scholar
Zhang, S., Cheng, D., Deng, Z., Zong, M., Deng, X.: A novel kNN algorithm with data-driven k parameter computation. Pattern Recogn. Lett. 0, 1–11 (2017)
Google Scholar
Zhu, Q., Feng, J., Huang, J.: Natural neighbor: a self-adaptive neighborhood method without parameter K. Pattern Recogn. Lett. 80, 30–36 (2016)
Article Google Scholar

Download references

Acknowledgement

This research was partially supported by CONACYT-Mexico (project FC-2410). The work of Paolo Rosso has been partially funded by the SomEMBED TIN2015-71147-C2-1-P MINECO research project.

Author information

Authors and Affiliations

Universidad Autónoma de Chihuahua, Chihuahua, Mexico
Juan Aguilera & Luis C. González
Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
Manuel Montes-y-Gómez
Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Authors

Juan Aguilera
View author publications
You can also search for this author in PubMed Google Scholar
Luis C. González
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Aguilera .

Editor information

Editors and Affiliations

Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Ruben Vera-Rodriguez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Julian Fierrez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Aythami Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aguilera, J., González, L.C., Montes-y-Gómez, M., Rosso, P. (2019). A New Weighted k-Nearest Neighbor Algorithm Based on Newton’s Gravitational Force. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-13469-3_36
Published: 03 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A New Weighted k-Nearest Neighbor Algorithm Based on Newton’s Gravitational Force

Abstract

Similar content being viewed by others

Feature Weighting by Maximum Distance Minimization

A Coupled k-Nearest Neighbor Algorithm for Multi-label Classification

A new globally adaptive k-nearest neighbor classifier based on local mean optimization

1 Introduction

2 Related Work