A medoid-based weighting scheme for nearest-neighbor decision rule toward effective text categorization

The k-nearest-neighbor (kNN) decision rule is a simple and robust classifier for text categorization. The performance of kNN decision rule depends heavily upon the value of the neighborhood parameter k. The method categorize a test document even if the difference between the number of members of two competing categories is one. Hence, choice of k is crucial as different values of k can change the result of text categorization. Moreover, text categorization is a challenging task as the text data are generally sparse and high dimensional. Note that, assigning a document to a predefined category for an arbitrary value of k may not be accurate when there is no bound on the margin of majority voting. A method is thus proposed in spirit of the nearest-neighbor decision rule using a medoid-based weighting scheme to deal with these issues. The method puts more weightage on the training documents that are not only lie close to the test document but also lie close to the medoid of its corresponding category in decision making, unlike the standard nearest-neighbor algorithms that stress on the documents that are just close to the test document. The aim of the proposed classifier is to enrich the quality of decision making. The empirical results show that the proposed method performs better than different standard nearest-neighbor decision rules and support vector machine classifier using various well-known text collections in terms of macro- and micro-averaged f-measure.


Introduction
The task of kNN decision rule is to assign a test document to a particular category using a set of training documents. The method first finds the k-nearest neighbors of the test document from the training set by using a similarity measure. Therefore, the category of the test document is determined by taking a majority vote among these k-nearest neighbors [1,2]. Thus the performance of kNN decision rule is heavily influenced by the neighborhood parameter k [3]. Different values of k can change the result of text categorization and hence choice of k is crucial for effective result. Moreover, text categorization is a challenging task as the text data are generally sparse and high dimensional. Hence, assigning a document to a predefined category for an arbitrary value of k may not be accurate when there is no bound on the margin of majority voting. The cross-validation technique is generally used to estimate an optimal value of k [4], but choosing an optimal k which provides satisfactory results for all test documents is still a difficult job. Moreover, a slight change in the value of k also leads to different results. For example, consider a two-class classification problem. Let there be 8 documents in the training set and d t be a test document. Let A and B be the two categories. According to kNN algorithm, the training documents are arranged according to non-decreasing order of similarity with d t . Let the labels of the categories of the ordered training documents are given as {A, A, B, B, A, B, B, A}. It can be seen that for k = 5 , d t is categorized to A, for k = 6 there is a tie and for k = 7 , d t belong to B. It is clear from this example that simple majority voting rule may not be useful for text categorization. In principle, when there is more or less same representation from the competing categories among the nearest neighbors, it is preferable to keep the test document unclassified rather than making a wrong judgment [5].
A tweak on the kNN (TkNN) decision rule have been proposed by Basu et al. to overcome these issues [5]. The method puts a bound on the majority voting of kNN by using a predefined threshold to enhance the confidence of the majority voting process. It starts with an arbitrary k and increases the value of k until it can categorize a test document. A document is thus categorized, if the difference between the number of documents of two competing categories is greater than a given threshold. The method does not require the knowledge of neighborhood parameter k to execute kNN. However, this method does not check the similarity of the documents when increasing the span of neighborhood, which is crucial. In principle, the similarity between the test document and the training documents should be checked to expand the neighborhood as the term-document matrices are generally sparse and high dimensional. The other widely used variant of kNN decision rule is distance-weighted kNN decision rule [6]. The method gives different weights to different k nearest neighbors based on their distances with the test document, where the closer neighbors get higher weights. Likewise, kNN decision rule this method too put no bound on the margin of majority voting for decision making. A method is thus desirable to overcome these limitations of the kNN decision rules and its variants for effective text categorization.
A nearest-neighbor decision rule is proposed here in spirit of the weighted kNN and TkNN decision rules. The proposed decision rule forms the neighborhood of a test document by considering the documents from the training set that are closely related to both medoid of a category and the test document. The medoid of a category is a representative document whose average dissimilarity to all the other documents in that category is minimal [7,8]. Note that medoids are always restricted to be the members of a data set. The method first finds the medoid of each category in the data set and subsequently it identifies the training documents that are closely related to the medoid of individual categories and the test document. These training documents constitute the neighborhood of the test document. The weight of a training document in that neighborhood is computed by considering the distance of that document from the medoid and also from the test document. Thereafter the first few neighbors are considered and the weights of these documents belonging to the individual categories are aggregated. The test document is then assigned to a particular category that has the maximum aggregated weight and this weight is greater than the weight of its competing categories by a given threshold. The method continues until this condition is not satisfied or the method has checked all the documents in the neighborhood. The objective of the proposed decision rule is to enrich the quality of the decision making. In worst case, it may happen that the proposed decision rule has examined all the neighbors of the test document, but could not take a decision. The test document will remain unclassified in such cases. Note that, in practice it is better not to take a decision when we are not sure about it. The proposed technique is developed in this spirit. The performance of the proposed method is compared with different standard nearest-neighbor decision rules and support vector machine classifier using standard text collections. The empirical results show that the proposed method outperforms the state of the arts in terms of macro-and micro-averaged f-measure.
The paper is organized as follows. The related works to this study are described in Sect. 2. Section 3 explains the vector space model for representation of text data. The proposed method is described in Sect. 4. Section 5 presents the experimental evaluation. Finally, we conclude with the scopes of future works in Sect. 6.

Related works
Text categorization is the problem of assigning predefined categories to the new documents. It assigns a new document to a particular category when the document is similar with more number of documents of that category than any other category [9]. A number of methods have been developed for effective text categorization [9]. Support vector machine (SVM) was introduced to solve two class classification problems using the structural risk minimization principle [10]. In its simplest linear form, SVM finds a hyperplane that separates the documents of two different categories with maximum margin [11]. Joachim reported an efficient implementation of SVM and its application in text categorization on Reuters-21578 corpus [12]. The kNN decision rule is a simple and effective similarity-based classifier and it has performed well for text categorization [13,14]. Cover and Hart [1] introduced the kNN decision rule, where a test sample is assigned to a particular category, which has the maximum number of representative training samples among the k nearest neighbors of the test sample.
The other widely used variant of kNN decision rule is distance weighted kNN decision rule [6]. The method assigns different weights to different k nearest neighbors based on their distances with the test document, where the closer neighbors get higher weights. Let , , … , be the k nearest neighbors of a test document, say, . Let the corresponding distances of these neighbors from is denoted by ( , ), ∀j = 1, 2, … k , where is a distance function. The weight w j associated with the jth nearest neighbor is defined as The test document is assigned to the category for which the sum of the weights of the representative documents of the category among these k nearest neighbors is maximum [6]. The major limitation of this method is that it also suffers from the influence of neighborhood parameter k. Different values of k may lead to different assignments of categories to the test document.
Gowda et al. have developed the condensed nearestneighbor (CNN) technique [15], which eliminates similar or redundant data sets that do not add extra information. Although it reduces the memory requirements and recognition rate while improving query time, it still poses the problem of computational cost. The reduced nearestneighbor (RNN) algorithm [16] does an extra job over CNN by removing the samples that are independent of the training set. Rank-based kNN (RNN) decision rule is quite effective in case of data with huge variations between features [17]. Bagui et al. [17] have proposed a generalization of the RNN rule by assigning ranks to the training data for each category. However, these methods have never used for text categorization.
Guan et al. have proposed a modification on kNN decision rule for text categorization, which considers mostly the documents that lie on the boundary region of individual categories in decision making and ignores the other documents [18]. The efficiency and effectiveness of the method is demonstrated using the standard Reuters corpus [18]. Tan has proposed a method called neighborweighted K-nearest neighbor (NWKNN) for unbalanced text categorization problems [19]. Instead of balancing the training data, NWKNN assigns high weight to the neighbors belong to the categories containing a few documents and provides small weight to the neighbors belong to the categories containing large number of documents [19].
Basu et al. have proposed the TkNN decision rule by putting a bound on the majority voting process of the kNN decision rule as discussed earlier [5]. TkNN rule restricts the majority voting of kNN by a predefined positive integer threshold, say , to assign a test document to a category. The method starts with number of neighbors, i.e., k = .
Subsequently, it checks whether the difference between the number of members of the best and the second-best competing categories is . If so, then the test document is categorized to the best competing category by this rule. Otherwise, the value of the neighborhood parameter k is increased by one. Thus the process continues till a decision is made or it reaches the last document of the set of neighbors.
The set of neighbors is literally the training set ordered as per the distance with the test document. If the test document is not categorized till the process checks all the documents of the set of neighbors, then it remains unclassified. However, this method does not consider the distance between the neighbors and the test document when performing the majority voting for decision making. A training document that is far away from the test document can take part in decision making by this rule, which is not desirable.

Representation of text data
The length of different documents in a corpus are different. Note that here length means the number of terms in a document. It is very difficult to find the similarity between two document vectors of different dimensions (length). Therefore, it is necessary to maintain the uniform length of all the documents in the corpus. Several models have been introduced in the information retrieval literature to represent the document data sets in the same frame [20,21]. The vector space model enables efficient analysis of huge document collections in spite of its simple idea [21]. It was originally introduced for indexing and information retrieval, but is now used in several text categorization and clustering techniques as well as in most of the currently available document retrieval systems [22].
Let us assume that the number of documents in the corpus is n and the number of terms is m. Let us also assume that the ith term is represented by t i and the number of times the term t i occurs in the jth document is denoted by tf ij , i = 1, 2, … , m; j = 1, 2, … , n . Document frequency df i is the number of documents in which t i occurs. Inverse document frequency idf i = log( n df i ) , determines how frequently a term occurs in the corpus. The weight of t i in the jth document, denoted by w ij , is determined by combining the term frequency with the inverse document frequency [22] as follows: (2) w ij = tf ij × idf i = tf ij × log( n df i ), ∀i = 1, 2, … , m and ∀j = 1, 2, … , n The documents can be efficiently represented using the vector space model in most of the text categorization and clustering algorithms [22]. In this model each document d j is considered to be a vector , where the ith component of the vector is w ij , i.e., = (w 1j , w 2j , … , w mj ).
The similarity between two documents is achieved through some distance function. Given two document vectors and , it is required to find the similarity (or dissimilarity) between them. Various similarity measures are available in the literature, but the commonly used measure is cosine similarity between two document vectors [20], which is given by Note that the weight of each term in a document is nonnegative. As a result the cosine similarity is nonnegative and bounded between 0 and 1, both inclusive. cos( , ) = 1 means the documents are exactly similar and the similarity decreases as the value goes to 0. An important property of the cosine similarity is its independence of document length. Thus cosine similarity has become popular as a similarity measure in the vector space model [23]. The vector space model is used here to represent a document vector.

A medoid-based nearest-neighbor decision rule for text categorization
In this work, a medoid-based weighting scheme is proposed to overcome the influence of the boundary documents on nearest-neighbor decision rule. A medoid is a document of a particular category whose average similarity to all the other documents in the category is maximal [8,24]. Let D = { , , … , } be the set of n document vectors corresponding to n documents in the training corpus. Here ∈ I R m , ∀i = 1, 2, … , n are generated from the raw texts following the tf-idf weighting scheme of vector space model [20]. Let us consider there are r categories in the training corpus, say, C 1 , C 2 , … , C r . The medoid of the documents of a particular category, say, C j is defined as Note that Ψ is a normalized similarity measure i.e., Ψ ∈ [0, 1] , where 1 indicates the highest similarity and the similarity decreases when the value decreases to 0.
In the experimental analysis of this article, Ψ is treated as cosine similarity.

Medoid-based weighting scheme
Let be the test document, whose category is to be identified. The proposed method considers a training document as effective neighbor of , whose similarity with is greater than a predefined threshold and the similarity between and medoid of a particular category. It forms the set of effective neighbors (EN) of as follows: This indicates that the effective neighbors of a test document are those training documents which lie between the test document and medoid of individual categories and have sufficient content similarity with d t . Here > 0 is a threshold on document similarity and it ensures that the documents in EN have sufficient content similarity with d t . The weight of a particular document, say, d ∈ EN in terms of d t is defined as Here d ∈ C j , ∀ j = 1, 2, … , r and d j is the medoid of C j . It may be noted that W( , ) ∈ [0, 1].
• The highest value of W( , ) is 1, which indicates that d is close to both d j and d t . • The value of W( , ) = 0 when Ψ( , ) = 0. • When d is close to d j but, far from d t i.e., Ψ( ,̂ ) is high, but Ψ( , ) is low then W( , ) will be low.
Note that Ψ( , ) in Eq. 6 indicates the similarity between the test document and a training document, whereas Ψ( ,̂ ) denotes the similarity between the same training document and the medoid of the category of this training document. The product of these two similarity values will be high only when their individual values are very high. Thus this weighting scheme ensures that the training documents which are not only close to the test document but also close to the medoid of the corresponding categories are given higher preference than the other documents in EN to take part in the majority voting of the proposed decision rule to categorize the test document.

Proposed text categorization technique
In the first stage, the proposed method finds the medoids of the individual categories. Therefore it creates the effective neighborhood, EN of the test document . EN is then rearranged in non-increasing order of similarity values between and individual members of EN. The method considers the first L documents of EN and stores them in S L to categorize . The initial values of L is predefined and it is denoted as in Algorithm 1. Subsequently, W( , )) is computed for each document ∈ S L . The weight of a category, C j , j = 1, 2, … , r is computed by aggregating the weights of the individual documents of C j as follows.
The weights of the maximum and the second maximum category are obtained from the set of category weights { W(C j ) ∶ j = 1, … , r} . Let they be called W(C max1 ) and W(C max2 ) respectively. These weights are then divided by the total number of documents of the respective categories, i.e., |C max1 | and |C max2 | respectively to get normalized scores. The proposed decision rule assigns the test document to the best category, when the normalized weights of the best category and its competing category is differed by a predefined threshold, say, , i.e., if |C max2 | > . If this criterion is not satisfied then the value of L is increased by 1 and the weight of the next document in EN is computed. The method is repeated until the aforesaid condition is satisfied or the method has checked all the members of EN. In worst case, d t is kept unclassified, if the method cannot categorize it after exploring all the documents in EN. The steps of the proposed method is presented in Algorithm 1.
Note that = 1 implies one nearest-neighbor decision rule and thus the minimum value of is 2. The value of can be at most |EN|. Note that is ensuring sufficient difference between the weights of majority category and its competing categories and thus it is enriching the confidence of the decision making. The value of is at least 0. As the category weights are normalized between 0 and 1, the maximum value of cannot be greater than 1. Thus the value of lies between 0 and 1, both inclusive.

Description of data
The proposed method and the state of the arts are evaluated using seven text corpora. All the corpora are developed by Karypis and Han [25] and these are mostly collected from TREC. 1 These corpora consists of documents as less as 204 to at most 4069, and has number of terms ranging from 3758 to 18,483. The number of categories of these corpora vary from 5 to 25. The overview of the corpora are presented in Table 1.

Evaluation techniques
The performance of the proposed method and the stateof-the-art classifiers are evaluated using the standard precision, recall and f-measure [13]. The precision and recall for two class classification problem can be computed as follows: Here TP stands for true positive and it counts the number of documents correctly predicted to the positive category. FP stands for false positive and it counts the number of documents that actually belong to the negative category, but predicted as positive (i.e., falsely predicted as positive). FN stands for false negative and it counts the number of documents that actually belong to the positive category, but predicted as negative. The f-measure combines recall and precision with an equal weight in the following form: The closer the values of precision and recall, the higher is the f-measure [26]. F-measure becomes 1 when the values of precision and recall are 1 and it becomes 0 when precision is 0, or recall is 0, or both are 0. Thus f-measure lies between 0 and 1. A high f-measure value is desirable for good classification [26]. There are two conventional methods to generalize these evaluation functions for multi class classification problem, namely macro-averaging and micro-averaging [27]. The macro-averaged measure finds the precision and recall score for each class, and then these scores for all the categories are aggregated [13]. The micro-averaged measure individually aggregates the true positives, false positives and false negatives over all the categories and then finds the precision and recall [13]. We have used both Recall + Precision macro-averaged and micro-averaged f-measure to evaluate the performance of the classifiers.

Experimental setup
The performance of the proposed method is compared with SVM [10], kNN [1], weighted kNN [6] and TkNN [5,28] classifiers. It may be noted that SVM has been widely used for text categorization in the last few years [29] and so that the performance of SVM is reported in this work for comparison. The concept of the proposed method has been introduced in spirit of nearest-neighbor decision rule, and therefore the performance of the proposed method is compared with kNN, weighted kNN and TkNN classifiers. The corpora used here have no specific training and test sets. Therefore we have randomly split the data sets into two parts-80% is considered as training set and the rest as test set. The random split is done in such a way that ensures the representative documents of each category in both training and test set. The training set is used to train the classifiers and the test set is used to evaluate the performance of individual classifiers.
The proposed algorithm has two major parameters: The first one is , which is used to initialize the neighborhood of the test document and the other one is , which is used as the bound on the weights of the competing categories. It may be noted that ∈ [2, 3, … , |EN|] , where EN is the set of effective neighbors of the test document. In the experiments = 3 is used. The value of is experimentally fixed by using grid search-based tenfold cross-validation technique on the training set by using = 0.1, 0.2, 0.3, 0.4, 0.5 . The value of is fixed as 0.3.
The parameters of the state-of-the-art classifiers, e.g., kNN, SVM etc. are tuned using grid search-based tenfold cross-validation technique on the training set. In case of kNN and weighted kNN classifiers, the value of k is chosen by varying it from 2 to 20. The state-of-the-art classifiers are implemented using Scikit-learn 2 [30], a machine learning tool in Python.

Analysis of results
The performance of the proposed method and state-ofthe-art classifiers on different text corpora are shown in Tables 2 and 3 respectively using micro-averaged and macro-averaged f-measure. The raw text data are transformed into feature vectors using the vector space model as described in Sect. 3. The value of the parameter k that has been selected by the tenfold cross-validation technique on training set to perform kNN and weighted kNN algorithms on the test documents are shown in Tables 2   Table 1 Overview of the corpora   Dataset  #Documents  #Terms  #Categories   re1  1657  3758  25  reviews  4069  18,483  5  tr45  690  8261  10  tr41  878  7454  10  tr11  414  6429  9  tr23  204  5832  6  tr12  303 Tables 2 and 3 show that the proposed method performs better than the other classifiers for all the data sets except tr41. For the tr41 data set, SVM performs better than the proposed method in terms both macro-averaged and micro-averaged f-measure scores. It can be seen from Tables 2 and 3 that there are 56 comparisons for the proposed method and the proposed one has performed better than the other methods in 51 cases. The statistical significance of these results is to be tested. For example, for tr12, the macro-averaged f-measure of SVM is 0.85 and for the proposed method it is 0.86, so we have to test whether this difference is statistically significant.
A paired t test is suitable for testing the equality of means when the variances are unknown. A suitable test statistic is described and tabled in [31] and [32], respectively. The statistic uses the null hypothesis of equal means assuming unequal variance on same sample size. The statistic t is measured as t =  , where 1 , 2 are the means, 1 , 2 are the standard deviations and n 1 , n 2 are the number of observations [31]. It has been found that the results are statically significant in 39 out of 51 cases, where the proposed technique performs better than the other methods for the level of significance 0.05. The test results are statistically significant in 3 out of 5 cases for the same level of significance, when other methods have an edge over the proposed technique. Thus in 92.85% cases the performance of the proposed technique is significantly better than the other classifiers. The effectiveness of the proposed method can be observed from these results.
The robustness of different classification algorithms can be determined by using the idea of Friedman [33]. Robustness of a classifier h for a particular data set is defined as E h = E h ∕E 0 , where E h is either macro-averaged or microaveraged f-measure of h and E 0 = max h E h [28]. The best classifier for a particular corpus will have E h = 1 , while the other competing algorithms will have E h ≤ 1 . Lower values of E h indicate the lack of robustness of the algorithm h. We have computed this ratio for all the classifiers and for all the corpora using micro-averaged and macro-averaged f-measure, and they are graphically shown by box-plots, respectively, in Figs. 1 and 2. It can be observed from these figures that the proposed method outperforms the competing classifiers.

Conclusion
A method has been introduced in this article to overcome some of the limitations of the state-of-the-art nearest-neighbor decision rules for effective text categorization. The performance of the proposed method is evaluated on different standard benchmark corpora. The method uses a parameter to provide a bound on the difference between the weights of the competing categories. Note that for a high value of , many documents may remain unclassified and for a low value of , we may compromise with the quality of the decision making. Thus choice of is crucial. In the experiments, the value of is chosen using the crossvalidation technique on the training set. The empirical analysis show that the proposed technique outperforms the state-of-the-art classifiers in most of the cases. It is also observed that no document remain unclassified by the proposed method for all corpora. This proves the effectiveness of the method. In future, the performance of the proposed method should be tested in different other applications, e.g., customer review analysis.