Skip to main content
Log in

A supervised term selection technique for effective text categorization

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection technique have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms of the corpus are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed method performs significantly better than the other methods in most of the cases of all the corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://in.mathworks.com/help/stats/knnsearch.html.

  2. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  3. http://www.textfixer.com/resources/common-english-words.txt.

  4. The test statistic is of the form \(t=\frac{\overline{x}_1-\overline{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}}\), where \(\overline{x}_1, \overline{x}_2\) are the means, \(s_1, s_2\) are the standard deviations and \(n_1, n_2\) are the number of observations.

References

  1. Gliozzo A, Strapparava C (2005) Domain kernels for text categorization. In: Proceedings of the ninth international conference on computational natural language learning

  2. Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Proceedings of the Australian data mining conference, pp 201–208

  3. Aphinyanaphongs Y, Fu LD, Li Z, Peskin ER, Efstathiadis E, Aliferis CF, Statnikov A (2014) A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Assoc Inf Sci Technol 65(10)

  4. Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Dec Supp Syst Web Retriev Min 35(1):45–87

    Article  Google Scholar 

  5. Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on applied computing, Melbourne, Australia, pp 784–788

  6. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of international conference on machine learning, pp 412–420

  7. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newslett Spec Issue Learn Imbalanced Datasets 6(1):80–89

    Article  Google Scholar 

  8. Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162

    Article  Google Scholar 

  9. Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of international conference on machine learning, pp 284–292

  10. Galavotti L, Sebastiani F, Simi M (2000) Feature selection and negative evidence in automated text categorization. In: Proceedings of the knowledge discovery and data mining workshop on text mining

  11. Forman G (2003) An extensive empirical study of feature selection metrics for text categorization. J Mach Learn Res 3(1):1289–1305

    MATH  Google Scholar 

  12. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of international conference on machine learning, pp 258–267

  13. Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Exp Syst Appl 33(1):1–5

  14. Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the international conference on information and knowledge management, pp 12–19

  15. Li S, Xia R, Zong C, Huang C (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of ACL and the 4th international joint conference on natural language processing, pp 692–700

  16. Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the IEEE international conference on data mining workshops, ICDMW’12, Brussels, Belgium, pp 918–925

  17. Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754

    Article  Google Scholar 

  18. Feng G, Guo J, Jing BY, Hao L (2012) A bayesian feature selection paradigm for text classification. Inf Process Manag 48(2):283–302

    Article  Google Scholar 

  19. Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2014) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst. doi:10.1109/TFUZZ.2014.2371479

  20. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retriev Kluwer Academic Publishers 1(1–2):69–90

    Google Scholar 

  21. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the european conference on machine learning, Berlin, Germany, pp 137–142

  22. Pilaszy I (2005) Text categorization and support vector machines. In: Proceedings of the sixth international symposium of Hungarian researchers on computational intelligence

  23. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  24. Zhang W, Yoshida T, Tang X (2008) Text classification based on multi-word with support vector machine. Knowled Based Syst 21(8):879–886

    Article  Google Scholar 

  25. Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84

    Article  Google Scholar 

  26. TREC (ed) Text REtrieval conference. http://trec.nist.gov

  27. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  28. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York

  29. Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York

    Google Scholar 

  30. Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanmay Basu.

Appendix

Appendix

In this section we shall justify the claims that have been made in Sect 4.2 for all possible TRL values between a term and a category. Let \(t_{1}, t_{2a}, t_{2b}, t_{2c}, t_{2d}\) and \(t_{2e}\) be the terms obtained respectively by case 1, case 2(a), case 2(b), case 2(c), case 2(d) and case 2(e) and all these terms belong to the same category i.e., \(C_{i}\). Let \(t_{3a}, t_{3b}, t_{3c}, t_{3d}\) and \(t_{3e}\) be the terms obtained respectively by case 3(a), case 3(b), case 3(c), case 3(d) and case 3(e) and all these terms belong to the same \(C_{i}\).

Claim 1

\(t_{1}\) gets highest preference among all the terms.

Justification: The minimum TRL value of a term t and a category c is 0 and it can be obtained only when \(P(t, C_{i})=P(t)=P(C_{i})\) and for all the other cases the TRL values are greater than 0. \(t_{1}\) is such a term where \(TRL(t_{1},C_{i})=0\). Note that the minimum TRL value indicates the optimum term. The terms of all the sub-cases of case 2 and case 3 can be obtained by using either TF or TRF or TCR. It has been explained in Sect. 4 that the values of TF, TRF and TCR lie between (0, 1) whenever they are applied to find TRL between a term and a category. Thus the TRL values of all the sub-cases of case 2 and case 3 lie between (0, 1). Hence \(t_{1}\) gets highest preference among all the terms.

Claim 2(a) \(t_{2a}\) gets higher preference than \(t_{2b}\) by TRL when \(P(t_{2a},C_{i})=P(t_{2b},C_{i})\).

Justification: As \(1+P(t_{2b}) > 1+P(C_{i})\) we have

$$\begin{aligned} \displaystyle \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} = \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(C_{i})} > \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \end{aligned}$$
$$\begin{aligned} \therefore\, TRL(t_{2a},C_{i}) &= {} 1- \displaystyle \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \\&< {} 1- \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\&= {} TRL(t_{2b},C_{i}) \end{aligned}$$

Hence the terms of case 2(a) gets higher preference than the terms of case 2(b) for the same \(P(C_{i})\) and \(P(t,C_{i})\) values.

Claim 2(b)

If \(P(t_{2b})=P(t_{2c})\) then \(t_{2b}\) gets higher preference than \(t_{2c}\) by TRL.

Justification: It can be seen that \(P(t_{2b},C_{i})>P(t_{2c},C_{i})\), since they both exist in the same category \(C_{i}\) and \(P(t_{2b},C_{i})=P(C_{i})\) and \(P(t_{2c},C_{i})<P(C_{i})\).

$$\begin{aligned} \therefore\, TRL(t_{2b},C_{i})& = {} 1- \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2b})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2c})} \times E(C_{i})\\&< {} 1- \frac{P(t_{2c},C_{i})}{P(t_{2c})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\&= {} TRL(t_{2c},C_{i}) \end{aligned}$$

Hence for the same \(P(C_{i})\) and P(t) values the terms of case 2(b) get higher preference than the terms of case 2(c).

Claim 2(c)

If \(P(t_{2b})=P(t_{2d})\) then \(t_{2b}\) gets higher preference than \(t_{2d}\) by TRL.

Justification: It can be seen that \(P(t_{2b},C_{i})>P(t_{2d},C_{i})\), since they both exist in the same category \(C_{i}\) and \(P(t_{2b},C_{i})=P(C_{i})\) and \(P(t_{2d},C_{i})<P(C_{i})\).

$$\begin{aligned} \therefore\, TRL(t_{2b},C_{i})& = {} 1- \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\& < {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2b})} \times E(C_{i}) \\ & = {} 1- \frac{P(C_{i})}{P(t_{2b})} \times E(C_{i})\\& < {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \times E(C_{i}) \\& < {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\& = {} TRL(t_{2d},C_{i}) \end{aligned}$$

as the multiplication of two fractional values is less than their individual values (e.g., \(0.4 \times 0.3 =0.12\) is less than both 0.4 and 0.3). Hence for the same \(P(C_{i})\) and P(t) values the terms of case 2(b) get higher preference than the terms of case 2(d).

Claim 2(d)

\(t_{2a}\) gets higher preference than \(t_{2e}\) by TRL, if \(P(t_{2a})=P(t_{2e})\).

Justification: Note that \(P(t_{2a})= P(t_{2a},C_{i})\) and \(P(t_{2a},C_{i})>P(t_{2e},C_{i})\) as both of \(t_{2a}\) and \(t_{2e}\) belong to \(C_{i}\).

$$\begin{aligned} \therefore\, TRL(t_{2a},C_{i})& = {} 1- \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2a},C_{i})}{P(C_{i})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2e})}{P(C_{i})} \times E(C_{i}) \\& < {} 1- \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \times E(C_{i}) \\&< {} 1-\frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2e})-P(t_{2e}, \overline{C_{i}})}{P(t_{2e})} \times E(C_{i}) \\&= {} TRL(t_{2e},C_{i}) \end{aligned}$$

as the multiplication of two fractional values is less than their individual values. Thus the terms of case 2(a) get higher preference than the terms of case 2(d) for the same \(P(C_{i})\) and \(P(t,C_{i})\) values.

Claim 2(e)

\(t_{2c}\) gets higher preference than \(t_{2d}\) by TRL when \(P(t_{2c},C_{i})=P(t_{2d},C_{i})\).

Justification:

$$\begin{aligned} TRL(t_{2c},C_{i})&= {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\&[\because\, P(t_{2c},C_{i})=P(t_{2d},C_{i}) \text{ and } \\&\qquad P(t_{2d})>P(t_{2c})] \\&< {} 1- \frac{P(C_{i})-P(t_{2e},C_{i})}{P(t_{2e})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\&= {} TRL(t_{2d},C_{i}) \end{aligned}$$

since the multiplication of two fractional values is less than their individual values. Therefore the terms of case 2(c) get higher preference than the terms of case 2(d) for the same \(P(C_{i})\) and \(P(t,C_{i})\) values.

Claim 3(a)

\(t_{3a}, t_{3b}, t_{3c}, t_{3d}\) and \(t_{3e}\) get lower preference than \(t_{2a}, t_{2b}, t_{2c}, t_{2d}\) and \(t_{2e}\) in \(C_{i}\).

Justification: It has been justified in Claim 2(d) that \(t_{2a}\) gets higher preference than \(t_{2e}\), if \(P(t_{2a}) =P(t_{2e})\). Claim 2(b) states that \(t_{2b}\) gets higher preference than \(t_{2c}\), if \(P(t_{2b})=P(t_{2c})\). Hence we have to 3(a).

  • (i) \(t_{3a}\) gets lower preference than \(t_{2c}\) by TRL.

    $$\begin{aligned} TRL(t_{2c},C_{i})= & {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\ TRL(t_{3a},C_{i})= & {} 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$

    As \(P(C_{i})>> P(t_{3a})\) and \(P(t_{3a})=P(t_{3a},C_{i})\) then \(1+P(t_{3a},C_{i}) << 1+P(C_{i})\) and \(\displaystyle \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} << 1\), i.e., close to 0. As a result \(TRL(t_{3a},C_{i})\) is close to 1. \(P(t_{2c},C_{i}) < P(C_{i}) = P(t_{2c})\) and \(P(t_{2c})\) is close to \(P(t_{2c},C_{i})\). Therefore \(\displaystyle \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})}\) is close to 1. Hence \(TRL(t_{2c},C_{i})\) is less than \(TRL(t_{3a},C_{i})\) and thus \(t_{3a}\) gets lower preference than \(t_{2c}\) by TRL.

  • (ii) \(t_{3a}\) gets lower preference than \(t_{2d}\) by TRL.

    $$\begin{aligned} TRL(t_{2d},C_{i})&= {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \end{aligned}$$
    $$\begin{aligned} TRL(t_{3a},C_{i}) = 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$

    It has been shown that \(TRL(t_{3a},C_{i})\) is close to 1. Note that \(P(t_{2d},C_{i}) < P(C_{i}) < P(t_{2d})\) and \(P(t_{2d},C_{i})\) and \(P(t_{2d})\) both are close to \(P(C_{i})\). Thus \(P(t_{2d})\) is close to \(P(t_{2d},C_{i})\). Therefore \(\displaystyle \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})}\) and \(\displaystyle \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})}\) both are close to 1 and \(TRL(t_{2d},C_{i})\) becomes close to 0. Hence \(TRL(t_{2d},C_{i}) < TRL(t_{3a},C_{i})\) and \(t_{3a}\) gets lower preference than \(t_{2c}\).

  • (iii) \(t_{3a}\) gets lower preference than \(t_{2e}\) by TRL.

    $$\begin{aligned} TRL(t_{2e},C_{i})= & {} 1- \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2e})-P(t_{2e},\overline{C_{i}})}{P(t_{2e})} \times E(C_{i}) \end{aligned}$$
    $$\begin{aligned} TRL(t_{3a},C_{i}) = 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$

    \(P(t_{2e},C_{i}) < P(t_{2e}) < P(C_{i})\) and \(P(t_{2e},C_{i})\) and \(P(t_{2e})\) both are close to \(P(C_{i})\). Thus \(P(t_{2e})\) is close to \(P(t_{2e},C_{i})\). Therefore \(\displaystyle \frac{P(t_{2e})-P(t_{2e},\overline{C_{i}})}{P(t_{2e})}\) and \(\displaystyle \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})}\) both are close to 1 and as a result \(TRL(t_{2e},C_{i})\) is close to 0. Note that \(TRL(t_{3a},C_{i})\) is close to 1 and thus \(TRL(t_{2e},C_{i}) < TRL(t_{3a},C_{i})\).

Fig. 1
figure 1

All possible TRL values of a term t and a category c

Claim 3(b)

\(t_{3a}\) gets higher preference than \(t_{3b}\) by TRL when \(P(t_{3a},C_{i})=P(t_{3b},C_{i})\).

Claim 3(c)

If \(P(t_{3b})=P(t_{3c})\) then \(t_{3b}\) gets higher preference than \(t_{3c}\) by TRL.

Claim 3(d)

If \(P(t_{3b})=P(t_{3d})\) then \(t_{3b}\) gets higher preference than \(t_{3d}\) by TRL.

Claim 3(e)

\(t_{3a}\) gets higher preference than \(t_{3e}\) by TRL, if \(P(t_{3a})=P(t_{3e})\).

Claim 3(f)

\(t_{3c}\) gets higher preference than \(t_{3d}\) by TRL, if \(P(t_{3c},C_{i}) = P(t_{3d},C_{i})\).

Justification: Claim 3(b) can be justified in the same way as claim 2(a) has been justified. Claim 3(c), claim 3(d), claim 3(e) and claim 3(f) can be justified respectively in the same way as claim 2(b), claim 2(c), claim 2(d) and claim 2(e) have been justified.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Basu, T., Murthy, C.A. A supervised term selection technique for effective text categorization. Int. J. Mach. Learn. & Cyber. 7, 877–892 (2016). https://doi.org/10.1007/s13042-015-0421-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0421-y

Keywords

Navigation