A supervised term selection technique for effective text categorization

Basu, Tanmay; Murthy, C. A.

doi:10.1007/s13042-015-0421-y

A supervised term selection technique for effective text categorization

Original Article
Published: 18 September 2015

Volume 7, pages 877–892, (2016)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Tanmay Basu¹ &
C. A. Murthy¹

417 Accesses
11 Citations
Explore all metrics

Abstract

Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection technique have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms of the corpus are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed method performs significantly better than the other methods in most of the cases of all the corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A feature selection approach based on term distributions

Article Open access 01 March 2016

A Novel Feature Selection Technique for Text Classification

Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools

Notes

http://in.mathworks.com/help/stats/knnsearch.html.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
http://www.textfixer.com/resources/common-english-words.txt.
The test statistic is of the form $t=\frac{\overline{x}_1-\overline{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}}$, where $\overline{x}_1, \overline{x}_2$ are the means, $s_1, s_2$ are the standard deviations and $n_1, n_2$ are the number of observations.

References

Gliozzo A, Strapparava C (2005) Domain kernels for text categorization. In: Proceedings of the ninth international conference on computational natural language learning
Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Proceedings of the Australian data mining conference, pp 201–208
Aphinyanaphongs Y, Fu LD, Li Z, Peskin ER, Efstathiadis E, Aliferis CF, Statnikov A (2014) A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Assoc Inf Sci Technol 65(10)
Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Dec Supp Syst Web Retriev Min 35(1):45–87
Article Google Scholar
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on applied computing, Melbourne, Australia, pp 784–788
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of international conference on machine learning, pp 412–420
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newslett Spec Issue Learn Imbalanced Datasets 6(1):80–89
Article Google Scholar
Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162
Article Google Scholar
Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of international conference on machine learning, pp 284–292
Galavotti L, Sebastiani F, Simi M (2000) Feature selection and negative evidence in automated text categorization. In: Proceedings of the knowledge discovery and data mining workshop on text mining
Forman G (2003) An extensive empirical study of feature selection metrics for text categorization. J Mach Learn Res 3(1):1289–1305
MATH Google Scholar
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of international conference on machine learning, pp 258–267
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Exp Syst Appl 33(1):1–5
Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the international conference on information and knowledge management, pp 12–19
Li S, Xia R, Zong C, Huang C (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of ACL and the 4th international joint conference on natural language processing, pp 692–700
Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the IEEE international conference on data mining workshops, ICDMW’12, Brussels, Belgium, pp 918–925
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754
Article Google Scholar
Feng G, Guo J, Jing BY, Hao L (2012) A bayesian feature selection paradigm for text classification. Inf Process Manag 48(2):283–302
Article Google Scholar
Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2014) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst. doi:10.1109/TFUZZ.2014.2371479
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retriev Kluwer Academic Publishers 1(1–2):69–90
Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the european conference on machine learning, Berlin, Germany, pp 137–142
Pilaszy I (2005) Text categorization and support vector machines. In: Proceedings of the sixth international symposium of Hungarian researchers on computational intelligence
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Zhang W, Yoshida T, Tang X (2008) Text classification based on multi-word with support vector machine. Knowled Based Syst 21(8):879–886
Article Google Scholar
Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84
Article Google Scholar
TREC (ed) Text REtrieval conference. http://trec.nist.gov
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York
Google Scholar
Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical
Google Scholar

Download references

Author information

Authors and Affiliations

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Tanmay Basu & C. A. Murthy

Authors

Tanmay Basu
View author publications
You can also search for this author in PubMed Google Scholar
C. A. Murthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanmay Basu.

Appendix

In this section we shall justify the claims that have been made in Sect 4.2 for all possible TRL values between a term and a category. Let $t_{1}, t_{2a}, t_{2b}, t_{2c}, t_{2d}$ and $t_{2e}$ be the terms obtained respectively by case 1, case 2(a), case 2(b), case 2(c), case 2(d) and case 2(e) and all these terms belong to the same category i.e., $C_{i}$. Let $t_{3a}, t_{3b}, t_{3c}, t_{3d}$ and $t_{3e}$ be the terms obtained respectively by case 3(a), case 3(b), case 3(c), case 3(d) and case 3(e) and all these terms belong to the same $C_{i}$.

Claim 1

$t_{1}$ gets highest preference among all the terms.

Justification: The minimum TRL value of a term t and a category c is 0 and it can be obtained only when $P(t, C_{i})=P(t)=P(C_{i})$ and for all the other cases the TRL values are greater than 0. $t_{1}$ is such a term where $TRL(t_{1},C_{i})=0$. Note that the minimum TRL value indicates the optimum term. The terms of all the sub-cases of case 2 and case 3 can be obtained by using either TF or TRF or TCR. It has been explained in Sect. 4 that the values of TF, TRF and TCR lie between (0, 1) whenever they are applied to find TRL between a term and a category. Thus the TRL values of all the sub-cases of case 2 and case 3 lie between (0, 1). Hence $t_{1}$ gets highest preference among all the terms.

Claim 2(a) $t_{2a}$ gets higher preference than $t_{2b}$ by TRL when $P(t_{2a},C_{i})=P(t_{2b},C_{i})$.

Justification: As $1+P(t_{2b}) > 1+P(C_{i})$ we have

$$\begin{aligned} \displaystyle \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} = \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(C_{i})} > \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \end{aligned}$$

$$\begin{aligned} \therefore\, TRL(t_{2a},C_{i}) &= {} 1- \displaystyle \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \\&< {} 1- \displaystyle \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\&= {} TRL(t_{2b},C_{i}) \end{aligned}$$

Hence the terms of case 2(a) gets higher preference than the terms of case 2(b) for the same $P(C_{i})$ and $P(t,C_{i})$ values.

Claim 2(b)

If $P(t_{2b})=P(t_{2c})$ then $t_{2b}$ gets higher preference than $t_{2c}$ by TRL.

Justification: It can be seen that $P(t_{2b},C_{i})>P(t_{2c},C_{i})$, since they both exist in the same category $C_{i}$ and $P(t_{2b},C_{i})=P(C_{i})$ and $P(t_{2c},C_{i})<P(C_{i})$.

$$\begin{aligned} \therefore\, TRL(t_{2b},C_{i})& = {} 1- \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2b})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2c})} \times E(C_{i})\\&< {} 1- \frac{P(t_{2c},C_{i})}{P(t_{2c})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\&= {} TRL(t_{2c},C_{i}) \end{aligned}$$

Hence for the same $P(C_{i})$ and P(t) values the terms of case 2(b) get higher preference than the terms of case 2(c).

Claim 2(c)

If $P(t_{2b})=P(t_{2d})$ then $t_{2b}$ gets higher preference than $t_{2d}$ by TRL.

Justification: It can be seen that $P(t_{2b},C_{i})>P(t_{2d},C_{i})$, since they both exist in the same category $C_{i}$ and $P(t_{2b},C_{i})=P(C_{i})$ and $P(t_{2d},C_{i})<P(C_{i})$.

$$\begin{aligned} \therefore\, TRL(t_{2b},C_{i})& = {} 1- \frac{1+P(t_{2b},C_{i})}{1+P(t_{2b})} \times E(C_{i}) \\& < {} 1- \frac{P(t_{2b},C_{i})}{P(t_{2b})} \times E(C_{i}) \\ & = {} 1- \frac{P(C_{i})}{P(t_{2b})} \times E(C_{i})\\& < {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \times E(C_{i}) \\& < {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\& = {} TRL(t_{2d},C_{i}) \end{aligned}$$

as the multiplication of two fractional values is less than their individual values (e.g., $0.4 \times 0.3 =0.12$ is less than both 0.4 and 0.3). Hence for the same $P(C_{i})$ and P(t) values the terms of case 2(b) get higher preference than the terms of case 2(d).

Claim 2(d)

$t_{2a}$ gets higher preference than $t_{2e}$ by TRL, if $P(t_{2a})=P(t_{2e})$.

Justification: Note that $P(t_{2a})= P(t_{2a},C_{i})$ and $P(t_{2a},C_{i})>P(t_{2e},C_{i})$ as both of $t_{2a}$ and $t_{2e}$ belong to $C_{i}$.

$$\begin{aligned} \therefore\, TRL(t_{2a},C_{i})& = {} 1- \frac{1+P(t_{2a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2a},C_{i})}{P(C_{i})} \times E(C_{i}) \\&= {} 1- \frac{P(t_{2e})}{P(C_{i})} \times E(C_{i}) \\& < {} 1- \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \times E(C_{i}) \\&< {} 1-\frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2e})-P(t_{2e}, \overline{C_{i}})}{P(t_{2e})} \times E(C_{i}) \\&= {} TRL(t_{2e},C_{i}) \end{aligned}$$

as the multiplication of two fractional values is less than their individual values. Thus the terms of case 2(a) get higher preference than the terms of case 2(d) for the same $P(C_{i})$ and $P(t,C_{i})$ values.

Claim 2(e)

$t_{2c}$ gets higher preference than $t_{2d}$ by TRL when $P(t_{2c},C_{i})=P(t_{2d},C_{i})$.

Justification:

$$\begin{aligned} TRL(t_{2c},C_{i})&= {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\&< {} 1- \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\&[\because\, P(t_{2c},C_{i})=P(t_{2d},C_{i}) \text{ and } \\&\qquad P(t_{2d})>P(t_{2c})] \\&< {} 1- \frac{P(C_{i})-P(t_{2e},C_{i})}{P(t_{2e})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \\&= {} TRL(t_{2d},C_{i}) \end{aligned}$$

since the multiplication of two fractional values is less than their individual values. Therefore the terms of case 2(c) get higher preference than the terms of case 2(d) for the same $P(C_{i})$ and $P(t,C_{i})$ values.

Claim 3(a)

$t_{3a}, t_{3b}, t_{3c}, t_{3d}$ and $t_{3e}$ get lower preference than $t_{2a}, t_{2b}, t_{2c}, t_{2d}$ and $t_{2e}$ in $C_{i}$.

Justification: It has been justified in Claim 2(d) that $t_{2a}$ gets higher preference than $t_{2e}$, if $P(t_{2a}) =P(t_{2e})$. Claim 2(b) states that $t_{2b}$ gets higher preference than $t_{2c}$, if $P(t_{2b})=P(t_{2c})$. Hence we have to 3(a).

(i) $t_{3a}$ gets lower preference than $t_{2c}$ by TRL.
$$\begin{aligned} TRL(t_{2c},C_{i})= & {} 1- \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})} \times E(C_{i}) \\ TRL(t_{3a},C_{i})= & {} 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$
As $P(C_{i})>> P(t_{3a})$ and $P(t_{3a})=P(t_{3a},C_{i})$ then $1+P(t_{3a},C_{i}) << 1+P(C_{i})$ and $\displaystyle \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} << 1$, i.e., close to 0. As a result $TRL(t_{3a},C_{i})$ is close to 1. $P(t_{2c},C_{i}) < P(C_{i}) = P(t_{2c})$ and $P(t_{2c})$ is close to $P(t_{2c},C_{i})$. Therefore $\displaystyle \frac{P(t_{2c})-P(t_{2c},\overline{C_{i}})}{P(t_{2c})}$ is close to 1. Hence $TRL(t_{2c},C_{i})$ is less than $TRL(t_{3a},C_{i})$ and thus $t_{3a}$ gets lower preference than $t_{2c}$ by TRL.
(ii) $t_{3a}$ gets lower preference than $t_{2d}$ by TRL.
$$\begin{aligned} TRL(t_{2d},C_{i})&= {} 1- \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})} \\&\times \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})} \times E(C_{i}) \end{aligned}$$

$$\begin{aligned} TRL(t_{3a},C_{i}) = 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$
It has been shown that $TRL(t_{3a},C_{i})$ is close to 1. Note that $P(t_{2d},C_{i}) < P(C_{i}) < P(t_{2d})$ and $P(t_{2d},C_{i})$ and $P(t_{2d})$ both are close to $P(C_{i})$. Thus $P(t_{2d})$ is close to $P(t_{2d},C_{i})$. Therefore $\displaystyle \frac{P(t_{2d})-P(t_{2d},\overline{C_{i}})}{P(t_{2d})}$ and $\displaystyle \frac{P(C_{i})-P(t_{2d},C_{i})}{P(t_{2d})-P(t_{2d},C_{i})}$ both are close to 1 and $TRL(t_{2d},C_{i})$ becomes close to 0. Hence $TRL(t_{2d},C_{i}) < TRL(t_{3a},C_{i})$ and $t_{3a}$ gets lower preference than $t_{2c}$.
(iii) $t_{3a}$ gets lower preference than $t_{2e}$ by TRL.
$$\begin{aligned} TRL(t_{2e},C_{i})= & {} 1- \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})} \\&\times \frac{P(t_{2e})-P(t_{2e},\overline{C_{i}})}{P(t_{2e})} \times E(C_{i}) \end{aligned}$$

$$\begin{aligned} TRL(t_{3a},C_{i}) = 1- \frac{1+P(t_{3a},C_{i})}{1+P(C_{i})} \times E(C_{i}) \end{aligned}$$
$P(t_{2e},C_{i}) < P(t_{2e}) < P(C_{i})$ and $P(t_{2e},C_{i})$ and $P(t_{2e})$ both are close to $P(C_{i})$. Thus $P(t_{2e})$ is close to $P(t_{2e},C_{i})$. Therefore $\displaystyle \frac{P(t_{2e})-P(t_{2e},\overline{C_{i}})}{P(t_{2e})}$ and $\displaystyle \frac{P(t_{2e})-P(t_{2e},C_{i})}{P(C_{i})-P(t_{2e},C_{i})}$ both are close to 1 and as a result $TRL(t_{2e},C_{i})$ is close to 0. Note that $TRL(t_{3a},C_{i})$ is close to 1 and thus $TRL(t_{2e},C_{i}) < TRL(t_{3a},C_{i})$.

Claim 3(b)

$t_{3a}$ gets higher preference than $t_{3b}$ by TRL when $P(t_{3a},C_{i})=P(t_{3b},C_{i})$.

Claim 3(c)

If $P(t_{3b})=P(t_{3c})$ then $t_{3b}$ gets higher preference than $t_{3c}$ by TRL.

Claim 3(d)

If $P(t_{3b})=P(t_{3d})$ then $t_{3b}$ gets higher preference than $t_{3d}$ by TRL.

Claim 3(e)

$t_{3a}$ gets higher preference than $t_{3e}$ by TRL, if $P(t_{3a})=P(t_{3e})$.

Claim 3(f)

$t_{3c}$ gets higher preference than $t_{3d}$ by TRL, if $P(t_{3c},C_{i}) = P(t_{3d},C_{i})$.

Justification: Claim 3(b) can be justified in the same way as claim 2(a) has been justified. Claim 3(c), claim 3(d), claim 3(e) and claim 3(f) can be justified respectively in the same way as claim 2(b), claim 2(c), claim 2(d) and claim 2(e) have been justified.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Basu, T., Murthy, C.A. A supervised term selection technique for effective text categorization. Int. J. Mach. Learn. & Cyber. 7, 877–892 (2016). https://doi.org/10.1007/s13042-015-0421-y

Download citation

Received: 18 December 2014
Accepted: 29 August 2015
Published: 18 September 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s13042-015-0421-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A supervised term selection technique for effective text categorization

Abstract

Access this article

Similar content being viewed by others

A feature selection approach based on term distributions

A Novel Feature Selection Technique for Text Classification

Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Claim 1

Claim 2(b)

Claim 2(c)

Claim 2(d)

Claim 2(e)

Claim 3(a)

Claim 3(b)

Claim 3(c)

Claim 3(d)

Claim 3(e)

Claim 3(f)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A supervised term selection technique for effective text categorization

Abstract

Access this article

Similar content being viewed by others

A feature selection approach based on term distributions

A Novel Feature Selection Technique for Text Classification

Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Claim 1

Claim 2(b)

Claim 2(c)

Claim 2(d)

Claim 2(e)

Claim 3(a)

Claim 3(b)

Claim 3(c)

Claim 3(d)

Claim 3(e)

Claim 3(f)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation