Abstract
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC’2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.
This work has been partially supported by a 2012 NSERC Canada Graduate Scholarship and a 2013 Ontario Graduate Scholarship (Hubert Haoyang Duan), 2012–2017 NSERC Discovery Grant “New set-theoretic tools for statistical learning” (Vladimir Pestov), and the 2012 Mitacs Globalink Program (Varun Singla).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aas, K., Eikvil, L.: Text Categorization: A Survey. In: Technical Report 941. Norwegian Computing Center (1999)
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Applications to image and text data. In: Proceedings of 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, KDD 2001, San Francisco, USA, pp. 245–250 (2001)
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Cardoso-Cachopo, A.: Datasets for single-label text categorization, http://web.ist.utl.pt/acardoso/datasets
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL 27, Vancouver, Canada, pp. 76–83 (1989)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Deerwester, S., Dumais, S.T., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Misc functions of the Department of Statistics (e1071), TU Wien. R package version 1.6 (2011), http://CRAN.R-project.org/package=e1071
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text Classification Using Machine Learning Techniques. WSEAS Transactions on Computers 4(8), 966–974 (2005)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization. IBM Systems Journal 41(3), 428–437 (2002)
Keim, D.A., Oelke, D., Rohrdantz, C.: Analyzing document collections via context-aware term extraction. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.) NLDB 2009. LNCS, vol. 5723, pp. 154–168. Springer, Heidelberg (2010)
Kim, S.B., Rim, H.C., Yook, D.S., Lim, H.S.: Effective Methods for Improving Naive Bayes Text Classifiers. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 414–423. Springer, Heidelberg (2002)
Lewis, D.D.: Test Collections, Reuters-21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Liaw, A., Wiener, M.: Classification and Regression by randomForest. R News 2(3), 18–22 (2002)
Lim, H.-S.: Improving kNN Based Text Classification with Well Estimated Parameters. In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS, vol. 3316, pp. 516–523. Springer, Heidelberg (2004)
Pang, P.S., Ban, T., Kadobayashi, Y., Song, J., Huang, K.: The 3rd Cybersecurity Data Mining Competition (2012), http://www.csmining.org/cdmc2012
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
R Development Core Team: R: A Language and Environment for Statistical Computer. R Foundation for Statistical Computing, Vienna, Austria (2008), http://www.R-project.org ISBN 3-900051-07-0
Radovanovic, M., Ivanovic, M.: Text Mining: Approaches and Applications. Novi Sad J. Math 38(3), 227–234 (2008)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill (1983)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Schölkopf, B., Smola, A.: A Short Introduction to Learning with Kernels. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS (LNAI), vol. 2600, pp. 41–64. Springer, Heidelberg (2003)
Schütze, H., Hull, D.A., Pedersen, J.O.: A Comparison of Classifiers and Document Representations for the Routing Problem. In: Proceedings of 18th ACM International Conference on Research and Development in Information Retrieval, SIGIR 1995, Seattle, USA, pp. 229–237 (1995)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Torkkola, K.: Linear Discriminant Analysis in Document Classification. In: Proceedings of 2001 IEEE ICDM Workshop on Text Mining, ICDM 2001, San Jose, USA, pp. 800–806 (2001)
Weichold, M., Huang, T.W., Lorentz, R., Qaraqe, K.: The 19th International Conference on Neural Information Processing, ICONIP 2012 (2012), http://www.iconip2012.org
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, Berkeley, USA, pp. 42–49 (1999)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, ICML 1997, Nashville, USA, pp. 412–420 (1997)
Zeng, X.Q., Wang, M.W., Nie, J.Y.: Text Classification Based on Partial Least Square Analysis. In: Proceedings of ACM, Seoul, Korea, pp. 834–838 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Duan, H.H., Pestov, V.G., Singla, V. (2013). Text Categorization via Similarity Search. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds) Similarity Search and Applications. SISAP 2013. Lecture Notes in Computer Science, vol 8199. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41062-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-41062-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41061-1
Online ISBN: 978-3-642-41062-8
eBook Packages: Computer ScienceComputer Science (R0)