Skip to main content

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8556))

Abstract

In this paper, we present a novel model that we propose for document representation. In contrast with the classical Vector Space Model which represents each document by a unique vector in the feature space, our model consists in representing each document by a vector in the space of training documents of each category. We develop, for this novel model, a discriminative classifier which is based on the norms of the generated vectors by our model. We call this algorithm the Nearest Cetroid based on Vector Norms. Our major goal, by the proposition of such new classification framework, is to overcome the problems related to huge dimensionality and vector sparsity which are commonly faced in Text Classification problems. We evaluate the performance of the proposed framework by comparing its effectiveness and efficiency with those of some standard classifiers when used with the classical document representation. The studied classifiers are Naïve Bayes (NB), Support Vector Machines (SVM) and k-Nearest Neighbors (kNN). We conduct our experiments on multi-lingual balanced and unbalanced binary data sets. Our results show that our algorithm typically performs well since it is competitive with the classical methods and, at the same time, dramatically faster especially in comparison with NB and kNN. We also apply our model on the Reuters21578 corpus so as to evaluate its performance in a multi-class environment. We can say that the obtained result (85.4% in terms of micro-F1) is promising and that it can be improved in future works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceeding of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)

    Google Scholar 

  2. Pang, B., Lee, L., Vaithyanathain, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1) (2002)

    Google Scholar 

  4. Aas, K., Eikvil, L.: Text Categorization: A Survey. Report No. 941, pp. 82–539 (June 1999) ISBN 82-539-0425-8

    Google Scholar 

  5. Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology 1(1), 4–20 (2010)

    Google Scholar 

  6. Mountassir, A., Benbrahim, H., Berrada, I.: A cross-study of Sentiment Classification on Arabic corpora. In: Bramer, M., Petridis, M. (eds.) Research and Development in Intelligent Systems XXIX, pp. 259–272. Springer, Heidelberg (2012a)

    Chapter  Google Scholar 

  7. Bhavsar, H., Ganatra, A.: A comparative Study of Training Algorithms for Supervised Machine Learning. International Journal of Soft Computing and Engineering 2(4), 74–81 (2012) ISSN: 2231-2307

    Google Scholar 

  8. Harish, B.S., Guru, D.S., Manjunath, S.: Representation and Classification of Text Documents: A Brief Review. IJCA Special Issue on “Recent Trends in Image Processing and Pattern Recognition” RTIPPR, 110–119 (2010)

    Google Scholar 

  9. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  10. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference, ICML 1997 (1997)

    Google Scholar 

  11. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(3) (1997)

    Google Scholar 

  12. Hoi, S.C.H., Wang, J., Zhao, P., Jin, R.: Online feature selection for mining big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2012, New York, NY, USA, pp. 93–100 (2012)

    Google Scholar 

  13. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27 (1948)

    Google Scholar 

  14. Smeaton, A.F.: Information retrieval: Still butting heads with natural language processing. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 115–139. Springer, Heidelberg (1997)

    Google Scholar 

  15. Khoja, S., Garside, R.: Stemming Arabic text. Computer Science Department, Lancaster University, Lancaster (1999)

    Google Scholar 

  16. Lewis, D.D.: Representation and Learning in Information Retrieval. Ph. D. thesis, Department of Computer and Information Science, University of Massachusetts, USA (1992)

    Google Scholar 

  17. Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of Unbalanced Data Sets in Sentiment Classification. In: Proc. of IEEE International Conference on Systems, Man and Cybernetics (SMC 2012), Seoul, Korea, pp. 3280–3285 (2012b)

    Google Scholar 

  18. Rushdi-Saleh, M., Martin-Valdivia, M.T., Urena-Lopez, L.A., Perea-Ortega, J.M.: Bilingual Experiments with an Arabic-English Corpus for Opinion Mining. In: Proc. of Recent Advances in Natural Language Processing 2011, Hissar, Bulgaria (2011a)

    Google Scholar 

  19. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting of the ACL, ACL 2004, Barcelona, Spain (July 2004)

    Google Scholar 

  20. Rushdi-Saleh, M., Martin-Valdivia, M.T., Urena-Lopez, L.A., Perea-Ortega, J.M.: Experiments with SVM to classify opinions in different domains. Expert Systems with Applications 38, 14799–14804 (2011b)

    Article  Google Scholar 

  21. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  22. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  23. McCallum, A., Nigam, K., Employing, E.M.: pool-based active learning for text classification. In: Machine Learning: Proceedings of the Fifteenth International Conference (ICML 1998), pp. 359–367 (1998)

    Google Scholar 

  24. Platt, J.: Fast training on SVMs using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge (1999)

    Google Scholar 

  25. Salton, G., McGill, M.: Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  26. Yang, Y.: An evaluation of statistical approaches to text categorization, Inform. Retr 1, 1–2 (1999)

    Google Scholar 

  27. Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Data Sets: One-Sided Sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997)

    Google Scholar 

  28. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1923 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mountassir, A., Benbrahim, H., Berrada, I. (2014). The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2014. Lecture Notes in Computer Science(), vol 8556. Springer, Cham. https://doi.org/10.1007/978-3-319-08979-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08979-9_34

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08978-2

  • Online ISBN: 978-3-319-08979-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics