Skip to main content

N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

  • Conference paper
Progress in Artificial Intelligence (EPIA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4874))

Included in the following conference series:

Abstract

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benzécri, J.-P.: L’Analyse des Données, T1 = la Taxinomie. In: Dunod (ed.), 3rd edn. (1979)

    Google Scholar 

  2. Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)

    Google Scholar 

  3. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proc. of the Seventh International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  4. Dunning, T.: Statistical Identification of Languages. Comp. Res. Lab. Technical Report, MCCS, pp. 94–273 (1994)

    Google Scholar 

  5. Fortuna, B., Mladenić, D.: Using String Kernels for Classification of Slovenian Web Documents. In: Proc. of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg (2005)

    Google Scholar 

  6. Ekmekçioglu, F.Ç., Lynch, M.F., Willett, P.: Stemming and N-gram Matching for Term Conflation in Turkish Texts. J. Inf. Res. 7(1), 2–6 (1996)

    Google Scholar 

  7. Jalam, R., Chauchat, J.-H.: Pourquoi les N-grammes Permettent de Classer des Textes? Recherche de Mots-clefs Pertinents à l’Aide des N-grammes Caractéristiques. In: Proc. of the 6es Journées internationales d’Analyse statistique des Données Textuelles, pp. 77–84 (2002)

    Google Scholar 

  8. Jalam, R.: Apprentissage Automatique et Catégorisation de Textes Multilingues Ph.D. thesis, Université Lumière Lyon 2 (2003)

    Google Scholar 

  9. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)

    Google Scholar 

  10. Joachims, T.: Text Categorization With Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Kraaij, W.: Variations on Language Modeling for Information Retrieval. Ph.D. thesis, University of Twente (2004)

    Google Scholar 

  12. Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In: Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)

    Google Scholar 

  13. Malenica, M., Šmuc, T., Šnajder, J., Dalbelo Bašić, B.: Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. J. Inf. Proc. Man (to appear, 2007)

    Google Scholar 

  14. Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. J. Dig. Inf. 1(5), 1–25 (2000)

    Google Scholar 

  15. Mladenić, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature Selection Using Linear Classifier Weights: Interaction with Classification Models. In: Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)

    Google Scholar 

  16. Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  17. Savoy, J.: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages. In: Proc. of the 2006 ACM Symposium on Applied Computing, pp. 1031–1035 (2006)

    Google Scholar 

  18. Sebastiani, F.: Machine Learning in Automated Text Categorization. J. ACM Comp. Surv. 1, 1–47 (2002)

    Article  Google Scholar 

  19. Shannon, C.: The Mathematical Theory of Communication. J. Bell Sys. Tech. 27, 379–423, 623–656 (1948)

    MathSciNet  Google Scholar 

  20. Šilić, A., Šarić, F., Dalbelo Bašić, B., Šnajder, J.: TMT: Object-oriented Library for Text Classification. In: Proc. of the 30th International Conference on Information Technology Interfaces, pp. 559–564 (2007)

    Google Scholar 

  21. Šnajder, J.: Rule-Based Automatic Acquisition of Large-Coverage Morphological Lexicons for Information Retrieval. Technical Report MZOS 2003-082, Department of Electronics, Microelectronics, Computer and Intelligent Systems, FER, University of Zagreb (2006)

    Google Scholar 

  22. Tadić, M.: Building the Croatian-English Parallel Corpus. In: Proc. of the Third International Conference On Language Resources And Evaluation, vol. 1, pp. 523–530 (2000)

    Google Scholar 

  23. Teytaud, O., Jalam, R.: Kernel based text categorization. In: Proc. of the International Joint Conference on Neural Networks, vol. 3, pp. 1891–1896 (2001)

    Google Scholar 

  24. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14h International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

José Neves Manuel Filipe Santos José Manuel Machado

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Šilić, A., Chauchat, JH., Dalbelo Bašić, B., Morin, A. (2007). N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_56

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77002-2_56

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77000-8

  • Online ISBN: 978-3-540-77002-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics