N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

Šilić, Artur; Chauchat, Jean-Hugues; Dalbelo Bašić, Bojana; Morin, Annie

doi:10.1007/978-3-540-77002-2_56

Artur Šilić¹,
Jean-Hugues Chauchat²,
Bojana Dalbelo Bašić¹ &
…
Annie Morin³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4874))

Included in the following conference series:

Portuguese Conference on Artificial Intelligence

1471 Accesses
5 Citations

Abstract

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benzécri, J.-P.: L’Analyse des Données, T1 = la Taxinomie. In: Dunod (ed.), 3rd edn. (1979)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proc. of the Seventh International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Dunning, T.: Statistical Identification of Languages. Comp. Res. Lab. Technical Report, MCCS, pp. 94–273 (1994)
Google Scholar
Fortuna, B., Mladenić, D.: Using String Kernels for Classification of Slovenian Web Documents. In: Proc. of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg (2005)
Google Scholar
Ekmekçioglu, F.Ç., Lynch, M.F., Willett, P.: Stemming and N-gram Matching for Term Conflation in Turkish Texts. J. Inf. Res. 7(1), 2–6 (1996)
Google Scholar
Jalam, R., Chauchat, J.-H.: Pourquoi les N-grammes Permettent de Classer des Textes? Recherche de Mots-clefs Pertinents à l’Aide des N-grammes Caractéristiques. In: Proc. of the 6es Journées internationales d’Analyse statistique des Données Textuelles, pp. 77–84 (2002)
Google Scholar
Jalam, R.: Apprentissage Automatique et Catégorisation de Textes Multilingues Ph.D. thesis, Université Lumière Lyon 2 (2003)
Google Scholar
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Google Scholar
Joachims, T.: Text Categorization With Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Kraaij, W.: Variations on Language Modeling for Information Retrieval. Ph.D. thesis, University of Twente (2004)
Google Scholar
Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In: Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)
Google Scholar
Malenica, M., Šmuc, T., Šnajder, J., Dalbelo Bašić, B.: Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. J. Inf. Proc. Man (to appear, 2007)
Google Scholar
Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. J. Dig. Inf. 1(5), 1–25 (2000)
Google Scholar
Mladenić, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature Selection Using Linear Classifier Weights: Interaction with Classification Models. In: Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)
Google Scholar
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Google Scholar
Savoy, J.: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages. In: Proc. of the 2006 ACM Symposium on Applied Computing, pp. 1031–1035 (2006)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. J. ACM Comp. Surv. 1, 1–47 (2002)
Article Google Scholar
Shannon, C.: The Mathematical Theory of Communication. J. Bell Sys. Tech. 27, 379–423, 623–656 (1948)
MathSciNet Google Scholar
Šilić, A., Šarić, F., Dalbelo Bašić, B., Šnajder, J.: TMT: Object-oriented Library for Text Classification. In: Proc. of the 30th International Conference on Information Technology Interfaces, pp. 559–564 (2007)
Google Scholar
Šnajder, J.: Rule-Based Automatic Acquisition of Large-Coverage Morphological Lexicons for Information Retrieval. Technical Report MZOS 2003-082, Department of Electronics, Microelectronics, Computer and Intelligent Systems, FER, University of Zagreb (2006)
Google Scholar
Tadić, M.: Building the Croatian-English Parallel Corpus. In: Proc. of the Third International Conference On Language Resources And Evaluation, vol. 1, pp. 523–530 (2000)
Google Scholar
Teytaud, O., Jalam, R.: Kernel based text categorization. In: Proc. of the International Joint Conference on Neural Networks, vol. 3, pp. 1891–1896 (2001)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14h International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb, Croatia
Artur Šilić & Bojana Dalbelo Bašić
Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, 5 avenue Pierre Mendès France, 69676 Bron Cedex, France
Jean-Hugues Chauchat
Université de Rennes 1, IRISA, 35042 Rennes Cedex, France
Annie Morin

Authors

Artur Šilić
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Hugues Chauchat
View author publications
You can also search for this author in PubMed Google Scholar
Bojana Dalbelo Bašić
View author publications
You can also search for this author in PubMed Google Scholar
Annie Morin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

José Neves Manuel Filipe Santos José Manuel Machado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Šilić, A., Chauchat, JH., Dalbelo Bašić, B., Morin, A. (2007). N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_56

Download citation

DOI: https://doi.org/10.1007/978-3-540-77002-2_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77000-8
Online ISBN: 978-3-540-77002-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics