Abstract
In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Benzécri, J.-P.: L’Analyse des Données, T1 = la Taxinomie. In: Dunod (ed.), 3rd edn. (1979)
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proc. of the Seventh International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Dunning, T.: Statistical Identification of Languages. Comp. Res. Lab. Technical Report, MCCS, pp. 94–273 (1994)
Fortuna, B., Mladenić, D.: Using String Kernels for Classification of Slovenian Web Documents. In: Proc. of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg (2005)
Ekmekçioglu, F.Ç., Lynch, M.F., Willett, P.: Stemming and N-gram Matching for Term Conflation in Turkish Texts. J. Inf. Res. 7(1), 2–6 (1996)
Jalam, R., Chauchat, J.-H.: Pourquoi les N-grammes Permettent de Classer des Textes? Recherche de Mots-clefs Pertinents à l’Aide des N-grammes Caractéristiques. In: Proc. of the 6es Journées internationales d’Analyse statistique des Données Textuelles, pp. 77–84 (2002)
Jalam, R.: Apprentissage Automatique et Catégorisation de Textes Multilingues Ph.D. thesis, Université Lumière Lyon 2 (2003)
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Joachims, T.: Text Categorization With Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Kraaij, W.: Variations on Language Modeling for Information Retrieval. Ph.D. thesis, University of Twente (2004)
Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In: Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)
Malenica, M., Šmuc, T., Šnajder, J., Dalbelo Bašić, B.: Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. J. Inf. Proc. Man (to appear, 2007)
Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. J. Dig. Inf. 1(5), 1–25 (2000)
Mladenić, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature Selection Using Linear Classifier Weights: Interaction with Classification Models. In: Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Savoy, J.: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages. In: Proc. of the 2006 ACM Symposium on Applied Computing, pp. 1031–1035 (2006)
Sebastiani, F.: Machine Learning in Automated Text Categorization. J. ACM Comp. Surv. 1, 1–47 (2002)
Shannon, C.: The Mathematical Theory of Communication. J. Bell Sys. Tech. 27, 379–423, 623–656 (1948)
Šilić, A., Šarić, F., Dalbelo Bašić, B., Šnajder, J.: TMT: Object-oriented Library for Text Classification. In: Proc. of the 30th International Conference on Information Technology Interfaces, pp. 559–564 (2007)
Šnajder, J.: Rule-Based Automatic Acquisition of Large-Coverage Morphological Lexicons for Information Retrieval. Technical Report MZOS 2003-082, Department of Electronics, Microelectronics, Computer and Intelligent Systems, FER, University of Zagreb (2006)
Tadić, M.: Building the Croatian-English Parallel Corpus. In: Proc. of the Third International Conference On Language Resources And Evaluation, vol. 1, pp. 523–530 (2000)
Teytaud, O., Jalam, R.: Kernel based text categorization. In: Proc. of the International Joint Conference on Neural Networks, vol. 3, pp. 1891–1896 (2001)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14h International Conference on Machine Learning, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Šilić, A., Chauchat, JH., Dalbelo Bašić, B., Morin, A. (2007). N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_56
Download citation
DOI: https://doi.org/10.1007/978-3-540-77002-2_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77000-8
Online ISBN: 978-3-540-77002-2
eBook Packages: Computer ScienceComputer Science (R0)