Classifying Amharic webnews
- 208 Downloads
We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.
KeywordsWeb mining Text classification Semitic languages
The authors would like to thank to Daniel Yacob at the Ge’ez Frontier Foundation; Mesfin Getachew, Dr. Girma Demeke, Dr. Gashaw Kebede, Kibur Lisanu, and Meshesha Legesse at Addis Ababa University; and Gunnar Eriksson, Fredrik Olsson, and Dr. Magnus Sahlgren at the Swedish Institute of Computer Science. The work was partially funded by Sida, the Swedish International Development Cooperation Agency through the ICT support programme of SAREC (the Department for Research Cooperation) and through SPIDER (the Swedish Programme for ICT in Developing Regions), as well as by the Faculty of Informatics at Addis Ababa University.
- Alemayehu, N., & Willett, P. (2003). The effectiveness of stemming for information retrieval in Amharic. Emerald Research Register, 37(4), 254–259.Google Scholar
- Argaw, A. A. (2008). Amharic-English information retrieval with pseudo relevance feedback. In C. Peters et al., (Eds.), Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19–21, Revised Selected Papers (pp. 119–126). Berlin/Heidelberg: Springer.Google Scholar
- Argaw, A. A., & Asker, L. (2007a). Amharic-English information retrieval. In C. Peters et al., (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers (pp. 43–50). Berlin/Heidelberg: Springer.Google Scholar
- Argaw, A. A., & Asker, L. (2007b). An Amharic stemmer: Reducing words to their citation forms. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Workshop on computational approaches to semitic languages (pp. 104–110). Prague, Czech Republic: ACL.Google Scholar
- Argaw, A. A., Asker, L., Cöster, R., & Karlgren, J. (2005). Dictionary-based Amharic–English information retrieval. In C. Peters et al., (Eds.), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross Language Evaluation Forum, CLEF 2004. Bath, UK, September 15–24, 2004, Revised Selected Papers (pp. 143–149). Berlin/Heidelberg: Springer.Google Scholar
- Argaw, A. A., Asker, L., Cöster, R., Karlgren, J., & Sahlgren, M. (2006). Dictionary-based Amharic–French information retrieval. In C. Peters et al., (Eds.), Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum, CLEF 2005. Vienna, Austria, September 21–23, 2005. Revised Selected Papers (pp. 83–92). Berlin/Heidelberg: Springer.Google Scholar
- Argaw, A. A., Asker, L., & Eriksson, G. (2003). An empirical approach to building an Amharic treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (pp. 205–208). Sweden: Växjö University.Google Scholar
- Amine, A., Elberrichi, Z., Simonet, M., & Malki, M. (2008). Evaluation and comparison of concept based and n-grams based text clustering using SOM. INFOCOMP Journal of Computer Science, 7(1), 27–35.Google Scholar
- Amsalu, S. (2001). The application of information retrieval techniques to Amharic. Master of Science Thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
- Amsalu, S., & Gibbon, D. (2005). Finite state morphology of Amharic. In R. Mitkov (Ed.), Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (pp. 47–51).Google Scholar
- Arampatzis, A. (2001). Adaptive and temporally-dependent document filtering. Doctor of Philosophy thesis, Department of Information Systems Sciences and Information Retrieval, Katholieke Universiteit Nijmegen, Nijmegen, The Netherlands.Google Scholar
- Bayou, A. (2000). Design and development of word parser for Amharic language. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
- Bayu, T. (2002). Automatic morphological analyser: An experiment using unsupervised and autosegmental approach. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
- Bender, M. L., Head, S. W., & Cowley, R. (1976). The Ethiopian writing system. In M. Bender, J. Bowen, R. Cooper, & C. Ferguson (Eds.), Language in Ethiopia (pp. 120–129). London, England: Oxford University Press.Google Scholar
- Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In Proceedings of the 26th International Conference on Research and Development in Information Retrieval, (pp. 182–189). Toronto, Canada: ACM SIGIR.Google Scholar
- CIA. (2008). The world factbook—Ethiopia. Washington, DC: The Central Intelligence Agency [Last updated 12 Feb, 2008].Google Scholar
- Cowell, J., & Hussain, F. (2003). Amharic character recognition using a fast signature based algorithm. In Proceedings of the 7th International Conference on Image Visualization (pp. 384–389). England: IEEE, London.Google Scholar
- Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, 1, 205–237.Google Scholar
- Demeke, G. A., & Getachew, M. (2006). Manual annotation of Amharic news items with part-of-speech tags and its challenges. ELRC Working Papers, 2(1), 1–17.Google Scholar
- Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods Instruments and Computers, 23(2), 229–236.Google Scholar
- Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. K. Harman (Ed.), Proceedings of the 3rd Text Retrieval Conference (pp. 219–230). Gaithersburg, MD: National Institute of Standards and Technology.Google Scholar
- Firdyiwek, Y., & Yacob, D. (1993). The Ethiopian script in ASCII. Journal of EthioSciences, 3(1). http://www.abyssiniacybergateway.net/fidel/sera.ps [Last updated 1 Jan 1997].
- Fissaha, S., & Haller, J. (2003a). Amharic verb lexicon in the context of machine translation. In Proceedings of the 10th Conference on Traitement Automatique des Langues Naturelles, Batz-sur-Mer, France (Vol. 2, pp. 183–192).Google Scholar
- Fissaha, S., & Haller, J. (2003b). Application of corpus-based techniques to Amharic texts. In Proceedings of the 9th Machine Translation Summit, New Orleans, Louisiana. Workshop on Machine Translation for Semitic Languages: Issues and Approaches. http://www.amtaweb.org/summit/WS2/Fissaya+Haller_paper.pdf.
- Furzey, J. (1996). Enpowering socio-economic development in Africa utilizing information technology. A country study for the United Nations Economic Commission for Africa, African Studies Center, University of Pennsylvania.Google Scholar
- Gaustad, T., & Bouma, G. (2002). Accurate stemming of Dutch for text classification. In M. Theune, A. Nijholt, & H. Hondorp (Eds.), Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting, Rodopi, Amsterdam, The Netherlands (pp. 104–117).Google Scholar
- GebreMeskel, T. (2003). Amharic text retrieval: An experiment using latent semantic indexing (LSI) with singular value decomposition (SVD). Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
- Gordon, R. G. Jr. (Ed.). (2005). Ethnologue: languages of the world (15th ed.). Dallas, TX: SIL International.Google Scholar
- Hoi, S. C. H., Jin, R., & Lyu, M. R. (2006). Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland (pp. 633–642).Google Scholar
- Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1997). WEBSOM—Self-Organizing Maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland (pp. 310–315).Google Scholar
- Hudson, G. (2006). 75 Ethiopian languages: 19 Cushitic, 20 Nilosaharan, 23 Omotic, 12 Semitic, and 1 unclassified. http://www.msu.edu/hudson/Ethlgslist.htm [Last updated 29 Dec, 2006].
- Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Doctor of Philosophy thesis, Stockholm University and the Royal Institute of Technology, Deparment of Computer and Systems Sciences, Stockholm, Sweden.Google Scholar
- Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of Real World Intelligence (pp. 294–308). Stanford California: CSLI publications.Google Scholar
- Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1996). Creating an order in digital libraries with Self-Organizing Maps. In Proceedings of the World Congress on Neural Networks, San Diego, California (pp. 814–817).Google Scholar
- Kohonen, T. (1999). Self-organization and associative memory (3rd ed.). Heidelberg, Germany: Springer.Google Scholar
- Larkey, L. S. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 275–282). Tampere, Finland: ACM SIGIR.Google Scholar
- Li, F., & Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C. (pp. 472–479).Google Scholar
- Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval (pp. 262–269). Chicago, IL: ACM SIGIR.Google Scholar
- Miniwatts Marketing Group. (2008). Internet world users by language. http://www.internetworldstats.com/languages.htm [Last updated 30 Jun, 2008].
- Negga, W. (2008). Wazéma System: an Ethiopian computer writing system for Windows NT/2000/XP/Vista Version 2.1. Croydon, England. www.gzamargna.net.
- Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (pp. 67–73). Philadelphia, PA: ACM SIGIR.Google Scholar
- Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
- Ruiz, M. E., & Srinivasan, P. (1999). Hierarchical neural networks for text categorization. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (pp. 281–282). Berkeley, CA: ACM SIGIR.Google Scholar
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. Rumelhart, & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Google Scholar
- Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th International Conference on Research and Development in Information Retrieval (pp. 229–237). Seattle, WA: ACM SIGIR.Google Scholar
- Sintayehu, Z. (2001). Automatic classification of Amharic news items: The case of the Ethiopian News Agency. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
- Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for text classification. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1090–1099). Honolulu, Hawaii: ACL.Google Scholar
- Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text classification. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.Google Scholar
- Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 269–274). Tampere, Finland: ACM SIGIR.Google Scholar
- Yacob, D. (1997). The system for Ethiopic representation in ASCII—1997 standard. http://www.abyssiniacybergateway.net/fidel/sera-97.html.
- Yacob, D. (2005). Developments towards an electronic Amharic corpus. In Proceedings of the 12th Conference on Traitement Automatique des Langues Naturelles, Dourdan, France. Workshop on Under-Resourced Languages. http://yacob.org/papers/DanielYacob-TALN2005.pdf.