Advertisement

Information Retrieval

, Volume 12, Issue 3, pp 416–435 | Cite as

Classifying Amharic webnews

  • Lars Asker
  • Atelach Alemu Argaw
  • Björn Gambäck
  • Samuel Eyassu Asfeha
  • Lemma Nigussie Habte
Article

Abstract

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.

Keywords

Web mining Text classification Semitic languages 

Notes

Acknowledgements

The authors would like to thank to Daniel Yacob at the Ge’ez Frontier Foundation; Mesfin Getachew, Dr. Girma Demeke, Dr. Gashaw Kebede, Kibur Lisanu, and Meshesha Legesse at Addis Ababa University; and Gunnar Eriksson, Fredrik Olsson, and Dr. Magnus Sahlgren at the Swedish Institute of Computer Science. The work was partially funded by Sida, the Swedish International Development Cooperation Agency through the ICT support programme of SAREC (the Department for Research Cooperation) and through SPIDER (the Swedish Programme for ICT in Developing Regions), as well as by the Faculty of Informatics at Addis Ababa University.

References

  1. Alemayehu, N., & Willett, P. (2002). Stemming of Amharic words for information retrieval. Literary and Linguistic Computing, 17(1), 1–17.CrossRefGoogle Scholar
  2. Alemayehu, N., & Willett, P. (2003). The effectiveness of stemming for information retrieval in Amharic. Emerald Research Register, 37(4), 254–259.Google Scholar
  3. Argaw, A. A. (2008). Amharic-English information retrieval with pseudo relevance feedback. In C. Peters et al., (Eds.), Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19–21, Revised Selected Papers (pp. 119–126). Berlin/Heidelberg: Springer.Google Scholar
  4. Argaw, A. A., & Asker, L. (2007a). Amharic-English information retrieval. In C. Peters et al., (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers (pp. 43–50). Berlin/Heidelberg: Springer.Google Scholar
  5. Argaw, A. A., & Asker, L. (2007b). An Amharic stemmer: Reducing words to their citation forms. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Workshop on computational approaches to semitic languages (pp. 104–110). Prague, Czech Republic: ACL.Google Scholar
  6. Argaw, A. A., Asker, L., Cöster, R., & Karlgren, J. (2005). Dictionary-based Amharic–English information retrieval. In C. Peters et al., (Eds.), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross Language Evaluation Forum, CLEF 2004. Bath, UK, September 15–24, 2004, Revised Selected Papers (pp. 143–149). Berlin/Heidelberg: Springer.Google Scholar
  7. Argaw, A. A., Asker, L., Cöster, R., Karlgren, J., & Sahlgren, M. (2006). Dictionary-based Amharic–French information retrieval. In C. Peters et al., (Eds.), Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum, CLEF 2005. Vienna, Austria, September 21–23, 2005. Revised Selected Papers (pp. 83–92). Berlin/Heidelberg: Springer.Google Scholar
  8. Argaw, A. A., Asker, L., & Eriksson, G. (2003). An empirical approach to building an Amharic treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (pp. 205–208). Sweden: Växjö University.Google Scholar
  9. Amine, A., Elberrichi, Z., Simonet, M., & Malki, M. (2008). Evaluation and comparison of concept based and n-grams based text clustering using SOM. INFOCOMP Journal of Computer Science, 7(1), 27–35.Google Scholar
  10. Amsalu, S. (2001). The application of information retrieval techniques to Amharic. Master of Science Thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
  11. Amsalu, S., & Gibbon, D. (2005). Finite state morphology of Amharic. In R. Mitkov (Ed.), Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (pp. 47–51).Google Scholar
  12. Arampatzis, A. (2001). Adaptive and temporally-dependent document filtering. Doctor of Philosophy thesis, Department of Information Systems Sciences and Information Retrieval, Katholieke Universiteit Nijmegen, Nijmegen, The Netherlands.Google Scholar
  13. Bayou, A. (2000). Design and development of word parser for Amharic language. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
  14. Bayu, T. (2002). Automatic morphological analyser: An experiment using unsupervised and autosegmental approach. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
  15. Bender, M. L., Head, S. W., & Cowley, R. (1976). The Ethiopian writing system. In M. Bender, J. Bowen, R. Cooper, & C. Ferguson (Eds.), Language in Ethiopia (pp. 120–129). London, England: Oxford University Press.Google Scholar
  16. Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573–595.MATHCrossRefMathSciNetGoogle Scholar
  17. Bloor, T. (1995). The Ethiopic writing system: A profile. Journal of the Simplified Spelling Society, 19(2), 30–36.MathSciNetGoogle Scholar
  18. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MATHMathSciNetGoogle Scholar
  19. Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In Proceedings of the 26th International Conference on Research and Development in Information Retrieval, (pp. 182–189). Toronto, Canada: ACM SIGIR.Google Scholar
  20. CIA. (2008). The world factbook—Ethiopia. Washington, DC: The Central Intelligence Agency [Last updated 12 Feb, 2008].Google Scholar
  21. Cowell, J., & Hussain, F. (2003). Amharic character recognition using a fast signature based algorithm. In Proceedings of the 7th International Conference on Image Visualization (pp. 384–389). England: IEEE, London.Google Scholar
  22. Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, 1, 205–237.Google Scholar
  23. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRefGoogle Scholar
  24. Demeke, G. A., & Getachew, M. (2006). Manual annotation of Amharic news items with part-of-speech tags and its challenges. ELRC Working Papers, 2(1), 1–17.Google Scholar
  25. Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods Instruments and Computers, 23(2), 229–236.Google Scholar
  26. Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. K. Harman (Ed.), Proceedings of the 3rd Text Retrieval Conference (pp. 219–230). Gaithersburg, MD: National Institute of Standards and Technology.Google Scholar
  27. Firdyiwek, Y., & Yacob, D. (1993). The Ethiopian script in ASCII. Journal of EthioSciences, 3(1). http://www.abyssiniacybergateway.net/fidel/sera.ps [Last updated 1 Jan 1997].
  28. Fissaha, S., & Haller, J. (2003a). Amharic verb lexicon in the context of machine translation. In Proceedings of the 10th Conference on Traitement Automatique des Langues Naturelles, Batz-sur-Mer, France (Vol. 2, pp. 183–192).Google Scholar
  29. Fissaha, S., & Haller, J. (2003b). Application of corpus-based techniques to Amharic texts. In Proceedings of the 9th Machine Translation Summit, New Orleans, Louisiana. Workshop on Machine Translation for Semitic Languages: Issues and Approaches. http://www.amtaweb.org/summit/WS2/Fissaya+Haller_paper.pdf.
  30. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MATHCrossRefMathSciNetGoogle Scholar
  31. Furzey, J. (1996). Enpowering socio-economic development in Africa utilizing information technology. A country study for the United Nations Economic Commission for Africa, African Studies Center, University of Pennsylvania.Google Scholar
  32. Gaustad, T., & Bouma, G. (2002). Accurate stemming of Dutch for text classification. In M. Theune, A. Nijholt, & H. Hondorp (Eds.), Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting, Rodopi, Amsterdam, The Netherlands (pp. 104–117).Google Scholar
  33. GebreMeskel, T. (2003). Amharic text retrieval: An experiment using latent semantic indexing (LSI) with singular value decomposition (SVD). Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
  34. Gordon, R. G. Jr. (Ed.). (2005). Ethnologue: languages of the world (15th ed.). Dallas, TX: SIL International.Google Scholar
  35. Hoi, S. C. H., Jin, R., & Lyu, M. R. (2006). Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland (pp. 633–642).Google Scholar
  36. Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1997). WEBSOM—Self-Organizing Maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland (pp. 310–315).Google Scholar
  37. Hudson, G. (1999). Linguistic analysis of the 1994 Ethiopian census. Northeast African Studies, 6(3), 89–107.CrossRefGoogle Scholar
  38. Hudson, G. (2006). 75 Ethiopian languages: 19 Cushitic, 20 Nilosaharan, 23 Omotic, 12 Semitic, and 1 unclassified. http://www.msu.edu/hudson/Ethlgslist.htm [Last updated 29 Dec, 2006].
  39. Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Doctor of Philosophy thesis, Stockholm University and the Royal Institute of Technology, Deparment of Computer and Systems Sciences, Stockholm, Sweden.Google Scholar
  40. Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of Real World Intelligence (pp. 294–308). Stanford California: CSLI publications.Google Scholar
  41. Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1996). Creating an order in digital libraries with Self-Organizing Maps. In Proceedings of the World Congress on Neural Networks, San Diego, California (pp. 814–817).Google Scholar
  42. Kohonen, T. (1999). Self-organization and associative memory (3rd ed.). Heidelberg, Germany: Springer.Google Scholar
  43. Kohonen, T. (2001). Self-Organizing Maps (3rd ed.). Berlin, Germany: Springer.MATHGoogle Scholar
  44. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.CrossRefGoogle Scholar
  45. Larkey, L. S. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 275–282). Tampere, Finland: ACM SIGIR.Google Scholar
  46. Li, F., & Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C. (pp. 472–479).Google Scholar
  47. Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval (pp. 262–269). Chicago, IL: ACM SIGIR.Google Scholar
  48. Miniwatts Marketing Group. (2008). Internet world users by language. http://www.internetworldstats.com/languages.htm [Last updated 30 Jun, 2008].
  49. Negga, W. (2008). Wazéma System: an Ethiopian computer writing system for Windows NT/2000/XP/Vista Version 2.1. Croydon, England. www.gzamargna.net.
  50. Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (pp. 67–73). Philadelphia, PA: ACM SIGIR.Google Scholar
  51. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
  52. Ruiz, M. E., & Srinivasan, P. (1999). Hierarchical neural networks for text categorization. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (pp. 281–282). Berkeley, CA: ACM SIGIR.Google Scholar
  53. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. Rumelhart, & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Google Scholar
  54. Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York, NY: McGraw-Hill.MATHGoogle Scholar
  55. Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th International Conference on Research and Development in Information Retrieval (pp. 229–237). Seattle, WA: ACM SIGIR.Google Scholar
  56. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRefGoogle Scholar
  57. Sintayehu, Z. (2001). Automatic classification of Amharic news items: The case of the Ethiopian News Agency. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.Google Scholar
  58. Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for text classification. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1090–1099). Honolulu, Hawaii: ACL.Google Scholar
  59. Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text classification. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.Google Scholar
  60. Tambouratzis, G., Hairetakis, N., Markantonatou, S., & Carayannis, G. (2003). Applying the SOM model to text classification according to register and stylistic content. International Journal of Neural Systems, 13(1), 1–11.CrossRefGoogle Scholar
  61. Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 269–274). Tampere, Finland: ACM SIGIR.Google Scholar
  62. Yacob, D. (1997). The system for Ethiopic representation in ASCII—1997 standard. http://www.abyssiniacybergateway.net/fidel/sera-97.html.
  63. Yacob, D. (2005). Developments towards an electronic Amharic corpus. In Proceedings of the 12th Conference on Traitement Automatique des Langues Naturelles, Dourdan, France. Workshop on Under-Resourced Languages. http://yacob.org/papers/DanielYacob-TALN2005.pdf.

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Lars Asker
    • 1
  • Atelach Alemu Argaw
    • 1
  • Björn Gambäck
    • 2
    • 3
  • Samuel Eyassu Asfeha
    • 4
  • Lemma Nigussie Habte
    • 4
  1. 1.Department of Computer and Systems SciencesStockholm UniversityStockholmSweden
  2. 2.Department of Computer and Information ScienceNorwegian University of Science and TechnologyTrondheimNorway
  3. 3.SICS, Swedish Institute of Computer Science ABKistaSweden
  4. 4.Department of Information ScienceAddis Ababa UniversityAddis AbabaEthiopia

Personalised recommendations