Advertisement

Language Resources and Evaluation

, Volume 45, Issue 3, pp 311–330 | Cite as

Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

  • Ralf Steinberger
  • Sylvia Ombuya
  • Mijail Kabadjov
  • Bruno Pouliquen
  • Leo Della Rocca
  • Jenya Belyaeva
  • Monica de Paola
  • Camelia Ignat
  • Erik van der Goot
Original Paper

Abstract

The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.

Keywords

Swahili Multilinguality Information extraction Named entity recognition and classification Geo-tagging Quotation recognition Date recognition Subject domain classification News analysis Media monitoring 

References

  1. Bering, C., Drożdżyński, W., Erbach, G., Guasch, L., Homola, P., Lehmann, S., et al. (2003). Corpora and evaluation tools for multilingual named entity grammar development. In Proceedings of the multilingual corpora workshop at corpus linguistics (pp. 42–52). Lancaster, UK.Google Scholar
  2. Carenini, M., Whyte, A., Bertorello, L., & Vanocchi, M. (2007). Improving communication in E-democracy using natural language processing. In IEEE Intelligent Systems, 22(1), 20–27.Google Scholar
  3. De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.Google Scholar
  4. De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2006). Data-driven part-of-speech tagging of Kiswahili. In Text, speech and dialogue (Vol. 4188, pp. 197–204). Berlin: Springer.Google Scholar
  5. De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.Google Scholar
  6. De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2011). Exploring the SAWA corpus—Collection and deployment of a parallel corpus English—Swahili. Language Resources and Evaluation Journal. Special Issue on African Language Technology, Springer.Google Scholar
  7. Gamon, M., Lozano, C., Pinkham, J., & Reutter, T. (1997). Practical experience with grammar sharing in multilingual NLP. In Proceedings of ACL/EACL, Madrid, Spain, pp. 49–56.Google Scholar
  8. Ignat, C., Pouliquen, B., Ribeiro, A., & Steinberger, R. (2003). Extending an information extraction tool set to central and eastern European languages. In Proceedings of the workshop information extraction for slavonic and other central and eastern European languages (IESL’2003) (pp. 33–39). Borovets, Bulgaria, 8–9 Sep 2003.Google Scholar
  9. Landauer, T., & Littman, M. (1991). A statistical method for language-independent representation of the topical content of text segments. In 11th International conference expert systems and their applications (Vol. 8, pp. 77–85), Avignon, France.Google Scholar
  10. Leek, T., Jin, H., Sista, S., & Schwartz, R. (1999). The BBN crosslingual topic detection and tracking system. In 1999 TDT evaluation system summary papers (pp. 214–221). Vienna, VA, USA.Google Scholar
  11. Manny, R., & Bouillon, P. (1996). Adapting the core language engine to French and Spanish. In Proceedings of the international conference NLP+IA,( pp. 224–232). Mouncton, Canada.Google Scholar
  12. Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., & Wilks, Y. (2002). Architectural elements of language engineering robustness. Natural Language Engineering, 8(3), 257–274. Special Issue on Robust Methods in Analysis of Natural Language Data.Google Scholar
  13. Ng’ang’a, W. (2005). Word sense disambiguation of Swahili: Extending Swahili language technology with machine learning. Ph.D. thesis, Helsinki University.Google Scholar
  14. Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.Google Scholar
  15. Pastra, K., Maynard, D., Hamza, O., Cunningham, H., & Wilks, Y. (2002). How feasible is the reuse of grammars for Named Entity Recognition? In Proceedings of LREC (pp. 412–1418). Las Palmas, Spain.Google Scholar
  16. Pouliquen, B., Kimler, M., Steinberger, R., Ignat, C., Oellinger, T., Blackler, K., et al. (2006). Geocoding multilingual texts: Recognition, disambiguation and visualisation. In Proceedings of LREC’2006, (pp. 53–58). Genoa, Italy, 24–26 May 2006.Google Scholar
  17. Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In C. Goutte, N. Cancedda, M. Dymetman & G. Foster (Eds.), Learning machine translation (pp. 59–78). Cambridge: MIT Press—Advances in Neural Information Processing Systems Series (NIPS).Google Scholar
  18. Pouliquen, B., Steinberger, R., & Best, C. (2007). Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in natural language processing (RANLP’2007) (pp. 487–492). Borovets, Bulgaria, 27–29.09.2007.Google Scholar
  19. Shah, R., Lin, B., Gershman, A., & Frederking, R. (2010). SYNERGY: A named entity recognition system for resource-scarce languages such as Swahili using online machine translation. In Proceedings of the second workshop on African language technology (AfLAT), Malta, 9 July 2010.Google Scholar
  20. Sproat, R., Roth, D., Zhai, C., Benmamoun, E., Fister, A., Karlinsky, N., et al. (2005). Named entity recognition and transliteration for 50 languages. Keynote address at the second midwest computational linguistics colloquium, 14–15 May 2010, The Ohio State University.Google Scholar
  21. Steinberger, R. (2011). A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation Journal, Special issue on LREC’2010.Google Scholar
  22. Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., & Yangarber, R. (2008b). Text mining from the web for medical intelligence. In F. Fogelman-Soulié, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 295–310). Amsterdam, The Netherlands: IOS Press.Google Scholar
  23. Steinberger, R., Pouliquen, B., & Ignat, C. (2008a). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam, The Netherlands: IOS Press.Google Scholar
  24. Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An Introduction to the Europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual world. Proceedings of SIGIR-CLIR (pp. 1–8). Boston, USA. 23 July 2009.Google Scholar
  25. Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. Advances of Neural Information Processing Systems, 15, 1473–1480.Google Scholar
  26. Wactlar, H. (1999). New directions in video information extraction and summarization. In Proceedings of the 10th DELOS workshop (pp. 1–10). Sanorini, Greece.Google Scholar
  27. Wentland, W., Knopp, J., Silberer, C., Hartung, M. (2008). Building a multilingual lexical resource for named entity disambiguation, translation and transliteration. In Proceedings of LREC (pp. 3230–3237). Genoa, Italy.Google Scholar
  28. Yarowski, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st international conference on Human Language Technology research (HLT) (pp. 1–8). Stroudsburg, PA, USA.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Ralf Steinberger
    • 1
  • Sylvia Ombuya
    • 1
  • Mijail Kabadjov
    • 1
  • Bruno Pouliquen
    • 1
  • Leo Della Rocca
    • 1
  • Jenya Belyaeva
    • 1
  • Monica de Paola
    • 1
  • Camelia Ignat
    • 1
  • Erik van der Goot
    • 1
  1. 1.European Commission, Joint Research CentreIspraItaly

Personalised recommendations