Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages

  • Ralf Steinberger
  • Maud Ehrmann
  • Júlia Pajzs
  • Mohamed Ebrahim
  • Josef Steinberger
  • Marco Turchi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8082)

Abstract

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?

Keywords

multilinguality text mining information extraction text classification inflection Slavic and Finno-Ugric languages media monitoring 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mohamed, E., Ehrmann, M., Turchi, M., Steinberger, R.: Multi-label EuroVoc classification for Eastern and Southern EU Languages. In: Vertan, C., Hahn, W. (eds.) Multilingual Processing in Eastern and Southern EU languages - Low-resourced Technologies and Translation, pp. 370–394. Cambridge Scholars Publishing, Cambridge (2012)Google Scholar
  2. 2.
    Farkas, R., Szarvas, G., Kocsor, A.: Named entity recognition for Hungarian using various machine learning algorithms. Acta Cybernetica 17(3), 633–646 (2006)MATHGoogle Scholar
  3. 3.
    Klementiev, A., Roth, D.: Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In: Proceedings of ACL 2006 Conference (2006)Google Scholar
  4. 4.
    Konkol, M., Konopík, M.: Maximum Entropy Named Entity Recognition for Czech Language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 203–210. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Küçük, D., Yazıcı, A.: Named Entity Recognition Experiments on Turkish Texts. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 524–535. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: Proceedings of the 26th European Conference on Information Retrieval Research, Sunderland, UK (2004)Google Scholar
  7. 7.
    Piskorski, J.: Extraction of Polish Named-Entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pp. 313–316 (2004)Google Scholar
  8. 8.
    Piskorski, J., Wieloch, K., Sydow, M.: On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages. Inf. Retrieval 12, 275–299 (2009)CrossRefGoogle Scholar
  9. 9.
    Pouliquen, B., Steinberger, R.: Automatic Construction of Multilingual Name Dictionaries. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation. Advances in Neural Information Processing Systems Series (NIPS), pp. 59–78. MIT Press (2009)Google Scholar
  10. 10.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop ‘Ontologies and Information Extraction’ at the EuroLan Summer School ‘The Semantic Web and Language Technology’ (EUROLAN 2003), Bucharest, Romania (2003)Google Scholar
  11. 11.
    Pouliquen, B., Steinberger, R., Deguernel, O.: Story tracking: linking similar news over time and across languages. In: Proceedings of the 2nd Workshop Multi-source Multilingual Information Extraction and Summarization (MMIES 2008) held at CoLing 2008, Manchester, UK (2008)Google Scholar
  12. 12.
    Steinberger, R.: A survey of methods to ease the development of highly multilingual Text Mining applications. Language Resources and Evaluation Journal 46(2), 155–176 (2012)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Steinberger, R., Pouliquen, B., van der Goot, E.: An Introduction to the Europe Media Monitor Family of Applications. In: Gey, F., Kando, N., Karlgren, J. (eds.) Information Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), Boston, USA, pp. 1–8 (2009)Google Scholar
  14. 14.
    Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, pp. 798–805 (2012)Google Scholar
  15. 15.
    Toman, M., Tesar, R., Ježek, K.: Influence of Word Normalization on Text Classification. In: Proceedings of InSciT 2006, Merida, Spain (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ralf Steinberger
    • 1
  • Maud Ehrmann
    • 2
  • Júlia Pajzs
    • 3
  • Mohamed Ebrahim
    • 4
  • Josef Steinberger
    • 5
  • Marco Turchi
    • 6
  1. 1.European Commission - Joint Research CentreIPSC-GlobeSecIspraItaly
  2. 2.Department of Computer ScienceSapienza University of RomeRomeItaly
  3. 3.Research Institute for LinguisticsHungarian Academy of SciencesBudapestHungary
  4. 4.Cognizant SetConMunichGermany
  5. 5.Faculty of Applied Sciences, Department of Computer Science and Engineering, NTIS CentreUniversity of West BohemiaPilsenCzech Republic
  6. 6.Human Language Technology groupFondazione Bruno KesslerTrentoItaly

Personalised recommendations