Abstract
We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?
Keywords
- multilinguality
- text mining
- information extraction
- text classification
- inflection
- Slavic and Finno-Ugric languages
- media monitoring
Invited talk.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Mohamed, E., Ehrmann, M., Turchi, M., Steinberger, R.: Multi-label EuroVoc classification for Eastern and Southern EU Languages. In: Vertan, C., Hahn, W. (eds.) Multilingual Processing in Eastern and Southern EU languages - Low-resourced Technologies and Translation, pp. 370–394. Cambridge Scholars Publishing, Cambridge (2012)
Farkas, R., Szarvas, G., Kocsor, A.: Named entity recognition for Hungarian using various machine learning algorithms. Acta Cybernetica 17(3), 633–646 (2006)
Klementiev, A., Roth, D.: Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In: Proceedings of ACL 2006 Conference (2006)
Konkol, M., Konopík, M.: Maximum Entropy Named Entity Recognition for Czech Language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 203–210. Springer, Heidelberg (2011)
Küçük, D., Yazıcı, A.: Named Entity Recognition Experiments on Turkish Texts. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 524–535. Springer, Heidelberg (2009)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: Proceedings of the 26th European Conference on Information Retrieval Research, Sunderland, UK (2004)
Piskorski, J.: Extraction of Polish Named-Entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pp. 313–316 (2004)
Piskorski, J., Wieloch, K., Sydow, M.: On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages. Inf. Retrieval 12, 275–299 (2009)
Pouliquen, B., Steinberger, R.: Automatic Construction of Multilingual Name Dictionaries. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation. Advances in Neural Information Processing Systems Series (NIPS), pp. 59–78. MIT Press (2009)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop ‘Ontologies and Information Extraction’ at the EuroLan Summer School ‘The Semantic Web and Language Technology’ (EUROLAN 2003), Bucharest, Romania (2003)
Pouliquen, B., Steinberger, R., Deguernel, O.: Story tracking: linking similar news over time and across languages. In: Proceedings of the 2nd Workshop Multi-source Multilingual Information Extraction and Summarization (MMIES 2008) held at CoLing 2008, Manchester, UK (2008)
Steinberger, R.: A survey of methods to ease the development of highly multilingual Text Mining applications. Language Resources and Evaluation Journal 46(2), 155–176 (2012)
Steinberger, R., Pouliquen, B., van der Goot, E.: An Introduction to the Europe Media Monitor Family of Applications. In: Gey, F., Kando, N., Karlgren, J. (eds.) Information Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), Boston, USA, pp. 1–8 (2009)
Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, pp. 798–805 (2012)
Toman, M., Tesar, R., Ježek, K.: Influence of Word Normalization on Text Classification. In: Proceedings of InSciT 2006, Merida, Spain (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Steinberger, R., Ehrmann, M., Pajzs, J., Ebrahim, M., Steinberger, J., Turchi, M. (2013). Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-40585-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)