DAnIEL: Language Independent Character-Based News Surveillance

  • Gaël Lejeune
  • Romain Brixtel
  • Antoine Doucet
  • Nadine Lucas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7614)


This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is first reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, filtered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specific properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F 1-measure score around 85%. Two issues are addressed: the first is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.


Information Extraction Computational Linguistics Information Extraction System Epidemic Event Motif Extraction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Al Khudhairy, D., Stilianakis, N.: Internet surveillance systems for early alerting of threats. Eurosurveillance 14(13) (2009)Google Scholar
  2. 2.
    Lyon, A., Nunn, M., Grossel, G., Burgman, M.: Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap. Transboundary and Emerging Diseases (2011)Google Scholar
  3. 3.
    Son, D., Quoc, H.N., Ai, K., Collier, N.: Global health monitor - a web-based system for detecting and mapping infectious diseases. In: International Joint Conference on Natural Language Processing, pp. 951–956 (2008)Google Scholar
  4. 4.
    Hartley, D.M., Nelson, N.P., Walters, R., Arthur, R., Yangarber, R., Madoff, L., Linge, J., Mawudeku, A., Collier, N., Bronstein, J.S., Thinus, G., Lightfoot, N.: The landscape of international event-based biosurveillance. Emerging Health Threats Journal 3(e3) (2010)Google Scholar
  5. 5.
    Reilly, A.R., Iarocci, E.A., Jung, C.M., Hartley, D.M., Nelson, N.P.: Indications and warning of pandemic influenza compared to seasonal inflluenza. Advances in Disease Surveillance 5, 190 (2008)Google Scholar
  6. 6.
    Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., Yangarber, R.: Text mining from the web for medical intelligence. In: Mining Massive Data Sets for Security, pp. 295–310. OIS Press (2008)Google Scholar
  7. 7.
    Huttunen, S., Arto, V., von Etter, P., Yangarber, R.: Relevance prediction in information extraction using discourse and lexical features. In: Nordic Conference on Computational Linguistics, Nodalida 2011, pp. 114–121 (2011)Google Scholar
  8. 8.
    Ji, H.: Challenges from information extraction to information fusion. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 507–515 (2010)Google Scholar
  9. 9.
    Du, M., Von Etter, P., Kopotev, M., Novikov, M., Tarbeeva, N., Yangarber, R.: Building Support Tools for Russian-Language Information Extraction. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 380–387. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Lucas, N.: Stylistic devices in the news, as related to topic recognition. In: Kwiatkowska, A. (ed.) Texts and Minds: Papers in Cognitive Poetics and Rhetoric. Łódź, Studies in language. Peter Lang, Frankfurt am Main, vol. 26, pp. 301–316 (2012)Google Scholar
  11. 11.
    Etzioni, O., Fader, A., Christensen, J., Soderland, S.: Open information extraction: The second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 3–10 (2011)Google Scholar
  12. 12.
    Hobbs, J.R.: The generic information extraction system. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 87–91. Association for Computational Linguistics, Stroudsburg (1993)CrossRefGoogle Scholar
  13. 13.
    Steinberger, R.: A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation, 1–22 (2011)Google Scholar
  14. 14.
    Church, K.: Empirical estimates of adaptation: the chance of two Noriegas is closer to \(\frac{p}{2}\) than p 2. In: Proceedings of the 18th Conference on Computational Linguistics, vol. 1, pp. 173–179. Association for Computational Linguistics (2000)Google Scholar
  15. 15.
    Collier, N., Ai, K., Jin, L., et al.: A multilingual ontology for infectious disease surveillance: rationale, design and challenges. Journal of Language Resources and Evaluation, 405–413 (2007)Google Scholar
  16. 16.
    Ukkonen, E.: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theorie in Computer Science 410(43), 4341–4349 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 789–797 (2010)Google Scholar
  19. 19.
    Piskorski, J., Belyaeva, J., Atkinson, M.: On refining real-time multilingual news event extraction through deployment of cross-lingual information fusion techniques. In: Proceedings of European Intelligence and Security Informatics Conference (EISIC), pp. 38–45 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Gaël Lejeune
    • 1
  • Romain Brixtel
    • 1
  • Antoine Doucet
    • 1
  • Nadine Lucas
    • 1
  1. 1.GREYCUniversity of Caen Lower-NormandyCaen CedexFrance

Personalised recommendations