Skip to main content

DAnIEL: Language Independent Character-Based News Surveillance

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7614)

Abstract

This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is first reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, filtered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specific properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F 1-measure score around 85%. Two issues are addressed: the first is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.

Keywords

  • Information Extraction
  • Computational Linguistics
  • Information Extraction System
  • Epidemic Event
  • Motif Extraction

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Al Khudhairy, D., Stilianakis, N.: Internet surveillance systems for early alerting of threats. Eurosurveillance 14(13) (2009)

    Google Scholar 

  2. Lyon, A., Nunn, M., Grossel, G., Burgman, M.: Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap. Transboundary and Emerging Diseases (2011)

    Google Scholar 

  3. Son, D., Quoc, H.N., Ai, K., Collier, N.: Global health monitor - a web-based system for detecting and mapping infectious diseases. In: International Joint Conference on Natural Language Processing, pp. 951–956 (2008)

    Google Scholar 

  4. Hartley, D.M., Nelson, N.P., Walters, R., Arthur, R., Yangarber, R., Madoff, L., Linge, J., Mawudeku, A., Collier, N., Bronstein, J.S., Thinus, G., Lightfoot, N.: The landscape of international event-based biosurveillance. Emerging Health Threats Journal 3(e3) (2010)

    Google Scholar 

  5. Reilly, A.R., Iarocci, E.A., Jung, C.M., Hartley, D.M., Nelson, N.P.: Indications and warning of pandemic influenza compared to seasonal inflluenza. Advances in Disease Surveillance 5, 190 (2008)

    Google Scholar 

  6. Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., Yangarber, R.: Text mining from the web for medical intelligence. In: Mining Massive Data Sets for Security, pp. 295–310. OIS Press (2008)

    Google Scholar 

  7. Huttunen, S., Arto, V., von Etter, P., Yangarber, R.: Relevance prediction in information extraction using discourse and lexical features. In: Nordic Conference on Computational Linguistics, Nodalida 2011, pp. 114–121 (2011)

    Google Scholar 

  8. Ji, H.: Challenges from information extraction to information fusion. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 507–515 (2010)

    Google Scholar 

  9. Du, M., Von Etter, P., Kopotev, M., Novikov, M., Tarbeeva, N., Yangarber, R.: Building Support Tools for Russian-Language Information Extraction. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 380–387. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  10. Lucas, N.: Stylistic devices in the news, as related to topic recognition. In: Kwiatkowska, A. (ed.) Texts and Minds: Papers in Cognitive Poetics and Rhetoric. Łódź, Studies in language. Peter Lang, Frankfurt am Main, vol. 26, pp. 301–316 (2012)

    Google Scholar 

  11. Etzioni, O., Fader, A., Christensen, J., Soderland, S.: Open information extraction: The second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 3–10 (2011)

    Google Scholar 

  12. Hobbs, J.R.: The generic information extraction system. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 87–91. Association for Computational Linguistics, Stroudsburg (1993)

    CrossRef  Google Scholar 

  13. Steinberger, R.: A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation, 1–22 (2011)

    Google Scholar 

  14. Church, K.: Empirical estimates of adaptation: the chance of two Noriegas is closer to \(\frac{p}{2}\) than p 2. In: Proceedings of the 18th Conference on Computational Linguistics, vol. 1, pp. 173–179. Association for Computational Linguistics (2000)

    Google Scholar 

  15. Collier, N., Ai, K., Jin, L., et al.: A multilingual ontology for infectious disease surveillance: rationale, design and challenges. Journal of Language Resources and Evaluation, 405–413 (2007)

    Google Scholar 

  16. Ukkonen, E.: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theorie in Computer Science 410(43), 4341–4349 (2009)

    CrossRef  MathSciNet  MATH  Google Scholar 

  17. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)

    CrossRef  MathSciNet  Google Scholar 

  18. Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 789–797 (2010)

    Google Scholar 

  19. Piskorski, J., Belyaeva, J., Atkinson, M.: On refining real-time multilingual news event extraction through deployment of cross-lingual information fusion techniques. In: Proceedings of European Intelligence and Security Informatics Conference (EISIC), pp. 38–45 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lejeune, G., Brixtel, R., Doucet, A., Lucas, N. (2012). DAnIEL: Language Independent Character-Based News Surveillance. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33983-7_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33982-0

  • Online ISBN: 978-3-642-33983-7

  • eBook Packages: Computer ScienceComputer Science (R0)