A Scalable and Distributed NLP Architecture for Web Document Annotation

  • Julien Deriviere
  • Thierry Hamon
  • Adeline Nazarenko
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)


In the context of the ALVIS project, which aims at integrating linguistic information in topic-specific search engines, we develop a NLP architecture to linguistically annotate large collections of web documents. This context leads us to face the scalability aspect of Natural Language Processing. The platform can be viewed as a framework using existing NLP tools. We focus on the efficiency of the platform by distributing linguistic processing on several machines. We carry out an an experiment on 55,329 web documents focusing on biology. These 79 million-word collections of web documents have been processed in 3 days on 16 computers.


Entity Recognition Word Segmentation Natural Language Processing Tool Standard Personal Computer Linguistic Annotation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ardö, A.: Focused crawling in the alvis semantic search engine. In: Poster in ESWC 2005 – 2nd Annual European Semantic Web Conference, Heraklion, Crete (2005)Google Scholar
  2. 2.
    Alphonse, E., Aubin, S., Bessieres, P., Bisson, G., Hamon, T., Laguarrigue, S., Manine, A.P., Nazarenko, A., Nedellec, C., Vetah, M.O.A., Poibeau, T., Weissenbacher, D.: Event-based information extraction for the biomedical domain: the caderige project. In: Workshop BioNLP (Biology and Natural language Processing), Conférence Computational Linguisitics (Coling 2004), Geneva (2004)Google Scholar
  3. 3.
    Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving gate to meet new challenges in language engineering. Natural Language Engineering 10(3-4), 349–374 (2004)CrossRefGoogle Scholar
  4. 4.
    Nazarenko, A., Alphonse, E., Derivière, J., Hamon, T., Vauvert, G., Weissenbacher, D.: The alvis format for linguistically annotated documents. In: LREC 2006 (submitted, 2006)Google Scholar
  5. 5.
    Berroyer, J.F.: Tagen, un analyseur d”entités nommées: conception, développement et évaluation. Mémoire de d.e.a. d’intelligence artificielle, Université Paris-Nord (2004)Google Scholar
  6. 6.
    Grefenstette, G.: Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston (1994)Google Scholar
  7. 7.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)Google Scholar
  8. 8.
    Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Consortium, T.G.O.: Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29 (2000)CrossRefGoogle Scholar
  10. 10.
    MeSH: Medical subject headings. Library of Medicine, Bethesda, Maryland (1998), WWW page: http://www.nlm.nih.gov/mesh/meshhome.html
  11. 11.
    National Library of Medicine (ed.): UMLS Knowledge Source, 13th edn. (2003)Google Scholar
  12. 12.
    Cunningham, H., Bontcheva, K., Tablan, V., Wilks, Y.: Software infrastructure for language resources: a taxonomy of previous work and a requirements analysis. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), Athens (2000)Google Scholar
  13. 13.
    Grishman, R.: Tipster architecture design document version 2.3. Technical report, DARPA (1997)Google Scholar
  14. 14.
    Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Natural Language Engineering 10(3-4), 375–392 (2004)CrossRefGoogle Scholar
  15. 15.
    Ferrucci, D., Lally, A.: Uima: an architecture approach to unstructured information processing in a corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004)CrossRefGoogle Scholar
  16. 16.
    Neff, M.S., Byrd, R.J., Boguraev, B.K.: The talent system: Textract architecture and data model. Natural Language Engineering 10(3-4), 307–326 (2004)CrossRefGoogle Scholar
  17. 17.
    Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11), 1984–1998 (2004)CrossRefGoogle Scholar
  18. 18.
    Consortium, T.G.O.: Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 11(8), 1425–1433 (2001)CrossRefGoogle Scholar
  19. 19.
    Zajac, R., Casper, M., Sharples, N.: An open distributed architecture for reuse and integration of heterogeneous nlp components. In: Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP 1997) (1997)Google Scholar
  20. 20.
    Aubin, S., Nazarenko, A., Nédellec, C.: Adapting a general parser to a sublanguage. In: The international conference RANLP 2005, Borovets, Bulgaria (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Julien Deriviere
    • 1
  • Thierry Hamon
    • 1
  • Adeline Nazarenko
    • 1
  1. 1.LIPN – UMR CNRS 7030VilletaneuseFrance

Personalised recommendations