Towards Large Scale Semantic Annotation Built on MapReduce Architecture

  • Michal Laclavík
  • Martin Šeleng
  • Ladislav Hluchý
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5103)


Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging tools. Automated semantic annotation tools with the best results are built on different machine learning algorithms requiring training sets. Another approach is to use pattern based semantic annotation solutions built on NLP, information retrieval or information extraction methods. Most of developed methods are tested and evaluated on hundreds of documents which cannot prove its real usage on large scale data such as web or email communication in enterprise or community environment. In this paper we present how a pattern based annotation tool can benefit from Google’s MapReduce architecture to process large amount of text data.


semantic annotation information extraction metadata MapReduce 


  1. 1.
    Cunningham, H.: Information Extraction, Automatic. Encyclopedia of Language and Linguistics, 2nd edn. (2005) Google Scholar
  2. 2.
    Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., Hluchy, L.: Ontology based Text Annotation OnTeA; Information Modelling and Knowledge Bases XVIII. Frontiers in AI, vol. 154, pp. 311–315. IOS Press, Amsterdam (2007)Google Scholar
  3. 3.
    Laclavik, M., Ciglan, M., Seleng, M., Hluchy, L.: Ontea: Empowering Automatic Semantic Annotation in Grid. In: Proceedings of PPAM 2007. Springer, Heidelberg (to appear, 2007)Google Scholar
  4. 4.
    Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2005)Google Scholar
  5. 5.
    Reeve, L., Han, H.: Survey of semantic annotation platforms. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1634–1638. ACM Press, New York (2005)CrossRefGoogle Scholar
  6. 6.
    Dill, S., Eiron, N., et al.: A Case for Automated Large-Scale Semantic Annotation. Journal of Web Semantics (2003) Google Scholar
  7. 7.
    Guha, R., McCool R.: Tap: Towards a web of data,
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, Google, Inc. OSDI 2004, San Francisco, CA (2004) Google Scholar
  9. 9.
    Lucene-hadoop Wiki, HadoopMapReduce (2008),
  10. 10.
    The Phoenix system for MapReduce programming (2008),
  11. 11.
    Laclavik, M., Seleng, M., Hluchy, L.: ACoMA: Network Enterprise Interoperability and Collaboration using E-mail Communication. In: Expanding the Knowledge Economy: Issues, Applications, Case Studies. IOS Press, Amsterdam (2007)Google Scholar
  12. 12.
    Vojtek, P., Bieliková, M.: Comparing Natural Language Identification Methods based on Markov Processes. In: Slovko - International Seminar on Computer Treatment of Slavic and East European Languages, Bratislava (2007) Google Scholar
  13. 13.
    Krajči, S., Novotný, R.: Lemmatization of Slovak words by a tool Morphonary. In: TAOPIK (2), Vydavateľstvo STU, pp. 115–118 (2007) ISBN 978-80-227-2716-7Google Scholar
  14. 14.
    Corcho, O.: Ontology-based document annotation: trends and open research problems. International Journal of Metadata, Semantics and Ontologies 1(1), 47–57 (2006)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Open Source Distributed Computing: Yahoo’s Hadoop Support, Developer Network blog (2007),
  16. 16.
    Yahoo! Launches World’s Largest Hadoop Production Application, Yahoo! Developer Network (2008),
  17. 17.
    Ontea: Pattern based Semantic Annotation Platform, project (2008),
  18. 18.
    Snowball Project (2008),
  19. 19.
    Apache Lucene project (2008),
  20. 20.
    Klimt B., Yang Y.: Introducing the Enron Corpus. In: CEAS, 2004 (2008),,

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Michal Laclavík
    • 1
  • Martin Šeleng
    • 1
  • Ladislav Hluchý
    • 1
  1. 1.Institute of Informatics, Slovak Academy of SciencesBratislava

Personalised recommendations