Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Laclavík, Michal; Šeleng, Martin; Hluchý, Ladislav

doi:10.1007/978-3-540-69389-5_38

Michal Laclavík²⁰,
Martin Šeleng²⁰ &
Ladislav Hluchý²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5103))

Included in the following conference series:

International Conference on Computational Science

1448 Accesses
13 Citations

Abstract

Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging tools. Automated semantic annotation tools with the best results are built on different machine learning algorithms requiring training sets. Another approach is to use pattern based semantic annotation solutions built on NLP, information retrieval or information extraction methods. Most of developed methods are tested and evaluated on hundreds of documents which cannot prove its real usage on large scale data such as web or email communication in enterprise or community environment. In this paper we present how a pattern based annotation tool can benefit from Google’s MapReduce architecture to process large amount of text data.

This work is supported by projects NAZOU SPVV 1025/2004, Commius FP7-213876, SEMCO-WS APVV-0391-06, VEGA 2/7098/27.

Download to read the full chapter text

Chapter PDF

Automatic Document Annotation with Data Mining Algorithms

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Semantic Annotation of Web Documents for Efficient Information Retrieval

Keywords

References

Cunningham, H.: Information Extraction, Automatic. Encyclopedia of Language and Linguistics, 2nd edn. (2005)
Google Scholar
Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., Hluchy, L.: Ontology based Text Annotation OnTeA; Information Modelling and Knowledge Bases XVIII. Frontiers in AI, vol. 154, pp. 311–315. IOS Press, Amsterdam (2007)
Google Scholar
Laclavik, M., Ciglan, M., Seleng, M., Hluchy, L.: Ontea: Empowering Automatic Semantic Annotation in Grid. In: Proceedings of PPAM 2007. Springer, Heidelberg (to appear, 2007)
Google Scholar
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2005)
Google Scholar
Reeve, L., Han, H.: Survey of semantic annotation platforms. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1634–1638. ACM Press, New York (2005)
Chapter Google Scholar
Dill, S., Eiron, N., et al.: A Case for Automated Large-Scale Semantic Annotation. Journal of Web Semantics (2003)
Google Scholar
Guha, R., McCool R.: Tap: Towards a web of data, http://tap.stanford.edu/
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, Google, Inc. OSDI 2004, San Francisco, CA (2004)
Google Scholar
Lucene-hadoop Wiki, HadoopMapReduce (2008), http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
The Phoenix system for MapReduce programming (2008), http://csl.stanford.edu/~christos/sw/phoenix/ .
Laclavik, M., Seleng, M., Hluchy, L.: ACoMA: Network Enterprise Interoperability and Collaboration using E-mail Communication. In: Expanding the Knowledge Economy: Issues, Applications, Case Studies. IOS Press, Amsterdam (2007)
Google Scholar
Vojtek, P., Bieliková, M.: Comparing Natural Language Identification Methods based on Markov Processes. In: Slovko - International Seminar on Computer Treatment of Slavic and East European Languages, Bratislava (2007)
Google Scholar
Krajči, S., Novotný, R.: Lemmatization of Slovak words by a tool Morphonary. In: TAOPIK (2), Vydavateľstvo STU, pp. 115–118 (2007) ISBN 978-80-227-2716-7
Google Scholar
Corcho, O.: Ontology-based document annotation: trends and open research problems. International Journal of Metadata, Semantics and Ontologies 1(1), 47–57 (2006)
Article MathSciNet Google Scholar
Open Source Distributed Computing: Yahoo’s Hadoop Support, Developer Network blog (2007), http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html
Yahoo! Launches World’s Largest Hadoop Production Application, Yahoo! Developer Network (2008), http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
Ontea: Pattern based Semantic Annotation Platform, SourceForge.net project (2008), http://ontea.sourceforge.net/
Snowball Project (2008), http://snowball.tartarus.org/
Apache Lucene project (2008), http://lucene.apache.org/
Klimt B., Yang Y.: Introducing the Enron Corpus. In: CEAS, 2004 (2008), http://www.ceas.cc/papers-2004/168.pdf , http://www.cs.cmu.edu/~enron/

Download references

Author information

Authors and Affiliations

Institute of Informatics, Slovak Academy of Sciences, Dúbravská cesta 9, Bratislava, 845 07
Michal Laclavík, Martin Šeleng & Ladislav Hluchý

Authors

Michal Laclavík
View author publications
You can also search for this author in PubMed Google Scholar
Martin Šeleng
View author publications
You can also search for this author in PubMed Google Scholar
Ladislav Hluchý
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Academic Computer Center CYFRONET, AGH University of Science and Technology, 30-950, Kraków, Poland
Marian Bubak
Department of Mathematics and Computer Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Geert Dick van Albada
Computer Science Department, University of Tennessee, 37996, Knoxville, TN, USA
Jack Dongarra
Computational Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laclavík, M., Šeleng, M., Hluchý, L. (2008). Towards Large Scale Semantic Annotation Built on MapReduce Architecture. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2008. ICCS 2008. Lecture Notes in Computer Science, vol 5103. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69389-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-69389-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69388-8
Online ISBN: 978-3-540-69389-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Abstract

Chapter PDF

Similar content being viewed by others

Automatic Document Annotation with Data Mining Algorithms

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Semantic Annotation of Web Documents for Efficient Information Retrieval

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Abstract

Chapter PDF

Similar content being viewed by others

Automatic Document Annotation with Data Mining Algorithms

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Semantic Annotation of Web Documents for Efficient Information Retrieval

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation