Abstract
Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging tools. Automated semantic annotation tools with the best results are built on different machine learning algorithms requiring training sets. Another approach is to use pattern based semantic annotation solutions built on NLP, information retrieval or information extraction methods. Most of developed methods are tested and evaluated on hundreds of documents which cannot prove its real usage on large scale data such as web or email communication in enterprise or community environment. In this paper we present how a pattern based annotation tool can benefit from Google’s MapReduce architecture to process large amount of text data.
This work is supported by projects NAZOU SPVV 1025/2004, Commius FP7-213876, SEMCO-WS APVV-0391-06, VEGA 2/7098/27.
Chapter PDF
Similar content being viewed by others
References
Cunningham, H.: Information Extraction, Automatic. Encyclopedia of Language and Linguistics, 2nd edn. (2005)
Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., Hluchy, L.: Ontology based Text Annotation OnTeA; Information Modelling and Knowledge Bases XVIII. Frontiers in AI, vol. 154, pp. 311–315. IOS Press, Amsterdam (2007)
Laclavik, M., Ciglan, M., Seleng, M., Hluchy, L.: Ontea: Empowering Automatic Semantic Annotation in Grid. In: Proceedings of PPAM 2007. Springer, Heidelberg (to appear, 2007)
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2005)
Reeve, L., Han, H.: Survey of semantic annotation platforms. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1634–1638. ACM Press, New York (2005)
Dill, S., Eiron, N., et al.: A Case for Automated Large-Scale Semantic Annotation. Journal of Web Semantics (2003)
Guha, R., McCool R.: Tap: Towards a web of data, http://tap.stanford.edu/
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, Google, Inc. OSDI 2004, San Francisco, CA (2004)
Lucene-hadoop Wiki, HadoopMapReduce (2008), http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
The Phoenix system for MapReduce programming (2008), http://csl.stanford.edu/~christos/sw/phoenix/ .
Laclavik, M., Seleng, M., Hluchy, L.: ACoMA: Network Enterprise Interoperability and Collaboration using E-mail Communication. In: Expanding the Knowledge Economy: Issues, Applications, Case Studies. IOS Press, Amsterdam (2007)
Vojtek, P., Bieliková, M.: Comparing Natural Language Identification Methods based on Markov Processes. In: Slovko - International Seminar on Computer Treatment of Slavic and East European Languages, Bratislava (2007)
Krajči, S., Novotný, R.: Lemmatization of Slovak words by a tool Morphonary. In: TAOPIK (2), Vydavateľstvo STU, pp. 115–118 (2007) ISBN 978-80-227-2716-7
Corcho, O.: Ontology-based document annotation: trends and open research problems. International Journal of Metadata, Semantics and Ontologies 1(1), 47–57 (2006)
Open Source Distributed Computing: Yahoo’s Hadoop Support, Developer Network blog (2007), http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html
Yahoo! Launches World’s Largest Hadoop Production Application, Yahoo! Developer Network (2008), http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
Ontea: Pattern based Semantic Annotation Platform, SourceForge.net project (2008), http://ontea.sourceforge.net/
Snowball Project (2008), http://snowball.tartarus.org/
Apache Lucene project (2008), http://lucene.apache.org/
Klimt B., Yang Y.: Introducing the Enron Corpus. In: CEAS, 2004 (2008), http://www.ceas.cc/papers-2004/168.pdf , http://www.cs.cmu.edu/~enron/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Laclavík, M., Šeleng, M., Hluchý, L. (2008). Towards Large Scale Semantic Annotation Built on MapReduce Architecture. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2008. ICCS 2008. Lecture Notes in Computer Science, vol 5103. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69389-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-69389-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69388-8
Online ISBN: 978-3-540-69389-5
eBook Packages: Computer ScienceComputer Science (R0)