The Semantic Web is gaining increasing interest to fulfill the need of sharing, retrieving, and reusing information. Since Web pages are designed to be read by people, not machines, searching and reusing information on the Web is a difficult task without human participation. To this aim adding semantics (i.e meaning) to a Web page would help the machines to understand Web contents and better support the Web search process. One of the latest developments in this field is Google’s Rich Snippets, a service for Web site owners to add semantics to their Web pages. In this paper we provide a structured approach to automatically annotate a Web page with Rich Snippets RDFa tags. Exploiting a data reverse engineering method, combined with several heuristics, and a named entity recognition technique, our method is capable of recognizing and annotating a subset of Rich Snippets’ vocabulary, i.e., all the attributes of its Review concept, and the names of the Person and Organization concepts. We implemented tools and services and evaluated the accuracy of the approach on real E-commerce Web sites.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Adida B, Birbeck M (2008) RDFa primer: bridging the human and data webs. http://www.w3.org/TR/xhtml-rdfa-primer/
Allison L, Wallace CS, Yee CN (1990) When is a string like a string? In: AI & Maths
Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Sci Am 284:34–43
Bizer C, Cyganiak R (2006) D2R server: publishing relational databases on the semantic web. In: Proc. of the 5th intl Semantic Web conf. (ISWC 2006)
Can L, Qian Z, Xiaofeng M, Wenyin L (2005) Postal address detection from Web documents. In: International workshop on challenges in Web information retrieval and integration. IEEE Computer Society, Piscataway, pp 40–45
Electrum (2009) Valid HTML statistics. http://try.powermapper.com/demo/statsvalid.aspx
Goel K, Guha RV, Hansson O (2009) Introducing Rich Snippets. http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
Google (2009) Google Webmaster tools: about review data. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146645
Kennedy A, Inkpen D (2006) Sentiment classification of movie reviews using contextual valence shifters. Comput Intell 22(2):110–225
Krupka GR, Hausman K (1998) IsoQuest, Inc: Description of the NetOwl(TM) extractor system as used for MUC-7. In: Seventh message understanding conference
Laender A, Ribeiro-Neto B, Silva AD, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec 31:84–93
Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Menlo Park, pp 1–8
Morgan R, Garigliano R, Callaghan P, Poria S, Smith M, Urbanowicz A, Collingham R, Costantino M, Cooper C, Group L (1995) University of Durham: description of the LOLITA system as used in MUC-6. In: Sixth message understanding conference. Morgan Kaufmann, San Francisco
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing. ACL, Menlo Park, pp 79–86
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Thirteenth conference on computational natural language learning. Association for Computational Linguistics, Menlo Park, pp 147–155
Seomoz.org (2009) Search engine ranking factors 2009. http://www.seomoz.org/article/search-ranking-factors
Tomberg V, Laanpere M (2009) RDFa versus microformats: exploring the potential for semantic interoperability of mash-up personal learning environments. In: Second international workshop on mashup personal learning environments. M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, pp 102–109
Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting of the Association for Computational Linguistics. ACL, Menlo Park, pp 417–424
Virgilio RD, Torlone R (2008) A meta-model approach to the management of hypertexts in Web information systems. In: ER workshops (WISM 2008)
Virgilio RD, Torlone R (2009) A structured approach to data reverse engineering of Web applications. In: 9th international conference on Web engineering. Springer, New York, pp 91–105
Yahoo! (2009) SearchMonkey: site owner overview. http://developer.yahoo.com/searchmonkey/siteowner.html
Ye Q, Zhang Z, Law R (2009) Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst Appl 36(3):6527–6535
About this article
Cite this article
De Virgilio, R., Frasincar, F., Hop, W. et al. A reverse engineering approach for automatic annotation of Web pages. Multimed Tools Appl 64, 119–140 (2013). https://doi.org/10.1007/s11042-011-0852-8
- Rich Snippets
- Web site segmentation