Advertisement

Multimedia Tools and Applications

, Volume 64, Issue 1, pp 119–140 | Cite as

A reverse engineering approach for automatic annotation of Web pages

  • Roberto De Virgilio
  • Flavius Frasincar
  • Walter Hop
  • Stephan Lachner
Article
  • 338 Downloads

Abstract

The Semantic Web is gaining increasing interest to fulfill the need of sharing, retrieving, and reusing information. Since Web pages are designed to be read by people, not machines, searching and reusing information on the Web is a difficult task without human participation. To this aim adding semantics (i.e meaning) to a Web page would help the machines to understand Web contents and better support the Web search process. One of the latest developments in this field is Google’s Rich Snippets, a service for Web site owners to add semantics to their Web pages. In this paper we provide a structured approach to automatically annotate a Web page with Rich Snippets RDFa tags. Exploiting a data reverse engineering method, combined with several heuristics, and a named entity recognition technique, our method is capable of recognizing and annotating a subset of Rich Snippets’ vocabulary, i.e., all the attributes of its Review concept, and the names of the Person and Organization concepts. We implemented tools and services and evaluated the accuracy of the approach on real E-commerce Web sites.

Keywords

RDFa Rich Snippets DRE Web site segmentation 

References

  1. 1.
    Adida B, Birbeck M (2008) RDFa primer: bridging the human and data webs. http://www.w3.org/TR/xhtml-rdfa-primer/
  2. 2.
    Allison L, Wallace CS, Yee CN (1990) When is a string like a string? In: AI & MathsGoogle Scholar
  3. 3.
    Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Sci Am 284:34–43CrossRefGoogle Scholar
  4. 4.
    Bizer C, Cyganiak R (2006) D2R server: publishing relational databases on the semantic web. In: Proc. of the 5th intl Semantic Web conf. (ISWC 2006)Google Scholar
  5. 5.
    Can L, Qian Z, Xiaofeng M, Wenyin L (2005) Postal address detection from Web documents. In: International workshop on challenges in Web information retrieval and integration. IEEE Computer Society, Piscataway, pp 40–45CrossRefGoogle Scholar
  6. 6.
    Electrum (2009) Valid HTML statistics. http://try.powermapper.com/demo/statsvalid.aspx
  7. 7.
    Goel K, Guha RV, Hansson O (2009) Introducing Rich Snippets. http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
  8. 8.
    Google (2009) Google Webmaster tools: about review data. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146645
  9. 9.
    Kennedy A, Inkpen D (2006) Sentiment classification of movie reviews using contextual valence shifters. Comput Intell 22(2):110–225MathSciNetCrossRefGoogle Scholar
  10. 10.
    Krupka GR, Hausman K (1998) IsoQuest, Inc: Description of the NetOwl(TM) extractor system as used for MUC-7. In: Seventh message understanding conferenceGoogle Scholar
  11. 11.
    Laender A, Ribeiro-Neto B, Silva AD, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec 31:84–93CrossRefGoogle Scholar
  12. 12.
    Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Menlo Park, pp 1–8CrossRefGoogle Scholar
  13. 13.
    Morgan R, Garigliano R, Callaghan P, Poria S, Smith M, Urbanowicz A, Collingham R, Costantino M, Cooper C, Group L (1995) University of Durham: description of the LOLITA system as used in MUC-6. In: Sixth message understanding conference. Morgan Kaufmann, San FranciscoGoogle Scholar
  14. 14.
    Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing. ACL, Menlo Park, pp 79–86CrossRefGoogle Scholar
  15. 15.
    Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Thirteenth conference on computational natural language learning. Association for Computational Linguistics, Menlo Park, pp 147–155CrossRefGoogle Scholar
  16. 16.
    Seomoz.org (2009) Search engine ranking factors 2009. http://www.seomoz.org/article/search-ranking-factors
  17. 17.
    Tomberg V, Laanpere M (2009) RDFa versus microformats: exploring the potential for semantic interoperability of mash-up personal learning environments. In: Second international workshop on mashup personal learning environments. M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, pp 102–109Google Scholar
  18. 18.
    Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting of the Association for Computational Linguistics. ACL, Menlo Park, pp 417–424Google Scholar
  19. 19.
    Virgilio RD, Torlone R (2008) A meta-model approach to the management of hypertexts in Web information systems. In: ER workshops (WISM 2008)Google Scholar
  20. 20.
    Virgilio RD, Torlone R (2009) A structured approach to data reverse engineering of Web applications. In: 9th international conference on Web engineering. Springer, New York, pp 91–105Google Scholar
  21. 21.
    Yahoo! (2009) SearchMonkey: site owner overview. http://developer.yahoo.com/searchmonkey/siteowner.html
  22. 22.
    Ye Q, Zhang Z, Law R (2009) Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst Appl 36(3):6527–6535CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Roberto De Virgilio
    • 1
  • Flavius Frasincar
    • 2
  • Walter Hop
    • 2
  • Stephan Lachner
    • 2
  1. 1.Dipartimento di Informatica e AutomazioneUniversitá Roma TreRomeItaly
  2. 2.Erasmus School of EconomicsErasmus University RotterdamRotterdamThe Netherlands

Personalised recommendations