A reverse engineering approach for automatic annotation of Web pages

Abstract

The Semantic Web is gaining increasing interest to fulfill the need of sharing, retrieving, and reusing information. Since Web pages are designed to be read by people, not machines, searching and reusing information on the Web is a difficult task without human participation. To this aim adding semantics (i.e meaning) to a Web page would help the machines to understand Web contents and better support the Web search process. One of the latest developments in this field is Google’s Rich Snippets, a service for Web site owners to add semantics to their Web pages. In this paper we provide a structured approach to automatically annotate a Web page with Rich Snippets RDFa tags. Exploiting a data reverse engineering method, combined with several heuristics, and a named entity recognition technique, our method is capable of recognizing and annotating a subset of Rich Snippets’ vocabulary, i.e., all the attributes of its Review concept, and the names of the Person and Organization concepts. We implemented tools and services and evaluated the accuracy of the approach on real E-commerce Web sites.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    http://linkeddata.org/

  2. 2.

    For more details see http://www.surl.org/.

  3. 3.

    http://www.w3.org/TR/xhtml-rdfa-primer/

  4. 4.

    http://microformats.org/about

  5. 5.

    http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html

  6. 6.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/

References

  1. 1.

    Adida B, Birbeck M (2008) RDFa primer: bridging the human and data webs. http://www.w3.org/TR/xhtml-rdfa-primer/

  2. 2.

    Allison L, Wallace CS, Yee CN (1990) When is a string like a string? In: AI & Maths

  3. 3.

    Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Sci Am 284:34–43

    Article  Google Scholar 

  4. 4.

    Bizer C, Cyganiak R (2006) D2R server: publishing relational databases on the semantic web. In: Proc. of the 5th intl Semantic Web conf. (ISWC 2006)

  5. 5.

    Can L, Qian Z, Xiaofeng M, Wenyin L (2005) Postal address detection from Web documents. In: International workshop on challenges in Web information retrieval and integration. IEEE Computer Society, Piscataway, pp 40–45

    Chapter  Google Scholar 

  6. 6.

    Electrum (2009) Valid HTML statistics. http://try.powermapper.com/demo/statsvalid.aspx

  7. 7.

    Goel K, Guha RV, Hansson O (2009) Introducing Rich Snippets. http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html

  8. 8.

    Google (2009) Google Webmaster tools: about review data. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146645

  9. 9.

    Kennedy A, Inkpen D (2006) Sentiment classification of movie reviews using contextual valence shifters. Comput Intell 22(2):110–225

    MathSciNet  Article  Google Scholar 

  10. 10.

    Krupka GR, Hausman K (1998) IsoQuest, Inc: Description of the NetOwl(TM) extractor system as used for MUC-7. In: Seventh message understanding conference

  11. 11.

    Laender A, Ribeiro-Neto B, Silva AD, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec 31:84–93

    Article  Google Scholar 

  12. 12.

    Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Menlo Park, pp 1–8

    Chapter  Google Scholar 

  13. 13.

    Morgan R, Garigliano R, Callaghan P, Poria S, Smith M, Urbanowicz A, Collingham R, Costantino M, Cooper C, Group L (1995) University of Durham: description of the LOLITA system as used in MUC-6. In: Sixth message understanding conference. Morgan Kaufmann, San Francisco

    Google Scholar 

  14. 14.

    Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing. ACL, Menlo Park, pp 79–86

    Chapter  Google Scholar 

  15. 15.

    Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Thirteenth conference on computational natural language learning. Association for Computational Linguistics, Menlo Park, pp 147–155

    Chapter  Google Scholar 

  16. 16.

    Seomoz.org (2009) Search engine ranking factors 2009. http://www.seomoz.org/article/search-ranking-factors

  17. 17.

    Tomberg V, Laanpere M (2009) RDFa versus microformats: exploring the potential for semantic interoperability of mash-up personal learning environments. In: Second international workshop on mashup personal learning environments. M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, pp 102–109

  18. 18.

    Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting of the Association for Computational Linguistics. ACL, Menlo Park, pp 417–424

    Google Scholar 

  19. 19.

    Virgilio RD, Torlone R (2008) A meta-model approach to the management of hypertexts in Web information systems. In: ER workshops (WISM 2008)

  20. 20.

    Virgilio RD, Torlone R (2009) A structured approach to data reverse engineering of Web applications. In: 9th international conference on Web engineering. Springer, New York, pp 91–105

    Google Scholar 

  21. 21.

    Yahoo! (2009) SearchMonkey: site owner overview. http://developer.yahoo.com/searchmonkey/siteowner.html

  22. 22.

    Ye Q, Zhang Z, Law R (2009) Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst Appl 36(3):6527–6535

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Roberto De Virgilio.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

De Virgilio, R., Frasincar, F., Hop, W. et al. A reverse engineering approach for automatic annotation of Web pages. Multimed Tools Appl 64, 119–140 (2013). https://doi.org/10.1007/s11042-011-0852-8

Download citation

Keywords

  • RDFa
  • Rich Snippets
  • DRE
  • Web site segmentation