Skip to main content

Automatic Web Page Annotation with Google Rich Snippets

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 6427)

Abstract

Web pages are designed to be read by people, not machines. Consequently, searching and reusing information on the Web is a difficult task without human participation. Adding semantics (i.e meaning) to a Web page would help machines to understand Web contents and better support the Web search process. One of the latest developments in this field is Google’s Rich Snippets, a service for Web site owners to add semantics to their Web pages. In this paper we provide an approach to automatically annotate a Web page with Rich Snippets RDFa tags. Exploiting several heuristics and a named entity recognition technique, our method is capable of recognizing and annotating a subset of Rich Snippets’ vocabulary, i.e., all attributes of its Review concept, and the names of Person and Organization concepts. We implemented an on-line service and evaluated the accuracy of the approach on real E-commerce Web sites.

Keywords

  • Name Entity Recognition
  • Entity Recognition
  • Page Title
  • Page Area
  • Natural Text

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-16949-6_21
  • Chapter length: 18 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-16949-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284, 34–43 (2001)

    CrossRef  Google Scholar 

  2. Goel, K., Guha, R.V., Hansson, O.: Introducing Rich Snippets, http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html

  3. Google: Google Webmaster Tools: About review data (2009), http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146645

  4. Adida, B., Birbeck, M.: RDFa Primer: Bridging the Human and Data Webs (2008), http://www.w3.org/TR/xhtml-rdfa-primer/

  5. Mikheev, A., Moens, M., Grover, C.: Named Entity Recognition without gazetteers. In: Ninth Conference on European Chapter of the Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics (1999)

    Google Scholar 

  6. Morgan, R., Garigliano, R., Callaghan, P., Poria, S., Smith, M., Urbanowicz, A., Collingham, R., Costantino, M., Cooper, C., Group, L.: University of Durham: Description of the LOLITA System as Used in MUC-6. In: Sixth Message Understanding Conference. Morgan Kaufmann Publishers, San Francisco (1995)

    Google Scholar 

  7. Krupka, G.R., Hausman, K.: IsoQuest, Inc: Description of the NetOwl(TM) extractor system as used for MUC-7. In: Seventh Message Understanding Conference (1998)

    Google Scholar 

  8. Seomoz.org.: Search Engine Ranking Factors (2009), http://www.seomoz.org/article/search-ranking-factors

  9. Virgilio, R.D., Torlone, R.: A Structured Approach to Data Reverse Engineering of Web Applications. In: 9th International Conference on Web Engineering, pp. 91–105. Springer, Heidelberg (2009)

    Google Scholar 

  10. Can, L., Qian, Z., Xiaofeng, M., Wenyin, L.: Postal Address Detection from Web Documents. In: International Workshop on Challenges in Web Information Retrieval and Integration, pp. 40–45. IEEE Computer Society, Los Alamitos (2005)

    CrossRef  Google Scholar 

  11. Yahoo!: SearchMonkey: Site Owner Overview (2009), http://developer.yahoo.com/searchmonkey/siteowner.html

  12. Electrum: Valid HTML Statistics (2009), http://try.powermapper.com/demo/statsvalid.aspx

  13. Tomberg, V., Laanpere, M.: RDFa versus Microformats: Exploring the Potential for Semantic Interoperability of Mash-up Personal Learning Environments. In: Second International Workshop on Mashup Personal Learning Environments, M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, pp. 102–109. RWTH Aachen (2009)

    Google Scholar 

  14. Ratinov, L., Roth, D.: Design Challenges and Misconceptions in Named Entity Recognition. In: Thirteenth Conference on Computational Natural Language Learning, pp. 147–155. Association for Computational Linguistics (2009)

    Google Scholar 

  15. Turney, P.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: 40th Annual Meeting of the Association for Computational Linguistics, pp. 417–424. ACL (2002)

    Google Scholar 

  16. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Conference on Emprirical Methods in Natural Language Processing, pp. 79–86. ACL (2002)

    Google Scholar 

  17. Ye, Q., Zhang, Z., Law, R.: Sentiment Classification of Online Reviews to Travel Destinations by Supervised Machine Learning Approaches. Expert Systems with Applications 36(3), 6527–6535 (2009)

    CrossRef  Google Scholar 

  18. Kennedy, A., Inkpen, D.: Sentiment Classification of Movie Reviews Using Contextual Valence Shifters. Computational Intelligence 22(2), 110–225 (2006)

    MathSciNet  CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hop, W., Lachner, S., Frasincar, F., De Virgilio, R. (2010). Automatic Web Page Annotation with Google Rich Snippets . In: Meersman, R., Dillon, T., Herrero, P. (eds) On the Move to Meaningful Internet Systems, OTM 2010. OTM 2010. Lecture Notes in Computer Science, vol 6427. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16949-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16949-6_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16948-9

  • Online ISBN: 978-3-642-16949-6

  • eBook Packages: Computer ScienceComputer Science (R0)