Advertisement

Extraction of Anchor-Related Text and Its Evaluation by User Studies

  • Bui Quang Hung
  • Masanori Otsubo
  • Yoshinori Hijikata
  • Shogo Nishida
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4557)

Abstract

Semantic Text Portion (STP) is a text portion in the original page which is semantically related to the anchor pointing to the target page. STPs may include the facts and the people’s opinions about the target pages. STPs can be used for various upper-level applications such as automatic summarization and document categorization. In this paper, we concentrate on extracting STPs. We conduct a survey of STP to see the positions of STPs in original pages and find out HTML tags which can divide STPs from the other text portions in original pages. We then develop a method for extracting STPs based on the result of the survey. The experimental results show that our method achieves high performance.

Keywords

user study text mining web mining semantic text portion link structure anchor 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Delort, J., Meunier, B.B., Rifqi, M.: Enhanced Web Document Summarization Using Hyperlinks. In: Proc. 14th ACM Conference on Hypertext and Hypermedia (HT 2003), pp. 208–215 (2003)Google Scholar
  2. 2.
    Amitay, E.: Using common hypertext links to identify the best phrasal description of target web documents. In: Proc. Post-Conference Workshop on Hypertext Information Retrieval for the Web (SIGIR 1998), pp. 271–276 (1998)Google Scholar
  3. 3.
    Open Directory http://dmoz.org/
  4. 4.
    Davison, B.D.: Topical Locality in the Web. In: Proc. 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 272–279 (2001)Google Scholar
  5. 5.
    Roy, S., Joshi, S., Krishnapuram, R.: Automatic Categorization of Websites based on Source Type. In: Proc. 15th ACM Conference on Hypertext & Hypermedia, pp. 38–39 (2004)Google Scholar
  6. 6.
    Amitay, E., Paris, C.: Automatically summarizing web sites: Is there a way around it? In: Proc. ACM 9th International Conference on Information and Knowledge Management, pp. 173–179 (2000)Google Scholar
  7. 7.
    Henzinger, M.: Link Analysis in Web Information Retrieval. IEEE Data Engineering Bulletin 23(3), 3–8 (2000)Google Scholar
  8. 8.
  9. 9.
    Otsubo, M., Hung, B.Q., Hijikata, Y., Nishida, S.: A Basic Study on Web Page Classification Method by Anchor-Related Text. In: Proc. SICE Annual Conference, pp. 3622–3625 (2005)Google Scholar
  10. 10.
    Chakrabarti, S., Dom, B., Gibson, D., Keinberg, J., Raghavan, P., Rajagopalan, S.: Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text. In: Proc. 7th International World Wide Web Conference, pp. 65–74 (1998)Google Scholar
  11. 11.
    Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using web structure for classifying and describing web pages. In: Proc. 11th International World Wide Web Conference, pp. 562–569 (2002)Google Scholar
  12. 12.
    Attardi, G., Di Marco, S., Salvi, D.: Categorisation by context. Journal of Universal Computer Science 4(9), 719–736 (1998)Google Scholar
  13. 13.
    Furnkranz, J.: Exploiting Structural Information for Text Classification on the WWW. In: Hand, D.J., Kok, J.N., Berthold, M.R. (eds.) Advances in Intelligent Data Analysis. LNCS, vol. 1642, Springer, Heidelberg (1999)Google Scholar
  14. 14.
    Blum, T.M.: Combining Labeled and Unlabeled Data with Co-Training. In: Proc. 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  15. 15.
    Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM) 45(5), 604–632 (1999)CrossRefGoogle Scholar
  16. 16.
    Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proc. 7th International Conference on World Wide Web, pp. 107–117 (1998)Google Scholar
  17. 17.

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Bui Quang Hung
    • 1
  • Masanori Otsubo
    • 1
  • Yoshinori Hijikata
    • 1
  • Shogo Nishida
    • 1
  1. 1.Department of Systems Innovation, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531Japan

Personalised recommendations