Skip to main content

Semantic Structure Analysis of Web Documents

  • Chapter
  • 1050 Accesses

Part of the Advances in Pattern Recognition book series (ACVPR)

Keywords

  • Basic Block
  • Content Structure
  • Semantic Structure
  • Passage Retrieval
  • Page Segmentation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (Canada)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bharat, K. and Henzinger, M.R. (1998). Improved algorithms for topic distillation in a hyperlinked environment. Proceedings of SIGIR-98, Twenty-first ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 104-111.

    Google Scholar 

  2. Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, pp. 604-632.

    CrossRef  MATH  MathSciNet  Google Scholar 

  3. Chakrabarti, S. (2001). Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. World Wide Web, pp. 211-220.

    Google Scholar 

  4. Chakrabarti, S., Joshi, M., and Tawde, V. (2001). Enhanced topic distillation using text, markup tags, and hyperlinks. Research and Development in Information Retrieval, pp. 208-216.

    Google Scholar 

  5. Chakrabarti, S., Punera, K., and Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. WWW'02: Proceedings of the Eleventh International Conference on World Wide Web. New York, NY: ACM Press, pp. 148-159.

    Google Scholar 

  6. Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T. (2000). Two approaches to bringing internet services to wap devices. Computer Networks, 33, pp. 231-246.

    CrossRef  Google Scholar 

  7. Callan, J. (1994). Passage-level evidence in document retrieval. In: Croft, W.B., van Rijsbergen, C. (Eds.): Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland: Spring, pp. 302-310.

    Google Scholar 

  8. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval in full text information systems. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49-58.

    Google Scholar 

  9. Wilkinson, R. (1994). Effective retrieval of structured documents. Research and Development in Information Retrieval, pp. 311-317.

    Google Scholar 

  10. Hearst, M. (1994). Multi-paragraph segmentation of expository text. Thirty-second Annual Meeting of the Association for Computational Linguistics. Las Cruces, New Mexico: New Mexico State University, pp. 9-16.

    Google Scholar 

  11. Ponte, J.M. and Croft, W.B. (1997). Text segmentation by topic. European Conference on Digital Libraries, pp. 113-125.

    Google Scholar 

  12. Kaszkiel, M. and Zobel, J. (2001). Effective ranking with arbitrary passages. Journal of the American Society of Information Science, 52, pp. 344-364.

    CrossRef  Google Scholar 

  13. Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. (1995). Efficient retrieval of partial documents. TREC-2: Proceedings of the Second Conference on Text Retrieval Conference. Elmsford, NY: Pergamon Press, pp. 361-377.

    Google Scholar 

  14. Kwok, K.L., Grunfeld, L., Dinstl, N., and Chan, M. (2000). Trec-9 cross language, web and question-answering track experiments using pircs. TREC.

    Google Scholar 

  15. Lin, S.H. and Ho, J.M. (2002). Discovering informative content blocks from web documents. KDD'02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM Press, pp. 588-593.

    Google Scholar 

  16. Wong, W. and Fu, A. (2000). Finding structure and characteristics of web documents for classification. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96-105.

    Google Scholar 

  17. Embley, D., Jiang, S., and Ng, Y. (1999). Record-Boundary Discovery In Web Documents.

    Google Scholar 

  18. Crivellari, F. and Melucci, M. (2001). Web document retrieval using passage retrieval, connectivity information, and automatic link weighting. TREC-9: Proceedings of the Ninth Text Retrieval Conference.

    Google Scholar 

  19. Chen, J., Zhou, B., Shi, J., Zhang, H., and Fengwu, Q. (2001). Function-based object model towards website adaptation. World Wide Web, pp. 587-596.

    Google Scholar 

  20. Cai, D., Yu, S., Wen, J.R., and Ma, W.Y. (2003). Vips: a vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79.

    Google Scholar 

  21. Cai, D., Yu, S., Wen, J.R., and Ma, W.Y. (2003). Extracting content structure for web pages based on visual representation. Proceedings of the Fifth Asia Pacific Web Conference, Xi'an, China.

    Google Scholar 

  22. ODP. Open directory project. http://dmoz.org/.

  23. Mitchell, T. (1997). Machine Learning. New York: McGraw-Hill.

    MATH  Google Scholar 

  24. Mehta, R.R., Mitra, P., and Karnick, H. (2005). Extracting semantic structure of web documents using content and visual information. WWW'05: Proceedings of the Fourteenth International Conference World Wide Web, pp. 928-929.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2007 Springer-Verlag London Limited

About this chapter

Cite this chapter

Mehta, R.R., Karnick, H., Mitra, P. (2007). Semantic Structure Analysis of Web Documents. In: Chaudhuri, B.B. (eds) Digital Document Processing. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84628-726-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-726-8_19

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84628-501-1

  • Online ISBN: 978-1-84628-726-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics