Skip to main content

Schema Driven and Topic Specific Web Crawling

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 3453)

Abstract

We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.

Keywords

  • Digital Library
  • Concept Schema
  • Virtual Network
  • Query Expansion
  • Relevant Page

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This research work is part of the ALVIS project of EU’s 6th Framework Programme and funded by the Ministry of Science and Technology of China.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proc. of the 8th International World Wide Web Conference, Toronto, Canada (1999)

    Google Scholar 

  2. Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)

    Google Scholar 

  3. Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)

    Google Scholar 

  4. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. In: Proc. AAAI 1999 Spring Symposium on Intelligent Agents in Cyberspace (1999)

    Google Scholar 

  5. Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In: International Conference on Digital Libraries. Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries (2004)

    Google Scholar 

  6. Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders. IEEE Computer 36(5), 56–62 (2003)

    Google Scholar 

  7. Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. of the 6th European Conference on Digital Libraries, Rome, Italy (2002)

    Google Scholar 

  8. Arocena, G.O., Mendelzon, A.O.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33

    Google Scholar 

  9. May, W., Himmeröder, R., Lausen, G., Ludäscher, B.: A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. In: International Workshop on International Workshop on the World-Wide Web and Conceptual Modeling (WWWCM 1999), pp. 307–320 (1999)

    Google Scholar 

  10. Kistler, T., Marais, H.: WebL - A programming language for the Web. In: Proceedings of WWW, vol. 7, pp. 259–270 (1998)

    Google Scholar 

  11. Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engi-neering (ICDE 2000) (2000)

    Google Scholar 

  12. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)

    Google Scholar 

  13. Adelberg, B.: Nodose – a tool for semi-automatically extraction structured and semi-structured data from text documents. In: ACM SIGMOD (1998)

    Google Scholar 

  14. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Kaing, Y., Quass, D., Smith, R.D.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)

    CrossRef  MATH  Google Scholar 

  15. Zhang, Z., Xing, C., Zhou, L., Feng, J.: A New Query Processing Scheme in a Web Data Engine. In: Bhalla, S. (ed.) DNIS 2002. LNCS, vol. 2544, pp. 74–87. Springer, Heidelberg (2002)

    CrossRef  Google Scholar 

  16. Guo, Q., Zhou, L., Zhang, Z., Feng, J.: A Highly Adaptive Web Extractor. In: Proc. of the 6th Asia Pacific Web Conference (2004)

    Google Scholar 

  17. Guo, Q.: Technique Report of GQML, http://dbroup.cs.tsinghua.edu.cn/sesq/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, Q., Guo, H., Zhang, Z., Sun, J., Feng, J. (2005). Schema Driven and Topic Specific Web Crawling. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_55

Download citation

  • DOI: https://doi.org/10.1007/11408079_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25334-1

  • Online ISBN: 978-3-540-32005-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics