Skip to main content

Personalized and Focused Web Spiders

  • Chapter
Web Intelligence

Abstract

As the size of the Web continues to grow, searching it for useful information has become increasingly difficult. Researchers have studied different ways to search the Web automatically using programs that have been known as spiders, crawlers, Web robots, Web agents, Webbots, etc. In this chapter, we will review research in this area, present two case studies, and suggest some future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amitay, E.: Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proc. the 21st ACM-SIGIR Post-Conference Workshop on Hypertext Information Retrieval for the Web ( Melbourne, Australia, 1998 )

    Google Scholar 

  2. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan: Searching the Web. ACM Transactions on Internet Technology, 1 (1), 2–43 (2001)

    Article  Google Scholar 

  3. R. Armstrong, D. Freitag, T. Joachims, T. Mitchell: WebWatcher: A Learning Apprentice for the World-Wide Web. Proc. the AAAI-95 Spring Symposium on Information Gathering from Heterogenous, Distributed Environments (Stanford, California, USA, 1995 )

    Google Scholar 

  4. M. Balabanovic, Y. Shoham: Learning Information Retrieval Agents: Experiment with Web Browsing. Proc. the MAI-95 Spring Symposium on Information Gathering from Heterogenous, Distributed Environments (Stanford, California, USA, 1995 )

    Google Scholar 

  5. R.K. Belew: Adaptive Information Retrieval: Using a Connectionist Representation to Retrieve and Learn about Documents. Proc. the 12th ACM-SIGIR Conference on Research and Development in Information Retrieval (Cambridge, Massachusetts, USA, 1989 )

    Google Scholar 

  6. I. Ben-Shaul, M. Herscovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim, V. Soroka, S. Ur: Adding Support for Dynamic and Focused Search with Fetuccino. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )

    Google Scholar 

  7. T. Berners-Lee: Weaving the Web. Harper, San Francisco (1999)

    Google Scholar 

  8. K. Bharat, A. Broder: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, 1998 )

    Google Scholar 

  9. K. Bharat, M.R. Henzinger: Improved Algorithms for Topic Distillation in a Hyper-linked Environment. Proc. the 21st ACMSIGIR Conference on Research and Development in Information Retrieval, Melbourne (Australia, 1998 )

    Google Scholar 

  10. C. Bowman, P. Danzig, U. Manber, M. Schwartz: Scalable Internet Resource Discovery: Research Problems and Approaches. Communications of the ACM, 37 (8) 98–107 (1994)

    Article  Google Scholar 

  11. S. Brin, L. Page: The Anatomy of a Large-scale Hypertextual Web Search Engine. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, 1998 )

    Google Scholar 

  12. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener: Graph Structure in the Web. Proc. the 9th International World-Wide Web Conference (Amsterdam, Netherlands, May 2000 )

    Google Scholar 

  13. M. Burner: Crawling Towards Eternity: Building an Archive of the World-Wide Web. Web Techniques, 2 (5) (1997)

    Google Scholar 

  14. S. Chakrabarti, B. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg: Mining the Web’s Link Structure. IEEE Computer, 32 (8), 60–67 (1999)

    Article  Google Scholar 

  15. S. Chakrabarti, M. Joshi, V. Tawde: Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks. Proc. the 24th ACM-SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA, Sep. 2001 )

    Google Scholar 

  16. S. Chakrabarti, M. van den Berg, B. Dom: Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Proceedings of the 8th International World-Wide Web Conference ( Toronto, Canada, May 1999 )

    Google Scholar 

  17. M. Chau, D. Zeng, H. Chen: Personalized Spiders for Web Search and Analysis. Proc. the 1st ACM-IEEE Joint Conference on Digital Libraries (Roanoke, Virginia, USA, Jun 2001 ) pp. 79–87.

    Google Scholar 

  18. M. Chau, D. Zeng, H. Chen, M. Huang, D. Hendriawan: Design and Evaluation of a Multi-agent Collaborative Web Mining System. Decision Support Systems (2002) in press.

    Google Scholar 

  19. H. Chen: Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. Journal of the American Society for Information Science, 46 (3), 194–216 (1995)

    Article  Google Scholar 

  20. H. Chen, M. Chau, D. Zeng: CI Spider: A Tool for Competitive Intelligence on the Web. Decision Support Systems (2002) in press.

    Google Scholar 

  21. H. Chen, Y. Chung, M. Ramsey, C.C. Yang: A Smart Itsy-Bitsy Spider for the Web. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, 49 (7), 604–618 (1998)

    Google Scholar 

  22. H. Chen, Y. Chung, M. Ramsey, C.C. Yang: An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching. Decision Support Systems, 23, 41–58 (1998)

    Article  Google Scholar 

  23. H. Chen, H. Fan, M. Chau, D. Zeng: MetaSpider: Meta-searching and Categorization on the Web. Journal of the American Society of Information Science and Technology, 52 (13), 1134–1147 (1998)

    Article  Google Scholar 

  24. H. Chen, A. Houston, R.R. Sewell, B. Schatz: Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, 49 (7) 582–603 (1998)

    Google Scholar 

  25. H. Chen, T. Ng: An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch and Bound Search vs. Connectionist Hopfield Net Activation. Journal of the American Society for Information Science, 46 (5) 348–369 (1995).

    Article  Google Scholar 

  26. H. Chen, C. Schufels, R. Orwig: Internet Categorization and Search: A Self-organizing Approach, Journal of Visual Communication and Image Representation, 7 (1) 88–102 (1996)

    Article  Google Scholar 

  27. Y.J. Chen, V.W. Soo: Ontology-based Information Gathering Agents. Proc. the 1st Asia-Pacific Conference on Web Intelligence (Maebashi City, Japan, Oct 2001) pp. 423–427.

    Google Scholar 

  28. F.C. Cheong: Internet Agents: Spiders, Wanderers, Brokers, and Bots (New Riders Publishing, Indianapolis, Indiana, USA, 1996 )

    Google Scholar 

  29. J. Cho, H. Garcia-Molina: The Evolution of the Web and Implications for an Incremental Crawler. Proc. the 26th International Conference on Very Large Databases (VLDB 2000 ) ( Cairo, Egypt, 2000 )

    Google Scholar 

  30. J. Cho, H. Garcia-Molina, L. Page: Efficient Crawling through URL Ordering. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )

    Google Scholar 

  31. F. Crimmins, A.F. Smeaton, T. Dkaki, J. Mothe: TetraFusion: Information Discovery on the Internet IEEE Intelligent System, Jul-Aug, 55–62 (1999)

    Google Scholar 

  32. P. DeBra, R. Post: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. Proc. the First International World-Wide Web Conference ( Geneva, Switzerland, 1994 )

    Google Scholar 

  33. M. Diligenti, E. Coetzee, S. Lawrence, C.L. Giles, M. Gori: Focused Crawling using Context Graphs. Proc. the 26th International Conference on Very Large Databases (VLDB 2000) ( Cairo, Egypt, 2000 ) pp. 527–534

    Google Scholar 

  34. R.B. Doorenbos, O. Etzioni, D.S. Weld: A Scalable Comparison-shopping Agent for the World-Wide Web. Proc. the First International Conference on Autonomous Agents (Agents’97) (Marina del Rey, California, USA, Feb 1997 ) pp. 39–48

    Google Scholar 

  35. Drott, M. C.: Indexing Aids at Corporate Websites: The Use of robots.txt and META Tags. Information Processing and Management, 38, 209–219 (2002)

    Article  MATH  Google Scholar 

  36. C. Dwork, R. Kumar, M. Noar, D. Sivakumar: Rank Aggregation Methods for the Web. Proc. the 10th International World-Wide Web Conference ( Hong Kong, May 2001 )

    Google Scholar 

  37. D. Eichmann: The RBSE Spider Balancing Effective Search Against Web Load. Proc. the 1st International World-Wide Web Conference ( Geneva, Switzerland, 1994 )

    Google Scholar 

  38. D. Eichmann: Ethical Web Agents. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )

    Google Scholar 

  39. G.W. Flake, S. Lawrence, C.L. Giles, F. Coetzee: Self-organization of the Web and Identification of Communities. IEEE Computer, 35 (3), 66–71 (2002)

    Article  Google Scholar 

  40. S. Gauch, G. Wang, M. Gomez: Profusion: Intelligent Fusion from Multiple Different Search Engines. Journal of Universal Computer Science, 2 (9) (1996)

    Google Scholar 

  41. M. Gordon: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM, 31 (10) 1208–1218 (1988)

    Article  Google Scholar 

  42. M. Gray: Internet Growth and Statistics: Credits and Background. [Online]. Available at http://www.mit.edu/people/mkgray/net/background.html (1993)

    Google Scholar 

  43. T.H. Haveliwala: Efficient Computation of PageRank. Stanford University Technical Report. Available at: http://dbpubs.stanford.edu:8090/pub/1999–31 (1999)

    Google Scholar 

  44. M. R. Henzinger: Hyperlink Analysis for the Web IEEE Internet Computing, 5 (1), 45–50 (2001).

    Article  Google Scholar 

  45. M.R. Henzinger, A. Heydon, M. Mitzenmacher, M. Najork: On Near-uniform URL Sampling. Proc. the 9th International World-Wide Web Conference (Amsterdam, Netherlands, May 2000 )

    Google Scholar 

  46. A. Heydon, M. Najork: Performance Limitations of the Java Core Libraries. Proc. the 1999 ACM Java Grande Conference, (Jun 1999) pp. 35–41.

    Google Scholar 

  47. A. Heydon, M. Najork: Mercator: A Scalable, Extensible Web Crawler. World-Wide Web, 219–229 (Dec 1999)

    Google Scholar 

  48. J.J. Hopfield: Neural Network and Physical Systems with Collective Computational Abilities. Proc. the National Academy of Science, USA, 79 (4), 2554–2558 (1982).

    Article  MathSciNet  Google Scholar 

  49. A.E. Howe, D. Dreilinger: SavvySearch: A Meta-search Engine that Learns which Search Engines to Query. AI Magazine, 18 (2) 19–25 (1997)

    Google Scholar 

  50. B. Kahle: Preserving the Internet. Scientific America (Mar 1997).

    Google Scholar 

  51. J. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Proc. the 9th ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, USA, Jan 1998 ) pp. 668–677.

    Google Scholar 

  52. T. Kohonen, T.: Self-organizing Maps ( Springer, Berlin, 1995 )

    Google Scholar 

  53. M. Koster: A Standard for Robot Exclusion. [Online]. Available at: http://www.robotstxt.org/wc/norobots.html (1994)

    Google Scholar 

  54. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the Web for Emerging Cyber-communities. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )

    Google Scholar 

  55. K.L. Kwok: A Neural Network for Probabilistic Information Retrieval. Proc. the 12 th ACM-SIGIR Conference on Research and Development in Information Retrieval (Cambridge, Massachusetts, USA, Jun 1989 ) pp. 21–30

    Google Scholar 

  56. S. Lawrence, C.L. Giles: Inquirus, the NECI Meta Search Engine. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )

    Google Scholar 

  57. S. Lawrence, C.L. Giles: Context and Page Analysis for Improved Web Search IEEE Internet Computing, Jul-Aug, 38–46 (1998).

    Google Scholar 

  58. S. Lawrence, C.L. Giles: Accessibility of Information on the Web. Nature, 400, 107–109 (1999)

    Google Scholar 

  59. C. Lin, H. Chen, J. Nunamaker: Verifying the Proximity and Size Hypothesis for Selforganizing Maps. Journal of Management Information Systems, 16 (3) 61–73 (2000)

    Google Scholar 

  60. P. Lyman, H.R. Varian: How Much Information. [Online]. Available at http://www.sims.berkeley.edu/how-much-info/ (2000)

    Google Scholar 

  61. U. Manber, M. Smith, B. Gopal: WebGlimpse: Combining Browsing and Searching. Proc. the USENIX 1997 Annual Technical Conference (Anaheim, California, Jan 1997)

    Google Scholar 

  62. M.L. Mauldin: Lycos: Design Choices in an Internet Search Service. IEEE Expert, 12 (1) 8–11 (1997)

    Article  Google Scholar 

  63. M.L. Mauldin: Spidering BOF Report. Report of the Distributed Indexing/Searching Workshop, (Cambridge, Massachusetts, USA, May 1996 )

    Google Scholar 

  64. O.A. McBryan: GENVL and WWWW: Tools for Taming the Web. Proc. the 1st International World Wide Web Conference ( Geneva, Switzerland, 1994 )

    Google Scholar 

  65. A. McCallum, K. Nigam, J. Rennie, K. Seymore: A Machine Learning Approach to Building Domain-specific Search Engines. Proc. the International Joint Conference on Artificial Intelligence (IJCAI-99) (1999) pp. 662–667

    Google Scholar 

  66. Z. Michalewicz (1992): Genetic Algorithms + Data Structures = Evolution Programs. ( Springer, Berlin, 1992 )

    MATH  Google Scholar 

  67. R.C. Miller, K. Bharat: SPHINX: A Framework for Creating Personal, Site-specific Web Crawlers. Proceedings of the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )

    Google Scholar 

  68. M. Najork, J.L. Wiener: Breadth-first Search Crawling Yields High-quality Pages. Proceedings of the 10th International World-Wide Web Conference (Hong Kong, May 2001 )

    Google Scholar 

  69. Netcraft: Web Server Survey. [Online]. Available at http://www.netcraft.com/Survey/Reports/0202/ (2002)

    Google Scholar 

  70. Z.Z. Nick, P. Themis: Web Search Using a Genetic Algorithm. IEEE Internet Computing, 5 (2) 18–26 (2001)

    Article  Google Scholar 

  71. J. Pearl: Heuristics: Intelligent Search Strategies for Computer Problem Solving. (Addison-Wesley Publishing Company, Reading, Massachusetts, USA, 1984 )

    Google Scholar 

  72. B. Pinkerton: Finding What People Want: Experiences with the WebCrawler. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )

    Google Scholar 

  73. P. Pirolli, J. Pitkow, R. Rao: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. Proc. the ACM Conference on Human Factors in Computing Systems ( Vancouver, Canada, Apr 1996 )

    Google Scholar 

  74. J. Rennie, A.K. McCallum: Using Reinforcement Learning to Spider the Web Efficiently. Proc. the 16th International Conference on Machine Learning (ICML-99) ( Bled, Slovenia, 1999 ) pp. 335–343

    Google Scholar 

  75. G. Salton: Another Look at Automatic Text-retrieval Systems. Communications of the ACM, 29 (7) 648–656 (1986)

    Article  Google Scholar 

  76. E. Selberg, O. Etzioni: Multi-service Search and Comparison using the MetaCrawler. Proc. the 4th World-Wide Web Conference (Boston, MA USA, December 1995 )

    Google Scholar 

  77. E. Selberg, O. Etzioni: The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, Jan-Feb, 11–14 (1997)

    Google Scholar 

  78. J. Smith, S.F. Chang: Visually Searching the Web for Content IEEE Multimedia, 4 (3), 12–20 (1997)

    Article  Google Scholar 

  79. E. Spertus: ParaSite: Mining Structural Information on the Web. Proc. the 6th Inter- national World-Wide Web Conference (Santa Clara, California, USA, Apr 1997 )

    Google Scholar 

  80. S. Spetka: The TkWWW Robot: Beyond Browsing. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )

    Google Scholar 

  81. R.G. Sumner, K. Yang, B.J. Dempsey: An Interactive WWW Search Engine for User-defined Collections. Proc. the 3rd ACM Conference on Digital Libraries (Pittsburgh, Pennsylvania, USA, Jun 1998 ) pp. 307–308

    Google Scholar 

  82. The ht://dig Group.: htdig Reference. [Online]. Available at http://www.htdig.org/ htdig.html

    Google Scholar 

  83. K.M. Tolle, H. Chen: Comparing Noun Phrasing Techniques for Use with Medical Digital Library Tools. Journal of the American Society for Information Science, Special Issue on Digital Libraries, 51 (4) 352–370 (2000)

    Google Scholar 

  84. S. Vrettos, A. Stafylopatis: A Fuzzy Rule-based Agent for Web Retrieval-filtering. Proc. the 1st Asia-Pacific Conference on Web Intelligence ( Maebashi City, Japan, Oct 2001 ) pp. 448–453

    Google Scholar 

  85. S. Waterhouse, D.M. Doolin, G. Kan, Y. Faybishenko: Distributed Search in P2P Networks. IEEE Internet Computing, 6 (1) 68–72 (2002)

    Article  Google Scholar 

  86. R. Weiss, B. Velez, M.A. Sheldon: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-link Hypertext Clustering. Proceedings of the ACM Conference on Hypertext (Washington, DC, USA, 1996 )

    Google Scholar 

  87. J. Weizenbaum: Eliza — A Computer Program for the Study of Natural Language Communication between Man and Machine. Communication of the ACM, 9 (1), 3645 (1966)

    Article  Google Scholar 

  88. I.H. Witten, D. Bainbridge, S.J. Boddie: Greenstone: Open-source DL Software. Communications of the ACM, 44 (5), 47 (2001)

    Article  Google Scholar 

  89. I.H. Witten, R.J. McNab, S.J. Boddie, D. Bainbridge: Greenstone: A Comprehensive Open-source Digital Library Software System. Proc. the 5th ACM Conference on Digital Libraries (San Antonio, Texas, USA, 2000 ) pp. 113–121

    Google Scholar 

  90. A.H. Whinston: Artificial Intelligence (Addison-Wesley Publishing Company Inc., Reading, Massachusetts, Second Edition, 1984 )

    Google Scholar 

  91. C.C. Yang, J. Yen, H. Chen: Intelligent Internet Searching Agent Based on Hybrid Simulated Annealing. Decision Support Systems, 28, 269–277 (2000)

    Article  Google Scholar 

  92. C. Yu, W. Meng, K.L. Liu: Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. Proc. the 2001 ACM SIGMOD International Conference on Management of Data ( Dallas, Texas, May 2001 ) pp. 187–198

    Google Scholar 

  93. O. Zamir, O. Etzioni: Grouper: A Dynamic Clustering Interface to Web Search Results. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chau, M., Chen, H. (2003). Personalized and Focused Web Spiders. In: Zhong, N., Liu, J., Yao, Y. (eds) Web Intelligence. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-05320-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-05320-1_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-07936-8

  • Online ISBN: 978-3-662-05320-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics