Personalized and Focused Web Spiders

  • Michael Chau
  • Hsinchun Chen
Chapter

Abstract

As the size of the Web continues to grow, searching it for useful information has become increasingly difficult. Researchers have studied different ways to search the Web automatically using programs that have been known as spiders, crawlers, Web robots, Web agents, Webbots, etc. In this chapter, we will review research in this area, present two case studies, and suggest some future research directions.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 10.1
    Amitay, E.: Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proc. the 21st ACM-SIGIR Post-Conference Workshop on Hypertext Information Retrieval for the Web ( Melbourne, Australia, 1998 )Google Scholar
  2. 10.2
    A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan: Searching the Web. ACM Transactions on Internet Technology, 1 (1), 2–43 (2001)CrossRefGoogle Scholar
  3. 10.3
    R. Armstrong, D. Freitag, T. Joachims, T. Mitchell: WebWatcher: A Learning Apprentice for the World-Wide Web. Proc. the AAAI-95 Spring Symposium on Information Gathering from Heterogenous, Distributed Environments (Stanford, California, USA, 1995 )Google Scholar
  4. 10.4
    M. Balabanovic, Y. Shoham: Learning Information Retrieval Agents: Experiment with Web Browsing. Proc. the MAI-95 Spring Symposium on Information Gathering from Heterogenous, Distributed Environments (Stanford, California, USA, 1995 )Google Scholar
  5. 10.5
    R.K. Belew: Adaptive Information Retrieval: Using a Connectionist Representation to Retrieve and Learn about Documents. Proc. the 12th ACM-SIGIR Conference on Research and Development in Information Retrieval (Cambridge, Massachusetts, USA, 1989 )Google Scholar
  6. 10.6
    I. Ben-Shaul, M. Herscovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim, V. Soroka, S. Ur: Adding Support for Dynamic and Focused Search with Fetuccino. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )Google Scholar
  7. 10.7
    T. Berners-Lee: Weaving the Web. Harper, San Francisco (1999)Google Scholar
  8. 10.8
    K. Bharat, A. Broder: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, 1998 )Google Scholar
  9. 10.9
    K. Bharat, M.R. Henzinger: Improved Algorithms for Topic Distillation in a Hyper-linked Environment. Proc. the 21st ACMSIGIR Conference on Research and Development in Information Retrieval, Melbourne (Australia, 1998 )Google Scholar
  10. 10.10
    C. Bowman, P. Danzig, U. Manber, M. Schwartz: Scalable Internet Resource Discovery: Research Problems and Approaches. Communications of the ACM, 37 (8) 98–107 (1994)CrossRefGoogle Scholar
  11. 10.11
    S. Brin, L. Page: The Anatomy of a Large-scale Hypertextual Web Search Engine. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, 1998 )Google Scholar
  12. 10.12
    A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener: Graph Structure in the Web. Proc. the 9th International World-Wide Web Conference (Amsterdam, Netherlands, May 2000 )Google Scholar
  13. 10.13
    M. Burner: Crawling Towards Eternity: Building an Archive of the World-Wide Web. Web Techniques, 2 (5) (1997)Google Scholar
  14. 10.14
    S. Chakrabarti, B. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg: Mining the Web’s Link Structure. IEEE Computer, 32 (8), 60–67 (1999)CrossRefGoogle Scholar
  15. 10.15
    S. Chakrabarti, M. Joshi, V. Tawde: Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks. Proc. the 24th ACM-SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA, Sep. 2001 )Google Scholar
  16. 10.16
    S. Chakrabarti, M. van den Berg, B. Dom: Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Proceedings of the 8th International World-Wide Web Conference ( Toronto, Canada, May 1999 )Google Scholar
  17. 10.17
    M. Chau, D. Zeng, H. Chen: Personalized Spiders for Web Search and Analysis. Proc. the 1st ACM-IEEE Joint Conference on Digital Libraries (Roanoke, Virginia, USA, Jun 2001 ) pp. 79–87.Google Scholar
  18. 10.
    M. Chau, D. Zeng, H. Chen, M. Huang, D. Hendriawan: Design and Evaluation of a Multi-agent Collaborative Web Mining System. Decision Support Systems (2002) in press.Google Scholar
  19. 10.19
    H. Chen: Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. Journal of the American Society for Information Science, 46 (3), 194–216 (1995)CrossRefGoogle Scholar
  20. 10.
    H. Chen, M. Chau, D. Zeng: CI Spider: A Tool for Competitive Intelligence on the Web. Decision Support Systems (2002) in press.Google Scholar
  21. 10.21
    H. Chen, Y. Chung, M. Ramsey, C.C. Yang: A Smart Itsy-Bitsy Spider for the Web. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, 49 (7), 604–618 (1998)Google Scholar
  22. 10.22
    H. Chen, Y. Chung, M. Ramsey, C.C. Yang: An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching. Decision Support Systems, 23, 41–58 (1998)CrossRefGoogle Scholar
  23. 10.23
    H. Chen, H. Fan, M. Chau, D. Zeng: MetaSpider: Meta-searching and Categorization on the Web. Journal of the American Society of Information Science and Technology, 52 (13), 1134–1147 (1998)CrossRefGoogle Scholar
  24. 10.24
    H. Chen, A. Houston, R.R. Sewell, B. Schatz: Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, 49 (7) 582–603 (1998)Google Scholar
  25. 10.25
    H. Chen, T. Ng: An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch and Bound Search vs. Connectionist Hopfield Net Activation. Journal of the American Society for Information Science, 46 (5) 348–369 (1995).CrossRefGoogle Scholar
  26. 10.26
    H. Chen, C. Schufels, R. Orwig: Internet Categorization and Search: A Self-organizing Approach, Journal of Visual Communication and Image Representation, 7 (1) 88–102 (1996)CrossRefGoogle Scholar
  27. 10.
    Y.J. Chen, V.W. Soo: Ontology-based Information Gathering Agents. Proc. the 1st Asia-Pacific Conference on Web Intelligence (Maebashi City, Japan, Oct 2001) pp. 423–427.Google Scholar
  28. 10.28
    F.C. Cheong: Internet Agents: Spiders, Wanderers, Brokers, and Bots (New Riders Publishing, Indianapolis, Indiana, USA, 1996 )Google Scholar
  29. 10.29
    J. Cho, H. Garcia-Molina: The Evolution of the Web and Implications for an Incremental Crawler. Proc. the 26th International Conference on Very Large Databases (VLDB 2000 ) ( Cairo, Egypt, 2000 )Google Scholar
  30. 10.30
    J. Cho, H. Garcia-Molina, L. Page: Efficient Crawling through URL Ordering. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )Google Scholar
  31. 10.31
    F. Crimmins, A.F. Smeaton, T. Dkaki, J. Mothe: TetraFusion: Information Discovery on the Internet IEEE Intelligent System, Jul-Aug, 55–62 (1999)Google Scholar
  32. 10.32
    P. DeBra, R. Post: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. Proc. the First International World-Wide Web Conference ( Geneva, Switzerland, 1994 )Google Scholar
  33. 10.33
    M. Diligenti, E. Coetzee, S. Lawrence, C.L. Giles, M. Gori: Focused Crawling using Context Graphs. Proc. the 26th International Conference on Very Large Databases (VLDB 2000) ( Cairo, Egypt, 2000 ) pp. 527–534Google Scholar
  34. 10.34
    R.B. Doorenbos, O. Etzioni, D.S. Weld: A Scalable Comparison-shopping Agent for the World-Wide Web. Proc. the First International Conference on Autonomous Agents (Agents’97) (Marina del Rey, California, USA, Feb 1997 ) pp. 39–48Google Scholar
  35. 10.35
    Drott, M. C.: Indexing Aids at Corporate Websites: The Use of robots.txt and META Tags. Information Processing and Management, 38, 209–219 (2002)MATHCrossRefGoogle Scholar
  36. 10.36
    C. Dwork, R. Kumar, M. Noar, D. Sivakumar: Rank Aggregation Methods for the Web. Proc. the 10th International World-Wide Web Conference ( Hong Kong, May 2001 )Google Scholar
  37. 10.37
    D. Eichmann: The RBSE Spider Balancing Effective Search Against Web Load. Proc. the 1st International World-Wide Web Conference ( Geneva, Switzerland, 1994 )Google Scholar
  38. 10.38
    D. Eichmann: Ethical Web Agents. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )Google Scholar
  39. 10.39
    G.W. Flake, S. Lawrence, C.L. Giles, F. Coetzee: Self-organization of the Web and Identification of Communities. IEEE Computer, 35 (3), 66–71 (2002)CrossRefGoogle Scholar
  40. 10.40
    S. Gauch, G. Wang, M. Gomez: Profusion: Intelligent Fusion from Multiple Different Search Engines. Journal of Universal Computer Science, 2 (9) (1996)Google Scholar
  41. 10.41
    M. Gordon: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM, 31 (10) 1208–1218 (1988)CrossRefGoogle Scholar
  42. 10.
    M. Gray: Internet Growth and Statistics: Credits and Background. [Online]. Available at http://www.mit.edu/people/mkgray/net/background.html (1993)Google Scholar
  43. 10.
    T.H. Haveliwala: Efficient Computation of PageRank. Stanford University Technical Report. Available at: http://dbpubs.stanford.edu:8090/pub/1999–31 (1999)Google Scholar
  44. 10.44
    M. R. Henzinger: Hyperlink Analysis for the Web IEEE Internet Computing, 5 (1), 45–50 (2001).CrossRefGoogle Scholar
  45. 10.45
    M.R. Henzinger, A. Heydon, M. Mitzenmacher, M. Najork: On Near-uniform URL Sampling. Proc. the 9th International World-Wide Web Conference (Amsterdam, Netherlands, May 2000 )Google Scholar
  46. 10.46
    A. Heydon, M. Najork: Performance Limitations of the Java Core Libraries. Proc. the 1999 ACM Java Grande Conference, (Jun 1999) pp. 35–41.Google Scholar
  47. 10.47
    A. Heydon, M. Najork: Mercator: A Scalable, Extensible Web Crawler. World-Wide Web, 219–229 (Dec 1999)Google Scholar
  48. 10.48
    J.J. Hopfield: Neural Network and Physical Systems with Collective Computational Abilities. Proc. the National Academy of Science, USA, 79 (4), 2554–2558 (1982).MathSciNetCrossRefGoogle Scholar
  49. 10.49
    A.E. Howe, D. Dreilinger: SavvySearch: A Meta-search Engine that Learns which Search Engines to Query. AI Magazine, 18 (2) 19–25 (1997)Google Scholar
  50. 10.50
    B. Kahle: Preserving the Internet. Scientific America (Mar 1997).Google Scholar
  51. 10.51
    J. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Proc. the 9th ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, USA, Jan 1998 ) pp. 668–677.Google Scholar
  52. 10.52
    T. Kohonen, T.: Self-organizing Maps ( Springer, Berlin, 1995 )Google Scholar
  53. 10.
    M. Koster: A Standard for Robot Exclusion. [Online]. Available at: http://www.robotstxt.org/wc/norobots.html (1994)Google Scholar
  54. 10.54
    R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the Web for Emerging Cyber-communities. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )Google Scholar
  55. 10.55
    K.L. Kwok: A Neural Network for Probabilistic Information Retrieval. Proc. the 12 th ACM-SIGIR Conference on Research and Development in Information Retrieval (Cambridge, Massachusetts, USA, Jun 1989 ) pp. 21–30Google Scholar
  56. 10.56
    S. Lawrence, C.L. Giles: Inquirus, the NECI Meta Search Engine. Proc. the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )Google Scholar
  57. 10.57
    S. Lawrence, C.L. Giles: Context and Page Analysis for Improved Web Search IEEE Internet Computing, Jul-Aug, 38–46 (1998).Google Scholar
  58. 10.58
    S. Lawrence, C.L. Giles: Accessibility of Information on the Web. Nature, 400, 107–109 (1999)Google Scholar
  59. 10.59
    C. Lin, H. Chen, J. Nunamaker: Verifying the Proximity and Size Hypothesis for Selforganizing Maps. Journal of Management Information Systems, 16 (3) 61–73 (2000)Google Scholar
  60. 10.
    P. Lyman, H.R. Varian: How Much Information. [Online]. Available at http://www.sims.berkeley.edu/how-much-info/ (2000)Google Scholar
  61. U. Manber, M. Smith, B. Gopal: WebGlimpse: Combining Browsing and Searching. Proc. the USENIX 1997 Annual Technical Conference (Anaheim, California, Jan 1997)Google Scholar
  62. 10.62
    M.L. Mauldin: Lycos: Design Choices in an Internet Search Service. IEEE Expert, 12 (1) 8–11 (1997)CrossRefGoogle Scholar
  63. 10.63
    M.L. Mauldin: Spidering BOF Report. Report of the Distributed Indexing/Searching Workshop, (Cambridge, Massachusetts, USA, May 1996 )Google Scholar
  64. 10.64
    O.A. McBryan: GENVL and WWWW: Tools for Taming the Web. Proc. the 1st International World Wide Web Conference ( Geneva, Switzerland, 1994 )Google Scholar
  65. 10.65
    A. McCallum, K. Nigam, J. Rennie, K. Seymore: A Machine Learning Approach to Building Domain-specific Search Engines. Proc. the International Joint Conference on Artificial Intelligence (IJCAI-99) (1999) pp. 662–667Google Scholar
  66. 10.66
    Z. Michalewicz (1992): Genetic Algorithms + Data Structures = Evolution Programs. ( Springer, Berlin, 1992 )MATHGoogle Scholar
  67. 10.67
    R.C. Miller, K. Bharat: SPHINX: A Framework for Creating Personal, Site-specific Web Crawlers. Proceedings of the 7th International World-Wide Web Conference ( Brisbane, Australia, Apr 1998 )Google Scholar
  68. 10.68
    M. Najork, J.L. Wiener: Breadth-first Search Crawling Yields High-quality Pages. Proceedings of the 10th International World-Wide Web Conference (Hong Kong, May 2001 )Google Scholar
  69. 10.
    Netcraft: Web Server Survey. [Online]. Available at http://www.netcraft.com/Survey/Reports/0202/ (2002)Google Scholar
  70. 10.70
    Z.Z. Nick, P. Themis: Web Search Using a Genetic Algorithm. IEEE Internet Computing, 5 (2) 18–26 (2001)CrossRefGoogle Scholar
  71. 10.71
    J. Pearl: Heuristics: Intelligent Search Strategies for Computer Problem Solving. (Addison-Wesley Publishing Company, Reading, Massachusetts, USA, 1984 )Google Scholar
  72. 10.72
    B. Pinkerton: Finding What People Want: Experiences with the WebCrawler. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )Google Scholar
  73. 10.73
    P. Pirolli, J. Pitkow, R. Rao: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. Proc. the ACM Conference on Human Factors in Computing Systems ( Vancouver, Canada, Apr 1996 )Google Scholar
  74. 10.74
    J. Rennie, A.K. McCallum: Using Reinforcement Learning to Spider the Web Efficiently. Proc. the 16th International Conference on Machine Learning (ICML-99) ( Bled, Slovenia, 1999 ) pp. 335–343Google Scholar
  75. 10.75
    G. Salton: Another Look at Automatic Text-retrieval Systems. Communications of the ACM, 29 (7) 648–656 (1986)CrossRefGoogle Scholar
  76. 10.76
    E. Selberg, O. Etzioni: Multi-service Search and Comparison using the MetaCrawler. Proc. the 4th World-Wide Web Conference (Boston, MA USA, December 1995 )Google Scholar
  77. 10.77
    E. Selberg, O. Etzioni: The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, Jan-Feb, 11–14 (1997)Google Scholar
  78. 10.78
    J. Smith, S.F. Chang: Visually Searching the Web for Content IEEE Multimedia, 4 (3), 12–20 (1997)CrossRefGoogle Scholar
  79. 10.79
    E. Spertus: ParaSite: Mining Structural Information on the Web. Proc. the 6th Inter- national World-Wide Web Conference (Santa Clara, California, USA, Apr 1997 )Google Scholar
  80. 10.80
    S. Spetka: The TkWWW Robot: Beyond Browsing. Proc. the 2nd International World-Wide Web Conference (Chicago, Illinois, USA, 1994 )Google Scholar
  81. 10.81
    R.G. Sumner, K. Yang, B.J. Dempsey: An Interactive WWW Search Engine for User-defined Collections. Proc. the 3rd ACM Conference on Digital Libraries (Pittsburgh, Pennsylvania, USA, Jun 1998 ) pp. 307–308Google Scholar
  82. 10.
    The ht://dig Group.: htdig Reference. [Online]. Available at http://www.htdig.org/ htdig.htmlGoogle Scholar
  83. 10.83
    K.M. Tolle, H. Chen: Comparing Noun Phrasing Techniques for Use with Medical Digital Library Tools. Journal of the American Society for Information Science, Special Issue on Digital Libraries, 51 (4) 352–370 (2000)Google Scholar
  84. 10.84
    S. Vrettos, A. Stafylopatis: A Fuzzy Rule-based Agent for Web Retrieval-filtering. Proc. the 1st Asia-Pacific Conference on Web Intelligence ( Maebashi City, Japan, Oct 2001 ) pp. 448–453Google Scholar
  85. 10.85
    S. Waterhouse, D.M. Doolin, G. Kan, Y. Faybishenko: Distributed Search in P2P Networks. IEEE Internet Computing, 6 (1) 68–72 (2002)CrossRefGoogle Scholar
  86. 10.86
    R. Weiss, B. Velez, M.A. Sheldon: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-link Hypertext Clustering. Proceedings of the ACM Conference on Hypertext (Washington, DC, USA, 1996 )Google Scholar
  87. 10.87
    J. Weizenbaum: Eliza — A Computer Program for the Study of Natural Language Communication between Man and Machine. Communication of the ACM, 9 (1), 3645 (1966)CrossRefGoogle Scholar
  88. 10.88
    I.H. Witten, D. Bainbridge, S.J. Boddie: Greenstone: Open-source DL Software. Communications of the ACM, 44 (5), 47 (2001)CrossRefGoogle Scholar
  89. 10.89
    I.H. Witten, R.J. McNab, S.J. Boddie, D. Bainbridge: Greenstone: A Comprehensive Open-source Digital Library Software System. Proc. the 5th ACM Conference on Digital Libraries (San Antonio, Texas, USA, 2000 ) pp. 113–121Google Scholar
  90. 10.90
    A.H. Whinston: Artificial Intelligence (Addison-Wesley Publishing Company Inc., Reading, Massachusetts, Second Edition, 1984 )Google Scholar
  91. 10.91
    C.C. Yang, J. Yen, H. Chen: Intelligent Internet Searching Agent Based on Hybrid Simulated Annealing. Decision Support Systems, 28, 269–277 (2000)CrossRefGoogle Scholar
  92. 10.92
    C. Yu, W. Meng, K.L. Liu: Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. Proc. the 2001 ACM SIGMOD International Conference on Management of Data ( Dallas, Texas, May 2001 ) pp. 187–198Google Scholar
  93. 10.93
    O. Zamir, O. Etzioni: Grouper: A Dynamic Clustering Interface to Web Search Results. Proc. the 8th World-Wide Web Conference ( Toronto, May 1999 )Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Michael Chau
    • 1
  • Hsinchun Chen
    • 1
  1. 1.Department of Management Information SystemsThe University of ArizonaTucsonUSA

Personalised recommendations