Skip to main content
Log in

Information Monitoring on the Web: A Scalable Solution

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

This paper presents WebCQ, a continual query system for large-scale Web information monitoring. WebCQ is designed to discover and detect changes to Web pages efficiently, and to notify users of interesting changes with personalized messages. Users' Web page monitoring requests are modeled as continual queries on the Web and referred to as Web page sentinels. The system consists of five main components: a change detection robot that discovers and detects changes, a proxy cache service that reduces the communication traffics to the original information provider on the remote server, a trigger evaluation tool that can filter only the changes that match certain thresholds, a personalized change presentation tool that highlights Web page changes, and a change notification service that displays and delivers interesting changes and fresh information to the right users at the right time. This paper describes the WebCQ system with an emphasis on the general issues in designing and engineering a large-scale information change monitoring system on the Web. There are two main contributions. First, we present the mechanisms that WebCQ provides to support various types of Web page sentinels for finding and displaying interesting changes to Web pages. The large collection of sentinel types allows WebCQ to efficiently locate and monitor a wide range of changes in Web pages. The second contribution is the development of sentinel grouping techniques for efficient and scalable processing of large number of concurrently running triggers and Web page sentinels. We report our initial experimental results showing the effectiveness of the proposed solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Acharya, M. Franklin, and S. Zdonik, “Balancing push and pull for data broadcast,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997.

  2. S. Babu and J. Widom, “Continuous queries over data streams,” in ACM SIGMOD Record, September 2001.

  3. BotSpot, http://bots.internet.com

  4. L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web caching and Zipf-like distributions: Evidence and implications,” Proceedings of IEEE Infocom'99, March 1999, pp. 126–134.

  5. S. Chakravarthy, “Architectures and monitoring techniques for active databases: An evaluation,” Technical Report TR-92-041, University of Florida, Gainesville, FL, 1992.

    Google Scholar 

  6. S. Chawathe, S. Abiteboul, and J. Widom, “Managing and querying changes in semi-structured data,” in Proceedings of ACM SIGMOD Conference, 1997.

  7. J. Chen, D. DeWitt, F. Tian, and Y. Wang, “NiagaraCQ: A scalable continuous query system for internet databases,” in Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.

  8. F. Douglis, T. Ball, Y. Chen, and E. Koutsofios, “WebGuide: Querying and navigating changes in Web repositories,” in Proceedings of 1996 USENIX Technical Conference, January 1996, pp. 1335–1344.

  9. F. Douglis, T. Ball, Y. Chen, and E. Koutsofios, “The AT&T Internet difference engine: Tracking and viewing changes on the Web,” World Wide Web1(1), January 1998, 27–44.

    Google Scholar 

  10. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and E. T. Berners-Lee, “Hypertext Transfer Protocol–HTTP/1.1,” RFC 2068, January 1997.

  11. L. Haas, W. Chang, G. Lohman, J. McPherson, P. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. Carey, and E. Shekita, “Starburst mid-flight: As the dust clears,” IEEE Transactions on Knowledge and Data Engineering, March 1990, 377–388.

  12. E. Hanson, “Rule condition testing and action execution in Ariel,” in Proceedings of ACM SIGMOD Conference, 1992.

  13. E. Hanson, C. Carnes, L. Huang, M. Konyala, and L. Noronha, “Scalable trigger processing,” in Proceedings of the International Conference on Data Engineering, 1999.

  14. Hypertext Transfer Protocol – HTTP/1.1, http://www.w3.org/Protocols/rfc2616/ rfc2616.html

  15. J. W. Hunt and M. D. Mcllroy, “An algorithm for efficient file comparison,” Technical Report Computer Science TR#41, Bell Laboratories, Murray Hill, NJ, 1995.

    Google Scholar 

  16. ICQ, http://www.icq.com

  17. R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat, “Programming high availability using lazy replication,” ACM Transactions on Computer Systems10(4), November 1992, 360–391.

    Google Scholar 

  18. L. Liu, C. Pu, R. Barga, and T. Zhou, “Differential evaluation of continual queries,” in IEEE Proceedings of the 16th International Conference on Distributed Computing Systems, Hong Kong, May 27–30, 1996.

  19. L. Liu, C. Pu, and W. Han, “XWrap: An XML-enabled wrapper construction system for Web information sources,” in Proceedings of the International Conference on Data Engineering, 2000.

  20. L. Liu, C. Pu, and W. Han, “An XML-enabled data extraction tool for Web sources,” International Journal of Information Systems, Special Issue on Data Extraction, Cleaning, and Reconciliation, 2001.

  21. L. Liu, C. Pu, and W. Tang, “Continual queries for Internet-scale event-driven information delivery,” IEEE Knowledge and Data Engineering, Special Issue on Web Technology, 1999.

  22. L. Liu, C. Pu, and W. Tang, “WebCQ: Detecting and delivering information changes on the Web,” in Proceedings of the International Conference on Information and Knowledge Management, November 2000.

  23. L. Liu, C. Pu, W. Tang, and W. Han, “Conquer: A continual query system for update monitoring in the WWW,” International Journal of Computer Systems, Science, and Engineering, Special Issue on Web Semantics, 1999.

  24. S. Madden and M. J. Franklin, “Fjording the stream: An architecture for queries over streaming sensor data,” in Proceedings of the 18th International Conference on Data Engineering (ICDE), 2002.

  25. S. R. Madden, M. A. Shaw, J. M. Hellerstein, and V. Raman, “Continuously adaptive continuous queries over streams,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002.

  26. D. McCarthy and U. Dayal, “The architecture of an active database management system,” in Proceedings of the ACM-SIGMOD International Conference on Management of Data, May 1989, pp. 215–224.

  27. Mortice Kern Systems (MKS), “Web integrity,” http://www.mks.com/solutions/ebms/

  28. NetMind, http://www.netmind.com

  29. B. C. Neuman, Scale in Distributed Systems, IEEE Computer Society Press, 1994.

  30. M. Newbery, “Kapipo,” http://www.vuw.ac.nz/newbery/katipo.html

  31. Pointcast, http://www.pointcast.com

  32. R. L. Rivest, “The md5 message-digest algorithm,” RFC 1321, April 1992, http://www.ietf.org/rfc/rfc1321.txt

  33. D. Rocco, D. Buttler, and L. Liu, “Sdiff,” Technical Report, Georgia Tech, College of Computing, February 2002.

  34. B. Satyanarayanan, “Scalable, secure, and highly available distributed file access,” IEEE Computer23, May 1990, 9–21.

    Google Scholar 

  35. U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan, “Alert: An architecture for transforming a passive DBMS into an active DBMS,” in Proceedings of the International Conference on Very Large Data Bases, Barcelona, Spain, September 1991, pp. 469–478.

  36. SmartBookmarks, http://www.firstfloor.com/

  37. M. Stonebraker, E. Hanson, and C. H. Hong, “The design of the Postgres rules systems,” in Proceedings of the International Conference on Data Engineering (ICDE), 1987.

  38. TracerLock, http://www.tracerlock.com

  39. Webclipping.com, http://www.webclipping.com

  40. WebCQ (online demo), http://disl.cc.gatech.edu/WebCQ

  41. Webwhacker, http://www.webwhacker.com

  42. WebSprite, http://www.websprite.com

  43. J. Widom and S. Ceri, Active Database Systems, Morgan Kaufmann, 1996.

  44. WWWFtech, http://www.wwwftech.com

  45. T. W. Yan and H. Garcia-Molina, “SIFT – a tool for wide area information dissemination,” in Proceedings of the 1995 USENIX Technical Conference, 1995, pp. 177–186.

  46. S. Zdonik, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and D. Carney, “Monitoring streams – a new class of data management applications,” in Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, 2002.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, L., Tang, W., Buttler, D. et al. Information Monitoring on the Web: A Scalable Solution. World Wide Web 5, 263–304 (2002). https://doi.org/10.1023/A:1021028509335

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021028509335

Navigation