A Shared Fragments Analysis System for Large Collections of Web Pages

  • Junchang Ma
  • Zhimin Gu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)


Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. However, the lack of good methods to analyze interesting fragments in large collections of web pages is preventing existing large web sites from using fragment-based techniques. Fragments are considered to be interesting if they are completely or structurally shared among multiple web pages. This paper first gives a formal description of the problem, and then presents our system for shared fragments analysis. We propose a well-designed data structure for representing web pages, and develop an efficient algorithm by utilizing database techniques. Our system is unique in its shared fragments analysis for large collections of web pages. The system has been built and successfully applied to some sets of large web pages, which has shown its effectiveness and usefulness, and may serve as a core building block in many applications.


Large Collection Fragment Analysis Document Object Model Document Object Model Tree Shared Fragment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Challenger, J., et al.: A Publishing System for Efficiently Creating Dynamic Web Content. In: Proceedings of INFOCOM 2000 (March 2000)Google Scholar
  2. 2.
  3. 3.
  4. 4.
    Microsoft. Caching Architecture Guide for.NET Framework Applications (2003)Google Scholar
  5. 5.
  6. 6.
    Datta, A., et al.: Proxy-Based Acceleration of Dynamically Generated Content on the World Wide Web: An approach and Implementation. In: Proceeding of ACM SIGMOD Intl. Conf. on Management of Data, June 2002, pp. 97–108 (2002)Google Scholar
  7. 7.
    Yuan, C., Hua, Z., Zhang, Z.: Proxy+: Simple Proxy Augmentation for Dynamic Content Processing. In: WCW 2003 (2003)Google Scholar
  8. 8.
    ESI Consortium. Edge Side Includes,
  9. 9.
    Document Object Model – W3C Recommendation,
  10. 10.
    Network Working Group.: Digest Values for DOM (DOMHASH). RFC 2803 (April 2000)Google Scholar
  11. 11.
  12. 12.
    Broder, A.Z.: Some Applications of Rabin’s Fingerprinting Method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II: Methods in Communications, Security, and Computer Science, pp. 143–152. Springer, Heidelberg (1993)Google Scholar
  13. 13.
    Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proceedings of SEQUENCES 1997 (1997)Google Scholar
  14. 14.
    Buttler, D., Liu, L.: A Fully Automated Object Extraction System for the World Wide Web. In: Proceedings of ICDCS 2001 (2001)Google Scholar
  15. 15.
    Kamiya, T., et al.: CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. In: IEEE Transactions on Software Engineering (July 2002)Google Scholar
  16. 16.
    Ramaswamy, L., lyengar, A., Liu, L., Douglis, F.: Automatic Detection of Fragments in Dynamically Generated Web Pages. In: WWW 2004, New York (May 2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Junchang Ma
    • 1
  • Zhimin Gu
    • 1
  1. 1.Department of Computer Science and EngineeringBeijing Institute of TechnologyBeijingChina

Personalised recommendations