A Shared Fragments Analysis System for Large Collections of Web Pages

Ma, Junchang; Gu, Zhimin

doi:10.1007/11669487_35

Junchang Ma¹⁸ &
Zhimin Gu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3872))

Included in the following conference series:

International Workshop on Document Analysis Systems

1567 Accesses

Abstract

Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. However, the lack of good methods to analyze interesting fragments in large collections of web pages is preventing existing large web sites from using fragment-based techniques. Fragments are considered to be interesting if they are completely or structurally shared among multiple web pages. This paper first gives a formal description of the problem, and then presents our system for shared fragments analysis. We propose a well-designed data structure for representing web pages, and develop an efficient algorithm by utilizing database techniques. Our system is unique in its shared fragments analysis for large collections of web pages. The system has been built and successfully applied to some sets of large web pages, which has shown its effectiveness and usefulness, and may serve as a core building block in many applications.

Download to read the full chapter text

Chapter PDF

Robust and scalable content-and-structure indexing

Article Open access 15 October 2022

Frequent pattern mining in attributed trees: algorithms and applications

Article 28 March 2015

Partition and Conquer: Map/Reduce Way of Substructure Discovery

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Challenger, J., et al.: A Publishing System for Efficiently Creating Dynamic Web Content. In: Proceedings of INFOCOM 2000 (March 2000)
Google Scholar
BEA WebLogic Server. http://www.bea.com/products/weblogic/server/
Oracle9iAS, http://www.oracle.com/appserver/
Microsoft. Caching Architecture Guide for.NET Framework Applications (2003)
Google Scholar
IBM WebSphere. http://www-3.ibm.com/software/webservers/appserv/
Datta, A., et al.: Proxy-Based Acceleration of Dynamically Generated Content on the World Wide Web: An approach and Implementation. In: Proceeding of ACM SIGMOD Intl. Conf. on Management of Data, June 2002, pp. 97–108 (2002)
Google Scholar
Yuan, C., Hua, Z., Zhang, Z.: Proxy+: Simple Proxy Augmentation for Dynamic Content Processing. In: WCW 2003 (2003)
Google Scholar
ESI Consortium. Edge Side Includes, http://www.esi.org
Document Object Model – W3C Recommendation, http://www.w3.org/DOM
Network Working Group.: Digest Values for DOM (DOMHASH). RFC 2803 (April 2000)
Google Scholar
CyberNeko HTML Parser, http://people.apache.org/~andyc/neko/doc/index.html
Broder, A.Z.: Some Applications of Rabin’s Fingerprinting Method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II: Methods in Communications, Security, and Computer Science, pp. 143–152. Springer, Heidelberg (1993)
Google Scholar
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proceedings of SEQUENCES 1997 (1997)
Google Scholar
Buttler, D., Liu, L.: A Fully Automated Object Extraction System for the World Wide Web. In: Proceedings of ICDCS 2001 (2001)
Google Scholar
Kamiya, T., et al.: CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. In: IEEE Transactions on Software Engineering (July 2002)
Google Scholar
Ramaswamy, L., lyengar, A., Liu, L., Douglis, F.: Automatic Detection of Fragments in Dynamically Generated Web Pages. In: WWW 2004, New York (May 2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Beijing Institute of Technology, Beijing, 100081, China
Junchang Ma & Zhimin Gu

Authors

Junchang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhimin Gu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Applied Mathematics, University of Bern, Neubrückstrasse 10, CH-3012, Bern, Switzerland
Horst Bunke
DocRec Ltd, 34 Strathaven Place, 7001, Atawhai, Nelson, New Zealand
A. Lawrence Spitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, J., Gu, Z. (2006). A Shared Fragments Analysis System for Large Collections of Web Pages. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_35

Download citation

DOI: https://doi.org/10.1007/11669487_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Shared Fragments Analysis System for Large Collections of Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Robust and scalable content-and-structure indexing

Frequent pattern mining in attributed trees: algorithms and applications

Partition and Conquer: Map/Reduce Way of Substructure Discovery

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Shared Fragments Analysis System for Large Collections of Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Robust and scalable content-and-structure indexing

Frequent pattern mining in attributed trees: algorithms and applications

Partition and Conquer: Map/Reduce Way of Substructure Discovery

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation