Abstract
We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended using the former. An experiment on a subsite of the INRIA Website is described.
Chapter PDF
Similar content being viewed by others
References
Abiteboul, S., Cobéna, G., Massanes, J., Sadrati, G.: A first experience in archiving the French Web. In: Proceedings of the European Conference on Digital Libraries (2002)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Goldberg, A.V., Tarjan, R.E.: A new approach to the maximum-flow problem. Journal of the ACM 35, 921–940 (1988)
van Dongen, S.M.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht (2000)
Senellart, P.: Identifying websites with flow simulation. Technical Report 387, Gemo, INRIA Futurs, Orsay, France (2005), ftp://ftp.inria.fr/INRIA/Projects/gemo/gemo/GemoReport-387.ps.gz
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series. The MIT Press / McGraw- Hill Book Company (1990)
Terveen, L., Hill, W., Amento, B.: Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction 6, 67–94 (1999)
Mathieu, F.: Mesures d’Importance à la PageRank. PhD thesis, Université Montpellier II (2004)
Karzanov, A.V.: Determing the maximal flow in a network by the method of preflows. Soviet Mathematics Doklady 15, 434–437 (1974)
Flake, G., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 150–160 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Senellart, P. (2005). Identifying Websites with Flow Simulation. In: Lowe, D., Gaedke, M. (eds) Web Engineering. ICWE 2005. Lecture Notes in Computer Science, vol 3579. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11531371_18
Download citation
DOI: https://doi.org/10.1007/11531371_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27996-9
Online ISBN: 978-3-540-31484-4
eBook Packages: Computer ScienceComputer Science (R0)