Advertisement

Identifying Websites with Flow Simulation

  • Pierre Senellart
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3579)

Abstract

We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended using the former. An experiment on a subsite of the INRIA Website is described.

References

  1. 1.
    Abiteboul, S., Cobéna, G., Massanes, J., Sadrati, G.: A first experience in archiving the French Web. In: Proceedings of the European Conference on Digital Libraries (2002)Google Scholar
  2. 2.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)Google Scholar
  3. 3.
    Goldberg, A.V., Tarjan, R.E.: A new approach to the maximum-flow problem. Journal of the ACM 35, 921–940 (1988)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    van Dongen, S.M.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht (2000)Google Scholar
  5. 5.
    Senellart, P.: Identifying websites with flow simulation. Technical Report 387, Gemo, INRIA Futurs, Orsay, France (2005), ftp://ftp.inria.fr/INRIA/Projects/gemo/gemo/GemoReport-387.ps.gz
  6. 6.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series. The MIT Press / McGraw- Hill Book Company (1990)Google Scholar
  7. 7.
    Terveen, L., Hill, W., Amento, B.: Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction 6, 67–94 (1999)CrossRefGoogle Scholar
  8. 8.
    Mathieu, F.: Mesures d’Importance à la PageRank. PhD thesis, Université Montpellier II (2004)Google Scholar
  9. 9.
    Karzanov, A.V.: Determing the maximal flow in a network by the method of preflows. Soviet Mathematics Doklady 15, 434–437 (1974)zbMATHGoogle Scholar
  10. 10.
    Flake, G., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 150–160 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Pierre Senellart
    • 1
    • 2
  1. 1.École normale supÉrieureParis Cedex 05France
  2. 2.INRIA FutursOrsay CedexFrance

Personalised recommendations