Abstract
This chapter covers the basic properties, concepts and models of the Web graph, as well as the main link ranking and Web page clustering algorithms. We also address important algorithmic issues such as streaming computation on graphs and web graph compression.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adler, M., Mitzenmacher, M.: Towards compressing web graphs. In: DCC 2001: Proceedings of the Data Compression Conference, p. 203. IEEE Computer Society, Washington (2001)
Aggarwal, G., Datar, M., Rajagopalan, S., Ruhl, M.: On the streaming model augmented with a sorting primitive. In: FOCS, pp. 540–549 (2004)
Aiello, W., Chung, F., Lu, L.: Random evolution in massive graphs. In: Handbook of massive data sets, pp. 97–122. Kluwer Academic Publishers, Norwell (2002)
Albert, R., Jeong, H., Barabasi, A.L.: The diameter of the world wide web. Nature 401, 130–131 (1999)
Baeza-Yates, R., Boldi, P., Castillo, C.: Generalizing pagerank: damping functions for link-based ranking algorithms. In: SIGIR 2006: Proceedings of the 29th annual International ACM SIGIR Conference on Research and development in information retrieval, pp. 308–315. ACM, New York (2006), http://doi.acm.org/10.1145/1148170.1148225
Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national web domains. ACM Trans. Internet Technol. 7(2), 9 (2007)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999) (Second edition will appear in 2010)
Baeza-Yates, R.A., Boldi, P., Castillo, C.: Generic damping functions for propagating importance in link-based ranking. Internet Mathematics 3(4), 445–478 (2007)
Baeza-Yates, R.A., Poblete, B.: Dynamics of the chilean web structure. Computer Networks 50(10), 1464–1473 (2006)
Bar-Yossef, Z., Keidar, I., Schonfeld, U.: Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3(1), 1–31 (2009), http://doi.acm.org/10.1145/1462148.1462151
Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp. 16–24. ACM, New York (2008), http://doi.acm.org/10.1145/1401890.1401898
Bergman, M.K.: The deep web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)
Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: Fast access to linkage information on the web. In: Proceedings of the Seventh International World–Wide Web Conference, Brisbane, Australia, pp. 469–477 (1998)
Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: SIGIR 1998: Proceedings of the 21st annual International ACM SIGIR Conference on Research and development in information retrieval, pp. 104–111. ACM Press, New York (1998)
Boldi, P., Lonati, V., Santini, M., Vigna, S.: Graph fibrations, graph isomorphism, and PageRank. RAIRO Inform. Théor. 40, 227–253 (2006)
Boldi, P., Posenato, R., Santini, M., Vigna, S.: Traps and pitfalls of topic-biased pagerank. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 107–116. Springer, Heidelberg (2008) (Revised Papers)
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602. ACM, New York (2004), http://doi.acm.org/10.1145/988672.988752
Boldi, P., Vigna, S.: Codes for the World–Wide Web. Internet Math 2(4), 405–427 (2005)
Bollobás, B., Riordan, O.: The diameter of a scale-free random graph. Combinatorica 24(1), 5–34 (2004), http://dx.doi.org/10.1007/s00493-004-0002-2
Bollobás, B., Riordan, O., Spencer, J., Tusnády, G.: The degree sequence of a scale-free random graph process. Random Struct. Algorithms 18(3), 279–290 (2001), http://dx.doi.org/10.1002/rsa.1009
Bonato, A.: A survey of models of the web graph. In: López-Ortiz, A., Hamel, A.M. (eds.) CAAN 2004. LNCS, vol. 3405, pp. 159–172. Springer, Heidelberg (2004)
Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal evolution of the UK web. In: ICDM Workshops, pp. 909–918. IEEE Computer Society, Los Alamitos (2008)
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. Comput. Netw. 33(1-6), 309–320 (2000), http://dx.doi.org/10.1016/S1389-12860000083-9
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC 1998: Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 327–336. ACM, New York (1998), http://doi.acm.org/10.1145/276698.276781
Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS, pp. 253–262 (2006)
Bush, V.: As we may think. The Atlantic Monthly 176(1), 101–108 (1945)
Caminero, R.C., Zavarsky, P., Mikami, Y.: Status of the African web. In: WWW 2006: Proceedings of the 15th International Conference on World Wide Web, pp. 869–870. ACM, New York (2006), http://doi.acm.org/10.1145/1135777.1135919
Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp. 219–228. ACM, New York (2009), http://doi.acm.org/10.1145/1557019.1557049
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Transactions on Internet Technology 3, 2003 (2000)
Demetrescu, C., Finocchi, I., Ribichini, A.: Trading off space for passes in graph streaming problems. In: SODA, pp. 714–723 (2006)
Donato, D., Laura, L., Leonardi, S., Millozzi, S.: The web as a graph: How far we are. ACM Trans. Internet Technol. 7(1), 4 (2007)
Donato, D., Leonardi, S., Millozzi, S., Tsaparas, P.: Mining the inner structure of the web graph. In: Eigth International workshop on the Web and databases WebDB, Baltimore, USA (2005)
Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press, Inc., New York (2003)
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 309–318. ACM, New York (2004), http://doi.acm.org/10.1145/988672.988714
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974), http://doi.acm.org/10.1145/321812.321820
Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)
Erdös, P., Rényi, A.: On random graphs, I. Publicationes Mathematicae (Debrecen) 6, 290–297 (1959)
Fano, R.M.: On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC (1971)
Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005), http://dx.doi.org/10.1016/j.tcs.2005.09.013
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T.: Hypertext transfer protocol – http/1.1 (1999)
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T.: Hypertext Transfer Protocol – HTTP/1.1. RFC 2616, Draft Standard (1999), http://www.ietf.org/rfc/rfc2616.txt (Updated by RFC 2817)
Flake, G.W., Lawrence, S., Giles, C.L., Coetzee, F.: Self-organization and identification of web communities. IEEE Computer 35(3), 66–71 (2002)
Adobe Flash, http://www.adobe.com/
Fogaras, D., Rácz, B., Csalogány, K., Sarlós, T.: Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics 2(3) (2005)
Garrett, J.J.: Ajax: A new approach to web applications (2005), http://adaptivepath.com/ideas/essays/archives/000385.php
Gilbert, E.N.: Random graphs. The Annals of Mathematical Statistics 30(4), 1141–1144 (1959)
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases, pp. 576–587. Morgan Kaufmann, San Francisco (2004)
Haveliwala, T.H.: Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003)
Henzinger, M.R.: Hyperlink analysis for the web. IEEE Internet Computing 5(1), 45–50 (2001)
Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: External memory algorithms, pp. 107–118. American Mathematical Society, Boston (1999)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Joo, W.K., Myaeng, S.H.: Improving retrieval effectiveness with hyperlink information. In: Proceedings of International Workshop on Information Retrieval with Asian Languages (IRAL), Singapore (1998)
Kalyanasundaram, B., Schintger, G.: The probabilistic communication complexity of set intersection. SIAM J. Discret. Math. 5(4), 545–557 (1992), http://dx.doi.org/10.1137/0405044
Kamvar, S., Haveliwala, T., Golub, G.: Adaptive methods for the computation of pagerank. Technical Report 2003-26, Stanford InfoLab (2003)
Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 261–270. ACM, New York (2003), http://doi.acm.org/10.1145/775152.775190
Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. The Bell system technical journal 49(1), 291–307 (1970)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999), http://doi.acm.org/10.1145/324133.324140
Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Mathematics 1 (2004)
Lee, C.P.C., Golub, G., Zenios, S.A.: Partial state space aggregation based on lumpability and its application to PageRank. Technical report, Stanford InfoLab (2003)
Lempel, R., Moran, S.: The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Comput. Netw. 33(1-6), 387–401 (2000), http://dx.doi.org/10.1016/S1389-12860000034-7
Levenstein, V.E.: On the redundancy and delay of separable codes for the natural numbers. Problems of Cybernetics 20, 173–179 (1968)
Li, Y.: Toward a qualitative search engine. IEEE Internet Computing 2(4), 24–29 (1998)
Marchiori, M.: The quest for correct information of the Web: hyper search engines. In: Proc. of the sixth international conference on the Web, Santa Clara, CA, USA, pp. 265–274 (1997)
Najork, M.A., Zaragoza, H., Taylor, M.J.: HITS on the web: how does it compare? In: SIGIR 2007: Proceedings of the 30th annual International ACM SIGIR Conference on Research and development in information retrieval, pp. 471–478. ACM, New York (2007), http://doi.acm.org/10.1145/1277741.1277823
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 1–12. ACM, New York (2004), http://doi.acm.org/10.1145/988672.988674
Page, L., Brin, S., Motwani, R., Inograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999), Previous number = SIDL-WP-1999-0120
Pinski, G., Narin, F.: Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of Physics. IP&M 12, 297–326 (1976)
Porter, M.A., Onnela, J.P., Mucha, P.J.: Communities in networks. Notices of the American Mathematical Society 56(9), 1082–1097 (2009)
Raghavan, S., Garcia-Molina, H.: Representing web graphs. Technical Report 2002-30, Stanford InfoLab (2002)
Randall, K., Stata, R., Wickremesinghe, R., Wiener, J.L.: The LINK database: Fast access to graphs of the Web. Research Report 175, Compaq Systems Research Center, Palo Alto, CA (2001)
Salomon, D.: Variable-length Codes for Data Compression. Springer-Verlag New York, Inc., Secaucus (2007)
Suel, T., Yuan, J.: Compressing the graph structure of the web. In: Data Compression Conference, vol. 0, p. 0213 (2001), http://doi.ieeecomputersociety.org/10.1109/DCC.2001.917152
Upstill, T.: Predicting fame and fortune: Pagerank or indegree? In: Proceedings of the Australasian Document Computing Symposium, ADCS 2003, pp. 31–40 (2003)
W3C: Deprecated usage of the META element for redirects, http://www.w3.org/TR/WCAG10-HTML-TECHS/#meta-element
Wasserman, S., Faust, K., Iacobucci, D.: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, Cambridge (1994)
Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, Industrial Track (2005)
Zhu, J.J.H., Meng, T., Xie, Z., Li, G., Li, X.: A teapot graph and its hierarchical structure of the Chinese web. In: WWW 2008: Proceeding of the 17th International Conference on World Wide Web, pp. 1133–1134. ACM, New York (2008), http://doi.acm.org/10.1145/1367497.1367692
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Baeza-Yates, R., Boldi, P. (2010). Web Structure Mining. In: Velásquez, J.D., Jain, L.C. (eds) Advanced Techniques in Web Intelligence - I. Studies in Computational Intelligence, vol 311. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14461-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-14461-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14460-8
Online ISBN: 978-3-642-14461-5
eBook Packages: EngineeringEngineering (R0)