World Wide Web

, Volume 11, Issue 1, pp 39–70 | Cite as

Prefetching in Content Distribution Networks via Web Communities Identification and Outsourcing

  • Antonis Sidiropoulos
  • George Pallis
  • Dimitrios Katsaros
  • Konstantinos Stamos
  • Athena Vakali
  • Yannis Manolopoulos


Content distribution networks (CDNs) improve scalability and reliability, by replicating content to the “edge” of the Internet. Apart from the pure networking issues of the CDNs relevant to the establishment of the infrastructure, some very crucial data management issues must be resolved to exploit the full potential of CDNs to reduce the “last mile” latencies. A very important issue is the selection of the content to be prefetched to the CDN servers. All the approaches developed so far, assume the existence of adequate content popularity statistics to drive the prefetch decisions. Such information though, is not always available, or it is extremely volatile, turning such methods problematic. To address this issue, we develop self-adaptive techniques to select the outsourced content in a CDN infrastructure, which requires no apriori knowledge of request statistics. We identify clusters of “correlated” Web pages in a site, called Web site communities, and make these communities the basic outsourcing unit. Through a detailed simulation environment, using both real and synthetic data, we show that the proposed techniques are very robust and effective in reducing the user-perceived latency, performing very close to an unfeasible, off-line policy, which has full knowledge of the content popularity.


data dissemination techniques on the web content distribution networks web communities web prefetching internet and web-based web data mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Annapureddy, S., Freedman, M.J., Mazieres D.: Shark: scaling file servers via cooperative caching. In: Proceedings of the USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, 2–4 May 2005Google Scholar
  2. 2.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)zbMATHCrossRefGoogle Scholar
  3. 3.
    Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large Web site population with implications for content delivery. World Wide Web J. 9(4), 505–536 (2006)CrossRefGoogle Scholar
  4. 4.
    Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the SIAM Data Mining Conference (SDM) (2004)Google Scholar
  5. 5.
    Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. SIAM J. Comput. 33(6), 1417–1440 (2004)zbMATHMathSciNetGoogle Scholar
  6. 6.
    Chen, Y., Qiu, L., Chen, W., Nguyen, L., Katz, R.H.: Efficient and adaptive Web replication using content clustering. IEEE J. Sel. Areas Commun. 21(6), 979–994 (2003)CrossRefGoogle Scholar
  7. 7.
    Eiron, N., McCurley, K.S.: Untangling compound documents on the Web. In: Proceedings of the ACM Conference on Hypertext and hypermedia (HT), Nottingham, 26–30 August 2003Google Scholar
  8. 8.
    Fan, L., Cao, P., Almeida, J.M., Broder, A.Z.: Summary cache: a scalable wide-area Web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3), 281–293 (2000)CrossRefGoogle Scholar
  9. 9.
    Flake, G.W., Lawrence, S., Giles, C.L., Coetzee, F.M.: Self-organization and identification of web communities. IEEE Comput. 35(3), 66–71 (2002)Google Scholar
  10. 10.
    Flake, G.W., Tarjan, R.E., Tsioutsiouliklis, K.: Graph clustering and minimum cut trees. Internet Math. 1(4), 385–408 (2003/2004)MathSciNetGoogle Scholar
  11. 11.
    Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)CrossRefGoogle Scholar
  12. 12.
    Fujita, N., Ishikawa, Y., Iwata, A., Izmailov, R.: Coarse-grain replica management strategies for dynamic replication of Web contents. Comput. Networks 45(1), 19–34 (2004)zbMATHCrossRefGoogle Scholar
  13. 13.
    Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U. S. A. 99(12), 7821–7826 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. World Wide Web J. 7(1), 58–82 (2004)CrossRefGoogle Scholar
  15. 15.
    Guimera, R., Sales-Pardo, M., Nunes Amaral, L.A.: Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 70, 025101 (2004)CrossRefGoogle Scholar
  16. 16.
    Haveliwala, T.H.: Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003)CrossRefGoogle Scholar
  17. 17.
    He, Y., Siganos, G., Faloutsos, M., Krishnamurthy, S.: Asystematic framework for unearthing the missing links: measurements and impact. In: Proceedings of the USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI), pp. 187–200. Cambridge, USA, 11–13 April 2007Google Scholar
  18. 18.
    Ino, H., Kudo, M., Nakamura, A.: Partitioning of Web graphs by community topology. In: Proceedings of the ACM World Wide Web Conference (WWW), Chiba, 10–14 May 2005Google Scholar
  19. 19.
    Jung, Y., Krishnamurthy, B., Rabinovich, M.: Flash crowds and denial of service attacks: characterization and implications for CDNs and Web sites. In: Proceedings of the ACM World Wide Web Conference (WWW), Honolulu, 7–11 May 2002Google Scholar
  20. 20.
    Kangasharju, J., Roberts, J.W., Ross, K.W.: Object replication strategies in content distribution networks. Comput. Commun. 25(4), 376–383 (2002)CrossRefGoogle Scholar
  21. 21.
    Katsaros, D., Manolopoulos, Y.: Caching in Web memory hierarchies. In: Proceedings of the ACM Symposium on Applied Computing (SAC), Nicosia, 14–17 March 2004Google Scholar
  22. 22.
    Katsaros, D., Manolopoulos, Y.: Prediction in wireless networks by Markov chains. IEEE Wireless Communications magazine, 2007 (in press)Google Scholar
  23. 23.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Krishnamurthy, B., Wills, C., Zhang, Y.: On the use and performance of content distribution networks. In: Proceedings of the ACM SIGCOMM Internet Measurement Workshop (IMW), San Francisco, 1–2 November 2001Google Scholar
  25. 25.
    Kroeger, T., Long, D.E., Mogul, J.: Exploring the bounds of Web latency reduction from caching and prefetching. In: Proceedings of the USENIX Symposium on Internet Technologies and Services (USITS), Monterey, 8–11 December 1997Google Scholar
  26. 26.
    Kulkarni, P., Shenoy, P.J., Gong, W.: Scalable techniques for memory-efficient CDN simulations. In: Proceedings of the ACM World Wide Web Conference (WWW), Budapest, 20–24 May 2003Google Scholar
  27. 27.
    Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for emerging cyber-communities. Comput. Networks 31(11–16), 1481–1493 (1999)CrossRefGoogle Scholar
  28. 28.
    Laoutaris, N., Zissimopoulos V., Stavrakakis, I.: On the optimization of storage capacity allocation for content distribution. Comput. Networks 47(3), 409–428 (2005)CrossRefGoogle Scholar
  29. 29.
    Leung, K.Y., Eric Wong, W.M., Yeung, K.H.: Designing efficient and robust caching algorithms for streaming-on-demand services on the Internet. World Wide Web J. 7(3), 297–314 (2004)CrossRefGoogle Scholar
  30. 30.
    Li, W.-S., Candan, K.S., Vu, Q., Agrawal, D.: Query relaxation by structure and semantics for retrieval of logical Web documents. IEEE Trans. Knowl. Data Eng. 14(4), 768–791 (2002)CrossRefGoogle Scholar
  31. 31.
    Masada, T., Takasu, A., Adachi, J.: Web page grouping based on parameterized connectivity. In: Lee, Y.-J., Li, J., Whang, K.-Y., Lee, D. (eds.) Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), pp. 374–380. Springer, Heidelberg (2004)Google Scholar
  32. 32.
    Nanopoulos, A., Katsaros, D., Manolopoulos, Y.: A data mining algorithm for generalized Web prefetching. IEEE Trans. Knowl. Data Eng. 15(5), 1155–1169 (2003)CrossRefGoogle Scholar
  33. 33.
    Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004)CrossRefGoogle Scholar
  34. 34.
    Ninan, A.G., Kulkarni, P., Shenoy, P.J., Ramamritham, K., Tewari, R.: Scalable consistency maintenance in content distribution networks using cooperative leases. IEEE Trans. Knowl. Data Eng. 15(4), 813–828 (2003)CrossRefGoogle Scholar
  35. 35.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford University (1999)Google Scholar
  36. 36.
    Pallis, G., Stamos, K., Vakali, A., Katsaros, D., Sidiropoulos, A., Manolopoulos, Y.: Replication based on objects load under a content distribution network. In: Proceedings of the IEEE International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), Atlanta, 3–7 April 2006Google Scholar
  37. 37.
    Pallis, G., Vakali, A.: Insight and perspectives for content delivery networks. Commun. ACM 49(1), 101–106 (2006)CrossRefGoogle Scholar
  38. 38.
    Pallis, G., Vakali, A., Stamos, K., Sidiropoulos, A., Katsaros, D., Manolopoulos, Y. A latency-based object placement approach in content distribution networks. In: Proceedings of the IEEE Latin American Web Congress (LA-Web), Buenos Aires, 31 October–2 November, 2005Google Scholar
  39. 39.
    Pan, J., Hou, Y.T., Li, B.: An overview of DNS-based server selections in content distribution networks. Comput. Networks 43(6), 695–711 (2003)zbMATHCrossRefGoogle Scholar
  40. 40.
    Qiu, L., Padmnanabhan, V.N., Voelker, G.M.: On the placement of Web server replicas. In: Proceedings of the IEEE Conference on Computer Communications (INFOCOM), vol. 3, pp. 1587–1596. IEEE, Piscataway (2001)Google Scholar
  41. 41.
    Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. U. S. A. 101(9), 2658–2663 (2004)CrossRefGoogle Scholar
  42. 42.
    Sivasubramanian, S., Pierre, G., van Steen, M., Alonso, G.: Analysis of caching and replication strategies for web applications. IEEE Internet Computing 11(1), 60–66 (2007)CrossRefGoogle Scholar
  43. 43.
    Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the Orkut social network. In: Proceeding of the ACM International Conference on Knowledge Discovery in Data Mining (SIGKDD), pp. 678–684 (2005)Google Scholar
  44. 44.
    Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup service for Internet applications. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), San Diego, 27–31 August 2001Google Scholar
  45. 45.
    Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrieval of logical information units in Web. In: Proceedings of the ACM Workshop on Organizing Web Space, in conjunction with ACM Digital Libraries (1999)Google Scholar
  46. 46.
    Vakali, A.: Proxy cache replacement algorithms: a history-based approach. World Wide Web J. 4(4), 277–298 (2001)zbMATHCrossRefGoogle Scholar
  47. 47.
    Vakali, A., Pallis, G.: Content delivery networks: status and trends. IEEE Internet Computing 7(6), 68–74 (2003)CrossRefGoogle Scholar
  48. 48.
    Venkataramani, A., Yalagandula, P., Kokku, R., Sharif, S., Dahlin, M.: The potential costs and benefits of long-term prefetching for content distribution. Comput. Commun. 25(4), 367–375 (2002)CrossRefGoogle Scholar
  49. 49.
    Wu, B., Kshemkalyani, A.D.: Objective-greedy algorithms for long-term Web prefetching. In: Proceedings of the IEEE Conference on Network Computing and Applications (NCA), pp. 61–68. IEEE, Piscataway (2004)Google Scholar
  50. 50.
    Wu, B., Kshemkalyani, A.D.: Objective-optimal algorithms for long-term Web prefetching. IEEE Trans. Comput. 55(1), 2–17 (2006)CrossRefGoogle Scholar
  51. 51.
    Zegura, E., Calvert, K., Bhattacharjee, S.: How to model an internetwork. In: Proceedings of the IEEE Conference on Computer Communications (INFOCOM), pp. 594–602. IEEE, Piscataway (1996)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Antonis Sidiropoulos
    • 1
  • George Pallis
    • 1
  • Dimitrios Katsaros
    • 1
    • 2
  • Konstantinos Stamos
    • 1
  • Athena Vakali
    • 1
  • Yannis Manolopoulos
    • 1
  1. 1.Informatics DepartmentAristotle UniversityThessalonikiGreece
  2. 2.Computer & Communication Engineering DepartmentUniversity of ThessalyVolosGreece

Personalised recommendations