Skip to main content
Log in

Characterization of a Large Web Site Population with Implications for Content Delivery

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites, and validate our analysis for several sites by replaying our trace through a real cache. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. We found an indication of widespread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features of the HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is strongly constrained by the maximum values allowed in the Squid proxy. Thus, supplying explicit expiration information would significantly improve Web sites’ cacheability. Finally, our simulation results indicate that while most Web sites benefit from the use of a CDN, the amount of the benefit varies widely among the sites, which underscores the need for workload analysis tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. The Apache software foundation. http://www.apache.org

  2. Arlitt, M., Jin, T.: Workload characterization of the 1998 World Cup Web site. Technical Report HPL-1999-35R1, HP Labs, October 1999

  3. Arlitt, M.F., Williamson, C.L.: Web server workload characterization: the search for invariants. In: Proc. of ACM SIGMETRICS, pp. 126–137 (1996)

  4. Arlitt, M., Friedrich, R., Jin, T.: Workload characterization of a Web proxy in a cable modem environment. Technical Report HPL-1999-48, Hewlett Packard Labs, April 1999

  5. Barford, P., Bestavros, A., Bradley, A., Crovella, M.: Changes in Web client access patterns: characteristics and caching implications. World Wide Web 2, 15–28 (1999)

    Article  Google Scholar 

  6. Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: Proc. of the World Wide Web Conference, May 2004

  7. Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Towards informed web content delivery. In: Proc. of the Ninth International Workshop on Web Content Caching and Distribution (WCW’04), October 2004

  8. Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proc. of the 9th Int. World Wide Web Conference (2000)

  9. Cherkasova, L., Karlsson, M.: Dynamics and evolution of Web sites: Analysis, metrics and design issues. Technical Report HPL-2001-1R1, Hewlett Packard Laboratories, 16 July 2001

  10. Cranor, C., Johnson, T., Spatscheck, O.: Gigascope: a stream database for network applications. In: Proc. of ACM SIGMOD, June 2003

  11. Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the World Wide Web. In: Proc. of the USENIX Symp. on Internet Technologies and Systems, pp. 147–158, December 1997

  12. Duska, B., Marwood, D., Feeley, M.J.: The measured access characteristics of World Wide Web client proxy caches. In: Proc. of the First USENIX Symp. on Internet Technologies and Systems, pp. 23–36, December 1997

  13. Feldmann, A., Cáceres, R., Douglis, F., Glass, G., Rabinovich, M.: Performance of Web proxy caching in heterogeneous bandwidth environments. In: Proc. of IEEE INFOCOM, pp. 107–116 (1999)

  14. Gribble, S.D., Brewer, E.A.: System design issues for Internet middleware services: deductions from a large client trace. In: Proc. of the First USENIX Symp. on Internet Technologies and Systems, pp. 207–218, December 1997

  15. The HiRes Timing Library. http://www.search.cpan.org/~jhi/Time-HiRes-1.66/HiRes.pm

  16. Iyengar, A.K., Squillante, M.S., Zhang, L.: Analysis and characterization of large-scale Web server access patterns and performance. World Wide Web 2(1–2), 85–100, June (1999)

    Article  Google Scholar 

  17. Jung, Y., Krishnamurthy, B., Rabinovich, M.: Flash crowds and denial of service attacks: characterization and implications for CDNs and web sites. In: Proc. of the 11th Int. World Wide Web Conference, May 2002

  18. Kelly, T.: Thin-client Web access patterns: measurements from a cache-busting proxy. In: Proc. of the Int. Workshop on Web Content Caching and Distribution (2001)

  19. Krishnamurthy, B., Wang, J.: On network-aware clustering of Web clients. In: Proc. of ACM SIGCOMM, August 2000

  20. Krishnamurthy, B., Wills, C.E.: Analyzing factors that influence end-to-end Web performance. Comput. Networks 33(1–6), 17–32 (2000)

    Article  Google Scholar 

  21. Krishnamurthy, B., Arlitt, M.: PRO-COW: Protocol compliance on the Web: a longitudinal study. In: Proc. of the 3rd USENIX Symp. on Internet Technologies and Systems, pp. 109–122 (2001)

  22. Krishnamurthy, B., Wills, C., Zhang, Y.: On the use and performance of content distribution networks. In: Proc. of the First ACM SIGCOMM Internet Measurement Workshop, pp. 169–182, November 2001

  23. libwww: http://search.cpan.org/~gaas/libwwwperl5.803/

  24. Manley, S., Seltzer, M.: Web facts and fantasy. In: Proc. of the USENIX Symp. on Internet Technologies and Systems, pp. 125–133, December 1997

  25. Mogul, J.C.: Network behavior of a busy Web server and its clients. Technical Report 95/5, Compaq Western Research Lab, October 1995

  26. Mogul, J.C., Douglis, F., Feldmann, A., Krishnamurthy, B.: Potential benefits of delta encoding and data compression for HTTP. In: Proc. of ACM SIGCOMM, pp. 181–194 (1997)

  27. Padmanabhan, V.N., Qiu, L.: The content and access dynamics of a busy Web site: findings and implications. In: Proc. of ACM SIGCOMM, August 2000

  28. Pitkow, J.E.: Summary of WWW characterizations. World Wide Web 2, 3–13, June (1999)

    Article  Google Scholar 

  29. Raunak, M.S., Shenoy, P.J., Goyal, P., Ramamritham, K.: Implications of proxy caching for provisioning networks and servers. In: Proc. of ACM SIGMETRICS, pp. 66–77 (2000)

  30. The squid Web proxy cache. version 2.5. http://www.squid-cache.org

  31. Wills, C.E., Mikhailov, M.: Examining the cacheability of user-requested Web resources. In: Proc. of the Fourth Int. Workshop on Web Content Caching and Distribution, April 1999

  32. Wolman, A., Voelker, G.M., Sharma, N., Cardwell, N., Brown, M., Landray, T., Pinnel, D., Karlin, A., Levy, H.: Organization-based analysis of Web-object sharing and caching. In: Proc. of the USENIX Symp. on Internet Technologies and Systems (1999)

  33. Wolman, A., Voelker, G.M., Sharma, N., Cardwell, N., Karlin, A., Levy, H.M.: On the scale and performance of cooperative Web proxy caching. In: Proc. of ACM SOSP, pp. 16–31, December 1999

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leeann Bent.

Additional information

Bent, Rabinovich, and Xiao performed this work while at AT&T Labs-Research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bent, L., Rabinovich, M., Voelker, G.M. et al. Characterization of a Large Web Site Population with Implications for Content Delivery. World Wide Web 9, 505–536 (2006). https://doi.org/10.1007/s11280-006-0224-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-006-0224-x

Categories and Subject Descriptors

General Terms

Keywords

Navigation