Abstract
This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites, and validate our analysis for several sites by replaying our trace through a real cache. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. We found an indication of widespread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features of the HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is strongly constrained by the maximum values allowed in the Squid proxy. Thus, supplying explicit expiration information would significantly improve Web sites’ cacheability. Finally, our simulation results indicate that while most Web sites benefit from the use of a CDN, the amount of the benefit varies widely among the sites, which underscores the need for workload analysis tools.
Similar content being viewed by others
References
The Apache software foundation. http://www.apache.org
Arlitt, M., Jin, T.: Workload characterization of the 1998 World Cup Web site. Technical Report HPL-1999-35R1, HP Labs, October 1999
Arlitt, M.F., Williamson, C.L.: Web server workload characterization: the search for invariants. In: Proc. of ACM SIGMETRICS, pp. 126–137 (1996)
Arlitt, M., Friedrich, R., Jin, T.: Workload characterization of a Web proxy in a cable modem environment. Technical Report HPL-1999-48, Hewlett Packard Labs, April 1999
Barford, P., Bestavros, A., Bradley, A., Crovella, M.: Changes in Web client access patterns: characteristics and caching implications. World Wide Web 2, 15–28 (1999)
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: Proc. of the World Wide Web Conference, May 2004
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Towards informed web content delivery. In: Proc. of the Ninth International Workshop on Web Content Caching and Distribution (WCW’04), October 2004
Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proc. of the 9th Int. World Wide Web Conference (2000)
Cherkasova, L., Karlsson, M.: Dynamics and evolution of Web sites: Analysis, metrics and design issues. Technical Report HPL-2001-1R1, Hewlett Packard Laboratories, 16 July 2001
Cranor, C., Johnson, T., Spatscheck, O.: Gigascope: a stream database for network applications. In: Proc. of ACM SIGMOD, June 2003
Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the World Wide Web. In: Proc. of the USENIX Symp. on Internet Technologies and Systems, pp. 147–158, December 1997
Duska, B., Marwood, D., Feeley, M.J.: The measured access characteristics of World Wide Web client proxy caches. In: Proc. of the First USENIX Symp. on Internet Technologies and Systems, pp. 23–36, December 1997
Feldmann, A., Cáceres, R., Douglis, F., Glass, G., Rabinovich, M.: Performance of Web proxy caching in heterogeneous bandwidth environments. In: Proc. of IEEE INFOCOM, pp. 107–116 (1999)
Gribble, S.D., Brewer, E.A.: System design issues for Internet middleware services: deductions from a large client trace. In: Proc. of the First USENIX Symp. on Internet Technologies and Systems, pp. 207–218, December 1997
The HiRes Timing Library. http://www.search.cpan.org/~jhi/Time-HiRes-1.66/HiRes.pm
Iyengar, A.K., Squillante, M.S., Zhang, L.: Analysis and characterization of large-scale Web server access patterns and performance. World Wide Web 2(1–2), 85–100, June (1999)
Jung, Y., Krishnamurthy, B., Rabinovich, M.: Flash crowds and denial of service attacks: characterization and implications for CDNs and web sites. In: Proc. of the 11th Int. World Wide Web Conference, May 2002
Kelly, T.: Thin-client Web access patterns: measurements from a cache-busting proxy. In: Proc. of the Int. Workshop on Web Content Caching and Distribution (2001)
Krishnamurthy, B., Wang, J.: On network-aware clustering of Web clients. In: Proc. of ACM SIGCOMM, August 2000
Krishnamurthy, B., Wills, C.E.: Analyzing factors that influence end-to-end Web performance. Comput. Networks 33(1–6), 17–32 (2000)
Krishnamurthy, B., Arlitt, M.: PRO-COW: Protocol compliance on the Web: a longitudinal study. In: Proc. of the 3rd USENIX Symp. on Internet Technologies and Systems, pp. 109–122 (2001)
Krishnamurthy, B., Wills, C., Zhang, Y.: On the use and performance of content distribution networks. In: Proc. of the First ACM SIGCOMM Internet Measurement Workshop, pp. 169–182, November 2001
Manley, S., Seltzer, M.: Web facts and fantasy. In: Proc. of the USENIX Symp. on Internet Technologies and Systems, pp. 125–133, December 1997
Mogul, J.C.: Network behavior of a busy Web server and its clients. Technical Report 95/5, Compaq Western Research Lab, October 1995
Mogul, J.C., Douglis, F., Feldmann, A., Krishnamurthy, B.: Potential benefits of delta encoding and data compression for HTTP. In: Proc. of ACM SIGCOMM, pp. 181–194 (1997)
Padmanabhan, V.N., Qiu, L.: The content and access dynamics of a busy Web site: findings and implications. In: Proc. of ACM SIGCOMM, August 2000
Pitkow, J.E.: Summary of WWW characterizations. World Wide Web 2, 3–13, June (1999)
Raunak, M.S., Shenoy, P.J., Goyal, P., Ramamritham, K.: Implications of proxy caching for provisioning networks and servers. In: Proc. of ACM SIGMETRICS, pp. 66–77 (2000)
The squid Web proxy cache. version 2.5. http://www.squid-cache.org
Wills, C.E., Mikhailov, M.: Examining the cacheability of user-requested Web resources. In: Proc. of the Fourth Int. Workshop on Web Content Caching and Distribution, April 1999
Wolman, A., Voelker, G.M., Sharma, N., Cardwell, N., Brown, M., Landray, T., Pinnel, D., Karlin, A., Levy, H.: Organization-based analysis of Web-object sharing and caching. In: Proc. of the USENIX Symp. on Internet Technologies and Systems (1999)
Wolman, A., Voelker, G.M., Sharma, N., Cardwell, N., Karlin, A., Levy, H.M.: On the scale and performance of cooperative Web proxy caching. In: Proc. of ACM SOSP, pp. 16–31, December 1999
Author information
Authors and Affiliations
Corresponding author
Additional information
Bent, Rabinovich, and Xiao performed this work while at AT&T Labs-Research.
Rights and permissions
About this article
Cite this article
Bent, L., Rabinovich, M., Voelker, G.M. et al. Characterization of a Large Web Site Population with Implications for Content Delivery. World Wide Web 9, 505–536 (2006). https://doi.org/10.1007/s11280-006-0224-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-006-0224-x
Categories and Subject Descriptors
- C.2.5 [Computer Communication Networks]: [Local and Wide Area Networks]-[Internet]
- C.4 [Performance of Systems]: Performance Attributes
- I.6 [Simulation and Modeling]: Applications