Scientometrics

, Volume 59, Issue 1, pp 43–62 | Cite as

Modelling the characteristics of Web page outlinks

  • Ajiferuke Isola
  • Wolfram Dietmar
Article

Abstract

Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adamic, L. A., Huberman, B. A. (2001). The Web's hidden order. Communications of the ACM, 44(9): 55-59.Google Scholar
  2. Ajiferuke, I., Wolfram, D. (submitted). Analysis of image tag distribution characteristics in Web pages.Google Scholar
  3. Albert, R., Barabasi, A. L. (2000). Topology of evolving networks: Local events and universality. Physical Review Letters, 85(24): 5234-5237.Google Scholar
  4. Albert, R., Jeong, H., Barabasi, A. L. (1999). Diameter of the world-wide web. Nature, 401: 130-131.Google Scholar
  5. Baayen, R. H. (2001). Word Frequency Distributions. Boston: Kluwer.Google Scholar
  6. Barford, P., Crovella, M. (1998). Generating representative web workloads for network and server performance evaluation. In: ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 151-160, July 1998.Google Scholar
  7. Bates, M. J., Lu, S. (1997). An explanatory profile of personal home pages: content, design, metaphors. Online & CDROM Review, 21(6): 331-340.Google Scholar
  8. Brin, S., Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available from: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [2003, April 15th]Google Scholar
  9. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Staa, R., Tomlins, A., Wiener, J. (2000). Graph structure in the Web. Computer Networks and ISDN Systems, 30: 209-320. Also in: Proceedings of the 9th International World Wide Web Conference, May 2000. http://www9.org/w9cdrom/160/160.htmlGoogle Scholar
  10. Burrell, Q. L., Fenton, M. R. (1993). Yes, the GIGP really does work — and is workable! Journal of the American Society for Information Science, 44: 61-69.Google Scholar
  11. Craven, T. C. (2001). Description meta tags in pages returned on different search engines. The Canadian Journal of Information and Library Science, 26(1): 1-17.Google Scholar
  12. cache/cond-mat/pdf/0009/0009090.pdfGoogle Scholar
  13. Egghe, L., Rousseau, R. (1990). Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Amsterdam: Elsevier.Google Scholar
  14. Famoye, F. (1997). Parameter estimation for generalized negative binomial distribution. Communications in Statistics: Simulation & Computation, 26(1): 269-279.Google Scholar
  15. Huberman, B. A. (2001). The Laws of the Web: Patterns in the Ecology of Information. Cambridge, MA: The MIT Press.Google Scholar
  16. Huberman, B. A., Adamic, L. A. (1999). Growth dynamics of the World Wide Web. Nature, 401: 131-133.Google Scholar
  17. Irwin, J. O. (1975). The generalized Waring distribution: Part 1, part 2, part 3. Journal of the Royal Statistical Society, Series A, 138: 18-31, 204–227, 374–384.Google Scholar
  18. Johnson, N. L., Kotz, S., Kemp, A. W. (1993). Univariate Discrete Distributions. 2nd edition. New York: John Wiley & Sons, Inc.Google Scholar
  19. Larson, R. R. (1996). Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Available: http://sherlock.berkeley.edu/asis96/asis96.html [2003, April 19th].Google Scholar
  20. Levene, M., Fenner, T., Loizou, G., Wheeldon, R. (2002). A stochastic model for the evolution of the Web. Computer Networks, 39(3): 277-287.Google Scholar
  21. Mandelbrot, B. (1954). Structure formelle des textes et communication: Deux etudes. Word, 10: 1-27.Google Scholar
  22. Nelson, M. J. (1989). Stochastic models for the distribution of index terms. Journal of Documentation, 45(3): 227-237.Google Scholar
  23. Nelson, M., Downie, J. S. (2002). Informetric analysis of a music database. Scientometrics, 54(2): 243-255.Google Scholar
  24. Nielsen, J. (1997a). Do Websites Have Increasing Returns? Available: http://www.useit.com/alertbox/9704b.html [2003, April 19th].Google Scholar
  25. Nielsen, J. (1997b). Zipf Curves and Website Popularity. Available: http://www.useit.com/alertbox/zipf.html [2003, April 19th].Google Scholar
  26. Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., Giles, C. L. (2002). Winners don.t take all: Characterizing the competition for links on the Web. Proceedings of the National Academic of Sciences of the United States of America, 99(8): 5207-5211.Google Scholar
  27. PIROLLI, P., PITKOW, J., RAO, R. (1996). Silk from a sow's ear: Extracting usable structures from the Web. In: R. BILGER, S. GUEST, M. J. TAUBER (Eds) CHI 96 – Electronic Proceedings. Available: http://www.acm.org/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html [2003, April 19th].Google Scholar
  28. Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1(1). Available: http://www.cindoc.csic.es/cybermetrics/articles/v1i1p1.html [2003, April 19th].Google Scholar
  29. Sichel, H. S. (1985). A bibliometric distribution which really works. Journal of the American Society for Information Science, 3(5): 314-321.Google Scholar
  30. Sichel, H. S. (1992). Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies. Information Processing & Management, 28(1): 5-17.Google Scholar
  31. Simon, H. A. (1955). On a class of skew distribution functions, Biometrika, 42: 425-440.Google Scholar
  32. Snyder, H., Rosenbaum, H. (1999). Can search engines be used as tools for web-link analysis? A critical review. Journal of Documentation, 55(4): 375-384.Google Scholar
  33. Wolfram, D. (2003) Applied Informetrics for Information Retrieval Research. Westport, CT: Libraries Unlimited.Google Scholar
  34. WOODRUFF, A., AOKI, P. M., BREWER, E., HAUTHIER, P., ROWE, L. A.(1996). An investigation of documentsfrom the World Wide Web. In: Proceedings of the Fifth International World Wide Web Conference,Paris, France, May 6-10, 1996. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html[2003, April 19th].Google Scholar
  35. Yule, G. U. (1944). Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.Google Scholar
  36. Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley.Google Scholar

Copyright information

© Kluwer Academic Publisher/Akadémiai Kiadó 2004

Authors and Affiliations

  • Ajiferuke Isola
    • 1
  • Wolfram Dietmar
    • 2
  1. 1.Faculty of Information and Media StudiesUniversity of Western OntarioLondonCanada
  2. 2.School of Information Studies, University of Wisconsin-MilwaukeeMilwaukeeUSA

Personalised recommendations