Skip to main content

Analysis of the Web Graph Aggregated by Host and Pay-Level Domain

  • Conference paper
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 813))

Abstract

In this paper the web is analyzed as a graph aggregated by host and pay-level domain (PLD). The web graph datasets, publicly available, have been released by the Common Crawl Foundation (http://commoncrawl.org) and are based on a web crawl performed during the period May-June-July 2017. The host graph has \(\sim \)1.3 billion nodes and \(\sim \)5.3 billion arcs. The PLD graph has \(\sim \)91 million nodes and \(\sim \)1.1 billion arcs. We study the distributions of degree and sizes of strongly/weakly connected components (SCC/WCC) focusing on power laws detection using statistical methods. The statistical plausibility of the power law model is compared with that of several alternative distributions. While there is no evidence of power law tails on host level, they emerge on PLD aggregation for indegree, SCC and WCC size distributions. Finally, we analyze distance-related features by studying the cumulative distributions of the shortest path lengths, and give an estimation of the diameters of the graphs.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.publicsuffix.org/.

  2. 2.

    https://www.verisign.com/assets/domain-name-report-Q22017.pdf.

  3. 3.

    http://github.com/ntamas/plfit.

  4. 4.

    https://pypi.python.org/pypi/powerlaw.

References

  1. Alstott, J., Bullmore, E., Plenz, D.: Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS ONE 9(1), e85777 (2014)

    Google Scholar 

  2. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999)

    Article  MathSciNet  Google Scholar 

  3. Broder, A., et al.: Graph structure in the web. Comput. Netw. 33(1–6), 309–320 (2000)

    Google Scholar 

  4. Clauset, A., Shalizzi, C.R., Newman, M.E.J.: Power law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

    Article  MathSciNet  Google Scholar 

  5. Donato, D., Leonardi, S., Millozzi, S., Tsaparas, P.: Mining the inner structure of the Web graph. J. Phys. A: Math. Theor. 41(22), 224017 (2008)

    Google Scholar 

  6. Kumar, R., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A.: Trawling the Web for emerging cyber-communities. Comput. Netw. 31(11–16), 1481–1493 (1999)

    Article  Google Scholar 

  7. Leskovec, J., Sosič, R.: Snap: a general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016)

    Article  Google Scholar 

  8. Meusel, R., Vigna, S., Lehmberg, O., Bizer, C.: The graph structure in the web - analyzed on different aggregation levels. J. Web Sci. 1, 33–47 (2015)

    Article  Google Scholar 

  9. Palmer, C.R., Gibbons, P.B., Faloutsos, C.: ANF: a fast and scalable tool for data mining in massive graphs. In: Proceedings of KDD ’02 (2002)

    Google Scholar 

  10. Ponti, G., et al.: The role of medium size facilities in the HPC ecosystem: the case of the new CRESCO4 cluster integrated in the ENEAGRID infrastructure. In: Proceedings of HPCS, pp. 1030–1033 (2014)

    Google Scholar 

  11. Serrano, M.A., Maguitman, A., Bogu\(\tilde{\text{n}}\)á, M., Fortunato, S., Vespignani, A.: Decoding the structure of the WWW: a comparative analysis of web crawls. ACM Trans. Web 1(2) (2007)

    Google Scholar 

  12. Zhu, J.J.H., Meng, T., Xie, Z., Li, G., Li, X.: A teapot graph and its hierarchical structure of the chinese web. In: Proceedings of WWW ’08 (2008)

    Google Scholar 

Download references

Acknowledgements

The computing resources and the related technical support used for this work have been provided by CRESCO/ENEAGRID High Performance Computing infrastructure and its staff [10]. CRESCO/ENEAGRID High Performance Computing infrastructure is funded by ENEA, the Italian National Agency for New Technologies, Energy and Sustainable Economic Development and by Italian and European research programmes, see https://www.eneagrid.enea.it for information.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Agostino Funel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Funel, A. (2019). Analysis of the Web Graph Aggregated by Host and Pay-Level Domain. In: Aiello, L., Cherifi, C., Cherifi, H., Lambiotte, R., Lió, P., Rocha, L. (eds) Complex Networks and Their Applications VII. COMPLEX NETWORKS 2018. Studies in Computational Intelligence, vol 813. Springer, Cham. https://doi.org/10.1007/978-3-030-05414-4_2

Download citation

Publish with us

Policies and ethics