Analyzing the Web: Are Top Websites Lists a Good Choice for Research?

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2022)


The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.

  1. 1.

    The term sites will be used as a synonym for hosts. A page is regarded as a single web page document on a host.

  2. 2.

    Relevant queries and their search volume can be identified using tools such as the Google Keyword Planner that provides historical data about search volume of specific queries.

  3. 3.

  4. 4.

  5. 5.

    Available at

  6. 6.

    Data available at

  7. 7.

  8. 8.

    Data dumps of Wikipedia External links are available at

  9. 9.

    Only search volume data from Google has been taken into account, and, as a consequence, only the hosts found in Google Search.

  10. 10.


