Skip to main content

Linking as voting: how the Condorcet jury theorem in political science is relevant to webometrics

Abstract

A webmaster’s decision to link to a webpage can be interpreted as a “vote” for that webpage. But how far does the parallel between linking and voting extend? In this paper, we prove several “linking theorems” showing that link-based ranking tracks importance on the web in the limit as the number of webpages grows, given independence and minimal linking competence. The theorems are similar in spirit to the voting, or jury, theorem famously attributed to the 18th century mathematician Nicolas de Condorcet. We argue that the linking theorems provide a fundamental epistemological justification for link-based ranking on the web, analogous to the justification that Condorcet’s theorems bestow on majority voting as a basic democratic procedure. The analogy extends to the practical limitations facing both kinds of result, in particular due to limited voting/linking independence. However, we argue, referring to the theoretical developments inspired by the jury theorem, that some of the pessimism expressed in the webometrics literature regarding the possibility of a “theory of linking” may be unjustified. The present study connects the two academic disciplines of webometrics in information science and epistemic democracy in political science by showing how they share a common structure. As such, it opens up new possibilities for theoretical cross-fertilization and interdisciplinary transference of concepts and results. In particular, we show how the relatively young field of webometrics can benefit from the extensive and sophisticated literature on the Condorcet jury theorem.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    This account of the Condorcet theorem and the example that follows are due to Goodin (2003), pp. 95–96.

  2. 2.

    “Larry downloaded the entire link structure of the Web, not quite knowing what he’d do with it. He realized that links weren’t organic; they were the result of conscious effort. In a sense, users were voting for the best links… when they included a link on their own site.” (Auletta 2010, p. 35).

  3. 3.

    We interpret zero importance as the complete absence of any of the qualities that make a page important, and an importance value of 1 as the maximal presence of all such qualities.

  4. 4.

    The probability of the surfer randomly jumping, as opposed to following a link, is a parameter in the random surfer algorithm, and it is this probability that features in the proof of Theorem 2 in this paper. The proof of Theorem 2 uses a theorem by Fortunato et al. (2008) for approximating PageRanks on the basis that PageRanks are calculated using the random surfer algorithm.

  5. 5.

    Google refines the search results using a reported 200–300 other “quality signals” which we ignore here. While the exact inner workings of the Google search engine is a trade secret, PageRank is not (U.S. Patent 6285999). PageRank is described in Brin et al. (1998). For a quick introduction and a short historical background see Wills (2006) and Franceschet (2011), respectively.

  6. 6.

    One could, of course, use more stringent thresholds. Depending on the nature of the data, common alternative thresholds are 0.05 and 0.01.

  7. 7.

    Putting our main question in this way also avoids a number of further issues that a more direct approach to transferring the jury theorem to the web would have to confront. What is the role of a repository of links, whose role is that of a signpost rather than a library? Should non-linking be considered a vote “against”, or is it an “abstention”? Does linking and non-linking constitute binary choices?

  8. 8.

    The original insight that the web can be represented as a directed graph is generally attributed to Björneborn and Ingwersen (2001).

  9. 9.

    In our exploratory study we assumed a positive exponential distribution characterized by its mean, which is the expected page importance given the distribution. We later explored negative exponential distributions. A positive exponential distribution implies that important pages are common; a negative exponential distribution that they are rare. Of course, any distribution of page importance could be sampled from, although most would require further parameters for their specification.

  10. 10.

    We assume 3rd order polynomials for the regression analysis because, for the web ecologies concerned and across the majority of parameter configurations, exponential and logarithmic functions were found to be unsuitable, while typically 2nd and lower order polynomials under-fit and 4th and higher order polynomials over-fit the data.

  11. 11.

    The run-length parameter determines the number of recurssions performed to calculate the PageRanks of the pages in the web-graph. The larger this number the more accurate the determination of PageRanks for a given size of web-graph in both the diffusion and random surfer algorithms but the greater the computing time for a given amount of computing power.

  12. 12.

    The repetitions parameter determines the number of web-graphs assessed for each parameter specification. Each parameter specification represents a particular web ecology from which graphs are sampled. As the sampling is random, so the graphs may be more or less typical of the ecology. To avoid unrepresentative results we sampled a number of graphs (10 repetitions) for each configuration and then selected a number of pages (20) from each. The metrics and page importance for these (200) pages were then plotted against each other and the coefficient of determination for that metric in that ecology therby determined against a 2nd or 3rd order polynomial regression function, as appropriate.

  13. 13.

    Our model of linking behavior captures key assumptions made by the Google founders and can be seen as an idealized competence model of linking on the web, emphasizing the role of webmaster reliability and (in)dependence. As such, it does not do justice to all aspects of the real web. For example, in our extended model, linking is a function solely of the importance of the source and target pages. Webmasters of more important pages are assumed to be more likely to link to other important pages. In the real web, by contrast, we would expect linking to be a function also of the topic addressed by the webpages. Thus, linking within the same topic (e.g. climate science) should be more likely to occur than linking across different topics (e.g. from climate science to modern French literature), everything else being equal. Alternatively, our model may be reinterpreted as modelling linking not in the whole web but only within one particular web topic. Future developments of the model include incorporating topic-sensitivity in the linking process and investigating related statistical models (see, for example, Schweinberger and Handcock 2015).

References

  1. Almind, T. C., & Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to ‘webometrics’. Journal of Documentation, 53(4), 404–426.

    Article  Google Scholar 

  2. Auletta, K. (2010). Googled: The end of the world as we know it. London: The Penguin Press.

    Google Scholar 

  3. Barabási, A. L. (2002). Linked: The new science of networks. Cambridge, Massachusetts: Perseus Publishing.

    Google Scholar 

  4. Bar-Ilan, J. (2004). A microscopic link analysis of academic institutions within a country: The case of Israel. Scientometrics, 59(3), 391–403.

    Article  Google Scholar 

  5. Berg, S. (1993). Condorcet’s jury theorem: Dependency among voters. Social Choice and Welfare, 10, 87–95.

    Article  MathSciNet  MATH  Google Scholar 

  6. Björneborn, L., & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1), 65–82.

    Article  Google Scholar 

  7. Boland, P. J. (1989). “Majority systems and the Condorcet jury theorem. Journal of the Royal Statistical Society, Series D (The Statistician), 38, 181–189.

    Google Scholar 

  8. Boland, P. J., Proschan, F., & Tong, Y. (1989). Modelling dependence in simple and indirect majority systems. Journal of Applied Probability, 26, 81–88.

    Article  MathSciNet  MATH  Google Scholar 

  9. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine”, WWW 1998. In Seventh international world-wide web conference. Brisbane, Australia.

  10. Brin, S., Page, L., Motwami, R., & Winograd, T. (1998). The PageRank citation ranking: Bringing order to the web. Stanford University Technical Report.

  11. Cohen, J. (1986). An epistemic conception of democracy. Ethics, 97, 26–38.

    Article  Google Scholar 

  12. Davenport, E., & Cronin, B. (2000). The citation network as a prototype for representing trust in virtual environments. In B. Cronin & H. B. Atkins (Eds.), The web of knowledge: A Festschrift in Honor of Eugene Garfield. ASIS Monograph Series (pp. 517–534). Metford, NJ: Information Today Inc.

    Google Scholar 

  13. de Condorcet, N. (1785). Essai sur l’application de l’analyse à la probabilité des decisions rendues à la pluralité des voix (Essay on the application of analysis to the probability of majority decisions). Paris: L'Impremerie Royale [facsimile edition New York: Chelsea, 1972].

  14. Diedrich, F., & Spiekermann, K. (2013). Epistemic democracy with defensible premises. Economics and Philosophy, 29, 87–120.

    Article  Google Scholar 

  15. Dietrich, F. (2008). The premises of Condorcet’s jury theorem are not simultaneously justified. Episteme, 58, 56–73.

    Article  Google Scholar 

  16. Dietrich, F., & List, C. (2004). A model of jury decisions where all jurors have the same evidence. Synthese, 142, 175–202.

    Article  MathSciNet  MATH  Google Scholar 

  17. Estlund, D. M. (1994). Opinion leaders, independence, and Condorcet’s jury theorem. Theory and Decision, 36, 131–162.

    Article  MATH  Google Scholar 

  18. Estlund, D. M. (2008). Democratic authority: A philosophical framework. Princeton, NJ: Princeton University Press.

    Google Scholar 

  19. Estlund, D., Waldron, J., Grofman, B., & Feld, S. L. (1989). Democratic theory and the public interest: Condorcet and Rousseau revisited. American Political Science Review, 83, 1317–1340.

    Article  Google Scholar 

  20. Fortunato, S., Boguñá, M., Flammini, A., & Menczer, F. (2008). Approximating PageRank from In-Degree. In W. Aiello, A. Broder, J. Janssen & E. Milios (Eds.), Algorithms and models for the web-graph (pp. 59–71). Berlin/Heidelberg: Springer-Verlag.

  21. Franceschet, M. (2011). PageRank: Standing on the shoulders of giants. Communications of the ACM, 54(6), 92–101.

    Article  Google Scholar 

  22. Gaus, G. (1997). Does democracy reveal the voice of the people? Four takes on Rousseau. Australasian Journal of Philosophy, 75, 141–162.

    Article  Google Scholar 

  23. Goodin, R. E. (2003). Reflective democracy. Oxford: Oxford University Press.

    Book  Google Scholar 

  24. Grofman, B., & Feld, S. L. (1988). Rousseau’s general will: A Condorcetian perspective. American Political Science Review, 82, 567–576.

    Article  Google Scholar 

  25. Hernández-Borges, A. A., Macías-Cervi, P., Gaspar-Guardado, M. A., Torres-Álvarez de Arcaya, M. L., Ruiz-Rabaza, A., & Jiménez-Sosa, A. (1999). Can examination of WWW usage statistics and other indirect quality indicators distinguish the relative quality of medical Web sites? Journal of Medical Internet Research, 1(1). http://www.jmir.org/1999/1991/e1991/index.htm.

  26. Ingwersen, P. (1998). The calculation of web impact factors. Journal of Documentation, 54(2), 236–243.

    Article  Google Scholar 

  27. Kaniovski, S. (2010). Aggregation of correlated votes and Condorcet’s jury theorem. Theory and Decision, 69, 453–468.

    Article  MathSciNet  MATH  Google Scholar 

  28. Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 81–89.

    Article  MathSciNet  MATH  Google Scholar 

  29. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.

    Article  MathSciNet  MATH  Google Scholar 

  30. Ladha, K. K. (1992). The Condorcet’s jury theorem, free speech, and correlated votes. American Journal of Political Science, 36, 617–634.

    Article  Google Scholar 

  31. Ladha, K. K. (1993). Condorcet’s jury theorem in light of de Finetti’s theorem. Social Choice and Welfare, 10, 69–85.

    Article  MathSciNet  MATH  Google Scholar 

  32. Ladha, K. K. (1995). Information pooling through majority-rule voting: Condorcet’s jury theorem with correlated votes. Journal of Economic Behavior & Organization, 26, 353–372.

    Article  Google Scholar 

  33. List, C., & Goodin, R. E. (2001). Epistemic democracy: Generalizing the Condorcet jury theorem. Journal of Political Philosophy, 9(3), 277–306.

    Article  Google Scholar 

  34. Lorentzen, D. G. (2014). Webometrics benefitting from web mining? An investigation of methods and applications of two research fields. Scientometrics, 99, 409–445.

    Article  Google Scholar 

  35. XXXX

  36. McLean, I., & Hewitt, F. (1994). Condorcet: Foundations of social choice and political theory. Northampton, MA: Edward Elgar Publishing Limited.

    Google Scholar 

  37. Nitzan, S., & Paroush, J. (1984). The significance of independent decisions in uncertain dichotomous choice situations. Theory and Decision, 17, 47–60.

    Article  MathSciNet  MATH  Google Scholar 

  38. Owen, G., Grofman, B., & Feld, S. L. (1989). Proving a distribution-free generalization of the Condorcet jury theorem. Mathematical Social Sciences, 17, 1–16.

    Article  MathSciNet  MATH  Google Scholar 

  39. Palmer, J. W., Bailey, J. P., & Faraj, S. (2000). The role of intermediaries in the development of trust on the WWW: The use and prominence of trusted third parties and privacy statements. Journal of Computer-Mediated Communication, 5(3). doi:10.1111/j.1083-6101.2000.tb00342.x.

  40. Pearl, J. (2000). Causality: Models, reasoning and inference. Cambridge: Cambridge University Press.

    Google Scholar 

  41. Rheingold, H. (2002). Smart mobs: The next social revolution. Cambridge, MA: Perseus Publishing.

    Google Scholar 

  42. Romeijn, J., & Atkinson, D. (2011). A Condorcet jury theorem for unknown juror competence. Politics, Philosophy, and Economics, 10, 237–262.

    Article  Google Scholar 

  43. Schweinberger, M., & Handcock, M. S. (2015). Local dependence in random graph models: Characterization, properties and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(3), 647–676.

    Article  MathSciNet  Google Scholar 

  44. Shapley, L., & Grofman, B. (1984). Optimizing group judgmental accuracy in the presence of interdependencies. Public Choice, 43, 329–343.

    Article  Google Scholar 

  45. Spiekermann, K. R., & Goodin, R. E. (2012). Courts of many minds. British Journal of Political Science, 12, 555–571.

    Article  Google Scholar 

  46. Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations. London: Little Brown.

    Google Scholar 

  47. Thelwall, M. (2006). Interpreting social science link analysis research: A theoretical framework. Journal of the American Society for Information Science and Technology archive, 57(1), 60–68.

    Article  Google Scholar 

  48. Vaughan, L., & Thelwall, M. (2003). Scholarly use of the web: What are the key inducers of links to journal web sites? Journal of the American Society for Information Science and Technology, 54(1), 29–38.

    Article  Google Scholar 

  49. Vreeland, R. C. (2000). Law libraries in hyperspace: A citation analysis of World Wide Web sites. Law Library Journal, 92(1), 9–25.

    Google Scholar 

  50. Wills, R. S. (2006). Google’s PageRAnk: The maths behind the search engine. The Mathematical Intelligencer, 28(4), 6–11.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The research for this article was funded by the Swedish Research Council through the framework grant Knowledge in a Digital World: Trust, Credibility and Relevance on the Web (Olsson, PI).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Erik J. Olsson.

Appendix: Proofs

Appendix: Proofs

All of the following proofs assume that the distribution of importance values is given by some density \( \rho \left( {pi} \right):\left[ {0,1} \right] \mapsto \left[ {0,\infty } \right] \), such that:

$$ P\left( {pi \in \left[ {a,b} \right]} \right)\text{ := }\mathop \int \limits_{a}^{b} \rho \left( {pi} \right)dpi. $$

Proof of Theorem 1

It is assumed in the basic model that linking probability is determined solely by the importance of the target page. Hence, the web ecology is such that the probability to link to the ith page from any of the other \( n - 1 \) pages of the web-graph is a simple function of the importance of the ith page.

$$ \forall j \ne i\left[ {P\left( {j \to {\text{i}}} \right) = {\mathcal{F}}\left( {pi_{i} } \right)} \right], $$

for \( {\mathcal{F}}\left( {pi_{i} } \right): \left[ {0,1} \right] \mapsto \left[ {0,1} \right]. \) The precise form, and parameters, of \( {\mathcal{F}}\left( {pi_{i} } \right) \) do not matter for the proof.

The assignment of up to \( n - 1 \) links to the ith page, each with the same (independent) probability for assignment, satisfies the conditions of a Bernoulli trial. Thus the probability distribution over the number of links to the ith page is the binomial distribution with mean—the expected number of in-links to the ith page \( \overline{N}_{i} \)—given by

$$ \overline{N}_{i} = \left( {n - 1} \right){\mathcal{F}}\left( {pi_{i} } \right). $$

The In-Degree of the ith webpage ID i is herein defined as the ratio of incoming links to maximum possible number of incoming links where each page may link to another only once and no page can link to itself:

$$ ID_{i } : = \frac{{N_{i} }}{n - 1}. $$

As the assignment of up to \( n - 1 \) links to the ith page, each with the same probability for assignment, satisfies the conditions of a Bernoulli trial, so the law of large numbers applies. Thus, with probability 1, \( \frac{{N_{i} }}{n - 1} \to \frac{{\overline{N}_{i} }}{n - 1} \) as \( n \to \infty \). As \( ID_{i } = \frac{{N_{i} }}{n - 1} \) and \( {\mathcal{F}}\left( {pi_{i} } \right) = \frac{{\overline{N}_{i} }}{n - 1} \), so we have that, with probability 1, \( ID_{i } \to {\mathcal{F}}\left( {pi_{i} } \right) \) as \( n \to \infty \).

As the above holds for each webpage in the web-graph, so the probability is 1 that page In-Degree converges to a function of page importance as graph size goes to infinity: which is to say that the probability is 1 that In-Degree and page importance go toward perfectly correlation as the number of pages to the web-graph goes to infinity. That is, the probability is 1 that \( R^{2} \to 1 \) with \( {\mathcal{F}}\left( {pi_{i} } \right) \) as the regression function as \( n \to \infty \).□

Proof of Theorem 2

As shown in Fortunato et al. (2008), the average PageRank \( PR_{i} \) of a node \( i \) with \( N_{i} \) in-links is

$$ PR_{i} = \frac{q}{n} + \frac{1 - q}{n}\frac{{N_{i} }}{{\overline{N}_{i} }} $$

where \( q \) is the probability for the random surfer to jump to a new random page rather than follow a link. This equation holds given that the probability of a source page to link to a target page is independent of the number of links from, or to, the source page itself, which is easily checked to hold given the assumption of the basic model that linking probability is determined solely by the importance of the target page. Substituting \( (n - 1) ID_{i} \) for \( N_{i} \), we get

$$ PR_{i} = \frac{q}{n} + \frac{1 - q}{n}\frac{{\left( {n - 1} \right) ID_{i} }}{{\overline{{\left( {n - 1} \right) ID}}_{i} }} = \frac{q}{n} + \frac{1 - q}{n}\frac{{\left( {n - 1} \right) ID_{i} }}{{\left( {n - 1} \right)\overline{{ID_{i} }}_{i} }} = \frac{q}{n} + \frac{1 - q}{n}\frac{{ ID_{i} }}{{\overline{{ID_{i} }} }} $$

As \( n \) increases, this is ever better approximated by:

$$ PR_{i} = \frac{1 - q}{n}\frac{{ ID_{i} }}{{\overline{{ID_{i} }} }} = c ID_{i} $$

for a constant \( c \). Thus \( PR_{i} \) is linearly proportional to \( ID_{i} \) in the limit of large \( n \). This explains why their \( R^{2} \) values converge in larger web-graphs and so explains why PageRank goes toward perfect correlation with page importance as web-graph size increases in web ecologies where In-Degree does the same and the linking probability is independent of the number of links to, or from, the source page.□

Proof of Theorem 3

By the assumptions of the extended model, the linking probability depends on the importance of the source page and the importance of the target page. Hence, the probability of page \( j \) linking to page \( i \) is a function \( {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right): \left[ {0,1} \right] \times \left[ {0,1} \right] \mapsto \left[ {0,1} \right] \) of the importance of both these pages:

$$ \forall j \ne i\left[ {P\left( {j \to {\text{i}}} \right) = {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)} \right], $$

The precise form, and parameters, of \( {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right) \) do not matter for the proof.

The (independent) assignment of up to \( n - 1 \) links to the ith page, each with varying probability for assignment, satisfies the conditions of a Poisson trial. The mean of the Poisson distribution—the expected number of in-links to the ith page \( \overline{N}_{i} \)—is given by:

$$ \overline{N}_{i} = \mathop \sum \limits_{j = 1}^{n - 1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right). $$

As the independent assignment of up to \( n - 1 \) links to the ith page, each with varying probability for assignment, satisfies the conditions of a Poisson trial, so the law of large numbers applies. Thus, with probability 1, \( \frac{{N_{i} }}{n - 1} \to \frac{{\overline{N}_{i} }}{n - 1} \) as \( n \to \infty \). As \( ID_{i } = \frac{{N_{i} }}{n - 1} \) and \( \overline{N}_{i} = \mathop \sum \limits_{j = 1}^{n - 1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right) \), so we have that, with probability 1, \( ID_{i } \to \frac{{\mathop \sum \nolimits_{j = 1}^{n - 1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)}}{n - 1} \) as \( n \to \infty \).

In words, we have that the probability is 1 that the In-Degree of the ith page converges to the average value of \( {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right) {\text{for a given }}pi_{i} \) as the number of pages goes to infinity. But in that same limit, again by the law of large numbers, the probability is 1 that the average value of \( {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right) {\text{for a given }}pi_{i} \) converges to the expected value of that function relative to how the \( pi_{j} \) are distributed. The expected value for \( {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right) \) relative to how the \( pi_{j} \) are distributed, for a given value of \( pi_{i} \), is given by:

$$ \overline{{{\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)}} = \mathop \int \limits_{0}^{1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)\rho \left( {pi_{j} } \right)dpi_{j} = k\left( {pi_{i} } \right), $$

where \( \rho \left( {pi_{j} } \right) \) is the distribution of page importance assumed in the model and \( k\left( {pi_{i} } \right) \) is whatever function that results from carrying out the integral.

Thus as both

$$ ID_{i} \to \frac{{\mathop \sum \nolimits_{j = 1}^{n - 1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)}}{n - 1} $$

and

$$ \frac{{\mathop \sum \nolimits_{j = 1}^{n - 1} {\mathcal{F}}\left( {pi_{i} ,pi_{j} } \right)}}{n - 1} \to k\left( {pi_{i} } \right) $$

as \( n \to \infty \) with probability 1, so it trivially follows that

$$ ID_{i} \to k\left( {pi_{i} } \right) $$

as \( n \to \infty \) with probability 1.As the above holds for each webpage in the web-graph, so the probability is 1 that page In-Degree converges to a function of page importance as graph size goes to infinity: which is to say that the probability is 1 that In-Degree and page importance go toward perfectly correlation as the number of pages to the web-graph goes to infinity. That is, the probability is 1 that \( R^{2} \to 1 \) with \( k\left( {pi_{i} } \right) \) as the regression function as \( n \to \infty \).□

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Masterton, G., Olsson, E.J. & Angere, S. Linking as voting: how the Condorcet jury theorem in political science is relevant to webometrics. Scientometrics 106, 945–966 (2016). https://doi.org/10.1007/s11192-016-1837-1

Download citation

Keywords

  • Webometrics
  • Condorcet jury theorem
  • Linking
  • Independence
  • Ranking
  • PageRank