Rules of Thumb for Information Acquisition from Large and Redundant Data

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6611)


We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which contain information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to “truncated distributions” with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate substantially from a strict power law. We further give one particular family of power-law functions that remain completely invariant under sampling. Finally, we validate our model with two large Web data sets: link distributions to web domains and tag distributions on


Information Acquisition Under Sampling Redundant Data Pareto Principle Zipf Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Achlioptas, D., Clauset, A., Kempe, D., Moore, C.: On the bias of traceroute sampling: or, power-law degree distributions in regular graphs. In: STOC, pp. 694–703 (2005)Google Scholar
  2. 2.
    Adamic, L.A.: Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, Palo Alto, CA 94304 (October 2000)Google Scholar
  3. 3.
    Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: CIKM, pp. 736–743 (2005)Google Scholar
  4. 4.
    Capurro, R., Hjørland, B.: The concept of information. Annual Review of Information Science and Technology 37(1), 343–411 (2003)CrossRefGoogle Scholar
  5. 5.
    Cattuto, C., Loreto, V., Pietronero, L.: Semiotic dynamics and collaborative tagging. PNAS 104(5), 1461–1464 (2007)CrossRefGoogle Scholar
  6. 6.
    Chaudhuri, S., Church, K.W., König, A.C., Sui, L.: Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp. 663–670 (2007)Google Scholar
  7. 7.
    Clauset, A., Shalizi, C.R., Newman, M.: Power-law distributions in empirical data. SIAM Review 51(4), 661–703 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)Google Scholar
  9. 9.
    Flajolet, P., Dumas, P., Puyhaubert, V.: Some exactly solvable models of urn process theory. Discrete Math. & Theoret. Comput. Sci. AG, 59–118 (2006)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Flajolet, P., Sedgewick, R.: Analytic combinatorics. CUP (2009)Google Scholar
  11. 11.
    Gardy, D.: Normal limiting distributions for projection and semijoin sizes. SIAM Journal on Discrete Mathematics 5(2), 219–248 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Gatterbauer, W.: Estimating Required Recall for Successful Knowledge Acquisition from the Web. In: WWW, pp. 969–970 (2006)Google Scholar
  13. 13.
    Gatterbauer, W.: Rules of thumb for information acquisition from large and redundant data. CoRR abs/1012.3502 (2010)Google Scholar
  14. 14.
    Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, pp. 311–322 (1995)Google Scholar
  15. 15.
    Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl? towards a query optimizer for text-centric tasks. In: SIGMOD, pp. 265–276 (2006)Google Scholar
  16. 16.
    Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1(2), 226–251 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Newman, M.E.: Power laws, pareto distributions and zipf’s law. Contemporary Physics 46(5), 323–351 (2005)CrossRefGoogle Scholar
  18. 18.
    Soboroff, I., Harman, D.: Overview of the trec 2003 novelty track. In: TREC 2003. NIST, pp. 38–53 (2003)Google Scholar
  19. 19.
    Stumpf, M.P.H., Wiuf, C., May, R.M.: Subnets of scale-free networks are not scale-free: sampling properties of networks. PNAS 102(12), 4221–4224 (2005)CrossRefGoogle Scholar
  20. 20.
    Zipf, G.K.: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, Reading (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations