Word Familiarity Distributions to Understand Heaps’ Law of Vocabulary Growth of the Internet Forums

  • Masao Kubo
  • Hiroshi Sato
  • Takashi Matsubara
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6883)


In this study, lexical analysis is applied to the log data of conversations on Internet forums. It is well known that many regularities in documents have been found, for example, Zipf’s law and Heaps’ law. This type of analysis has been applied to documents in various media. However, few studies apply this analysis to documents that have been developed by many authors, for example, the log data of conversations on Internet forums. Usually, the relationship between document size and these regularities is not important, because the size of such documents is determined by its author, which is normally only a single person. However, the size of the communication log of an Internet forum is an emergent property for people who are interested in the forum. We believe that it is important to understand the dynamics of conversations.

Owing to the investigation in this study, the following trend has been found: the number of posted messages is small if the vocabulary growth parameter β of Heaps’ law is not within preferred range. Additionally, this study propose a new explanation based on the multiple author environment to understand the differences of this parameter β. Traditionally, such documents written by more than 1 person, for example, web sites and programming language, are analyzed from the single author point of view. This traditional approach is very important but not sufficient because this approach cannot discuss differences of vocabulary of each of the authors.


Word Frequency Social Network Service Multiple Author Vocabulary Size Unique Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Cattuto, C., Baldassarri, A., Servedio, V.D.P., Loreto, V.: Vocabulary growth in collaborative tagging systems, arXiv:0704.3316v1 (2007)Google Scholar
  4. 4.
    Cattuto, C., Loreto, V., Pietronero, L.: Semiotic dynamics and collaborative tagging. Proceedings of the National Academy of Sciences 104(5), 1461–1464 (2007)CrossRefGoogle Scholar
  5. 5.
    Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information Diffusion Through Blogspace. In: Proceedings of the 13th International Conference on World Wide Web, pp. 491–501 (2004)Google Scholar
  6. 6.
    Kubo, M., Naruse, K., Sato, H., Matsubara, T.: Population estimation of internet forum community by posted article distribution. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 298–307. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Naruse, K., Kubo, M.: Lognormal Distribution of BBS Articles and its Social and Generative Mechanism. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 103–112 (2006)Google Scholar
  8. 8.
    Kubo, M., Naruse, K., Sato, H., Matubara, T.: The possibility of an epidemic meme analogy for web community population analysis. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 1073–1080. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
  10. 10.
    Manning, C.D., Raghavan, P., Schëtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  11. 11.
    Zhang, H.: Discovering power laws in computer programs. Information Processing and Management 45, 477–483 (2009)CrossRefGoogle Scholar
  12. 12.
    Li, W.: Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory 38(6), 1842–1845 (1992)CrossRefGoogle Scholar
  13. 13.
    Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web-Probabilistic Methods and Algorithms. Wiley, Chichester (2003)Google Scholar
  14. 14.
    van Leijenhorst, D.C., van der Weide, T.P.: A formal derivation of Heaps’ Law. Information Sciences 170, 263–272 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  15. 15.
    Lu, L., Zhang, Z.-K., Zhou, T.: Zipf’s Law Leads to Heaps’ Law: Analyzing Their Relation in Finite-Size Systems. arXiv:1002.3861v2 (2010)Google Scholar
  16. 16.
    \(\dot{A}\)ngeles Serrano, M., Flammini, A., Menczer, F.: Beyond Zipf’s law: Modeling the structure of human language (2009),
  17. 17.
    Chi, E.H., Mytkowicz, T.: Understanding the efficiency of social tagging systems using information theory. In: Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, June 19-21, pp. 81–88. ACM, Pittsburgh (2008)CrossRefGoogle Scholar
  18. 18.
    Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Masao Kubo
    • 1
  • Hiroshi Sato
    • 1
  • Takashi Matsubara
    • 1
  1. 1.National Defense Academy of JapanYokosukaJapan

Personalised recommendations