Investigating the Statistical Properties of User-Generated Documents

  • Giacomo Inches
  • Mark James Carman
  • Fabio Crestani
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7022)


The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. In this work we aim at analyzing the properties of these online user-generated documents for some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and comparing them with a consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting the applicability and limits of traditional content analysis and indexing techniques used in information retrieval to the new online user-generated documents.


Similarity Class Past Participle Rare Word Stopword Removal Document Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007)Google Scholar
  2. 2.
    Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: WOSP 2008: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008)CrossRefGoogle Scholar
  3. 3.
    Haichao Dong, S.C.H., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review 30(5), 496–516 (2006)CrossRefGoogle Scholar
  4. 4.
    Kucukyilmaz, T., Cambazoglu, B., Aykanat, C., Can, F.: Chat mining for gender prediction. In: Yakhno, T., Neuhold, E.J. (eds.) ADVIS 2006. LNCS, vol. 4243, pp. 274–283. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Medina, E.W.: Military textual analysis and chat research. In: International Conference on Semantic Computing, pp. 569–572 (2008)Google Scholar
  6. 6.
    Bache, R., Crestani, F., Canter, D., Youngs, D.: Mining police digital archives to link criminal styles with offender characteristics. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 493–494. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  7. 7.
    Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop, Madrid, Spain (2009)Google Scholar
  8. 8.
    Qi, H., Li, M., Gao, J., Li, S.: Information retrieval for short documents. Journal of Electronics (China) 23(6), 933–936 (2006)CrossRefGoogle Scholar
  9. 9.
    Wang, F., Greer, J.: Retrieval of short documents from discussion forums. In: Advances in Artificial Intelligence, pp. 339–343 (2002)Google Scholar
  10. 10.
    Inches, G., Carman, M., Crestani, F.: Statistics of online user-generated short documents. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 649–652. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Carullo, M., Binaghi, E., Gallo, I.: An online document clustering technique for short web contents. Pattern Recognition Letters 30(10), 870–876 (2009)CrossRefGoogle Scholar
  12. 12.
    Tuulos, V.H., Tirri, H.: Combining topic models and social networks for chat data mining. In: WI 2004: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213. IEEE Computer Society, Washington, DC, USA (2004)Google Scholar
  13. 13.
    Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Serrano, M., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PLoS ONE 4(4), e5372 (2009)CrossRefGoogle Scholar
  15. 15.
    Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)Google Scholar
  16. 16.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  17. 17.
    Allan, J., Raghavan, H.: Using part-of-speech patterns to reduce query ambiguity. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM, New York (2002)CrossRefGoogle Scholar
  18. 18.
    Lioma, C., Blanco, R.: Part of speech based term weighting for information retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 412–423. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  19. 19.
    Lioma, C., Ounis, I.: Examining the content load of part of speech blocks for information retrieval. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 531–538. Association for Computational Linguistics, Morristown (2006)CrossRefGoogle Scholar
  20. 20.
    Codina, J., Kaltenbrunner, A., Grivolla, J., Banchs, R.E., Baeza-Yates, R.: Content analysis in web 2.0. 18th International World Wide Web Conference (April 2009)Google Scholar
  21. 21.
    Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: ICWSM (2010)Google Scholar
  22. 22.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)Google Scholar
  23. 23.
    Wilcock, G.: Introduction to linguistic annotation and text analytics. Synthesis Lectures on Human Language Technologies 2(1), 1–159 (2009)CrossRefGoogle Scholar
  24. 24.
    Balog, K., Bron, M., He, J., Hofmann, K., Meij, E.J., de Rijke, M., Tsagkias, E., Weerkamp, W.: The university of amsterdam at trec 2009: Blog, web, entity, and relevance feedback. In: TREC 2009 Working Notes. NIST (November 2009)Google Scholar
  25. 25.
    Macdonald, C., Santos, R.L., Ounis, I., Soboroff, I.: Blog track research at trec. SIGIR Forum 44(1), 58–75 (2010)CrossRefGoogle Scholar
  26. 26.
    O’Connor, B., Balasubramanyan, R., Routledge, B.R., Smith, N.A.: From tweets to polls: Linking text sentiment to public opinion time series. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)Google Scholar
  27. 27.
    Ku, L.W., Ke, K.J., Chen, H.H.: Opinion analysis on caw 2.0 datasets. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop, Madrid, Spain (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Giacomo Inches
    • 1
  • Mark James Carman
    • 2
  • Fabio Crestani
    • 1
  1. 1.Faculty of InformaticsUniversity of LuganoLuganoSwitzerland
  2. 2.Faculty of Information TechnologyMonash UniversityMelbourneAustralia

Personalised recommendations