Investigating the Statistical Properties of User-Generated Documents

Inches, Giacomo; Carman, Mark James; Crestani, Fabio

doi:10.1007/978-3-642-24764-4_18

Investigating the Statistical Properties of User-Generated Documents

Giacomo Inches²⁵,
Mark James Carman²⁶ &
Fabio Crestani²⁵

Conference paper

657 Accesses
2 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7022))

Abstract

The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. In this work we aim at analyzing the properties of these online user-generated documents for some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and comparing them with a consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting the applicability and limits of traditional content analysis and indexing techniques used in information retrieval to the new online user-generated documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007)
Google Scholar
Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: WOSP 2008: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008)
Chapter Google Scholar
Haichao Dong, S.C.H., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review 30(5), 496–516 (2006)
Article Google Scholar
Kucukyilmaz, T., Cambazoglu, B., Aykanat, C., Can, F.: Chat mining for gender prediction. In: Yakhno, T., Neuhold, E.J. (eds.) ADVIS 2006. LNCS, vol. 4243, pp. 274–283. Springer, Heidelberg (2006)
Chapter Google Scholar
Medina, E.W.: Military textual analysis and chat research. In: International Conference on Semantic Computing, pp. 569–572 (2008)
Google Scholar
Bache, R., Crestani, F., Canter, D., Youngs, D.: Mining police digital archives to link criminal styles with offender characteristics. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 493–494. Springer, Heidelberg (2007)
Chapter Google Scholar
Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop, Madrid, Spain (2009)
Google Scholar
Qi, H., Li, M., Gao, J., Li, S.: Information retrieval for short documents. Journal of Electronics (China) 23(6), 933–936 (2006)
Article Google Scholar
Wang, F., Greer, J.: Retrieval of short documents from discussion forums. In: Advances in Artificial Intelligence, pp. 339–343 (2002)
Google Scholar
Inches, G., Carman, M., Crestani, F.: Statistics of online user-generated short documents. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 649–652. Springer, Heidelberg (2010)
Chapter Google Scholar
Carullo, M., Binaghi, E., Gallo, I.: An online document clustering technique for short web contents. Pattern Recognition Letters 30(10), 870–876 (2009)
Article Google Scholar
Tuulos, V.H., Tirri, H.: Combining topic models and social networks for chat data mining. In: WI 2004: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213. IEEE Computer Society, Washington, DC, USA (2004)
Google Scholar
Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007)
Chapter Google Scholar
Serrano, M., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PLoS ONE 4(4), e5372 (2009)
Article Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Allan, J., Raghavan, H.: Using part-of-speech patterns to reduce query ambiguity. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM, New York (2002)
Chapter Google Scholar
Lioma, C., Blanco, R.: Part of speech based term weighting for information retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 412–423. Springer, Heidelberg (2009)
Chapter Google Scholar
Lioma, C., Ounis, I.: Examining the content load of part of speech blocks for information retrieval. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 531–538. Association for Computational Linguistics, Morristown (2006)
Chapter Google Scholar
Codina, J., Kaltenbrunner, A., Grivolla, J., Banchs, R.E., Baeza-Yates, R.: Content analysis in web 2.0. 18th International World Wide Web Conference (April 2009)
Google Scholar
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: ICWSM (2010)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Wilcock, G.: Introduction to linguistic annotation and text analytics. Synthesis Lectures on Human Language Technologies 2(1), 1–159 (2009)
Article Google Scholar
Balog, K., Bron, M., He, J., Hofmann, K., Meij, E.J., de Rijke, M., Tsagkias, E., Weerkamp, W.: The university of amsterdam at trec 2009: Blog, web, entity, and relevance feedback. In: TREC 2009 Working Notes. NIST (November 2009)
Google Scholar
Macdonald, C., Santos, R.L., Ounis, I., Soboroff, I.: Blog track research at trec. SIGIR Forum 44(1), 58–75 (2010)
Article Google Scholar
O’Connor, B., Balasubramanyan, R., Routledge, B.R., Smith, N.A.: From tweets to polls: Linking text sentiment to public opinion time series. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)
Google Scholar
Ku, L.W., Ke, K.J., Chen, H.H.: Opinion analysis on caw 2.0 datasets. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop, Madrid, Spain (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Informatics, University of Lugano, Lugano, Switzerland
Giacomo Inches & Fabio Crestani
Faculty of Information Technology, Monash University, Melbourne, Australia
Mark James Carman

Authors

Giacomo Inches
View author publications
You can also search for this author in PubMed Google Scholar
Mark James Carman
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Crestani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Communication, Business and Information Technologies, Roskilde University, P.O. box 260, 4000, Roskilde, Denmark
Henning Christiansen
Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000, Ghent, Belgium
Guy De Tré
Computer Engineering Department, Middle East Technical University (METU), 06531, Ankara, Türkiye
Adnan Yazici
Systems Research Institute, Polish Academy of Science, Newelska 6, 01-447, Warsaw, Poland
Slawomir Zadrozny
Department of Computer Science, Roskilde University, Building 42.1, P.O Box 260, 4000, Roskilde, Denmark
Troels Andreasen
Department of Electronic Systems, Aalborg University, Niels Bohrs Vey 8, H321, 6700, Esbjerg, Denmark
Henrik Legind Larsen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Inches, G., Carman, M.J., Crestani, F. (2011). Investigating the Statistical Properties of User-Generated Documents. In: Christiansen, H., De Tré, G., Yazici, A., Zadrozny, S., Andreasen, T., Larsen, H.L. (eds) Flexible Query Answering Systems. FQAS 2011. Lecture Notes in Computer Science(), vol 7022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24764-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-24764-4_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24763-7
Online ISBN: 978-3-642-24764-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics