Characterizing Weblog Corpora

Perez-Tellez, Fernando; Pinto, David; Cardiff, John; Rosso, Paolo

doi:10.1007/978-3-642-12550-8_28

Fernando Perez-Tellez²⁰,
David Pinto²¹,
John Cardiff²⁰ &
…
Paolo Rosso²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

841 Accesses
2 Citations

Abstract

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending on the specific interests of the writer, their linguistic style, and the volume of texts that they produce.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pinto, D., Rosso, P., Jiménez-Salazar, H.: UPV-SI: Word Sense Induction using Self-Term Expansion. In: 4th Workshop on Semantic Evaluations - SemEval 2007. Association for Computational Linguistics (2007)
Google Scholar
Pinto, D.: On Clustering and Evaluation of Narrow Domain Short-Text Corpora, PhD dissertation, Universidad Politécnica de Valencia, Spain (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Social Media Research Group, Institute of Technology Tallaght, Dublin, Ireland
Fernando Perez-Tellez & John Cardiff
Benemerita Universidad Autónoma de Puebla, Mexico
David Pinto
Natural Language Engineering Lab. – EliRF, Dept. Sistemas Informáticos y Computación, Universidad Politécnica Valencia, Spain
Paolo Rosso

Authors

Fernando Perez-Tellez
View author publications
You can also search for this author in PubMed Google Scholar
David Pinto
View author publications
You can also search for this author in PubMed Google Scholar
John Cardiff
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Computertechnologie, Technische Universität Wien, A-1040, Wien, Austria
Helmut Horacek
CNAM- Laboratoire Cédric, 292 Rue St. Martin, 75141, Paris Cedex 03, France
Elisabeth Métais
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vincente del Raspeig, Apdo 99, 03080, Alicante, Spain
Rafael Muñoz
Dept. of Computational Linguistics, Saarland University, Germany
Magdalena Wolska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perez-Tellez, F., Pinto, D., Cardiff, J., Rosso, P. (2010). Characterizing Weblog Corpora. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-12550-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12549-2
Online ISBN: 978-3-642-12550-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics