Characterizing Weblog Corpora

  • Fernando Perez-Tellez
  • David Pinto
  • John Cardiff
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5723)

Abstract

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending on the specific interests of the writer, their linguistic style, and the volume of texts that they produce.

References

  1. 1.
    Pinto, D., Rosso, P., Jiménez-Salazar, H.: UPV-SI: Word Sense Induction using Self-Term Expansion. In: 4th Workshop on Semantic Evaluations - SemEval 2007. Association for Computational Linguistics (2007)Google Scholar
  2. 2.
    Pinto, D.: On Clustering and Evaluation of Narrow Domain Short-Text Corpora, PhD dissertation, Universidad Politécnica de Valencia, Spain (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fernando Perez-Tellez
    • 1
  • David Pinto
    • 2
  • John Cardiff
    • 1
  • Paolo Rosso
    • 3
  1. 1.Social Media Research GroupInstitute of Technology TallaghtDublinIreland
  2. 2.Benemerita Universidad Autónoma de PueblaMexico
  3. 3.Natural Language Engineering Lab. – EliRF, Dept. Sistemas Informáticos y ComputaciónUniversidad Politécnica ValenciaSpain

Personalised recommendations