Abstract
In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline.
Chapter PDF
References
Agrawal, N., Galan, M., Liu, H., Subramanya, S.: Clustering blogs with collective wisdom. In: Proc. of the International Conference on Web Engineering, pp. 336–339. IEEE Computer Society, USA (2008)
Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study: Final Report. In: Proc. DARPA Broadcast News Transcription and Understanding Workshop (1998)
Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proc. SIGIR International Conference on Research and Development in Information Retrieval, pp. 37–45. ACM, NY (1998)
Banerjee, S., Pedersen, T.: An adapted Lesk algorithm for word sense disambiguation using WordNet. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 136–145. Springer, Heidelberg (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Marchine Learning Research, JMLR.org 3, 993–1022 (2003)
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of American Society of Information Science 41, 391–407 (1990)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Flynn, C., Dunnion, J.: Topic Detection in the News Domain. In: Proc. of the 2004 International Symposium on Information and Communication Technologies, pp. 103–108. ACM, New York (2004)
Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Ac., Dordrecht (1994)
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Hofman, T.: Probabilistic latent semantic indexing. In: Proc. of the Twenty-Second Annual International SIGIR Conference, pp. 50–57. ACM, NY (1999)
Karp, R.M., Rabin, M.O.: Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development 31(2), 249–260 (1987)
Li, B., Xu, S., Zhang, J.: Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments. In: ACM Southeast Regional Conference, pp. 94–99 (2007)
Manning, D.C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Perez-Tellez, F., Pinto, D., Cardiff, J., Rosso, P.: Characterizing Weblog Corpora. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.) NLPIS 2010. LNCS, vol. 5723, pp. 299–300. Springer, Heidelberg (2010)
Pinto, D.: On Clustering and Evaluation of Narrow Domain Short-Text Corpora. PhD dissertation, Universidad Politecnica de Valencia, Spain (2008)
Qiu, Y., Frei, H.P.: Concept based query expansion. In: Proc. of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–169. ACM, New York (1993)
Sekiguchi, Y., Kawashima, H., Okuda, H., Oku, M.: Topic Detection from Blog Documents Using Users’ Interests. In: Proc. of the 7th International Conference on Mobile Data Management (2006)
Spärck, J.K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Wartena, C., Brussee, R.: Topic Detection by Clustering Keywords. In: Proc. of the 19th International Conference on Database and Expert Systems Application, pp. 54–58. IEEE Computer Society, USA (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Perez-Tellez, F., Pinto, D., Cardiff, J., Rosso, P. (2010). Clustering Weblogs on the Basis of a Topic Detection Method. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Kittler, J. (eds) Advances in Pattern Recognition. MCPR 2010. Lecture Notes in Computer Science, vol 6256. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15992-3_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-15992-3_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15991-6
Online ISBN: 978-3-642-15992-3
eBook Packages: Computer ScienceComputer Science (R0)