Abstract
With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Reuter, T., Cimiano, P.: Event-based classification of social media streams. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR 2012, pp. 22:1–22:8. ACM, New York (2012)
Reuter, T., Papadopoulos, S., Petkos, G., Mezaris, V., Kompatsiaris, Y., Cimiano, P., de Vries, C., Geva, S.: Social event detection at mediaeval 2013: Challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043, CEUR-WS.org (2013)
Petkos, G., Papadopoulos, S., Kompatsiaris, Y.: Social event detection using multimodal clustering and integrating supervisory signals. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR 2012, pp. 23:1–23:8. ACM, New York (2012)
Dhillon, I.S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Grossman, R., Kamath, C., Kumar, V., Namburu, R.R. (eds.) Data Mining for Scientific and Engineering Applications, pp. 357–381. Kluwer Academic Publishers (2001) (Invited book chapter)
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on Machine Learning, ICML 2002, pp. 27–34. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Aksyonoff, A.: Introduction to Search with Sphinx: From installation to relevance tuning. O’Reilly (2011)
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)
Lin, Y., Li, W., Chen, K., Liu, Y.: Model formulation: A document clustering and ranking system for exploring medline citations. Journal of the American Medical Informatics Association 14(5), 651–661 (2007)
Cai, X., Li, W.: Ranking through clustering: An integrated approach to multi-document summarization. IEEE Transactions on Audio, Speech, and Language Processing 21(7), 1424–1433 (2013)
Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications, 1st edn. Chapman & Hall/CRC (2008)
Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning, ICML 2004, pp. 11–18. ACM, New York (2004)
Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, pp. 59–68. ACM, New York (2004)
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data and Knowledge Engineering 68(11), 1271–1288 (2009)
Davidson, I., Ravi, S.S., Ester, M.: Efficient incremental constrained clustering. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2007, pp. 240–249. ACM, New York (2007)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, pp. 21–29. ACM, New York (1996)
Schutz, J.: Sphinx search engine comparative benchmarks (2011) (Online; accessed January 6, 2014)
Sinnott, R.W.: Sky and telescope. Virtues of the Haversine 68(2), 159 (1984)
Brenner, M., Izquierdo, E.: Mediaeval 2013: Social event detection, retrieval and classification in collaborative photo collections. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043, CEUR-WS.org (2013)
Zeppelzauer, M., Zaharieva, M., del Fabro, M.: Unsupervised clustering of social events. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, 2013. Volume 1043. CEUR-WS.org (2013)
Papaoikonomou, A., Tserpes, K., Kardara, M., Varvarigou, T.A.: A similarity-based chinese restaurant process for social event detection. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043. CEUR-WS.org (2013)
Rafailidis, D., Semertzidis, T., Lazaridis, M., Strintzis, M.G., Daras, P.: A data-driven approach for social event detection. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043. CEUR-WS.org (2013)
Schinas, M., Mantziou, E., Papadopoulos, S., Petkos, G., Kompatsiaris, Y.: Certh @ mediaeval 2013 social event detection task. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043. CEUR-WS.org (2013)
Gupta, I., Gautam, K., Chandramouli, K.: Vit@mediaeval 2013 social event detection task: Semantic structuring of complementary information for clustering events. In: Working Notes Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop Barcelona, Spain, October 18-19, vol. 1043. CEUR-WS.org (2013)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sutanto, T., Nayak, R. (2014). The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-05813-9_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)