Abstract
This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough evaluation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rijsbergen, C.V.: Information Retrieval. Butterworths, London (1979)
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 186–193. ACM Press, New York (2004)
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 318–329. ACM Press, New York (1992)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Selected papers from the sixth international conference on World Wide Web, Essex, UK, pp. 1157–1166. Elsevier Science Publishers Ltd., Amsterdam (1997)
Puppin, D., Silvestri, F.: The query-vector document model. In: Proceedings of the 15th ACM international conference on Information and Knowledge Management, pp. 880–881. ACM Press, New York (2006)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of Data, pp. 76–85. ACM Press, New York (2003)
Parapar, J., Barreiro, Á.: Winnowing-based text clustering. In: Proceeding of the 17th ACM conference on Information and Knowledge Management, pp. 1353–1354. ACM, New York (2008)
Rivest, R.L.: The MD5 message digest algorithm. RFC 1321 (April 1992)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31(2), 249–260 (1987)
Giannotti, F., Gozzi, C., Manco, G.: Characterizing web user accesses: A transactional approach to web log clustering. In: Proceedings of the International Conference on Information Technology: Coding and Computing, Washington, DC, USA, pp. 312–317. IEEE Computer Society, Los Alamitos (2002)
Pantel, P., Lin, D.: Document clustering with committees. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 199–206. ACM Press, New York (2002)
Rosell, M., Kann, V., Litton, J.E.: Comparing comparisons: Document clustering evaluation using two manual classifications. In: Proceedings of the International Conference on Natural Language Processing (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Parapar, J., Barreiro, Á. (2009). Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds) Advances in Information Retrieval. ECIR 2009. Lecture Notes in Computer Science, vol 5478. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00958-7_61
Download citation
DOI: https://doi.org/10.1007/978-3-642-00958-7_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00957-0
Online ISBN: 978-3-642-00958-7
eBook Packages: Computer ScienceComputer Science (R0)