Abstract
In this paper we describe clustering of web documents represented by graphs rather than vectors. We present a novel method for clustering graph-based data using the standard k-means algorithm and compare its performance to the conventional vector-model approach using cosine similarity. The proposed method is evaluated when using five different graph representations under two different clustering performance indices. The experiments are performed on two separate web document collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 255–259 (1998)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)
Fernández, M.-L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 753–758 (2001)
Messmer, B.T., Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 493–504 (1998)
Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)
Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Clustering of web documents using a graph model. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities (to appear)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI-2000: Workshop of Artificial Intelligence for Web Search, pp. 58–64 (2000)
Wallis, W.D., Shoubridge, P., Kraetz, M., Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22, 701–704 (2001)
Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20, 68–86 (1971)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schenker, A., Last, M., Bunke, H., Kandel, A. (2003). Graph Representations for Web Document Clustering. In: Perales, F.J., Campilho, A.J.C., de la Blanca, N.P., Sanfeliu, A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2003. Lecture Notes in Computer Science, vol 2652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44871-6_108
Download citation
DOI: https://doi.org/10.1007/978-3-540-44871-6_108
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40217-6
Online ISBN: 978-3-540-44871-6
eBook Packages: Springer Book Archive