Abstract
In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.
Chapter PDF
Similar content being viewed by others
References
Boldyreff, C., Tonella, P.: Web Site Evolution. Special issue of Journal of Software Maintenance 16(1-2), 1–4 (2004)
De Lucia, A., Scanniello, G., Tortora, G.: Using a Competitive Clustering Algorithm to Comprehend Web Applications. In: Proc. of 8th IEEE International Symposium on Web Site Evolution, Philadelphia, Pennsylvania, pp. 33–40. IEEE CS Press, Los Alamitos (2006)
De Lucia, A., Francese, R., Scanniello, G., Tortora, G.: Identifying Cloned Navigational Patterns in Web Applications. International Journal of Web Engineering 5(2), 150–174, Rinton Press (2006)
Di Lucca, G.A., Di Penta, M., Fasolino, A.R.: An Approach to Identify Duplicated Web Pages. In: Proc. of 26th Annual International Computer Software and Application Conference, Oxford, UK, pp. 481–486. IEEE CS Press, Los Alamitos (2002)
Di Lucca, G.A., Fasolino, A.R., De Carlini, U., Pace, F., Tramontana, P.: Comprehending web applications by a clustering based approach. In: Proc. of the 10th International Workshop on Program Comprehension, Paris, France, pp. 261–270. IEEE Computer Society Press, Los Alamitos (2002)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication, JOHN WILEY & SONS, Inc. New York, pp. 576-581
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York (1990)
King, F.: Step-wise clustering procedures. Journal of the American Statistical Association 62, 86–101 (1967)
Levenshtein, V.L.: Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 10, 707–710 (1966)
Mcqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Ricca, F., Tonella, P.: Using Clustering to Support the Migration from Static to Dynamic Web Pages. In: Proc. of International Workshop on Program Comprehension, Portland, Oregon, USA, pp. 207–216 (2003)
Tonella, P., Ricca, F., Pianta, E., Girardi, C.: Restructuring Multilingual Web Sites. In: Proc. of International Conference on Software Maintenance, Montreal, Canada, pp. 290–299. IEEE CS Press, Los Alamitos (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
De Lucia, A., Risi, M., Scanniello, G., Tortora, G. (2007). Comparing Clustering Algorithms for the Identification of Similar Pages in Web Applications. In: Baresi, L., Fraternali, P., Houben, GJ. (eds) Web Engineering. ICWE 2007. Lecture Notes in Computer Science, vol 4607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73597-7_34
Download citation
DOI: https://doi.org/10.1007/978-3-540-73597-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73596-0
Online ISBN: 978-3-540-73597-7
eBook Packages: Computer ScienceComputer Science (R0)