Comparing Clustering Algorithms for the Identification of Similar Pages in Web Applications

De Lucia, Andrea; Risi, Michele; Scanniello, Giuseppe; Tortora, Genoveffa

doi:10.1007/978-3-540-73597-7_34

Andrea De Lucia¹,
Michele Risi¹,
Giuseppe Scanniello² &
…
Genoveffa Tortora¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4607))

Included in the following conference series:

International Conference on Web Engineering

Abstract

In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.

Download to read the full chapter text

Chapter PDF

Locating similar names through locality sensitive hashing and graph theory

Article 31 July 2018

Clustering of Biological Sequences

Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach

Keywords

References

Boldyreff, C., Tonella, P.: Web Site Evolution. Special issue of Journal of Software Maintenance 16(1-2), 1–4 (2004)
Google Scholar
De Lucia, A., Scanniello, G., Tortora, G.: Using a Competitive Clustering Algorithm to Comprehend Web Applications. In: Proc. of 8th IEEE International Symposium on Web Site Evolution, Philadelphia, Pennsylvania, pp. 33–40. IEEE CS Press, Los Alamitos (2006)
Chapter Google Scholar
De Lucia, A., Francese, R., Scanniello, G., Tortora, G.: Identifying Cloned Navigational Patterns in Web Applications. International Journal of Web Engineering 5(2), 150–174, Rinton Press (2006)
Google Scholar
Di Lucca, G.A., Di Penta, M., Fasolino, A.R.: An Approach to Identify Duplicated Web Pages. In: Proc. of 26th Annual International Computer Software and Application Conference, Oxford, UK, pp. 481–486. IEEE CS Press, Los Alamitos (2002)
Chapter Google Scholar
Di Lucca, G.A., Fasolino, A.R., De Carlini, U., Pace, F., Tramontana, P.: Comprehending web applications by a clustering based approach. In: Proc. of the 10th International Workshop on Program Comprehension, Paris, France, pp. 261–270. IEEE Computer Society Press, Los Alamitos (2002)
Chapter Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication, JOHN WILEY & SONS, Inc. New York, pp. 576-581
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York (1990)
Google Scholar
King, F.: Step-wise clustering procedures. Journal of the American Statistical Association 62, 86–101 (1967)
Article Google Scholar
Levenshtein, V.L.: Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 10, 707–710 (1966)
Google Scholar
Mcqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Ricca, F., Tonella, P.: Using Clustering to Support the Migration from Static to Dynamic Web Pages. In: Proc. of International Workshop on Program Comprehension, Portland, Oregon, USA, pp. 207–216 (2003)
Google Scholar
Tonella, P., Ricca, F., Pianta, E., Girardi, C.: Restructuring Multilingual Web Sites. In: Proc. of International Conference on Software Maintenance, Montreal, Canada, pp. 290–299. IEEE CS Press, Los Alamitos (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Informatica, Università di Salerno, Via Ponte Don Melillo, 84084, Fisciano (SA), Italy
Andrea De Lucia, Michele Risi & Genoveffa Tortora
Dipartimento di Matematica e Informatica, Università della Basilicata, Viale Dell’Ateneo, Macchia Romana, 85100, Potenza, Italy
Giuseppe Scanniello

Authors

Andrea De Lucia
View author publications
You can also search for this author in PubMed Google Scholar
Michele Risi
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author in PubMed Google Scholar
Genoveffa Tortora
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Luciano Baresi Piero Fraternali Geert-Jan Houben

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Lucia, A., Risi, M., Scanniello, G., Tortora, G. (2007). Comparing Clustering Algorithms for the Identification of Similar Pages in Web Applications. In: Baresi, L., Fraternali, P., Houben, GJ. (eds) Web Engineering. ICWE 2007. Lecture Notes in Computer Science, vol 4607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73597-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-73597-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73596-0
Online ISBN: 978-3-540-73597-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics