Fitness Function Obtained from a Genetic Programming Approach for Web Document Clustering Using Evolutionary Algorithms

Cobos, Carlos; Muñoz, Leydy; Mendoza, Martha; León, Elizabeth; Herrera-Viedma, Enrique

doi:10.1007/978-3-642-34654-5_19

Carlos Cobos²¹,
Leydy Muñoz²¹,
Martha Mendoza²¹,
Elizabeth León²² &
…
Enrique Herrera-Viedma²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7637))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1900 Accesses
2 Citations

Abstract

Web document clustering (WDC) is an alternative means of searching the web and has become a rewarding research area. Algorithms for WDC still present some problems, in particular: inconsistencies in the content and description of clusters. The use of evolutionary algorithms is one approach for improving results. It uses standard index to evaluate the quality (as a fitness function) of different solutions of clustering. Indexes such as Bayesian Information Criteria (BIC), Davies-Bouldin, and others show good performance, but with much room for improvement. In this paper, a modified BIC fitness function for WDC based on evolutionary algorithms is presented. This function was discovered using a genetic program (from a reverse engineering view). Experiments on datasets based on DMOZ show promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A Survey of Web Clustering Engines. ACM Computing Surveys 41(3), 17:1–17:38 (2009)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)
Google Scholar
Carpineto, C., D’Amico, M., Romano, G.: Evaluating Subtopic Retrieval Methods - Clustering Versus Diversification of Search Results. Information Processing & Management 48(2), 358–373 (2012)
Article Google Scholar
Hammouda, K.: Web Mining - Clustering Web Documents A Preliminary Review. Dept. of Systems Design Engineering. University of Waterloo (2001)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD 2000 Workshop on Text Mining, pp. 1–20. ACM (2000)
Google Scholar
Li, Y., Chung, S.M., Holt, J.D.: Text Document Clustering Based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64(1), 381–404 (2008)
Article Google Scholar
Oren, Z., Oren, E.: Web Document Clustering - A Feasibility Demonstration. In: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 46–54. ACM (1998)
Google Scholar
Mahdavi, M., Abolhassani, H.: Harmony K-Means Algorithm for Document Clustering. Data Mining and Knowledge Discovery 18(3), 370–391 (2009)
Article MathSciNet Google Scholar
Berkhin, P., Kogan, J., Nicholas, C., Teboulle, M.: A Survey of Clustering Data Mining Techniques. In: Grouping Multidimensional Data, pp. 25–71. Springer (2006)
Google Scholar
Osiński, S., Weiss, D.: A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems 20(3), 48–54 (2005)
Article Google Scholar
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Chapter Google Scholar
Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: 3rd SIAM Intl. Conference on Data Mining (SDM 2003), pp. 59–70. SIAM (2003)
Google Scholar
Mecca, G., Raunich, S., Pappalardo, A.: A New Algorithm for Clustering Search Results. Data & Knowledge Engineering 62(3), 504–522 (2007)
Article Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD Intl. Conf. on Know. Discovery and Data Mining (KDD 2002), pp. 436–442. ACM (2002)
Google Scholar
Cobos, C., Mendoza, M., Leon, E.: A Hyper-Heuristic Approach to Design and Tuning Heuristic Methods for Web Document Clustering. In: IEEE Congress on Evolutionary Computation (CEC 2011), pp. 1350–1358. IEEE (2011)
Google Scholar
Cobos, C., Montealegre, C., Mejía, M., Mendoza, M., León, E.: Web Document Clustering based on a New Niching Memetic Algorithm, Term-Document Matrix and Bayesian Information Criterion. In: IEEE Congress on Evolutionary Computation (CEC 2010), pp. 4629–4636. IEEE (2010)
Google Scholar
Cobos, C., Andrade, J., Constain, W., Mendoza, M., León, E.: Web Document Clustering Based on Global-Best Harmony Search, K-means, Frequent Term Sets and Bayesian Information Criterion. In: IEEE Congress on Evolutionary Computation (CEC 2010), pp. 4637–4644. IEEE (2010)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering - A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Osiński, S., Weiss, D.: Carrot 2 - Design of a Flexible and Efficient Web Information Retrieval Framework. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 439–444. Springer, Heidelberg (2005)
Chapter Google Scholar
Wei, X., Xin, L., Yihong, G.: Document Clustering Based on Non-Negative Matrix Factorization. In: 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR 2003), pp. 267–273. ACM (2003)
Google Scholar
Zhong-Yuan, Z., Zhang, J.: Survey on the Variations and Applications of Nonnegative Matrix Factorization. In: 9th International Symposium on Operations Research and Its Applications (ISORA 2010), pp. 317–323. ORSC & APORC (2010)
Google Scholar
Bernardini, A., Carpineto, C., D’Amico, M.: Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering. In: IEEE/WIC/ACM Intl. Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT 2009), pp. 206–213. IEEE (2009)
Google Scholar
Navigli, R., Crisafulli, G.: Inducing Word Senses to Improve Web Search Result Clustering. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), pp. 116–126. Association for Computational Linguistics (2010)
Google Scholar
Geem, Z., Kim, J., Loganathan, G.V.: A New Heuristic Optimization Algorithm - Harmony Search. Simulation 76(2), 60–68 (2001)
Article Google Scholar
Forsati, R., Meybodi, M.R., Mahdavi, M., Neiat, A.G.: Hybridization of K-Means and Harmony Search Methods for Web Page Clustering. In: IEEE/WIC/ACM Intl. Conf. on Web Intell. and Intell. Agent Technology (WI-IAT 2008), pp. 329–335. IEEE (2008)
Google Scholar
Mahdavi, M., Chehreghani, M.H., Abolhassani, H., Forsati, R.: Novel Meta-Heuristic Algorithms for Clustering Web Documents. Applied Mathematics and Computation 201(1), 441–451 (2008)
Article MathSciNet MATH Google Scholar
Song, W., Li, C.H., Park, S.C.: Genetic Algorithm for Text Clustering Using Ontology and Evaluating the Validity of Various Semantic Similarity Measures. Expert Systems with Applications 36(5), 9095–9104 (2009)
Article Google Scholar
Song, W., Park, S.: Genetic Algorithm-Based Text Clustering Technique. In: Jiao, L., Wang, L., Gao, X., Liu, J., Wu, F. (eds.) ICNC 2006, Part I. LNCS, vol. 4421, pp. 779–782. Springer, Heidelberg (2006)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software - An Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
Article Google Scholar
Lopez-Herrera, A.G., Herrera-Viedma, E., Herrera, F.: A Study of the Use of Multi-Objective Evolutionary Algorithms to Learn Boolean Queries - A Comparative Study. Journal of the American Society for Information Science and Technology 60(6), 1192–1207 (2009)
Article Google Scholar
Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1(6), 80–83 (1945)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Universidad del Cauca, Colombia
Carlos Cobos, Leydy Muñoz & Martha Mendoza
Systems and Industrial Engineering Department, Engineering Faculty, Universidad Nacional de Colombia, Colombia
Elizabeth León
Department of Computer Science and Artificial Intelligence, University of Granada, Spain
Enrique Herrera-Viedma

Authors

Carlos Cobos
View author publications
You can also search for this author in PubMed Google Scholar
Leydy Muñoz
View author publications
You can also search for this author in PubMed Google Scholar
Martha Mendoza
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth León
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Herrera-Viedma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Facultad de Informática, Universidad Complutense de Madrid, c\ Profesor José García Santesmases, 28040, Madrid, Spain
Juan Pavón & Rubén Fuentes-Fernández &
Universidad Nacional de Colombia, Carrera 30 No 45-03, Edificio 477, Bogotá, DC, Colombia
Néstor D. Duque-Méndez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cobos, C., Muñoz, L., Mendoza, M., León, E., Herrera-Viedma, E. (2012). Fitness Function Obtained from a Genetic Programming Approach for Web Document Clustering Using Evolutionary Algorithms. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds) Advances in Artificial Intelligence – IBERAMIA 2012. IBERAMIA 2012. Lecture Notes in Computer Science(), vol 7637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34654-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-34654-5_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34653-8
Online ISBN: 978-3-642-34654-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics