Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions
Data Extraction from the World Wide Web is a well known, unsolved, and critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount ofWeb data available. These data usually has a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build reliable systems. This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.
KeywordsGenetic Algorithm Multiagent System Regular Expression Phone Number Genetic Operator
Unable to display preview. Download preview PDF.
- 1.David Camacho, Maria D. R-Moreno, David F. Barrero, and Rajendra Akerkar. Semantic wrappers for semi-structured data extraction. Computing Letters (COLE), 4(1), 2008.Google Scholar
- 2.Longbing Cao, Chao Luo, and Chengqi Zhang. Agent-mining interaction: An emerging area. In AIS-ADM, pages 60–73, 2007.Google Scholar
- 3.John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, April 1992.Google Scholar
- 4.Marat Kanteev, Igor Minakov, George Rzevski, Petr Skobelev, and Simon Volman. Multiagent meta-search engine based on domain ontology. In AIS-ADM, pages 269–274, 2007.Google Scholar
- 6.M. Michalowski, J.L. Ambite, S. Thakkar, R. Tuchinda, C.A. Knoblock, and S. Minton. Retrieving and semantically integrating heterogeneous data from the web. IEEE Intelligent Systems, 19(3), 2004.Google Scholar