Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Barrero, David F.; Camacho, David; R-Moreno, María D.

doi:10.1007/978-1-4419-0522-2_9

David F. Barrero²,
David Camacho³ &
María D. R-Moreno²

1356 Accesses
12 Citations

Abstract

Data Extraction from the World Wide Web is a well known, unsolved, and critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount ofWeb data available. These data usually has a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build reliable systems. This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

David Camacho, Maria D. R-Moreno, David F. Barrero, and Rajendra Akerkar. Semantic wrappers for semi-structured data extraction. Computing Letters (COLE), 4(1), 2008.
Google Scholar
Longbing Cao, Chao Luo, and Chengqi Zhang. Agent-mining interaction: An emerging area. In AIS-ADM, pages 60–73, 2007.
Google Scholar
John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, April 1992.
Google Scholar
Marat Kanteev, Igor Minakov, George Rzevski, Petr Skobelev, and Simon Volman. Multiagent meta-search engine based on domain ontology. In AIS-ADM, pages 269–274, 2007.
Google Scholar
Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000, 2000.
Article MathSciNet Google Scholar
M. Michalowski, J.L. Ambite, S. Thakkar, R. Tuchinda, C.A. Knoblock, and S. Minton. Retrieving and semantically integrating heterogeneous data from the web. IEEE Intelligent Systems, 19(3), 2004.
Google Scholar
Ken Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Universidad de Alcalá, Madrid, Spain
David F. Barrero & María D. R-Moreno
Computer Science Department, Universidad Autónoma de Madrid, Madrid, Spain
David Camacho

Authors

David F. Barrero
View author publications
You can also search for this author in PubMed Google Scholar
David Camacho
View author publications
You can also search for this author in PubMed Google Scholar
María D. R-Moreno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David F. Barrero .

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, University of Technology, Sydney, Broadway, NSW 2007, Australia
Longbing Cao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Barrero, D.F., Camacho, D., R-Moreno, M.D. (2009). Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions. In: Cao, L. (eds) Data Mining and Multi-agent Integration. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0522-2_9

Download citation

DOI: https://doi.org/10.1007/978-1-4419-0522-2_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-0521-5
Online ISBN: 978-1-4419-0522-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics