Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

  • David F. BarreroEmail author
  • David Camacho
  • María D. R-Moreno


Data Extraction from the World Wide Web is a well known, unsolved, and critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount ofWeb data available. These data usually has a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build reliable systems. This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.


Genetic Algorithm Multiagent System Regular Expression Phone Number Genetic Operator 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    David Camacho, Maria D. R-Moreno, David F. Barrero, and Rajendra Akerkar. Semantic wrappers for semi-structured data extraction. Computing Letters (COLE), 4(1), 2008.Google Scholar
  2. 2.
    Longbing Cao, Chao Luo, and Chengqi Zhang. Agent-mining interaction: An emerging area. In AIS-ADM, pages 60–73, 2007.Google Scholar
  3. 3.
    John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, April 1992.Google Scholar
  4. 4.
    Marat Kanteev, Igor Minakov, George Rzevski, Petr Skobelev, and Simon Volman. Multiagent meta-search engine based on domain ontology. In AIS-ADM, pages 269–274, 2007.Google Scholar
  5. 5.
    Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000, 2000.CrossRefMathSciNetGoogle Scholar
  6. 6.
    M. Michalowski, J.L. Ambite, S. Thakkar, R. Tuchinda, C.A. Knoblock, and S. Minton. Retrieving and semantically integrating heterogeneous data from the web. IEEE Intelligent Systems, 19(3), 2004.Google Scholar
  7. 7.
    Ken Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968.CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • David F. Barrero
    • 1
    Email author
  • David Camacho
    • 2
  • María D. R-Moreno
    • 1
  1. 1.Computer Science DepartmentUniversidad de AlcaláMadridSpain
  2. 2.Computer Science DepartmentUniversidad Autónoma de MadridMadridSpain

Personalised recommendations