Skip to main content

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

  • Chapter
Data Mining and Multi-agent Integration

Abstract

Data Extraction from the World Wide Web is a well known, unsolved, and critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount ofWeb data available. These data usually has a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build reliable systems. This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. David Camacho, Maria D. R-Moreno, David F. Barrero, and Rajendra Akerkar. Semantic wrappers for semi-structured data extraction. Computing Letters (COLE), 4(1), 2008.

    Google Scholar 

  2. Longbing Cao, Chao Luo, and Chengqi Zhang. Agent-mining interaction: An emerging area. In AIS-ADM, pages 60–73, 2007.

    Google Scholar 

  3. John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, April 1992.

    Google Scholar 

  4. Marat Kanteev, Igor Minakov, George Rzevski, Petr Skobelev, and Simon Volman. Multiagent meta-search engine based on domain ontology. In AIS-ADM, pages 269–274, 2007.

    Google Scholar 

  5. Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000, 2000.

    Article  MathSciNet  Google Scholar 

  6. M. Michalowski, J.L. Ambite, S. Thakkar, R. Tuchinda, C.A. Knoblock, and S. Minton. Retrieving and semantically integrating heterogeneous data from the web. IEEE Intelligent Systems, 19(3), 2004.

    Google Scholar 

  7. Ken Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David F. Barrero .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Barrero, D.F., Camacho, D., R-Moreno, M.D. (2009). Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions. In: Cao, L. (eds) Data Mining and Multi-agent Integration. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0522-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-0522-2_9

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-0521-5

  • Online ISBN: 978-1-4419-0522-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics