Journal of Signal Processing Systems

, Volume 59, Issue 1, pp 123–137 | Cite as

Finding and Extracting Data Records from Web Pages

  • Manuel Álvarez
  • Alberto Pan
  • Juan Raposo
  • Fernando Bellas
  • Fidel Cacheda


Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.


Web Automatic data extraction Data mining Web mining Hidden web 



This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.


  1. 1.
    Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F. (2007). Finding and extracting data records from web pages. Proc. of 2007 IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2007). Lecture Notes in Computer Science, 4808, 466–478 ISSN: 0302-9743.CrossRefGoogle Scholar
  2. 2.
    Álvarez, M., Pan, A., Raposo, J., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Proceedings of the 2007 International Conference on Computational Science and its Applications (ICCSA). Lecture Notes in Computer Science, 4706(2), 322–333 Springer Berlin/Heidelberg, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-540-74475-7.CrossRefGoogle Scholar
  3. 3.
    Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data.Google Scholar
  4. 4.
    Arlota, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large websites. In Proceedings of the WebDB Workshop, pp. 7–12.Google Scholar
  5. 5.
    Baumgartner, R., Flesca, S., Gottlob, G. (2001). Visual web information extraction with lixto. In Proc. of Very Large DataBases (VLDB).Google Scholar
  6. 6.
    Chakrabarti, S. (2003). Mining the web: Discovering knowledge from hypertext data. San Francisco: Morgan Kaufmann ISBN: 1-55860-754-4.Google Scholar
  7. 7.
    Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Int. World Wide Web Conf., pp. 681–688.Google Scholar
  8. 8.
    Chang, K., He, B., & Zhang, Z. (2004). MetaQuerier over the deep web: Shallow integration across holistic sources. In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb).Google Scholar
  9. 9.
    Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Int. VLDB Conf, pp. 109–118.Google Scholar
  10. 10.
    Crescenzi, V., Merialdo, P., & Missier, P. (2005). Clustering web pages based on their structure. Data & Knowledge Engineering Journal, 54(3), 279–299. September.CrossRefGoogle Scholar
  11. 11.
    Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). New indices for text: Pat trees and pat arrays. Information retrieval: Data structures and algorithms. Upper Saddle River: Prentice Hall.Google Scholar
  12. 12.
    Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The Tsimmis experience. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), pp. 1–8.Google Scholar
  13. 13.
    Hogue, A., & Karger, D. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International World Wide Web Conference.Google Scholar
  14. 14.
    Hsu, C. N., & Dung, M. T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information System, 23(8), 521–538. doi: 10.1016/S0306-4379(98)00027-1.CrossRefGoogle Scholar
  15. 15.
    Jung, Y., Geller, J., Wu, Y., & Ae Chun, S. (2007). Semantic deep web: Automatic attribute extraction from the deep web data sources. In Proceedings of the International SAC Conference, pp. 1667–1672.Google Scholar
  16. 16.
    Kovalev, V., Bhowmick, S., & Madria, S. (2005). HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML. Data & Knowledge Engineering Journal, 54(2), 241–276, August.CrossRefGoogle Scholar
  17. 17.
    Kistlera, T., & Marais, H. (1998). WebL: A Programming Language for the Web. In Proceedings of the 7th International World Wide Web Conference (WWW7), pp. 259–270.Google Scholar
  18. 18.
    Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737.Google Scholar
  19. 19.
    Laender, A. H. F., Ribeiro-Neto, B. A., Soares da Silva, A., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93. doi: 10.1145/565117.565137.CrossRefGoogle Scholar
  20. 20.
    Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNetGoogle Scholar
  21. 21.
    Liddle, S., Yau, S., & Embley, D. (2001). On the automatic extraction of data from the hidden web. ER (Workshops), pp. 212–226.Google Scholar
  22. 22.
    Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi Agent Syst., 93–114. doi: 10.1023/A:1010022931168.
  23. 23.
    Notredame, C. (2002). Recent progresses in multiple sequence alignment: A survey. Technical report, Information Genetique et.Google Scholar
  24. 24.
    Pan, A., et al. (2002). Semi-automatic wrapper generation for commercial web sources. In Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC).Google Scholar
  25. 25.
    Raghavan, S., & García-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB).Google Scholar
  26. 26.
    Raposo, J., Pan, A., Álvarez, M., & Hidalgo, J. (2007). Automatically maintaining wrappers for web sources. Data & Knowledge Engineering, 61(2), 331–358. doi: 10.1016/j.datak.2006.06.006.CrossRefGoogle Scholar
  27. 27.
    Sahuguet, A., & Azavant, F. (2001). Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering Journal, 36(3), 283–316. doi: 10.1016/S0169-023X(00)00051-3.zbMATHCrossRefGoogle Scholar
  28. 28.
    Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International World Wide Web Conference (WWW12).Google Scholar
  29. 29.
    Zhai, Y., & Liu, B. (2005). Extracting web data using instance-based learning. In Proc. of Web Information Systems Engineering (WISE), pp. 318–331.Google Scholar
  30. 30.
    Zhai, Y., & Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 1614–1628. doi: 10.1109/TKDE.2006.197.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Manuel Álvarez
    • 1
  • Alberto Pan
    • 1
  • Juan Raposo
    • 1
  • Fernando Bellas
    • 1
  • Fidel Cacheda
    • 1
  1. 1.Department of Information and Communications TechnologiesUniversity of A CoruñaA CoruñaSpain

Personalised recommendations