Finding and Extracting Data Records from Web Pages

Álvarez, Manuel; Pan, Alberto; Raposo, Juan; Bellas, Fernando; Cacheda, Fidel

doi:10.1007/s11265-008-0270-y

Finding and Extracting Data Records from Web Pages

Published: 24 September 2008

Volume 59, pages 123–137, (2010)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Manuel Álvarez¹,
Alberto Pan¹,
Juan Raposo¹,
Fernando Bellas¹ &
…
Fidel Cacheda¹

297 Accesses
19 Citations
Explore all metrics

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Web Page Representations and Data Extraction with BERyL

Notes

References

Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F. (2007). Finding and extracting data records from web pages. Proc. of 2007 IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2007). Lecture Notes in Computer Science, 4808, 466–478 ISSN: 0302-9743.
Article Google Scholar
Álvarez, M., Pan, A., Raposo, J., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Proceedings of the 2007 International Conference on Computational Science and its Applications (ICCSA). Lecture Notes in Computer Science, 4706(2), 322–333 Springer Berlin/Heidelberg, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-540-74475-7.
Article Google Scholar
Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data.
Arlota, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large websites. In Proceedings of the WebDB Workshop, pp. 7–12.
Baumgartner, R., Flesca, S., Gottlob, G. (2001). Visual web information extraction with lixto. In Proc. of Very Large DataBases (VLDB).
Chakrabarti, S. (2003). Mining the web: Discovering knowledge from hypertext data. San Francisco: Morgan Kaufmann ISBN: 1-55860-754-4.
Google Scholar
Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Int. World Wide Web Conf., pp. 681–688.
Chang, K., He, B., & Zhang, Z. (2004). MetaQuerier over the deep web: Shallow integration across holistic sources. In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb).
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Int. VLDB Conf, pp. 109–118.
Crescenzi, V., Merialdo, P., & Missier, P. (2005). Clustering web pages based on their structure. Data & Knowledge Engineering Journal, 54(3), 279–299. September.
Article Google Scholar
Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). New indices for text: Pat trees and pat arrays. Information retrieval: Data structures and algorithms. Upper Saddle River: Prentice Hall.
Google Scholar
Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The Tsimmis experience. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), pp. 1–8.
Hogue, A., & Karger, D. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International World Wide Web Conference.
Hsu, C. N., & Dung, M. T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information System, 23(8), 521–538. doi:10.1016/S0306-4379(98)00027-1.
Article Google Scholar
Jung, Y., Geller, J., Wu, Y., & Ae Chun, S. (2007). Semantic deep web: Automatic attribute extraction from the deep web data sources. In Proceedings of the International SAC Conference, pp. 1667–1672.
Kovalev, V., Bhowmick, S., & Madria, S. (2005). HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML. Data & Knowledge Engineering Journal, 54(2), 241–276, August.
Article Google Scholar
Kistlera, T., & Marais, H. (1998). WebL: A Programming Language for the Web. In Proceedings of the 7th International World Wide Web Conference (WWW7), pp. 259–270.
Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737.
Laender, A. H. F., Ribeiro-Neto, B. A., Soares da Silva, A., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93. doi:10.1145/565117.565137.
Article Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
MathSciNet Google Scholar
Liddle, S., Yau, S., & Embley, D. (2001). On the automatic extraction of data from the hidden web. ER (Workshops), pp. 212–226.
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi Agent Syst., 93–114. doi:10.1023/A:1010022931168.
Notredame, C. (2002). Recent progresses in multiple sequence alignment: A survey. Technical report, Information Genetique et.
Pan, A., et al. (2002). Semi-automatic wrapper generation for commercial web sources. In Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC).
Raghavan, S., & García-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB).
Raposo, J., Pan, A., Álvarez, M., & Hidalgo, J. (2007). Automatically maintaining wrappers for web sources. Data & Knowledge Engineering, 61(2), 331–358. doi:10.1016/j.datak.2006.06.006.
Article Google Scholar
Sahuguet, A., & Azavant, F. (2001). Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering Journal, 36(3), 283–316. doi:10.1016/S0169-023X(00)00051-3.
Article MATH Google Scholar
Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International World Wide Web Conference (WWW12).
Zhai, Y., & Liu, B. (2005). Extracting web data using instance-based learning. In Proc. of Web Information Systems Engineering (WISE), pp. 318–331.
Zhai, Y., & Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 1614–1628. doi:10.1109/TKDE.2006.197.
Article Google Scholar

Download references

Acknowledgements

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.

Author information

Authors and Affiliations

Department of Information and Communications Technologies, University of A Coruña, Campus de Elviña s/n. 15071, A Coruña, Spain
Manuel Álvarez, Alberto Pan, Juan Raposo, Fernando Bellas & Fidel Cacheda

Authors

Manuel Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Pan
View author publications
You can also search for this author in PubMed Google Scholar
Juan Raposo
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Bellas
View author publications
You can also search for this author in PubMed Google Scholar
Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Álvarez.

Additional information

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Álvarez, M., Pan, A., Raposo, J. et al. Finding and Extracting Data Records from Web Pages. J Sign Process Syst Sign Image Video Technol 59, 123–137 (2010). https://doi.org/10.1007/s11265-008-0270-y

Download citation

Received: 29 April 2008
Revised: 04 August 2008
Accepted: 02 September 2008
Published: 24 September 2008
Issue Date: April 2010
DOI: https://doi.org/10.1007/s11265-008-0270-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding and Extracting Data Records from Web Pages

Abstract

Access this article

Similar content being viewed by others

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Web Page Representations and Data Extraction with BERyL

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding and Extracting Data Records from Web Pages

Abstract

Access this article

Similar content being viewed by others

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Web Page Representations and Data Extraction with BERyL

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation