Abstract
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.
Similar content being viewed by others
References
Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F. (2007). Finding and extracting data records from web pages. Proc. of 2007 IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2007). Lecture Notes in Computer Science, 4808, 466–478 ISSN: 0302-9743.
Álvarez, M., Pan, A., Raposo, J., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Proceedings of the 2007 International Conference on Computational Science and its Applications (ICCSA). Lecture Notes in Computer Science, 4706(2), 322–333 Springer Berlin/Heidelberg, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-540-74475-7.
Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data.
Arlota, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large websites. In Proceedings of the WebDB Workshop, pp. 7–12.
Baumgartner, R., Flesca, S., Gottlob, G. (2001). Visual web information extraction with lixto. In Proc. of Very Large DataBases (VLDB).
Chakrabarti, S. (2003). Mining the web: Discovering knowledge from hypertext data. San Francisco: Morgan Kaufmann ISBN: 1-55860-754-4.
Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Int. World Wide Web Conf., pp. 681–688.
Chang, K., He, B., & Zhang, Z. (2004). MetaQuerier over the deep web: Shallow integration across holistic sources. In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb).
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Int. VLDB Conf, pp. 109–118.
Crescenzi, V., Merialdo, P., & Missier, P. (2005). Clustering web pages based on their structure. Data & Knowledge Engineering Journal, 54(3), 279–299. September.
Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). New indices for text: Pat trees and pat arrays. Information retrieval: Data structures and algorithms. Upper Saddle River: Prentice Hall.
Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The Tsimmis experience. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), pp. 1–8.
Hogue, A., & Karger, D. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International World Wide Web Conference.
Hsu, C. N., & Dung, M. T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information System, 23(8), 521–538. doi:10.1016/S0306-4379(98)00027-1.
Jung, Y., Geller, J., Wu, Y., & Ae Chun, S. (2007). Semantic deep web: Automatic attribute extraction from the deep web data sources. In Proceedings of the International SAC Conference, pp. 1667–1672.
Kovalev, V., Bhowmick, S., & Madria, S. (2005). HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML. Data & Knowledge Engineering Journal, 54(2), 241–276, August.
Kistlera, T., & Marais, H. (1998). WebL: A Programming Language for the Web. In Proceedings of the 7th International World Wide Web Conference (WWW7), pp. 259–270.
Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737.
Laender, A. H. F., Ribeiro-Neto, B. A., Soares da Silva, A., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93. doi:10.1145/565117.565137.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
Liddle, S., Yau, S., & Embley, D. (2001). On the automatic extraction of data from the hidden web. ER (Workshops), pp. 212–226.
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi Agent Syst., 93–114. doi:10.1023/A:1010022931168.
Notredame, C. (2002). Recent progresses in multiple sequence alignment: A survey. Technical report, Information Genetique et.
Pan, A., et al. (2002). Semi-automatic wrapper generation for commercial web sources. In Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC).
Raghavan, S., & García-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB).
Raposo, J., Pan, A., Álvarez, M., & Hidalgo, J. (2007). Automatically maintaining wrappers for web sources. Data & Knowledge Engineering, 61(2), 331–358. doi:10.1016/j.datak.2006.06.006.
Sahuguet, A., & Azavant, F. (2001). Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering Journal, 36(3), 283–316. doi:10.1016/S0169-023X(00)00051-3.
Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International World Wide Web Conference (WWW12).
Zhai, Y., & Liu, B. (2005). Extracting web data using instance-based learning. In Proc. of Web Information Systems Engineering (WISE), pp. 318–331.
Zhai, Y., & Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 1614–1628. doi:10.1109/TKDE.2006.197.
Acknowledgements
This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.
Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.
Rights and permissions
About this article
Cite this article
Álvarez, M., Pan, A., Raposo, J. et al. Finding and Extracting Data Records from Web Pages. J Sign Process Syst Sign Image Video Technol 59, 123–137 (2010). https://doi.org/10.1007/s11265-008-0270-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0270-y