Automatic Data Extraction from Data-Rich Web Pages

Hu, Dongdong; Meng, Xiaofeng

doi:10.1007/11408079_75

Dongdong Hu¹⁹ &
Xiaofeng Meng¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3453))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1051 Accesses
5 Citations

Abstract

Extracting data from web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. In this paper, we propose a novel technique to the problem of differentiating roles of data items from Web pages, which is one of the key problems in our automatic extraction approach. The problem is resolved at various levels: semantic blocks, sections and data items, and several approaches are proposed to effectively identify the mapping between data items having the same role. Intensive experiments on real web sites show that the proposed technique can effectively help extracting desired data with high accuracies in most of the cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of VLDB, pp. 119–128 (2001)
Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap: An XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE, pp. 611–621 (2000)
Google Scholar
Meng, X., Wang, H., Hu, D., Li, C.: A supervised visual wrapper generator for web-data extraction. In: Proceedings of COMPSAC, pp. 657–662 (2003)
Google Scholar
Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36, 283–316 (2001)
Article MATH Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Article Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structure data from web pages. In: Proceedings of SIGMOD, pp. 337–348 (2003)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB, pp. 109–118 (2001)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of WWW, pp. 187–196 (2003)
Google Scholar
Grumbach, S., Mecca, G.: In search of the lost schema. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 314–331. Springer, Heidelberg (1998)
Chapter Google Scholar
XML query language (xquery), http://www.w3.org/TR/xquery/
XML path language (xpath) 2.0, http://www.w3.org/TR/xpath20/
Document object model (dom) level 2 core specification, http://www.w3.org/TR/DOM-Level-2-Core
Arlotta, L., Crescenzi, V., Mecca, G., Merialdo, P.: Automatic annotation of data extracted from large web sites. In: Proceedings of WebDB, pp. 7–12 (2003)
Google Scholar
Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Proceedings of ACM WIDM, pp. 1–8 (2003)
Google Scholar
Meng, X., Wang, H., Hu, D., Gu, M.: Sg-wram: Schema guided wrapper maintenance. In: Proceedings of ICDE, pp. 750–752 (2003)
Google Scholar
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of AAAI/IAAI, pp. 609–614 (2000)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31, 84–93 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China,
Dongdong Hu & Xiaofeng Meng

Authors

Dongdong Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Lizhu Zhou
National University of Singapore, Singapore
Beng Chin Ooi
School of Information, Renmin University of China,
Xiaofeng Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, D., Meng, X. (2005). Automatic Data Extraction from Data-Rich Web Pages. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_75

Download citation

DOI: https://doi.org/10.1007/11408079_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics