Efficient Page-Level Data Extraction via Schema Induction and Verification

Chang, Chia-Hui; Chen, Tian-Sheng; Chen, Ming-Chuan; Ding, Jhung-Li

doi:10.1007/978-3-319-31750-2_38

Chia-Hui Chang¹⁹,
Tian-Sheng Chen¹⁹,
Ming-Chuan Chen¹⁹ &
…
Jhung-Li Ding¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3022 Accesses
1 Citations

Abstract

Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD 2003 San Diego, California, USA, pp. 337–348 (2003)
Google Scholar
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: CIKM 2011, Glasgow, Scotland, UK, pp. 1265–1274 (2011)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE TKDE 18(10), 1411–1428 (2006)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB 2001, Roma, Italy, pp. 109–118 (2001)
Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: Page-level web data extraction from template pages. IEEE TKDE 22, 249–263 (2010)
Google Scholar
Kushmerick, N.: Wrapper verification. WWW 3(2), 79–94 (2000)
Article MATH Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., de Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. In: ACM SIGMOD (2002)
Google Scholar
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res. 18, 149–181 (2003)
MATH Google Scholar
Liu, W., Meng, X.F., Meng, W.Y.: ViDE: A vision-based approach for deep web data extraction. IEEE TKDE 22, 447–460 (2010)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW 2009, pp. 981–990 (2009)
Google Scholar
Sleiman, H.A., Corchuelo, R.: TEX: An efficient and effective unsupervised web information extractor. Knowl. Based Syst. 39, 109–123 (2013)
Article Google Scholar
Sleiman, H.A., Corchuelo, R.: A survey on region extractors from documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)
Article Google Scholar
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, pp. 47–56 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

CSIE, National Central University, Zhongli District, Taiwan
Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen & Jhung-Li Ding

Authors

Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Tian-Sheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Chuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jhung-Li Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chia-Hui Chang .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, CH., Chen, TS., Chen, MC., Ding, JL. (2016). Efficient Page-Level Data Extraction via Schema Induction and Verification. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_38
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics