Skip to main content

Efficient Page-Level Data Extraction via Schema Induction and Verification

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Abstract

Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD 2003 San Diego, California, USA, pp. 337–348 (2003)

    Google Scholar 

  2. Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: CIKM 2011, Glasgow, Scotland, UK, pp. 1265–1274 (2011)

    Google Scholar 

  3. Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE TKDE 18(10), 1411–1428 (2006)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB 2001, Roma, Italy, pp. 109–118 (2001)

    Google Scholar 

  5. Kayed, M., Chang, C.H.: FiVaTech: Page-level web data extraction from template pages. IEEE TKDE 22, 249–263 (2010)

    Google Scholar 

  6. Kushmerick, N.: Wrapper verification. WWW 3(2), 79–94 (2000)

    Article  MATH  Google Scholar 

  7. Laender, A.H.F., Ribeiro-Neto, B.A., de Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. In: ACM SIGMOD (2002)

    Google Scholar 

  8. Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res. 18, 149–181 (2003)

    MATH  Google Scholar 

  9. Liu, W., Meng, X.F., Meng, W.Y.: ViDE: A vision-based approach for deep web data extraction. IEEE TKDE 22, 447–460 (2010)

    Google Scholar 

  10. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW 2009, pp. 981–990 (2009)

    Google Scholar 

  11. Sleiman, H.A., Corchuelo, R.: TEX: An efficient and effective unsupervised web information extractor. Knowl. Based Syst. 39, 109–123 (2013)

    Article  Google Scholar 

  12. Sleiman, H.A., Corchuelo, R.: A survey on region extractors from documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)

    Article  Google Scholar 

  13. Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, pp. 47–56 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hui Chang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Chang, CH., Chen, TS., Chen, MC., Ding, JL. (2016). Efficient Page-Level Data Extraction via Schema Induction and Verification. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31750-2_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31749-6

  • Online ISBN: 978-3-319-31750-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics