Adaptive Web Wrapper Based on Hybrid Hierarchical Conditional Random Fields for Web Data Integration

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 217)

Abstract

During the process of Web data integration, new related Web sites consisting valuable data will be identified constantly. Wrapper induction based on labeled examples is a widely accepted method. However, manually labeling the sampled Web pages in the new Web site is a time-consuming work. To efficiently extract the structured data from new related Web sites, a novel model, Hybrid Hierarchical Conditional Random Fields (HH-CRFs), is proposed in this paper. HH-CRFs are trained using the accumulated data in the system of Web data integration, and HH-CRFs perform type identification of Web object, Web object detection and attribute labeling together. Then, the labeled, sampled Web pages will be used to induce the wrapper of target Web site. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can help to induce the target wrapper efficiently.

Keywords

Web data integration Wrapper Hybrid Hierarchical Conditional Random Fields (HH-CRFs) 

Notes

Acknowledgments

This work is supported by the Open Project Program of the Shandong Provincial Key Lab of Software Engineering(Grant No. 2011SE002)and the Project of Shandong Province Higher Educational Science and Technology Program (Grant No. J11LG32).

References

  1. 1.
    Emilio F, Giacomo F (2010) Web data extraction, applications and techniques: a survey. ACM Trans Comput Logic 03:1–20Google Scholar
  2. 2.
    Huang JB, Ji JB, Sun HL (2008) Integration of heterogeneous of web records using mixed skip-chain conditional fields. J Softw (in Chin) 19(8):2149–2158CrossRefGoogle Scholar
  3. 3.
    Zhu J, Nie ZQ, Wen JR, Zhang B, Ma WY (2006) Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, vol 35. New York, USA, pp 494-503Google Scholar
  4. 4.
    Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Mathmetical Prog 45:503–528MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Ding YH, Li QZ, Dong YQ, Peng ZH (2010) 2D correlative-chain conditional random fields for semantic annotation of web objects. J Comput Sci Technol 25(4):761–770CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.School of Information Science and EngineeringShandong Normal University Shandong Provincial Key Laboratory for Distributed Computer Software Novel TechnologyJinanChina

Personalised recommendations