Adaptive Web Wrapper Based on Hybrid Hierarchical Conditional Random Fields for Web Data Integration
During the process of Web data integration, new related Web sites consisting valuable data will be identified constantly. Wrapper induction based on labeled examples is a widely accepted method. However, manually labeling the sampled Web pages in the new Web site is a time-consuming work. To efficiently extract the structured data from new related Web sites, a novel model, Hybrid Hierarchical Conditional Random Fields (HH-CRFs), is proposed in this paper. HH-CRFs are trained using the accumulated data in the system of Web data integration, and HH-CRFs perform type identification of Web object, Web object detection and attribute labeling together. Then, the labeled, sampled Web pages will be used to induce the wrapper of target Web site. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can help to induce the target wrapper efficiently.
KeywordsWeb data integration Wrapper Hybrid Hierarchical Conditional Random Fields (HH-CRFs)
This work is supported by the Open Project Program of the Shandong Provincial Key Lab of Software Engineering(Grant No. 2011SE002)and the Project of Shandong Province Higher Educational Science and Technology Program (Grant No. J11LG32).
- 1.Emilio F, Giacomo F (2010) Web data extraction, applications and techniques: a survey. ACM Trans Comput Logic 03:1–20Google Scholar
- 3.Zhu J, Nie ZQ, Wen JR, Zhang B, Ma WY (2006) Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, vol 35. New York, USA, pp 494-503Google Scholar