Exploiting Attribute Redundancy for Web Entity Data Extraction

  • Yanxu Zhu
  • Gang Yin
  • Xiang Li
  • Huaimin Wang
  • Dianxi Shi
  • Lin Yuan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7008)

Abstract

Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.

Keywords

Web Entity Data Extraction Attribute Redundancy 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gibson, Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW, pp. 830–839. ACM Press, New York (2005)Google Scholar
  2. 2.
    Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L.: An Indent Shape based Approach for Web Lists Mining. In: Wang, F.L. (ed.) WISM 2011, Part II. LNCS, vol. 6988, pp. 113–121. Springer, Heidelberg (2011)Google Scholar
  3. 3.
    Agichtein, E.: Confidence Estimation Methods for Partially Supervised Relation Extraction. In: The 6th SIAM International Conference on Data Mining, ACM Press, New York (2006)Google Scholar
  4. 4.
    Agrawal, R., Bayardo, R.J., Srikant, R.: Athena: Mining-Based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: The 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM Press, New York (2003)CrossRefGoogle Scholar
  6. 6.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  7. 7.
    Papotti, P., Crescenzi, V., Merialdo, P., Bronzi, M., Blanco, L.: Redundancy-driven web data extraction and integration. In: WebDB (2010)Google Scholar
  8. 8.
    Gulhane, P., Rastogi, R., Sengamedu, S., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)Google Scholar
  9. 9.
    Miao, G., et al.: Extracting data records from the web using tag path clusterting. In: WWW, pp. 981–990. ACM Press, New York (2009)CrossRefGoogle Scholar
  10. 10.
    Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The 10th SIAM, pp. 930–941 (2010)Google Scholar
  11. 11.
    Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)Google Scholar
  12. 12.
    Sivakumar, P., Parvathi, R.M.S.: An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining. European Journal of Scientific Research 50(3), 340–351 (2011)Google Scholar
  13. 13.
    Liu, W., Meng, X., Yang, J., Xiao, J.: Duplicate Identification in Deep Web Data Integration. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 5–17. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49(4), 46 (2006)CrossRefGoogle Scholar
  15. 15.
    Huang, J., Wang, H., et al.: Link-based Hidden Attribute Discovery for Objects on Web. In: 14th International Conference on Extending Database Technology, pp. 473–484. ACM Press, New York (2011)Google Scholar
  16. 16.
    Wang, J., Shao, B., et al.: Understanding Tables on the Web. Technique report. Microsoft Research Asia (2011)Google Scholar
  17. 17.
    Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Yanxu Zhu
    • 1
  • Gang Yin
    • 1
  • Xiang Li
    • 1
  • Huaimin Wang
    • 1
    • 2
  • Dianxi Shi
    • 1
  • Lin Yuan
    • 3
  1. 1.College of Computer Science and TechnologyNational University of Defense TechnologyChangshaChina
  2. 2.National Laboratory for Parallel and Distributed ProcessingNational University of Defense TechnologyChangshaChina
  3. 3.College of Electronic TechnologyInformation Engineering UniversityZhengzhouChina

Personalised recommendations