An Indent Shape Based Approach for Web Lists Mining

  • Yanxu Zhu
  • Gang Yin
  • Huaimin Wang
  • Dianxi Shi
  • Xiang Li
  • Lin Yuan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6988)

Abstract

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.

Keywords

Repeated Patterns Web Lists Mining Indent Shape Tandem Repeated Wave 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Embley, D.W., Jiang, Y., Ng, Y.K.: Record-Boundary Discovery in Web Documents. In: ACM SIGMOD International Conference on Management of Data, pp. 467–478 (1999)Google Scholar
  2. 2.
    Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)Google Scholar
  3. 3.
    Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606 (2003)Google Scholar
  4. 4.
    Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully.: Automatic Wrapper Generation for Search Engines. In: The 14th International World Wide Web Conference, pp. 66–75 (2005)Google Scholar
  5. 5.
    Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)Google Scholar
  6. 6.
    Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: ACM SIGKDD, pp. 588–593 (2002)Google Scholar
  7. 7.
    Zhai, Y., Liu, B.: Web Data Extraction based on Partial Tree Alignment. In: The 14th International World Wide Web Conference, pp. 76–85 (2005)Google Scholar
  8. 8.
    Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: The 12th International World Wide Web Conference, pp. 187–196 (2003)Google Scholar
  9. 9.
    Liu, B., Zhai, Y.: NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: Conference on Information and Knowledge Management, pp. 381–388 (2005)Google Scholar
  11. 11.
    Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards Domain Independent Information Extraction from Web Tables. In: International World Wide Web Conference, pp. 71–80 (2007)Google Scholar
  12. 12.
    Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2009)CrossRefGoogle Scholar
  13. 13.
    Diao, Y., Lu, H., Chen, S., Tian, Z.: Toward Learning Based Web Query Processing. In: International Conference on Very Large Databases, pp. 317–328 (2000)Google Scholar
  14. 14.
    W3C, HTML 4.01 Specification (1999), http://www.w3.org/TR/html401
  15. 15.
    Liu, B.: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Yanxu Zhu
    • 1
  • Gang Yin
    • 1
  • Huaimin Wang
    • 1
  • Dianxi Shi
    • 1
  • Xiang Li
    • 1
  • Lin Yuan
    • 2
  1. 1.College of Computer Science and TechnologyNational University of Defense TechnologyChangshaChina
  2. 2.College of Electronic TechnologyInformation Engineering UniversityZhengzhouChina

Personalised recommendations