Advertisement

Structured Data Extraction: Wrapper Generation

  • Bing LiuEmail author
Chapter
Part of the Data-Centric Systems and Applications book series (DCSA)

Abstract

Web information extraction is the problem of extracting target information items from Web pages. There are two general problems: extracting information from natural language text and extracting structured data from Web pages. This chapter focuses on extracting structured data. A program for extracting such data is usually called a wrapper. Extracting information from text is studied mainly in the natural language processing community.

Keywords

Regular Expression Target Item Generalize Node Longe Common Subsequence Tree Edit Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. 1.
    Arasu, A. and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.Google Scholar
  2. 2.
    Arlotta, L., V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extracted from large web sites. In Proceedings of Intl. Workshop on Web and Databases, 2003.Google Scholar
  3. 3.
    Baumgartner, R., S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.Google Scholar
  4. 4.
    Buttler, D., L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of International Conference on Distributed Computing Systems (ICDCS-2001), 2002.Google Scholar
  5. 5.
    Cafarella, M., A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. In Proceedings of International Conference on Very Large Data Bases (VLDB-2008), 2008.Google Scholar
  6. 6.
    Carrillo, H. and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 1988, 48(5): p. 1073–1082.zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Chang, C., M. Kayed, M. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1411–1428.Google Scholar
  8. 8.
    Chang, C. and S. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.Google Scholar
  9. 9.
    Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 2001, 40(2): p. 135–158.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Cohen, W., M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.Google Scholar
  11. 11.
    Crescenzi, V., G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.Google Scholar
  12. 12.
    Embley, D., Y. Jiang, and Y. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Record, 1999, 28(2): p. 467–478.CrossRefGoogle Scholar
  13. 13.
    Grumbach, S. and G. Mecca. In search of the lost schema. Database Theory—ICDT’99, 1999: p. 314–331.Google Scholar
  14. 14.
    Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. 1997: Cambridge Univ Press.Google Scholar
  15. 15.
    Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.Google Scholar
  16. 16.
    Honavar, V. and G. Slutzki. eds. Grammatical Inference. Fourth Intl Colloquium on Grammatical Inference. 1998, LNCS 1433. Springer-Verlag.Google Scholar
  17. 17.
    Hsu, C. and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems, 1998, 23(8): p. 521–538.CrossRefGoogle Scholar
  18. 18.
    Irmak, U. and T. Suel. Interactive wrapper generation with minimal user effort. In Proceedings of International Conference on World Wide Web (WWW-2006), 2006.Google Scholar
  19. 19.
    Kushmerick, N. Wrapper induction for information extraction, PhD Thesis. 1997.Google Scholar
  20. 20.
    Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 2000, 118(1–2): p. 15–68.zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Lafferty, J., A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML-2001), 2001.Google Scholar
  22. 22.
    Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.Google Scholar
  23. 23.
    Li, Z. and W. Ng. Wiccap: From semi-structured data to structured data. In Proceedings of 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (ECBS'04), 2004.Google Scholar
  24. 24.
    Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 2003.Google Scholar
  25. 25.
    Liu, B. and Y. Zhai. NET - A System for Extracting Web Data from Flat and Nested Data Records. In Proceedings of Intl. Conf. on Web Information Systems Engineering (WISE2005), 2005.Google Scholar
  26. 26.
    Miao, G., J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.Google Scholar
  27. 27.
    Muslea, I., S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 2006, 27(1): p. 203–233.zbMATHMathSciNetGoogle Scholar
  28. 28.
    Muslea, I., S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of Intl. Conf. on Autonomous Agents (AGENTS-1999), 1999.Google Scholar
  29. 29.
    Raposo, J., A. Pan, M. Álvarez, J. Hidalgo, and A. Vina. The wargo system: Semi-automatic wrapper generation in presence of complex data access modes. In Proceedings of Workshop on Database and Expert Systems Applications, 2002.Google Scholar
  30. 30.
    Reis, D., P. Golgher, A. Silva, and A. Laender. Automatic web news extraction using tree edit distance. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.Google Scholar
  31. 31.
    Simon, K. and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2005), 2005.Google Scholar
  32. 32.
    Song, X., J. Liu, Y. Cao, C. Lin, and H. Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2010), 2010.Google Scholar
  33. 33.
    Tai, K. The tree-to-tree correction problem. Journal of the ACM (JACM), 1979, 26(3): p. 433.Google Scholar
  34. 34.
    Wang, J. and F. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of International Conference on World Wide Web (WWW-2003), 2003.Google Scholar
  35. 35.
    Wang, J., B. Shapiro, D. Shasha, K. Zhang, and K. Currey. An algorithm for finding the largest approximately common substructures of two trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, 20(8): p. 889–895.CrossRefGoogle Scholar
  36. 36.
    Yang, J., R. Cai, Y. Wang, J. Zhu, L. Zhang, and W. Ma. Incorporating sitelevel knowledge to extract structured data from web forums. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.Google Scholar
  37. 37.
    Yang, W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): p. 739–755.CrossRefGoogle Scholar
  38. 38.
    Zhai, Y. and B. Liu. Extracting web data using instance-based learning. World Wide Web, 2007, 10(2): p. 113–132.CrossRefGoogle Scholar
  39. 39.
    Zhai, Y. and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1614–1628.Google Scholar
  40. 40.
    Zhai, Y. and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.Google Scholar
  41. 41.
    Zhang, K., R. Statman, and D. Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 1992, 42(3): p. 133–139.zbMATHCrossRefMathSciNetGoogle Scholar
  42. 42.
    Zhao, H., W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.Google Scholar
  43. 43.
    Zheng, S., R. Song, J. Wen, and C. Giles. Efficient record-level wrapper induction. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2009), 2009.Google Scholar
  44. 44.
    Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. 2D conditional random fields for web information extraction. In Proceedings of International Conference on Machine Learning (ICML-2005), 2005.Google Scholar
  45. 45.
    Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), 2006.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois, ChicagoChicagoUSA

Personalised recommendations