Skip to main content

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

  • 1092 Accesses

Abstract

In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divided into distinct groups according to their main approach. These groups are: Languages for Wrapper Development, Wrapper Induction Methods, NLP-based Methods, Ontology-based Methods, and HTML-aware Methods. Next, it is specifically presented probabilistic graph-based methods, supervised and unsupervised, and discusses their main characteristics in comparison to the unsupervised approach presented in this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). USA: Seattle.

    Google Scholar 

  • Arocena, G., & Mendelzon, A. (1998). Weboql: Restructuring documents, databases and webs. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 24–33). USA: Orlando.

    Google Scholar 

  • Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington, Washington.

    Google Scholar 

  • Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. Proceedings of the IJCAI International Joint Conference on Artificial Intelligence (pp. 2670–2676). India: Hyderabad.

    Google Scholar 

  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). USA: Santa Barbara.

    Google Scholar 

  • Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.

    Google Scholar 

  • Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). USA: Washington.

    Google Scholar 

  • Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: synchronized data extraction. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710). Austria: Viena.

    Google Scholar 

  • Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Canada: Vancouver.

    Google Scholar 

  • Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158.

    Article  Google Scholar 

  • Crescenzi, V., & Mecca, G. (1998). Grammars have exceptions. Information Systems, 23(8), 539–565.

    Article  Google Scholar 

  • Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 109–118). Italy: Rome.

    Google Scholar 

  • Dalvi, N., Bohannon, P., & Sha, F. (2009). Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 335–348). Rhode Island, USA: Providence.

    Google Scholar 

  • Elmeleegy, H., Madhavan, J., & Halevy, A. (2009). Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1), 1078–1089.

    Google Scholar 

  • Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., et al. (1999a). Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3), 227–251.

    Article  MATH  Google Scholar 

  • Embley, D., Jiang, Y., & Ng, Y. (1999b). Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2), 467–478.

    Article  Google Scholar 

  • Etzioni, O., Banko, M., Soderland, S., & Weld, D. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.

    Article  Google Scholar 

  • Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by Stochastic optimization. Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589). USA: Austin.

    Google Scholar 

  • Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The tsimmis experience. Proceedings of the East-European Symposium on Advances in Databases and Information Systems (pp. 1–8). Russia: St. Petersburg.

    Google Scholar 

  • Hsu, C., & Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8), 521–538.

    Article  Google Scholar 

  • Kristjansson, T., Culotta, A., Viola, P., & McCallum, A. (2004). Interactive information extraction with constrained conditional random fields. Proceedings of the AAAI National Conference on Artificial Inteligence (pp. 412–418). San Jose: USA.

    Google Scholar 

  • Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.

    Article  MathSciNet  MATH  Google Scholar 

  • Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.

    Article  Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). USA: Williamstown.

    Google Scholar 

  • Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). USA: Atlanta.

    Google Scholar 

  • Michelson, M., & Knoblock, C. (2007). Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition, 10(3), 211–226.

    Google Scholar 

  • Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. Proceedings of the National Conference on Artificial Intelligence (pp. 328–334). USA: Orlando.

    Google Scholar 

  • Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1–2), 93–114.

    Article  Google Scholar 

  • Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.

    Article  Google Scholar 

  • Reis, D. C., Golgher, P. B., Silva, A. S., & Laender, A. F. (2004). Automatic web news extraction using tree edit distance. Proceedings of the WWW International World Wide Web Conferences (pp. 502–511). USA: New York.

    Google Scholar 

  • Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.

    Article  Google Scholar 

  • Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.

    Google Scholar 

  • Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine learning, 34(1), 233–272.

    Article  MATH  Google Scholar 

  • Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). USA: Atlanta.

    Google Scholar 

  • Zhao, H., Meng, W., Wu, Z., Raghavan, V., & Yu, C. (2005). Fully automatic wrapper generation for search engines. Proceedings of the WWW International World Wide Web Conferences (pp. 66–75). Japan: Chiba.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eli Cortez .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Cite this chapter

Cortez, E., da Silva, A.S. (2013). Related Work. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02597-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02596-4

  • Online ISBN: 978-3-319-02597-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics