Related Work

Cortez, Eli; da Silva, Altigran S.

doi:10.1007/978-3-319-02597-1_2

Eli Cortez¹⁶ &
Altigran S. da Silva¹⁶

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

1092 Accesses

Abstract

In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divided into distinct groups according to their main approach. These groups are: Languages for Wrapper Development, Wrapper Induction Methods, NLP-based Methods, Ontology-based Methods, and HTML-aware Methods. Next, it is specifically presented probabilistic graph-based methods, supervised and unsupervised, and discusses their main characteristics in comparison to the unsupervised approach presented in this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). USA: Seattle.
Google Scholar
Arocena, G., & Mendelzon, A. (1998). Weboql: Restructuring documents, databases and webs. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 24–33). USA: Orlando.
Google Scholar
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington, Washington.
Google Scholar
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. Proceedings of the IJCAI International Joint Conference on Artificial Intelligence (pp. 2670–2676). India: Hyderabad.
Google Scholar
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). USA: Santa Barbara.
Google Scholar
Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.
Google Scholar
Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). USA: Washington.
Google Scholar
Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: synchronized data extraction. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710). Austria: Viena.
Google Scholar
Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Canada: Vancouver.
Google Scholar
Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158.
Article Google Scholar
Crescenzi, V., & Mecca, G. (1998). Grammars have exceptions. Information Systems, 23(8), 539–565.
Article Google Scholar
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 109–118). Italy: Rome.
Google Scholar
Dalvi, N., Bohannon, P., & Sha, F. (2009). Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 335–348). Rhode Island, USA: Providence.
Google Scholar
Elmeleegy, H., Madhavan, J., & Halevy, A. (2009). Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1), 1078–1089.
Google Scholar
Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., et al. (1999a). Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3), 227–251.
Article MATH Google Scholar
Embley, D., Jiang, Y., & Ng, Y. (1999b). Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2), 467–478.
Article Google Scholar
Etzioni, O., Banko, M., Soderland, S., & Weld, D. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.
Article Google Scholar
Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by Stochastic optimization. Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589). USA: Austin.
Google Scholar
Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The tsimmis experience. Proceedings of the East-European Symposium on Advances in Databases and Information Systems (pp. 1–8). Russia: St. Petersburg.
Google Scholar
Hsu, C., & Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8), 521–538.
Article Google Scholar
Kristjansson, T., Culotta, A., Viola, P., & McCallum, A. (2004). Interactive information extraction with constrained conditional random fields. Proceedings of the AAAI National Conference on Artificial Inteligence (pp. 412–418). San Jose: USA.
Google Scholar
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.
Article MathSciNet MATH Google Scholar
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.
Article Google Scholar
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). USA: Williamstown.
Google Scholar
Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). USA: Atlanta.
Google Scholar
Michelson, M., & Knoblock, C. (2007). Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition, 10(3), 211–226.
Google Scholar
Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. Proceedings of the National Conference on Artificial Intelligence (pp. 328–334). USA: Orlando.
Google Scholar
Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1–2), 93–114.
Article Google Scholar
Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.
Article Google Scholar
Reis, D. C., Golgher, P. B., Silva, A. S., & Laender, A. F. (2004). Automatic web news extraction using tree edit distance. Proceedings of the WWW International World Wide Web Conferences (pp. 502–511). USA: New York.
Google Scholar
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Article Google Scholar
Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.
Google Scholar
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine learning, 34(1), 233–272.
Article MATH Google Scholar
Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). USA: Atlanta.
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., & Yu, C. (2005). Fully automatic wrapper generation for search engines. Proceedings of the WWW International World Wide Web Conferences (pp. 66–75). Japan: Chiba.
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Computação, Universidade Federal do Amazonas, Manaus, AM, Brazil
Eli Cortez & Altigran S. da Silva

Authors

Eli Cortez
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eli Cortez .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cortez, E., da Silva, A.S. (2013). Related Work. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-02597-1_2
Published: 24 October 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02596-4
Online ISBN: 978-3-319-02597-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics