Abstract
In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divided into distinct groups according to their main approach. These groups are: Languages for Wrapper Development, Wrapper Induction Methods, NLP-based Methods, Ontology-based Methods, and HTML-aware Methods. Next, it is specifically presented probabilistic graph-based methods, supervised and unsupervised, and discusses their main characteristics in comparison to the unsupervised approach presented in this book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). USA: Seattle.
Arocena, G., & Mendelzon, A. (1998). Weboql: Restructuring documents, databases and webs. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 24–33). USA: Orlando.
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington, Washington.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. Proceedings of the IJCAI International Joint Conference on Artificial Intelligence (pp. 2670–2676). India: Hyderabad.
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). USA: Santa Barbara.
Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.
Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). USA: Washington.
Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: synchronized data extraction. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710). Austria: Viena.
Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Canada: Vancouver.
Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158.
Crescenzi, V., & Mecca, G. (1998). Grammars have exceptions. Information Systems, 23(8), 539–565.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 109–118). Italy: Rome.
Dalvi, N., Bohannon, P., & Sha, F. (2009). Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 335–348). Rhode Island, USA: Providence.
Elmeleegy, H., Madhavan, J., & Halevy, A. (2009). Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1), 1078–1089.
Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., et al. (1999a). Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3), 227–251.
Embley, D., Jiang, Y., & Ng, Y. (1999b). Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2), 467–478.
Etzioni, O., Banko, M., Soderland, S., & Weld, D. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.
Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by Stochastic optimization. Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589). USA: Austin.
Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The tsimmis experience. Proceedings of the East-European Symposium on Advances in Databases and Information Systems (pp. 1–8). Russia: St. Petersburg.
Hsu, C., & Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8), 521–538.
Kristjansson, T., Culotta, A., Viola, P., & McCallum, A. (2004). Interactive information extraction with constrained conditional random fields. Proceedings of the AAAI National Conference on Artificial Inteligence (pp. 412–418). San Jose: USA.
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). USA: Williamstown.
Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). USA: Atlanta.
Michelson, M., & Knoblock, C. (2007). Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition, 10(3), 211–226.
Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. Proceedings of the National Conference on Artificial Intelligence (pp. 328–334). USA: Orlando.
Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1–2), 93–114.
Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.
Reis, D. C., Golgher, P. B., Silva, A. S., & Laender, A. F. (2004). Automatic web news extraction using tree edit distance. Proceedings of the WWW International World Wide Web Conferences (pp. 502–511). USA: New York.
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine learning, 34(1), 233–272.
Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). USA: Atlanta.
Zhao, H., Meng, W., Wu, Z., Raghavan, V., & Yu, C. (2005). Fully automatic wrapper generation for search engines. Proceedings of the WWW International World Wide Web Conferences (pp. 66–75). Japan: Chiba.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cortez, E., da Silva, A.S. (2013). Related Work. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-02597-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02596-4
Online ISBN: 978-3-319-02597-1
eBook Packages: Computer ScienceComputer Science (R0)