Skip to main content

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

  • 1108 Accesses

Abstract

The Information Extraction problem (IE) refers to the automatic extraction of structured information from noisy unstructured textual sources. This problem is a research topic in different Computer Science communities, such as: Databases, Information Retrieval, and Artificial Intelligence. This chapter provides an introduction of this problem and also an overview of how information extraction fits into the broader topics of data management. It also provides a list of the main contribution that can be found in this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.pbct.inweb.org.br/pbct/

References

  • Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29), Seattle, USA.

    Google Scholar 

  • Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington.

    Google Scholar 

  • Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proceedings of the WWW International World Wide Web Conferences (pp. 441–450), Alberta, Canada.

    Google Scholar 

  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186), Santa Barbara, USA.

    Google Scholar 

  • Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.

    Google Scholar 

  • Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. ACM SIGMOD Record, 33(3), 61–70.

    Article  Google Scholar 

  • Chang, C., Kayed, M., Girgis, M., & Shaalan, K. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.

    Article  Google Scholar 

  • Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: Synchronized data extraction. In Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710), Viena, Austria.

    Google Scholar 

  • Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. In Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54), Indianapolis, USA.

    Google Scholar 

  • Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818), Indianapolis, USA.

    Google Scholar 

  • Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552), Athens, Greece.

    Google Scholar 

  • Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1535–1545), Edinburgh, UK.

    Google Scholar 

  • Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589), Austin, USA.

    Google Scholar 

  • Halevy, A. (2012). Towards an ecosystem of structured data on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 1–2), Berlin, Germany.

    Google Scholar 

  • Jin, W., Ho, H., & Srihari, R. (2009). OpinionMiner: A novel machine learning system for web opinion mining and extraction. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1195–1204), Paris, France.

    Google Scholar 

  • Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.

    Article  Google Scholar 

  • Laender, A., Moro, M., Gonçalves, M., Davis, Jr., C., da Silva, A., Silva, A., et al. (2011a). Building a research social network from an individual perspective. In Proceedings of the International ACM/IEEE Joint Conference on Digital Libraries (pp. 427–428), Ottawa, Canada.

    Google Scholar 

  • Laender, A., Moro, M., Gonçalves, M., Davis Jr, C., da Silva, A., Silva, A., et al. (2011b). Ciência Brasil—the Brazilian portal of science and technology. In Integrated Seminar of Software and Hardware (SEMISH), Natal, Brasil.

    Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML International Conference on Machine Learning (pp. 282–289), Williamstown, USA.

    Google Scholar 

  • Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., et al. (2007). Web-scale data integration: You can only afford to pay as you go. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research (pp. 342–350), Asilomar, USA.

    Google Scholar 

  • Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. In Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41), Atlanta, USA.

    Google Scholar 

  • Mausam, Schmitz, M., Soderland, S., Bart, R., & Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 523–534), Jeju Island, Korea.

    Google Scholar 

  • Mesquita, F., & Barbosa, D. (2011). Extracting meta statements from the blogosphere. In Proceedings of the International Conference on Weblogs and Social Media, Barcelona, Spain.

    Google Scholar 

  • Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.

    Article  Google Scholar 

  • Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. In Simpsio Brasileiro de Banco de Dados, Florianpolis, Brasil.

    Google Scholar 

  • Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (pp. 147–155), Stroudsburg, USA.

    Google Scholar 

  • Ritter, A., Clark, S., & Etzioni, O. (2011). Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1524–1534), Edinburgh, UK.

    Google Scholar 

  • Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.

    Article  Google Scholar 

  • Sardi Mergen, S., Freire, J., & Heuser, C. (2010). Indexing relations on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 430–440), Lausanne, Switzerland.

    Google Scholar 

  • Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using Wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.

    Google Scholar 

  • Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. In Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164), Madrid, Spain.

    Google Scholar 

  • Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.

    Google Scholar 

  • Vidal, M., da Silva, A., de Moura, E., & Cavalcanti, J. (2006). Structure-driven crawler generation by example. In Proceedings of the International ACM SIGIR Conference on Research & Development of Information Retrieval (pp. 292–299), Seattle, USA.

    Google Scholar 

  • Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining (pp. 420–431), Atlanta, USA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eli Cortez .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Cite this chapter

Cortez, E., da Silva, A.S. (2013). Introduction. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02597-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02596-4

  • Online ISBN: 978-3-319-02597-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics