Abstract
The Information Extraction problem (IE) refers to the automatic extraction of structured information from noisy unstructured textual sources. This problem is a research topic in different Computer Science communities, such as: Databases, Information Retrieval, and Artificial Intelligence. This chapter provides an introduction of this problem and also an overview of how information extraction fits into the broader topics of data management. It also provides a list of the main contribution that can be found in this book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29), Seattle, USA.
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington.
Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proceedings of the WWW International World Wide Web Conferences (pp. 441–450), Alberta, Canada.
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186), Santa Barbara, USA.
Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.
Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. ACM SIGMOD Record, 33(3), 61–70.
Chang, C., Kayed, M., Girgis, M., & Shaalan, K. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.
Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: Synchronized data extraction. In Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710), Viena, Austria.
Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. In Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54), Indianapolis, USA.
Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818), Indianapolis, USA.
Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552), Athens, Greece.
Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1535–1545), Edinburgh, UK.
Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589), Austin, USA.
Halevy, A. (2012). Towards an ecosystem of structured data on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 1–2), Berlin, Germany.
Jin, W., Ho, H., & Srihari, R. (2009). OpinionMiner: A novel machine learning system for web opinion mining and extraction. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1195–1204), Paris, France.
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.
Laender, A., Moro, M., Gonçalves, M., Davis, Jr., C., da Silva, A., Silva, A., et al. (2011a). Building a research social network from an individual perspective. In Proceedings of the International ACM/IEEE Joint Conference on Digital Libraries (pp. 427–428), Ottawa, Canada.
Laender, A., Moro, M., Gonçalves, M., Davis Jr, C., da Silva, A., Silva, A., et al. (2011b). Ciência Brasil—the Brazilian portal of science and technology. In Integrated Seminar of Software and Hardware (SEMISH), Natal, Brasil.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML International Conference on Machine Learning (pp. 282–289), Williamstown, USA.
Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., et al. (2007). Web-scale data integration: You can only afford to pay as you go. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research (pp. 342–350), Asilomar, USA.
Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. In Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41), Atlanta, USA.
Mausam, Schmitz, M., Soderland, S., Bart, R., & Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 523–534), Jeju Island, Korea.
Mesquita, F., & Barbosa, D. (2011). Extracting meta statements from the blogosphere. In Proceedings of the International Conference on Weblogs and Social Media, Barcelona, Spain.
Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.
Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. In Simpsio Brasileiro de Banco de Dados, Florianpolis, Brasil.
Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (pp. 147–155), Stroudsburg, USA.
Ritter, A., Clark, S., & Etzioni, O. (2011). Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1524–1534), Edinburgh, UK.
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Sardi Mergen, S., Freire, J., & Heuser, C. (2010). Indexing relations on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 430–440), Lausanne, Switzerland.
Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using Wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.
Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. In Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164), Madrid, Spain.
Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.
Vidal, M., da Silva, A., de Moura, E., & Cavalcanti, J. (2006). Structure-driven crawler generation by example. In Proceedings of the International ACM SIGIR Conference on Research & Development of Information Retrieval (pp. 292–299), Seattle, USA.
Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining (pp. 420–431), Atlanta, USA.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cortez, E., da Silva, A.S. (2013). Introduction. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-02597-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02596-4
Online ISBN: 978-3-319-02597-1
eBook Packages: Computer ScienceComputer Science (R0)