Abstract
This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-based features from knowledge bases, how to automatically induce structure-based features with no previous human-driven training, a feature that is unique to this approach, and how to effectively combine these features to label segments of a text input.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The maximum probability density of \(v_A\) is \(1/\sqrt{2\pi \sigma ^{2}}\).
References
Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). Seattle, USA.
Agrawal, S., & Chaudhuri, S. (2003). Automated ranking of database query results. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research, Asilomar, USA.
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). Santa Barbara, USA.
Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). Washington, USA.
Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818). Indianapolis, USA.
Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Vancouver, Canada.
Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54). Indianapolis, USA.
Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552). Athens, Greece.
Fan, W., Gordon, M., & Pathak, P. (2004). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on knowledge and Data Engineering, 16(4), 523.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning (pp. 137–142). Chemnitz, Germany.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). Williamstown, USA.
Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). Atlanta, USA.
Mesquita, F., da Silva, A., de Moura, E., Calado, P., & Laender, A. (2007). LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4), 983–1004.
Pearl, J., & Shafer, G. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco: Morgan Kaufmann Publishers Inc.
Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. Florianpolis: In Simpsio Brasileiro de Banco de Dados.
Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.
Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.
Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164). Madrid, Spain.
Toda, G. A., & da Silva, A. S. (2006). Um Mtodo Probabilstico para o Preenchimento Automtico de Formulrios Web a Partir de Textos Ricos em Dados. Universidade Federal do Amazonas.
Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). Atlanta, USA.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cortez, E., da Silva, A.S. (2013). Exploiting Pre-Existing Datasets to Support IETS. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-02597-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02596-4
Online ISBN: 978-3-319-02597-1
eBook Packages: Computer ScienceComputer Science (R0)