Exploiting Pre-Existing Datasets to Support IETS

Cortez, Eli; da Silva, Altigran S.

doi:10.1007/978-3-319-02597-1_3

Eli Cortez¹⁶ &
Altigran S. da Silva¹⁶

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

1086 Accesses

Abstract

This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-based features from knowledge bases, how to automatically induce structure-based features with no previous human-driven training, a feature that is unique to this approach, and how to effectively combine these features to label segments of a text input.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The maximum probability density of \(v_A\) is \(1/\sqrt{2\pi \sigma ^{2}}\).

References

Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). Seattle, USA.
Google Scholar
Agrawal, S., & Chaudhuri, S. (2003). Automated ranking of database query results. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research, Asilomar, USA.
Google Scholar
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). Santa Barbara, USA.
Google Scholar
Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). Washington, USA.
Google Scholar
Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818). Indianapolis, USA.
Google Scholar
Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Vancouver, Canada.
Google Scholar
Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54). Indianapolis, USA.
Google Scholar
Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552). Athens, Greece.
Google Scholar
Fan, W., Gordon, M., & Pathak, P. (2004). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on knowledge and Data Engineering, 16(4), 523.
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning (pp. 137–142). Chemnitz, Germany.
Google Scholar
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). Williamstown, USA.
Google Scholar
Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). Atlanta, USA.
Google Scholar
Mesquita, F., da Silva, A., de Moura, E., Calado, P., & Laender, A. (2007). LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4), 983–1004.
Article Google Scholar
Pearl, J., & Shafer, G. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco: Morgan Kaufmann Publishers Inc.
Google Scholar
Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. Florianpolis: In Simpsio Brasileiro de Banco de Dados.
Google Scholar
Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Article MATH Google Scholar
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Article Google Scholar
Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.
Google Scholar
Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.
Google Scholar
Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164). Madrid, Spain.
Google Scholar
Toda, G. A., & da Silva, A. S. (2006). Um Mtodo Probabilstico para o Preenchimento Automtico de Formulrios Web a Partir de Textos Ricos em Dados. Universidade Federal do Amazonas.
Google Scholar
Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). Atlanta, USA.
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Computação, Universidade Federal do Amazonas, Manaus, AM, Brazil
Eli Cortez & Altigran S. da Silva

Authors

Eli Cortez
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eli Cortez .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cortez, E., da Silva, A.S. (2013). Exploiting Pre-Existing Datasets to Support IETS. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-02597-1_3
Published: 24 October 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02596-4
Online ISBN: 978-3-319-02597-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics