Skip to main content

Exploiting Pre-Existing Datasets to Support IETS

  • Chapter
  • First Online:
Unsupervised Information Extraction by Text Segmentation

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

  • 1086 Accesses

Abstract

This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-based features from knowledge bases, how to automatically induce structure-based features with no previous human-driven training, a feature that is unique to this approach, and how to effectively combine these features to label segments of a text input.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The maximum probability density of \(v_A\) is \(1/\sqrt{2\pi \sigma ^{2}}\).

References

  • Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). Seattle, USA.

    Google Scholar 

  • Agrawal, S., & Chaudhuri, S. (2003). Automated ranking of database query results. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research, Asilomar, USA.

    Google Scholar 

  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). Santa Barbara, USA.

    Google Scholar 

  • Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). Washington, USA.

    Google Scholar 

  • Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818). Indianapolis, USA.

    Google Scholar 

  • Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Vancouver, Canada.

    Google Scholar 

  • Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54). Indianapolis, USA.

    Google Scholar 

  • Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552). Athens, Greece.

    Google Scholar 

  • Fan, W., Gordon, M., & Pathak, P. (2004). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on knowledge and Data Engineering, 16(4), 523.

    Google Scholar 

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning (pp. 137–142). Chemnitz, Germany.

    Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). Williamstown, USA.

    Google Scholar 

  • Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). Atlanta, USA.

    Google Scholar 

  • Mesquita, F., da Silva, A., de Moura, E., Calado, P., & Laender, A. (2007). LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4), 983–1004.

    Article  Google Scholar 

  • Pearl, J., & Shafer, G. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. Florianpolis: In Simpsio Brasileiro de Banco de Dados.

    Google Scholar 

  • Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.

    Article  Google Scholar 

  • Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.

    Google Scholar 

  • Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.

    Google Scholar 

  • Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164). Madrid, Spain.

    Google Scholar 

  • Toda, G. A., & da Silva, A. S. (2006). Um Mtodo Probabilstico para o Preenchimento Automtico de Formulrios Web a Partir de Textos Ricos em Dados. Universidade Federal do Amazonas.

    Google Scholar 

  • Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). Atlanta, USA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eli Cortez .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Cite this chapter

Cortez, E., da Silva, A.S. (2013). Exploiting Pre-Existing Datasets to Support IETS. In: Unsupervised Information Extraction by Text Segmentation. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-02597-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02597-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02596-4

  • Online ISBN: 978-3-319-02597-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics