Skip to main content

On Designing an Effective Training Set for Information Extraction

  • Conference paper
  • 3303 Accesses

Part of the Lecture Notes in Electrical Engineering book series (LNEE,volume 330)

Abstract

While training set design has received less attention from academia compared to its significance, it becomes crucial in big data environments. We propose a novel way to construct a training set for information extraction. An effective data collection considering the trade-off between system quality and annotation difficulty is the core of the proposed approach. Instead of a random collection of data like usual systems, well-defined key expressions are used as sampling queries. This work is a part of an on-going R&D project and now in process of manual annotation that would be evaluated via final system quality.

Keywords

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bilenko, M., Mooney, R.J.: On Evaluation and Training-Set Construction for Duplicate Detection. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 7–12 (2003)

    Google Scholar 

  2. Derrac, J., García, S., Herrera, F.: A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing 1(1), 60–92 (2010)

    Article  Google Scholar 

  3. Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised leaning. International Journal of Computer Science 1(2), 111–117 (2006)

    Google Scholar 

  4. Wu, F., Weld, D.S.: Open information extraction using Wikipedia. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2010)

    Google Scholar 

  5. Shen, D., Sun, J.T., Yang, Q., Chen, Z.: Building bridges for web query classification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2006)

    Google Scholar 

  6. Kolter, J.Z., Matthew, J.J.: REDD: A public data set for energy disaggregation research. In: Workshop on Data Mining Applications in Sustainability (SIGKDD) (2011)

    Google Scholar 

  7. Juang, P., Testa, C., Mote, N.: Training set construction for taxonomic classification. U.S. Patent 12/604,025 (October 22, 2009)

    Google Scholar 

  8. Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  9. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM International Conference on Digital Libraries (2000)

    Google Scholar 

  10. Kambhatla, N.: Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In: Proceedings of the ACL (2004)

    Google Scholar 

  11. Frederik, H., Frasincar, F., Kaymak, U., De Jong, F.: An overview of event extraction from text. In: Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web at Tenth International Semantic Web Conference (ISWC 2011), vol. 779 (2011)

    Google Scholar 

  12. Li, Q., Ji, H., Huang, L.: Joint Event Extraction via Structured Prediction with Global Features. ACL (1) (2013)

    Google Scholar 

  13. McClosky, D., Surdeanu, M., Manning, C.D.: Event extraction as dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Min Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, YM., Song, Sk., Shin, S., Seon, CN., Hong, S., Jung, H. (2015). On Designing an Effective Training Set for Information Extraction. In: Park, J., Stojmenovic, I., Jeong, H., Yi, G. (eds) Computer Science and its Applications. Lecture Notes in Electrical Engineering, vol 330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45402-2_156

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-45402-2_156

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-45401-5

  • Online ISBN: 978-3-662-45402-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics