Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization

Lu, Tingming; Zhu, Man; Gao, Zhiqiang

doi:10.1007/978-3-319-50496-4_22

Tingming Lu^18,19,
Man Zhu²⁰ &
Zhiqiang Gao^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10102))

Included in the following conference series:

4659 Accesses

Abstract

Annotated named entity corpora play a significant role in many natural language processing applications. However, annotation by humans is time-consuming and costly. In this paper, we propose a high recall pre-annotator which combines multiple existing named entity taggers based on ensemble learning, to reduce the number of annotations that humans have to add. In addition, annotations are categorized into normal annotations and candidate annotations based on their estimated confidence, to reduce the number of human corrective actions as well as the total annotation time. The experiment results show that our approach outperforms the baseline methods in reduction of annotation time without loss in annotation performance (in terms of F-measure).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://58.192.114.226/assistedNER.
2.
http://nlp.stanford.edu/software/CRF-NER.shtml (version 3.6.0).
3.
http://cogcomp.cs.illinois.edu/page/software_view/NETagger (version 2.8.8).
4.
http://balie.sourceforge.net (version 1.8.1).
5.
http://opennlp.apache.org/index.html (version 1.6.0).
6.
http://gate.ac.uk/ (version 8.1).
7.
http://alias-i.com/lingpipe/ (version 4.1.0).

References

Wu, D., Ngai, G., Carpuat, M.: A stacked, voted, stacked model for named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 200–203. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Speck, R., Ngonga Ngomo, A.-C.: Ensemble learning for named entity recognition. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 519–534. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_33
Google Scholar
Ganchev, K., Pereira, F., Mandel, M., Carroll, S., White, P.: Semi-automated named entity annotation. In: Proceedings of the Linguistic Annotation Workshop, pp. 53–56. Association for Computational Linguistics (2007)
Google Scholar
Stenetorp, P., Pyysalo, S., Ananiadou, S., Jun’ichi, T.: Generalising semantic category disambiguation with large lexical resources for fun and profit. J. Biomed. Semant. 5, 26 (2014)
Article Google Scholar
Lingren, T., Deleger, L., Molnar, K., Zhai, H., Meinzen-Derr, J., Kaiser, M., Stoutenborough, L., Li, Q., Solti, I.: Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements. J. Am. Med. Inform. Assoc. 21(3), 406–413 (2014)
Article Google Scholar
Ogren, P.V., Savova, G.K., Chute, C.G.: Constructing evaluation corpora for automated clinical named entity recognition. In: Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 28–30 (2008)
Google Scholar
Loftsson, H., Yngvason, J.H., Helgadóttir, S., Rögnvaldsson, E.: Developing a PoS-tagged corpus using existing tools. In: Proceedings of “Creation and use of basic lexical resources for less-resourced languages”, workshop at the 7th International Conference on Language Resources and Evaluation (2010)
Google Scholar
Tjong Kim Sang, E. F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147. Association for Computational Linguistics (2003)
Google Scholar
Röder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N3-a collection of datasets for named entity recognition and disambiguation in the NLP interchange format. In: Proceeding of the Ninth International Conference on Language Resources and Evaluation (2014)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)
Google Scholar
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 147–155. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Nadeau, D.: Balie-baseline information extraction: multilingual information extraction from text with machine learning and natural language techniques. Technical report, University of Ottawa (2005)
Google Scholar
Baldridge, J.: The OpenNLP Project (2005)
Google Scholar
Cunningham, H.: GATE: a general architecture for text engineering. Comput. Humanit. 36(2), 223–254 (2001)
Article Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar

Download references

Acknowledgement

This work is partially funded by the National Science Foundation of China under Grant 61170165, 61602260, 61502095. We would like to thank all the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Key Lab of Computer Network and Information Integration, (Southeast University), Ministry of Education, Nanjing, China
Tingming Lu & Zhiqiang Gao
School of Computer Science and Engineering, Southeast University, Nanjing, China
Tingming Lu & Zhiqiang Gao
School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, China
Man Zhu

Authors

Tingming Lu
View author publications
You can also search for this author in PubMed Google Scholar
Man Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqiang Gao .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Chin-Yew Lin
Brandeis University, Waltham, Massachusetts, USA
Nianwen Xue
Peking University, Beijing, China
Dongyan Zhao
Fudan University, Shanghai, China
Xuanjing Huang
Peking University, Beijing, China
Yansong Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, T., Zhu, M., Gao, Z. (2016). Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-50496-4_22
Published: 02 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics