Combining rule-based and statistical mechanisms for low-resource named entity recognition

Abstract

We describe a multifaceted approach to named entity recognition that can be deployed with minimal data resources and a handful of hours of non-expert annotation. We describe how this approach was applied in the 2016 LoReHLT evaluation and demonstrate that both statistical and rule-based approaches contribute to our performance. We also demonstrate across many languages the value of selecting the sentences to be annotated when training on small amounts of data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    IL3_dictionary.xml; LDC-provided.

  2. 2.

    xinjiang_places.pdf with link to Wikipedia; LDC-provided.

  3. 3.

    link in CategoryII_list.pdf; LDC-provided.

  4. 4.

    parallel_grammar.pdf; LDC-provided.

  5. 5.

    This is the numerical stability parameter typically used in AdaGrad implementations.

  6. 6.

    The word shape feature collapsed all consecutive letters in a name to a single letter to attempt to identify punctuation patterns. For example, the name Bob would have the shape a, while @Bob would have the shape @a.

  7. 7.

    Arabic and Mandarin were also provided but we exclude them from our experiments here due to data processing issues. Yoruba is excluded because it had too little data for meaningful experiments. Hausa was excluded because the data did not annotate the gpe type. LDC catalog numbers were 2014E115, 2015E70, and 2016E{29,87,91,93,95,97,99,103}.

References

  1. Bonadiman D, Severyn A, Moschitti A (2015) Deep neural networks for named entity recognition in Italian. CLiC it 51–55

  2. Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pp 100–110

  3. Duchi JC, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159

    MathSciNet  MATH  Google Scholar 

  4. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, ICML ’01, pp 282–289

  5. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. CoRR abs/1603.01360, http://arxiv.org/abs/1603.01360

  6. Li W, McCallum A (2003) Rapid development of hindi named entity recognition using conditional random fields and feature induction. In: ACM transactions on Asian language information processing, pp 290–294

  7. Linguistic Data Consortium (2016) LORELEI IL3 incident language pack for year 1 Eval. LDC2016E57

  8. Nadeau D, Turney PD, Matwin S (2006) Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Proceedings of the 19th international conference on advances in artificial intelligence: Canadian Society for Computational Studies of Intelligence, Springer, Berlin, Heidelberg, AI’06, pp 266–277

  9. Ramshaw LA, Marcus MP (1999) Text chunking using transformation-based learning. In: Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D (eds) Natural language processing using very large corpora. Springer, The Netherlands, Dordrecht, pp 157–176

    Google Scholar 

  10. Riaz K (2010) Rule-based named entity recognition in urdu. In: Proceedings of the 2010 named entities workshop, Association for computational linguistics, Stroudsburg, PA, NEWS ’10, pp 126–135

  11. Settles B (2010) Active learning literature survey. In: Computer sciences technical report, University of Wisconsin-Madison

  12. Sun H, Grishman R, Wang Y (2016) Domain adaptation with active learning for named entity recognition. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud computing and security: second international conference. Revised Selected Papers, Part II, Springer International Publishing, Cham, ICCCS 2016, Nanjing, China, 29–31 July 2016, pp 611–622

  13. Sundheim BM (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding, Association for Computational Linguistics, Stroudsburg, PA, MUC-6 ’95, pp 13–31

  14. Tjong Kim Sang EF, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 - Vol4, Association for Computational Linguistics, Stroudsburg, PA, CoNLL ’03, pp 142–147

  15. Wick M (2016) Geonames ontology. http://www.geonames.org/about.html

  16. Xu H, Marcus M, Ungar L, Yang C (2017) Unsupervised morphology learning with statistical paradigms, unpublished manuscript

  17. Zhang B, Pan X, Wang T, Vaswani A, Ji H, Knight K, Marcu D (2016) Name tagging for low-resource incident languages based on expectation-driven learning. In: Proceedings of ACL 2016

Download references

Acknowledgements

This material is based upon work supported by the the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. (Approved for Public Release by DARPA on Aug 29, 2017 (DISTAR Approval #28392) , Distribution Unlimited)

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ryan Gabbard.

Additional information

All work described in this article was performed at Raytheon BBN Technologies. Authors Freedman, Gabbard, Lignos, and Weischedel are currently affiliated with the University of Southern California Information Sciences Institute, 4676 Admiralty Way, Suite 1001, Marina del Rey, 90292, USA.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gabbard, R., DeYoung, J., Lignos, C. et al. Combining rule-based and statistical mechanisms for low-resource named entity recognition. Machine Translation 32, 31–43 (2018). https://doi.org/10.1007/s10590-017-9208-0

Download citation

Keywords

  • Named entity recognition
  • Low-resource NLP
  • Annotation