Inducing Context Gazetteers from Encyclopedic Databases for Named Entity Recognition

  • Han-Cheol Cho
  • Naoaki Okazaki
  • Kentaro Inui
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7818)

Abstract

Named entity recognition (NER) is a fundamental task for mining valuable information from unstructured and semi-structured texts. State-of-the-art NER models mostly employ a supervised machine learning approach that heavily depends on local contexts. However, results of recent research have demonstrated that non-local contexts at the sentence or document level can help advance the improvement of recognition performance. As described in this paper, we propose the use of a context gazetteer, the list of contexts with which entity names can co-occur, as new non-local context information. We build a context gazetteer from an encyclopedic database because manually annotated data are often too few to extract rich and sophisticated context patterns. In addition, dependency path is used as sentence level non-local context to capture more syntactically related contexts to entity mentions than linear context in traditional NER. In the discussion of experimentation used for this study, we build a context gazetteer of gene names and apply it for a biomedical NER task. High confidence context patterns appear in various forms. Some are similar to a predicate–argument structure whereas some are in unexpected forms. The experiment results show that the proposed model using both entity and context gazetteers improves both precision and recall over a strong baseline model, and therefore the usefulness of the context gazetteer.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bodenreider, O.: The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Research 32(suppl. 1), D267–D270 (2004)CrossRefGoogle Scholar
  2. 2.
    Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Nyu: Description of the mene named entity system as used in muc-7. In: Proceedings of the Seventh Message Understanding Conference, MUC-7 (1998)Google Scholar
  3. 3.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Journal of Computational Linguistics 18(4), 467–479 (1992)Google Scholar
  4. 4.
    Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 160–163 (2003)Google Scholar
  5. 5.
    Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the Seventh Message Understanding Conference (MUC7) (April 1998)Google Scholar
  6. 6.
    Consortium, T.U.: Reorganizing the protein space at the universal protein resource (uniprot). Nucleic Acids Research 40(D1), D71–D75 (2012)CrossRefGoogle Scholar
  7. 7.
    Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., Sinclair, G.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on NLPBA, pp. 88–91 (2004)Google Scholar
  8. 8.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on ACL, pp. 363–370 (2005)Google Scholar
  9. 9.
    Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 168–171 (2003)Google Scholar
  10. 10.
    Kambhatla, N.: Minority vote: at-least-n voting improves recall for extracting relations. In: Proceedings of COLING-ACL, pp. 460–466 (2006)Google Scholar
  11. 11.
    Kazama, J., Torisawa, K.: Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations. In: Proceedings of ACL-HLT, pp. 407–415 (2008)Google Scholar
  12. 12.
    Kim, J.D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., Tsujii, J.: Overview of bionlp shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 1–6 (2011)Google Scholar
  13. 13.
    Krishnan, V., Manning, C.D.: An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: Proceedings of COLING-ACL, pp. 1121–1128 (2006)Google Scholar
  14. 14.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  15. 15.
    Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on svms. Journal of Biomedical Informatics 37(6), 436–447 (2004)CrossRefGoogle Scholar
  16. 16.
    Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez gene: Gene-centered information at ncbi. Nucleic Acids Research 33(suppl. 1), D54–D58 (2005)Google Scholar
  17. 17.
    Marneffe, M.C.D., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC 2006 (2006)Google Scholar
  18. 18.
    Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Susan Dumais, D.M., Roukos, S. (eds.) Proceedings of HLT-NAACL, May 2-May 7, pp. 337–342 (2004)Google Scholar
  19. 19.
    Okazaki, N.: Crfsuite: A fast implementation of conditional random fields, crfs (2007), http://www.chokkan.org/software/crfsuite/
  20. 20.
    Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on CoNLL, pp. 147–155 (2009)Google Scholar
  21. 21.
    Riloff, E., Shepherd, J.: A corpus-based approach for building semantic lexicons. In: Proceedings of the Second Conference on EMNLP, pp. 117–124 (1997)Google Scholar
  22. 22.
    Smith, L., Tanabe, L., Ando, R., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin, Y.S., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W., Hunter, L., Carpenter, B., Tsai, R., Dai, H.J., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, A., Mana-Lopez, M., Mata, J., Wilbur, W.J.: Overview of biocreative ii gene mention recognition. Genome Biology 9(suppl. 2), S2 (2008)CrossRefGoogle Scholar
  23. 23.
    Smith, L.H., Wilbur, W.J.: Value of parsing as feature generation for gene mention recognition. Journal of Biomedical Informatics 42(5), 895–904 (2009)CrossRefGoogle Scholar
  24. 24.
    Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.J.: Genetag: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6(suppl. 1), S3 (2005)CrossRefGoogle Scholar
  25. 25.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh CoNLL at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)Google Scholar
  26. 26.
    Torisawa, K.: Exploiting wikipedia as external knowledge for named entity recognition. In: Proceedings of the Joint Conference on EMNLP-CoNLL, pp. 798–707 (2007)Google Scholar
  27. 27.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the HLT-NAACL, vol. 1, pp. 173–180 (2003)Google Scholar
  28. 28.
    Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of the Conference on HLT-EMNLP, pp. 467–474 (2005)Google Scholar
  29. 29.
    Usami, Y., Cho, H.C., Okazaki, N., Tsujii, J.: Automatic acquisition of huge training data for bio-medical named entity recognition. In: Proceedings of BioNLP 2011 Workshop, pp. 65–73 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Han-Cheol Cho
    • 1
  • Naoaki Okazaki
    • 2
    • 3
  • Kentaro Inui
    • 2
  1. 1.Suda Lab., Graduate School of Information Science and Technologythe University of TokyoTokyoJapan
  2. 2.Inui and Okazaki Lab., Graduate School of Information ScienceTohoku UniversitySendaiJapan
  3. 3.Japan Science and Technology Agency (JST)Japan

Personalised recommendations