Topic-Oriented Words as Features for Named Entity Recognition

  • Ziqi Zhang
  • Trevor Cohn
  • Fabio Ciravegna
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7816)


Research has shown that topic-oriented words are often related to named entities and can be used for Named Entity Recognition. Many have proposed to measure topicality of words in terms of ‘informativeness’ based on global distributional characteristics of words in a corpus. However, this study shows that there can be large discrepancy between informativeness and topicality; empirically, informativeness based features can damage learning accuracy of NER. This paper proposes to measure words’ topicality based on local distributional features specific to individual documents, and proposes methods to transform topicality into gazetteer-like features for NER by binning. Evaluated using five datasets from three domains, the methods have shown consistent improvement over a baseline by between 0.9 and 4.0 in F-measure, and always outperformed methods that use informativeness measures.


Name Entity Recognition Entity Recognition Informativeness Measure MEDLINE Abstract Learning Accuracy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahmed, K., Gillam, L., Tostevin, L.: University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER). In: The 8th Text Retrieval Conference, TREC-8 (1999)Google Scholar
  2. 2.
    Chang, J., Schütze, H., Altman, R.: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–225 (2004)CrossRefGoogle Scholar
  3. 3.
    Church, K., Gale, W.: Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In: Proceedings of the 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, pp. 121–130 (1995a)Google Scholar
  4. 4.
    Church, K., Gale, W.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995b)CrossRefGoogle Scholar
  5. 5.
    Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. In: Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases, pp. 949–964 (1999)Google Scholar
  6. 6.
    Collier, N., Nobata, C., Tsujii, J.: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In: Proceedings of COLING 2000, pp. 201–207 (2000)Google Scholar
  7. 7.
    Dagan, I., Church, K.: Termight: Identify-ing and Translating Technical Terminology. In: Proceedings of EACL, pp. 34–40 (1994)Google Scholar
  8. 8.
    Downey, D., Broadhead, M., Etzioni, O.: Locating Complex Named Entities in Web Text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (2007)Google Scholar
  9. 9.
    Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A brief history. In: Proceedings of the 16th International Conference on Computational Linguistics (1996)Google Scholar
  10. 10.
    Gupta, S., Bhattacharyya, P.: Think Globally, Apply Locally: Using Distributional Characteristics for Hindi Named Entity Identification. In: Proceedings of the 2010 Named Entities Workshop, ACL 2010, pp. 116–125 (2010)Google Scholar
  11. 11.
    Harter, S.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science 26(4), 197–206 (1975)CrossRefGoogle Scholar
  12. 12.
    Hassel, M.: Exploitation of Named Entities in Automatic Text Summarization for Swedish. In: Proceedings of the 14th Nordic Conference on Computational Linguistics (2003)Google Scholar
  13. 13.
    Jones, K.: Index term weighting. Information Storage and Retrieval 9(11), 619–633 (1973)CrossRefGoogle Scholar
  14. 14.
    Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the Bio-Entity Recognition Task at JNLPBA. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)Google Scholar
  15. 15.
    Mizzaro, S.: Relevance: The Whole History. Journal of the American Society for Information Science 48(9), 810–832 (1997)CrossRefGoogle Scholar
  16. 16.
    Morgan, A., Hirschman, L., Yeh, A., Colosimo, M.: Gene Name Extraction Using FlyBase Resources. In: ACL 2003 Workshop on Language Processing in Biomedicine, Sapporo, Japan, pp. 1–8 (2003)Google Scholar
  17. 17.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  18. 18.
    Rennie, J., Jaakkola, T.: Using Term Informativeness for Named Entity Detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2005)Google Scholar
  19. 19.
    Saha, S., Sarkar, S., Mitra, P.: Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics 42(5), 905–911 (2009)CrossRefGoogle Scholar
  20. 20.
    Silva, J., Kozareva, Z., Noncheva, V., Lopes, G.: Extracting Named Entities: A Statistical Approach. In: Proceeding of TALN (2004)Google Scholar
  21. 21.
    Tjong, E., Sang, K., Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)Google Scholar
  22. 22.
    Wan, X., Zhong, L., Huang, X., Ma, T., Jia, H., Wu, Y., Xiao, J.: Named Entity Recognition in Chinese News Comments on the Web. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 856–864 (2011)Google Scholar
  23. 23.
    Zhang, L., Pan, Y., Zhang, T.: Focused Named Entity Recognition using Machine Learning. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)Google Scholar
  24. 24.
    Zhang, Z., Iria, J.: A Novel Approach to Automatic Gazetteer Generation using Wikipedia. In: Proceedings of the ACL 2009 Workshop on Collaboratively Constructed Semantic Resources (2009)Google Scholar
  25. 25.
    Zhang, Z., Iria, J., Ciravegna, F.: Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction. In: Proceedings of LREC 2010, Malta (May 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ziqi Zhang
    • 1
  • Trevor Cohn
    • 1
  • Fabio Ciravegna
    • 1
  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUK

Personalised recommendations