Methods for automatic term recognition in domain-specific text collections: A survey

Abstract

Applications related to domain specific text processing often use glossaries and ontologies, and the main step of such resource construction is term recognition. This paper presents a survey of existing definitions of the term and its linguistic features, formulates the task definition for term recognition, and analyzes presently-available methods for automatic term recognition, such as methods for candidates collection, methods based on statistics and contexts of term occurrences, methods using topic models, and methods based on external resources (such as text collections from other domains, ontologies, and Wikipedia). This paper also provides an overview of standard methodologies and datasets for experimental research.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Astrakhantsev, N. and Turdakov, D., Automatic construction and enrichment of informal ontologies: A survey, Program. Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.

    Article  Google Scholar 

  2. 2.

    Myakshin, K.A., Various approaches to definition of the concept “term,” Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2007, vol. 3, no. 3, pp. 175–178.

    Google Scholar 

  3. 3.

    Pazienza, M., Pennacchiotti, M., and Zanzotto, F., Terminology extraction: An analysis of linguistic and statistical approaches, in Knowledge Mining, 2005, pp. 255–279.

    Google Scholar 

  4. 4.

    Komarova, R.I., Term system of the heuristics sublanguage (on the material of English), Extended Abstract of Cand. Phil. Sci. Dissertation, Odessa, 1991, p. 18.

    Google Scholar 

  5. 5.

    Vinokur, G.O., Grammatical observations in the field of technical terminology, Tr. Mosk. Inst. Filosofii Literatury Istorii, 1939, vol. 5.

  6. 6.

    Wüster, E., Einfuuhrung in die allgemeine Terminologielehre und terminologische Lexikographie (1979), Kobenhavn: Handelshojskolen, 1985.

    Google Scholar 

  7. 7.

    Felber, H., Basic principles and methods for the preparation of terminology standards, Standardization of Technical Terminology: Principle and Practices, ASTM STP, 1982, vol. 806, pp. 3–13.

    Google Scholar 

  8. 8.

    Terminology–Vocabulary: Standard, CH: International Organization for Standardization, Geneva, 1990.

  9. 9.

    Pearson, J., Terms in Context, John Benjamins, 1998, vol. 1.

  10. 10.

    Rondeau, G., Introduction ala terminologie, Quebec: Gaetan Morin, 1984, 2nd ed.

    Google Scholar 

  11. 11.

    Myakshin, K.A., On the question of main features of the term, Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2008, vol. 2, no. 21, pp. 17–22.

    Google Scholar 

  12. 12.

    Khayutin, A.D., Compound terms: Functional type of complex linguistic units from the perspective of lexicography, in Otraslevaya terminologiya i leksikografiya (Industrial Terminology and Lexicography), Voronezh: Voronezh State Pedagogical Univ., 1981.

    Google Scholar 

  13. 13.

    Akhmanova, O.S., Linguistic terminology, Linguistic Encyclopedic Dictionary, Moscow: Sov. Entsikl., 1990.

    Google Scholar 

  14. 14.

    Judea, A., Schütze, H., and Bruegmann, S., Unsupervised training set generation for automatic acquisition of technical terminology in patents, Proc. 25th Int. Conf. Computational Linguistics: Technical Papers (COLING), Dublin, 2014, pp. 290–300.

    Google Scholar 

  15. 15.

    Bernier-Colborne, G. and Drouin, P., Creating a test corpus for term extractors through term annotation, Terminology, 2014, vol. 20, no. 1, pp. 50–73.

    Article  Google Scholar 

  16. 16.

    Wu, W., Liu, T., Hu, H., et al., Extracting domain-relevant term using Wikipedia based on random walk model, Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2012, pp. 68–75.

    Google Scholar 

  17. 17.

    Bordea, G., Buitelaar, P., and Polajnar, T., Domainindependent term extraction through domain modeling, Proc. 10th Int. Conf. Terminology and Artificial Intelligence (TIA), Paris, 2013.

    Google Scholar 

  18. 18.

    Bagot, R.E., Les unites de signification sprecialisrees relargissant l’objet du travail en terminologie, Terminology, 2002, vol. 7, no. 2, pp. 217–237.

    Article  Google Scholar 

  19. 19.

    Kageura, K. and Umino, B., Methods of automatic term recognition: A review, Terminology, 1996, vol. 3, no. 2, pp. 259–289.

    Article  Google Scholar 

  20. 20.

    Wermter, J. and Hahn, U., You can’t beat frequency (unless you use linguistic knowledge): A qualitative evaluation of association measures for collocation and term extraction, Proc. 21st Int. Conf. Computational Linguistic and 44th Annu. Meet. Association for Computational Linguistic, 2006, pp. 785–792.

    Google Scholar 

  21. 21.

    Zhang, Z., Brewster, C., and Ciravegna, F., A comparative evaluation of term recognition algorithms, Proc. 6th Int. Conf. Language Resources and Evaluation (LREC), Marrakech, 2008.

    Google Scholar 

  22. 22.

    Evans, D.A. and Lefferts, R.G., Clarit-trec experiments, Inf. Process. Manage., 1995, vol. 31, no. 3, pp. 385–395.

    Article  Google Scholar 

  23. 23.

    Ahmad, K., Gillam, L., Tostevin, L., et al., University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder), Proc. 8th Text Retrieval Conf. (TREC), 1999.

    Google Scholar 

  24. 24.

    Frantzi, K., Ananiadou, S., and Mima, H., Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. Digital Libr., 2000, vol. 3, no. 2, pp. 115–130.

    Article  Google Scholar 

  25. 25.

    Kozakov, L., Park, Y., Fin, T., et al., Glossary extraction and utilization in the information search and delivery system for IBM technical support, IBM Syst. J., 2004, vol. 43, no. 3, pp. 546–563.

    Article  Google Scholar 

  26. 26.

    Sclano, F. and Velardi, P., Termextractor: A web application to learn the shared terminology of emergent web communities, Enterprise Interoperability II, 2007, pp. 287–290.

    Google Scholar 

  27. 27.

    Braslavskii, P.I. and Sokolov, E.A., Comparison of four methods for automatic recognition of two-word terms in text, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2006, pp. 88–94.

    Google Scholar 

  28. 28.

    Braslavskii, P. and Sokolov, E., Comparison of five methods for recognition of terms of arbitrary length, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2008, no. 7, p. 14.

    Google Scholar 

  29. 29.

    Bourigault, D., Surface grammatical analysis for the extraction of terminological noun phrases, Proc. 14th Conf. Computational Linguistic, 1992, vol. 3, pp. 977–981.

    Article  Google Scholar 

  30. 30.

    Baroni, M. and Bernardini, S., BootCaT: Bootstrapping corpora and terms from the Web, Proc. Conf. Language Resources and Evaluation (LREC), 2004, pp. 1313–1316.

    Google Scholar 

  31. 31.

    Dobrov, B.V., Lukashevich, N.V., and Syromyatnikov, S.V., Formation of the base of terminological phrases based on domain texts, Trudy 5oi Vseross. nauchn. konf. “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii” (Proc. 5th All-Russ. Sci. Conf. “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”), 2003, pp. 201–210.

    Google Scholar 

  32. 32.

    Automatic Text Processing, Syntactic analysis. http://www.aot.ru/docs/synan.html.

  33. 33.

    Fedorenko, D., Astrakhantsev, N., and Turdakov, D., Automatic recognition of domain-specific terms: An experimental evaluation, Proc. Spring Researchers Colloquium on Databases and Information Systems (SYRCo- DIS), 2013, pp. 15–23.

    Google Scholar 

  34. 34.

    Nokel, M. and Loukachevitch, N., An experimental study of term extraction for real information-retrieval thesauri, Proc. 10th Int. Conf. Terminology and Artificial Intelligence, 2013, pp. 69–76.

    Google Scholar 

  35. 35.

    Ventura, J.A.L., Jonquet, C., and Roche, M., et al., Combining C-value and keyword extraction methods for biomedical terms extraction, Proc. Int. Symp. Languages in Biology and Medicine (LBM), 2013, pp. 45–49.

    Google Scholar 

  36. 36.

    Barron-Cedeno, A., Sierra, G., Drouin, P., et al., An improved automatic term recognition method for Spanish, in Computational Linguistics and Intelligent Text Processing, Berlin: Springer, 2009, pp. 125–136.

    Google Scholar 

  37. 37.

    Bordea, G., Domain adaptive extraction of topical hierarchies for expertise mining, Ph.D. Dissertation, Galway: National University of Ireland, 2013.

    Google Scholar 

  38. 38.

    Navigli, R. and Velardi, P., Semantic interpretation of terminological strings, Proc. 6th Int. Conf. Terminology and Knowledge Engineering, 2002, pp. 95–100.

    Google Scholar 

  39. 39.

    Dennis, S.F., The construction of a thesaurus automatically from a sample of text, Proc. Symp. Statistical Association Methods for Mechanized Documentation, Washington, 1965, pp. 61–148.

    Google Scholar 

  40. 40.

    Church, K., Gale, W., Hanks, P., et al., Using statistics in lexical analysis, in Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991, p. 115.

    Google Scholar 

  41. 41.

    Dunning, T., Accurate methods for the statistics of surprise and coincidence, Comput. Linguist., 1993, vol. 19, no. 1, pp. 61–74.

    Google Scholar 

  42. 42.

    Church, K.W. and Hanks, P., Word association norms, mutual information, and lexicography, Comput. Linguist., 1990, vol. 16, no. 1, pp. 22–29.

    Google Scholar 

  43. 43.

    Daille, B., Combined approach for terminology extraction: Lexical statistics and linguistic filtering, Ph.D. Thesis, Paris: University Paris 7, 1994.

    Google Scholar 

  44. 44.

    Park, Y., Byrd, R., and Boguraev, B., Automatic glossary extraction: Beyond terminology identification, Proc. 19th Int. Conf. Computational Linguistic, 2002, vol. 1, pp. 1–7.

    Google Scholar 

  45. 45.

    Blei, D.M. and Lafferty, J.D., Topic models, Text Min. Classif., Clustering, Appl., 2009, vol. 10, p. 71.

    Article  Google Scholar 

  46. 46.

    Bolshakova, E., Loukachevitch, N., and Nokel, M., Topic models can improve domain term extraction, in Advances in Information Retrieval, Berlin: Springer, 2013, pp. 684–687.

    Google Scholar 

  47. 47.

    Li, S., Li, J., Song, T., et al., A novel topic model for automatic term extraction, Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2013, pp. 885–888.

    Google Scholar 

  48. 48.

    Meijer, K., Frasincar, F., and Hogenboom, F., A semantic approach for extracting domain taxonomies from text, Decis. Support Syst., 2014, vol. 62, pp. 78–93.

    Article  Google Scholar 

  49. 49.

    Penas, A., Verdejo, F., Gonzalo, J., et al., Corpusbased terminology extraction applied to information access, Proc. Corpus Linguistics, 2001, vol. 13.

  50. 50.

    Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, 1999.

    Google Scholar 

  51. 51.

    Braslavskii, P.I. and Sokolov, E.A., Automatic term recognition using Internet retrieval engines, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2007, pp. 89–94.

    Google Scholar 

  52. 52.

    Golomazov, D.D., Methods and techniques for scientific information management using ontologies, Cand. Sci. (Phys.–Math.) Dissertation, Moscow, 2012.

    Google Scholar 

  53. 53.

    Dobrov, B.V. and Loukachevitch, N.V., Multiple evidence for term extraction in broad domains, Proc. Recent Advances in Natural Language Processing, 2011, pp. 710–715.

    Google Scholar 

  54. 54.

    Xu, F., Kurz, D., Piskorski, J., et al., A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping, Proc. Int. Conf. Language Resources and Evaluation, 2002.

    Google Scholar 

  55. 55.

    Milne, D., Medelyan, O., and Witten, I.H., Mining domain-specific thesauri from Wikipedia: A case study, Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence, 2006, pp. 442–448.

    Google Scholar 

  56. 56.

    Strube, M. and Ponzetto, S.P., WikiRelate!: Computing semantic relatedness using Wikipedia, Proc. 21st AAAI Conf. Artificial Intelligence, 2006, vol. 6, pp. 1419–1424.

    Google Scholar 

  57. 57.

    Mihalcea, R. and Csomai, A., Wikify!: Linking documents to encyclopedic knowledge, Proc. 16th ACM Conf. Information and Knowledge Management, 2007, pp. 233–242.

    Google Scholar 

  58. 58.

    Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. Information and Knowledge Management, 2008, pp. 509–518.

    Google Scholar 

  59. 59.

    Vivaldi, J. and Rodriguez, H., Using Wikipedia for term extraction in the biomedical domain: First experiences, Procesamiento del Lenguaje Natural, 2010, vol. 45, pp. 251–254.

    Google Scholar 

  60. 60.

    Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., et al., Using Wikipedia to validate the terminology found in a corpus of basic textbooks, Proc. Conf. Language Resources and Evaluation (LREC), 2012, pp. 3820–3827.

    Google Scholar 

  61. 61.

    Astrakhantsev, N., Automatic term recognition in a domain-specific text collection using Wikipedia, Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk, 2014, vol. 26, no. 4, pp. 7–20.

    Google Scholar 

  62. 62.

    Patry, A. and Langlais, P., Corpus-based terminology extraction, Proc. 7th Int. Conf. Terminology and Knowledge Engineering, Copenhagen, 2005.

    Google Scholar 

  63. 63.

    Astrakhantsev, N., Fedorenko, D., and Turdakov, D., Automatic enrichment of informal ontology by analyzing a domain-specific text collection, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2014, vol. 13, pp. 29–42.

    Google Scholar 

  64. 64.

    Yang, Y., Yu, H., Meng, Y., et al., Fault-tolerant learning for term extraction. http://www.aclweb.org/anthology/Y10-1036.

  65. 65.

    Liu, X. and Kit, C., An improved corpus comparison approach to domain specific term recognition, Proc. Pacific Asia Conf. Language, Information, and Computing (PACLIC), 2008, pp. 253–261.

    Google Scholar 

  66. 66.

    Kim, J.-D., Ohta, T., Tateisi, Y., et al., GENIA corpus: A semantically annotated corpus for bio-textmining, Bioinformatics, 2003, vol. 19, no. Suppl. 1, pp. 180–182.

    Article  Google Scholar 

  67. 67.

    Nenadie, G., Ananiadou, S., and McNaught, J., Enhancing automatic term recognition through recognition of variation, Proc. 20th Int. Conf. Computational Linguistics, 2004, p. 604.

    Google Scholar 

  68. 68.

    Krauthammer, M. and Nenadic, G., Term identification in the biomedical literature, J. Biomed. Inf., 2004, vol. 37, no. 6, pp. 512–526.

    Article  Google Scholar 

  69. 69.

    Medelyan, O. and Witten, I.H., Domain-independent automatic keyphrase indexing with small training sets, J. Am. Soc. Inf. Sci. Technol., 2008, vol. 59, no. 7, pp. 1026–1040.

    Article  Google Scholar 

  70. 70.

    Krapivin, M., Autaeu, A., and Marchese, M., Large dataset for keyphrases extraction. http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeumarchese. pdf.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to N. A. Astrakhantsev.

Additional information

Original Russian Text © N.A. Astrakhantsev, D.G. Fedorenko, D.Yu. Turdakov, 2015, published in Programmirovanie, 2015, Vol. 41, No. 6.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Astrakhantsev, N.A., Fedorenko, D.G. & Turdakov, D.Y. Methods for automatic term recognition in domain-specific text collections: A survey. Program Comput Soft 41, 336–349 (2015). https://doi.org/10.1134/S036176881506002X

Download citation

Keywords

  • Semantic Relatedness
  • Topic Model
  • Term Frequency
  • Term Candidate
  • Computational Linguistics