Advertisement

Programming and Computer Software

, Volume 41, Issue 6, pp 336–349 | Cite as

Methods for automatic term recognition in domain-specific text collections: A survey

  • N. A. AstrakhantsevEmail author
  • D. G. Fedorenko
  • D. Yu. Turdakov
Article

Abstract

Applications related to domain specific text processing often use glossaries and ontologies, and the main step of such resource construction is term recognition. This paper presents a survey of existing definitions of the term and its linguistic features, formulates the task definition for term recognition, and analyzes presently-available methods for automatic term recognition, such as methods for candidates collection, methods based on statistics and contexts of term occurrences, methods using topic models, and methods based on external resources (such as text collections from other domains, ontologies, and Wikipedia). This paper also provides an overview of standard methodologies and datasets for experimental research.

Keywords

Semantic Relatedness Topic Model Term Frequency Term Candidate Computational Linguistics 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Astrakhantsev, N. and Turdakov, D., Automatic construction and enrichment of informal ontologies: A survey, Program. Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.CrossRefGoogle Scholar
  2. 2.
    Myakshin, K.A., Various approaches to definition of the concept “term,” Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2007, vol. 3, no. 3, pp. 175–178.Google Scholar
  3. 3.
    Pazienza, M., Pennacchiotti, M., and Zanzotto, F., Terminology extraction: An analysis of linguistic and statistical approaches, in Knowledge Mining, 2005, pp. 255–279.CrossRefGoogle Scholar
  4. 4.
    Komarova, R.I., Term system of the heuristics sublanguage (on the material of English), Extended Abstract of Cand. Phil. Sci. Dissertation, Odessa, 1991, p. 18.Google Scholar
  5. 5.
    Vinokur, G.O., Grammatical observations in the field of technical terminology, Tr. Mosk. Inst. Filosofii Literatury Istorii, 1939, vol. 5.Google Scholar
  6. 6.
    Wüster, E., Einfuuhrung in die allgemeine Terminologielehre und terminologische Lexikographie (1979), Kobenhavn: Handelshojskolen, 1985.Google Scholar
  7. 7.
    Felber, H., Basic principles and methods for the preparation of terminology standards, Standardization of Technical Terminology: Principle and Practices, ASTM STP, 1982, vol. 806, pp. 3–13.Google Scholar
  8. 8.
    Terminology–Vocabulary: Standard, CH: International Organization for Standardization, Geneva, 1990.Google Scholar
  9. 9.
    Pearson, J., Terms in Context, John Benjamins, 1998, vol. 1.Google Scholar
  10. 10.
    Rondeau, G., Introduction ala terminologie, Quebec: Gaetan Morin, 1984, 2nd ed.Google Scholar
  11. 11.
    Myakshin, K.A., On the question of main features of the term, Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2008, vol. 2, no. 21, pp. 17–22.Google Scholar
  12. 12.
    Khayutin, A.D., Compound terms: Functional type of complex linguistic units from the perspective of lexicography, in Otraslevaya terminologiya i leksikografiya (Industrial Terminology and Lexicography), Voronezh: Voronezh State Pedagogical Univ., 1981.Google Scholar
  13. 13.
    Akhmanova, O.S., Linguistic terminology, Linguistic Encyclopedic Dictionary, Moscow: Sov. Entsikl., 1990.Google Scholar
  14. 14.
    Judea, A., Schütze, H., and Bruegmann, S., Unsupervised training set generation for automatic acquisition of technical terminology in patents, Proc. 25th Int. Conf. Computational Linguistics: Technical Papers (COLING), Dublin, 2014, pp. 290–300.Google Scholar
  15. 15.
    Bernier-Colborne, G. and Drouin, P., Creating a test corpus for term extractors through term annotation, Terminology, 2014, vol. 20, no. 1, pp. 50–73.CrossRefGoogle Scholar
  16. 16.
    Wu, W., Liu, T., Hu, H., et al., Extracting domain-relevant term using Wikipedia based on random walk model, Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2012, pp. 68–75.Google Scholar
  17. 17.
    Bordea, G., Buitelaar, P., and Polajnar, T., Domainindependent term extraction through domain modeling, Proc. 10th Int. Conf. Terminology and Artificial Intelligence (TIA), Paris, 2013.Google Scholar
  18. 18.
    Bagot, R.E., Les unites de signification sprecialisrees relargissant l’objet du travail en terminologie, Terminology, 2002, vol. 7, no. 2, pp. 217–237.CrossRefGoogle Scholar
  19. 19.
    Kageura, K. and Umino, B., Methods of automatic term recognition: A review, Terminology, 1996, vol. 3, no. 2, pp. 259–289.CrossRefGoogle Scholar
  20. 20.
    Wermter, J. and Hahn, U., You can’t beat frequency (unless you use linguistic knowledge): A qualitative evaluation of association measures for collocation and term extraction, Proc. 21st Int. Conf. Computational Linguistic and 44th Annu. Meet. Association for Computational Linguistic, 2006, pp. 785–792.Google Scholar
  21. 21.
    Zhang, Z., Brewster, C., and Ciravegna, F., A comparative evaluation of term recognition algorithms, Proc. 6th Int. Conf. Language Resources and Evaluation (LREC), Marrakech, 2008.Google Scholar
  22. 22.
    Evans, D.A. and Lefferts, R.G., Clarit-trec experiments, Inf. Process. Manage., 1995, vol. 31, no. 3, pp. 385–395.CrossRefGoogle Scholar
  23. 23.
    Ahmad, K., Gillam, L., Tostevin, L., et al., University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder), Proc. 8th Text Retrieval Conf. (TREC), 1999.Google Scholar
  24. 24.
    Frantzi, K., Ananiadou, S., and Mima, H., Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. Digital Libr., 2000, vol. 3, no. 2, pp. 115–130.CrossRefGoogle Scholar
  25. 25.
    Kozakov, L., Park, Y., Fin, T., et al., Glossary extraction and utilization in the information search and delivery system for IBM technical support, IBM Syst. J., 2004, vol. 43, no. 3, pp. 546–563.CrossRefGoogle Scholar
  26. 26.
    Sclano, F. and Velardi, P., Termextractor: A web application to learn the shared terminology of emergent web communities, Enterprise Interoperability II, 2007, pp. 287–290.CrossRefGoogle Scholar
  27. 27.
    Braslavskii, P.I. and Sokolov, E.A., Comparison of four methods for automatic recognition of two-word terms in text, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2006, pp. 88–94.Google Scholar
  28. 28.
    Braslavskii, P. and Sokolov, E., Comparison of five methods for recognition of terms of arbitrary length, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2008, no. 7, p. 14.Google Scholar
  29. 29.
    Bourigault, D., Surface grammatical analysis for the extraction of terminological noun phrases, Proc. 14th Conf. Computational Linguistic, 1992, vol. 3, pp. 977–981.CrossRefGoogle Scholar
  30. 30.
    Baroni, M. and Bernardini, S., BootCaT: Bootstrapping corpora and terms from the Web, Proc. Conf. Language Resources and Evaluation (LREC), 2004, pp. 1313–1316.Google Scholar
  31. 31.
    Dobrov, B.V., Lukashevich, N.V., and Syromyatnikov, S.V., Formation of the base of terminological phrases based on domain texts, Trudy 5oi Vseross. nauchn. konf. “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii” (Proc. 5th All-Russ. Sci. Conf. “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”), 2003, pp. 201–210.Google Scholar
  32. 32.
    Automatic Text Processing, Syntactic analysis. http://www.aot.ru/docs/synan.html.Google Scholar
  33. 33.
    Fedorenko, D., Astrakhantsev, N., and Turdakov, D., Automatic recognition of domain-specific terms: An experimental evaluation, Proc. Spring Researchers Colloquium on Databases and Information Systems (SYRCo- DIS), 2013, pp. 15–23.Google Scholar
  34. 34.
    Nokel, M. and Loukachevitch, N., An experimental study of term extraction for real information-retrieval thesauri, Proc. 10th Int. Conf. Terminology and Artificial Intelligence, 2013, pp. 69–76.Google Scholar
  35. 35.
    Ventura, J.A.L., Jonquet, C., and Roche, M., et al., Combining C-value and keyword extraction methods for biomedical terms extraction, Proc. Int. Symp. Languages in Biology and Medicine (LBM), 2013, pp. 45–49.Google Scholar
  36. 36.
    Barron-Cedeno, A., Sierra, G., Drouin, P., et al., An improved automatic term recognition method for Spanish, in Computational Linguistics and Intelligent Text Processing, Berlin: Springer, 2009, pp. 125–136.CrossRefGoogle Scholar
  37. 37.
    Bordea, G., Domain adaptive extraction of topical hierarchies for expertise mining, Ph.D. Dissertation, Galway: National University of Ireland, 2013.Google Scholar
  38. 38.
    Navigli, R. and Velardi, P., Semantic interpretation of terminological strings, Proc. 6th Int. Conf. Terminology and Knowledge Engineering, 2002, pp. 95–100.Google Scholar
  39. 39.
    Dennis, S.F., The construction of a thesaurus automatically from a sample of text, Proc. Symp. Statistical Association Methods for Mechanized Documentation, Washington, 1965, pp. 61–148.Google Scholar
  40. 40.
    Church, K., Gale, W., Hanks, P., et al., Using statistics in lexical analysis, in Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991, p. 115.Google Scholar
  41. 41.
    Dunning, T., Accurate methods for the statistics of surprise and coincidence, Comput. Linguist., 1993, vol. 19, no. 1, pp. 61–74.Google Scholar
  42. 42.
    Church, K.W. and Hanks, P., Word association norms, mutual information, and lexicography, Comput. Linguist., 1990, vol. 16, no. 1, pp. 22–29.Google Scholar
  43. 43.
    Daille, B., Combined approach for terminology extraction: Lexical statistics and linguistic filtering, Ph.D. Thesis, Paris: University Paris 7, 1994.Google Scholar
  44. 44.
    Park, Y., Byrd, R., and Boguraev, B., Automatic glossary extraction: Beyond terminology identification, Proc. 19th Int. Conf. Computational Linguistic, 2002, vol. 1, pp. 1–7.Google Scholar
  45. 45.
    Blei, D.M. and Lafferty, J.D., Topic models, Text Min. Classif., Clustering, Appl., 2009, vol. 10, p. 71.CrossRefGoogle Scholar
  46. 46.
    Bolshakova, E., Loukachevitch, N., and Nokel, M., Topic models can improve domain term extraction, in Advances in Information Retrieval, Berlin: Springer, 2013, pp. 684–687.CrossRefGoogle Scholar
  47. 47.
    Li, S., Li, J., Song, T., et al., A novel topic model for automatic term extraction, Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2013, pp. 885–888.Google Scholar
  48. 48.
    Meijer, K., Frasincar, F., and Hogenboom, F., A semantic approach for extracting domain taxonomies from text, Decis. Support Syst., 2014, vol. 62, pp. 78–93.CrossRefGoogle Scholar
  49. 49.
    Penas, A., Verdejo, F., Gonzalo, J., et al., Corpusbased terminology extraction applied to information access, Proc. Corpus Linguistics, 2001, vol. 13.Google Scholar
  50. 50.
    Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, 1999.zbMATHGoogle Scholar
  51. 51.
    Braslavskii, P.I. and Sokolov, E.A., Automatic term recognition using Internet retrieval engines, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2007, pp. 89–94.Google Scholar
  52. 52.
    Golomazov, D.D., Methods and techniques for scientific information management using ontologies, Cand. Sci. (Phys.–Math.) Dissertation, Moscow, 2012.Google Scholar
  53. 53.
    Dobrov, B.V. and Loukachevitch, N.V., Multiple evidence for term extraction in broad domains, Proc. Recent Advances in Natural Language Processing, 2011, pp. 710–715.Google Scholar
  54. 54.
    Xu, F., Kurz, D., Piskorski, J., et al., A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping, Proc. Int. Conf. Language Resources and Evaluation, 2002.Google Scholar
  55. 55.
    Milne, D., Medelyan, O., and Witten, I.H., Mining domain-specific thesauri from Wikipedia: A case study, Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence, 2006, pp. 442–448.Google Scholar
  56. 56.
    Strube, M. and Ponzetto, S.P., WikiRelate!: Computing semantic relatedness using Wikipedia, Proc. 21st AAAI Conf. Artificial Intelligence, 2006, vol. 6, pp. 1419–1424.Google Scholar
  57. 57.
    Mihalcea, R. and Csomai, A., Wikify!: Linking documents to encyclopedic knowledge, Proc. 16th ACM Conf. Information and Knowledge Management, 2007, pp. 233–242.Google Scholar
  58. 58.
    Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. Information and Knowledge Management, 2008, pp. 509–518.Google Scholar
  59. 59.
    Vivaldi, J. and Rodriguez, H., Using Wikipedia for term extraction in the biomedical domain: First experiences, Procesamiento del Lenguaje Natural, 2010, vol. 45, pp. 251–254.Google Scholar
  60. 60.
    Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., et al., Using Wikipedia to validate the terminology found in a corpus of basic textbooks, Proc. Conf. Language Resources and Evaluation (LREC), 2012, pp. 3820–3827.Google Scholar
  61. 61.
    Astrakhantsev, N., Automatic term recognition in a domain-specific text collection using Wikipedia, Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk, 2014, vol. 26, no. 4, pp. 7–20.Google Scholar
  62. 62.
    Patry, A. and Langlais, P., Corpus-based terminology extraction, Proc. 7th Int. Conf. Terminology and Knowledge Engineering, Copenhagen, 2005.Google Scholar
  63. 63.
    Astrakhantsev, N., Fedorenko, D., and Turdakov, D., Automatic enrichment of informal ontology by analyzing a domain-specific text collection, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2014, vol. 13, pp. 29–42.Google Scholar
  64. 64.
    Yang, Y., Yu, H., Meng, Y., et al., Fault-tolerant learning for term extraction. http://www.aclweb.org/anthology/Y10-1036.Google Scholar
  65. 65.
    Liu, X. and Kit, C., An improved corpus comparison approach to domain specific term recognition, Proc. Pacific Asia Conf. Language, Information, and Computing (PACLIC), 2008, pp. 253–261.Google Scholar
  66. 66.
    Kim, J.-D., Ohta, T., Tateisi, Y., et al., GENIA corpus: A semantically annotated corpus for bio-textmining, Bioinformatics, 2003, vol. 19, no. Suppl. 1, pp. 180–182.CrossRefGoogle Scholar
  67. 67.
    Nenadie, G., Ananiadou, S., and McNaught, J., Enhancing automatic term recognition through recognition of variation, Proc. 20th Int. Conf. Computational Linguistics, 2004, p. 604.Google Scholar
  68. 68.
    Krauthammer, M. and Nenadic, G., Term identification in the biomedical literature, J. Biomed. Inf., 2004, vol. 37, no. 6, pp. 512–526.CrossRefGoogle Scholar
  69. 69.
    Medelyan, O. and Witten, I.H., Domain-independent automatic keyphrase indexing with small training sets, J. Am. Soc. Inf. Sci. Technol., 2008, vol. 59, no. 7, pp. 1026–1040.CrossRefGoogle Scholar
  70. 70.
    Krapivin, M., Autaeu, A., and Marchese, M., Large dataset for keyphrases extraction. http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeumarchese. pdf.Google Scholar

Copyright information

© Pleiades Publishing, Ltd. 2015

Authors and Affiliations

  • N. A. Astrakhantsev
    • 1
    Email author
  • D. G. Fedorenko
    • 1
  • D. Yu. Turdakov
    • 1
  1. 1.Institute for System ProgrammingRussian Academy of SciencesMoscowRussia

Personalised recommendations